Eric J Pedersen
August 6th, 2016
So you have a GAM:
gam.check()
gam.check()
part 2library(mgcv)
library(magrittr)
library(ggplot2)
library(dplyr)
library(statmod)
source("quantile_resid.R")
With all models, how accurate your predictions will be depends on how good the model is
set.seed(2)
n = 400
x1 = rnorm(n)
x2 = rnorm(n)
y_val =1 + 2*cos(pi*x1) + 2/(1+exp(-5*(x2)))
y_norm = y_val + rnorm(n, 0, 0.5)
y_negbinom = rnbinom(n, mu = exp(y_val),size=10)
y_binom = rbinom(n,1,prob = exp(y_val)/(1+exp(y_val)))
k
per terms(x, k=10)
or s(x, y, k=100)
k
)norm_model_1 = gam(y_norm~s(x1,k=4)+s(x2,k=4),method= "REML")
gam.check(norm_model_1)
Method: REML Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-0.0003467788,0.0005154578]
(score 736.9402 & scale 2.252304).
Hessian positive definite, eigenvalue range [0.000346021,198.5041].
Model rank = 7 / 7
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 3.000 1.002 0.125 0.00
s(x2) 3.000 2.914 1.045 0.79
norm_model_2 = gam(y_norm~s(x1,k=12)+s(x2,k=4),method= "REML")
gam.check(norm_model_2)
Method: REML Optimizer: outer newton
full convergence after 11 iterations.
Gradient range [-8.009729e-05,7.329675e-05]
(score 345.3111 & scale 0.2706205).
Hessian positive definite, eigenvalue range [0.9677905,198.63].
Model rank = 15 / 15
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 11.000 10.837 0.989 0.42
s(x2) 3.000 2.984 0.861 0.00
norm_model_3 = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model_3)
Method: REML Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-1.136198e-08,6.812328e-13]
(score 334.2084 & scale 0.2485446).
Hessian positive definite, eigenvalue range [2.812271,198.6868].
Model rank = 23 / 23
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 11.000 10.848 0.976 0.28
s(x2) 11.000 7.948 0.946 0.12
layout(matrix(1:6,ncol=2,byrow = T))
plot(norm_model_1);plot(norm_model_2);plot(norm_model_3)
layout(1)
gam.check()
creates 4 plots:
Quantile-quantile plots of residuals. If the model is right, should follow 1-1 line
Histogram of residuals
Residuals vs. linear predictor
Observed vs. fitted values
gam.check()
uses deviance residuals by default
norm_model = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model)
pois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
gam.check(pois_model)
negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
gam.check(negbin_model)
Work with the y_negbinom
data. Try fitting both Poisson and negative
binomial model, with different degrees of freedom. Using model summaries and
gam.check
, determine which model requires more degrees of freedom to fit well.
How do the fitted functions vary between the two?
Using the y_binomial
data, fit a smooth model with family=binomial
, and
use gam.check()
to determine the appropriate degrees of freedom. What do the
gam.check()
plots tell you about model fit?
gam.check
left side can be helpfulrqresiduals
pois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
pois_resid = residuals(pois_model, type="deviance")
pois_rqresid = rqresiduals(pois_model)
layout(matrix(1:2, nrow=1))
plot(pois_model$linear.predictors,pois_resid)
plot(pois_model$linear.predictors,pois_rqresid)
negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
negbin_rqresid = rqresiduals(negbin_model)
layout(matrix(1:2, nrow=1))
plot(pois_model$linear.predictors,pois_rqresid)
plot(negbin_model$linear.predictors,negbin_rqresid)
Use the rqresiduals
function to test the model residuals for patterns
from your binomial model from the previous exercise
Nonlinear measure, similar to co-linearity
Measures, for each smooth term, how well this term could be approximated by
concurvity(model, full=TRUE)
: some combination of all other smooth termsconcurvity(model, full=FALSE)
: Each of the other smooth terms in the model
(useful for identifying which terms are causing issues)[1] "concurvity(m1)"
para s(x1_cc) s(x2_cc)
worst 0 0.12 0.12
observed 0 0.07 0.04
estimate 0 0.04 0.04
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 0.32 0.32
observed 0 0.03 0.19
estimate 0 0.08 0.19
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 0.84 0.84
observed 0 0.22 0.57
estimate 0 0.28 0.60
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 1.0 1.00
observed 0 1.0 1.00
estimate 0 0.3 0.97
cor(data)
not sufficient: use the concurvity(model)
functionMake sure to test your model! GAMs are powerful, but with great power…
You should at least:
Check if your smooths are sufficiently smooth
Test if you have the right distribution
Make sure there's no patterns left in your data
If you have time series, grouped, spatial, etc. data, check for dependencies
We'll get into testing for dependence later on in the extended examples.