Eric J Pedersen
August 6th, 2016
So you have a GAM:
gam.check()gam.check() part 2library(mgcv)
library(magrittr)
library(ggplot2)
library(dplyr)
library(statmod)
source("quantile_resid.R")
With all models, how accurate your predictions will be depends on how good the model is
set.seed(2)
n = 400
x1 = rnorm(n)
x2 = rnorm(n)
y_val =1 + 2*cos(pi*x1) + 2/(1+exp(-5*(x2)))
y_norm = y_val + rnorm(n, 0, 0.5)
y_negbinom = rnbinom(n, mu = exp(y_val),size=10)
y_binom = rbinom(n,1,prob = exp(y_val)/(1+exp(y_val)))
k per terms(x, k=10) or s(x, y, k=100)k)norm_model_1 = gam(y_norm~s(x1,k=4)+s(x2,k=4),method= "REML")
gam.check(norm_model_1)
Method: REML   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-0.0003467788,0.0005154578]
(score 736.9402 & scale 2.252304).
Hessian positive definite, eigenvalue range [0.000346021,198.5041].
Model rank =  7 / 7 
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
         k'   edf k-index p-value
s(x1) 3.000 1.002   0.125    0.00
s(x2) 3.000 2.914   1.045    0.79
norm_model_2 = gam(y_norm~s(x1,k=12)+s(x2,k=4),method= "REML")
gam.check(norm_model_2)
Method: REML   Optimizer: outer newton
full convergence after 11 iterations.
Gradient range [-8.009729e-05,7.329675e-05]
(score 345.3111 & scale 0.2706205).
Hessian positive definite, eigenvalue range [0.9677905,198.63].
Model rank =  15 / 15 
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
          k'    edf k-index p-value
s(x1) 11.000 10.837   0.989    0.42
s(x2)  3.000  2.984   0.861    0.00
norm_model_3 = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model_3)
Method: REML   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-1.136198e-08,6.812328e-13]
(score 334.2084 & scale 0.2485446).
Hessian positive definite, eigenvalue range [2.812271,198.6868].
Model rank =  23 / 23 
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
          k'    edf k-index p-value
s(x1) 11.000 10.848   0.976    0.28
s(x2) 11.000  7.948   0.946    0.12
layout(matrix(1:6,ncol=2,byrow = T))
plot(norm_model_1);plot(norm_model_2);plot(norm_model_3)
layout(1)
gam.check() creates 4 plots: 
Quantile-quantile plots of residuals. If the model is right, should follow 1-1 line
Histogram of residuals
Residuals vs. linear predictor
Observed vs. fitted values
gam.check() uses deviance residuals by default
norm_model = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model)
pois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
gam.check(pois_model)
negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
gam.check(negbin_model)
Work with the y_negbinom data. Try fitting both Poisson and negative
binomial model, with different degrees of freedom. Using model summaries and
gam.check, determine which model requires more degrees of freedom to fit well.
How do the fitted functions vary between the two?
Using the y_binomial data, fit a smooth model with family=binomial, and
use gam.check() to determine the appropriate degrees of freedom. What do the
gam.check() plots tell you about model fit?
gam.check left side can be helpfulrqresidualspois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
pois_resid = residuals(pois_model, type="deviance")
pois_rqresid = rqresiduals(pois_model)
layout(matrix(1:2, nrow=1))
plot(pois_model$linear.predictors,pois_resid)
plot(pois_model$linear.predictors,pois_rqresid)
negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
negbin_rqresid = rqresiduals(negbin_model)
layout(matrix(1:2, nrow=1))
plot(pois_model$linear.predictors,pois_rqresid)
plot(negbin_model$linear.predictors,negbin_rqresid)
Use the rqresiduals function to test the model residuals for patterns
from your binomial model from the previous exercise
Nonlinear measure, similar to co-linearity
Measures, for each smooth term, how well this term could be approximated by
concurvity(model, full=TRUE): some combination of all other smooth termsconcurvity(model, full=FALSE): Each of the other smooth terms in the model 
(useful for identifying which terms are causing issues)[1] "concurvity(m1)"
         para s(x1_cc) s(x2_cc)
worst       0     0.12     0.12
observed    0     0.07     0.04
estimate    0     0.04     0.04
[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0     0.32     0.32
observed    0     0.03     0.19
estimate    0     0.08     0.19
[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0     0.84     0.84
observed    0     0.22     0.57
estimate    0     0.28     0.60
[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0      1.0     1.00
observed    0      1.0     1.00
estimate    0      0.3     0.97
cor(data) not sufficient: use the concurvity(model) functionMake sure to test your model! GAMs are powerful, but with great power…
You should at least:
Check if your smooths are sufficiently smooth
Test if you have the right distribution
Make sure there's no patterns left in your data
If you have time series, grouped, spatial, etc. data, check for dependencies
We'll get into testing for dependence later on in the extended examples.