class: inverse, middle, left, my-title-slide, title-slide .title[ # Generalized Additive Models ] .subtitle[ ## a data-driven approach to estimating regression models ] .author[ ### Gavin Simpson ] .institute[ ### Department of Animal & Veterinary Sciences · Aarhus University ] .date[ ### 1400–2000 CET (1300–1900 UTC) Wednesday 22nd January, 2025 ] --- class: inverse middle center big-subsection # Day 3 ??? --- # Logistics ## Slides Slidedeck: [bit.ly/physalia-gam-3](https://bit.ly/physalia-gam-3) Sources: [bit.ly/physalia-gam](https://bit.ly/physalia-gam) Direct download a ZIP of everything: [bit.ly/physalia-gam-zip](https://bit.ly/physalia-gam-zip) Unpack the zip & remember where you put it --- # Matters arising 1. Basis functions for `s(x, z)` 2. Testing for non-linearity beyond a linear effect 3. Derivatives of smooths --- # Basis functions for `s(x, z)` Code for this is in `day-2/test.R` .center[ <img src="index_files/figure-html/tprs-2d-basis-1.svg" width="90%" /> ] --- # Basis functions for `s(x, z)` Panels 1–15 are the basis functions .center[ <img src="resources/wood-2ed-fig-5-12-2-d-tprs-basis-funs.png" width="45%" /> ] .small[ Source: Wood SN (2017) ] --- # Beyond linearity How do we test if we need something more than a linear effect? -- Include a linear parametric term but remove it from the null space of the spline `m = c(2, 0)` implies 2<sup>nd</sup> derivative penalty but zero null space Null space is the set of functions in the basis that the penalty doesn't affect --- # Beyond linearity .row[ .col-6[ ```r n <- 100 set.seed(2) df <- tibble(x = runif(n), y = x + x^2 * 0.2 + rnorm(n) * 0.1) model <- gam(y ~ x + s(x, m = c(2, 0)), data = df, method = "REML") ``` ] .col-6[ .small[ ```r summary(model) ``` ``` ## ## Family: gaussian ## Link function: identity ## ## Formula: ## y ~ x + s(x, m = c(2, 0)) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.02133 0.05315 -0.401 0.689 ## x 1.18249 0.10564 11.193 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(x) 0.9334 8 0.304 0.076 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.91 Deviance explained = 91.1% ## -REML = -70.567 Scale est. = 0.012767 n = 100 ``` ] ] ] --- # Beyond linearity ```r draw(model, parametric = TRUE) ``` <img src="index_files/figure-html/beyond-linearity-plot-1.svg" width="95%" style="display: block; margin: auto;" /> --- # Derivatives of smooths Derivatives of smooths are useful for a number of statistical problems * The 1st derivative at a specified value of the covariate is the GAM equivalent of the `\(\beta\)` in a (G)LM * Can also be used to investigate *where* in the range of the covariate `\(\hat{y}_i\)` changes --- # Derivatives of smooths Use the `derivatives()` function in {gratia} Provides * First derivative * Second derivative * Finite differences * Forward * Backward * Central If you want to estimate the the *n*th derivative, need to have a `\(n+1\)` derivative penalty To estimate the second derivative you need to have `m = 3` say for TPRS And set the finite difference `eps` to a much larger value than the default --- # Derivatives of smooths Simulated motorcycle collision data ```r data(mcycle, package = "MASS") m <- gam(accel ~ s(times), data = mcycle, method = "REML") sm_plt <- draw(m, residuals = TRUE) sm_plt ``` <img src="index_files/figure-html/draw-mcycle-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Derivatives of `s(times)` ```r fd <- derivatives(m, type = "central", unconditional = TRUE) fd ``` ``` ## # A tibble: 100 × 9 ## .smooth .by .fs .derivative .se .crit .lower_ci .upper_ci times ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 s(times) <NA> <NA> -2.16 3.30 1.96 -8.62 4.30 2.40 ## 2 s(times) <NA> <NA> -2.12 3.28 1.96 -8.55 4.31 2.96 ## 3 s(times) <NA> <NA> -1.98 3.22 1.96 -8.29 4.34 3.52 ## 4 s(times) <NA> <NA> -1.72 3.10 1.96 -7.80 4.37 4.07 ## 5 s(times) <NA> <NA> -1.35 2.93 1.96 -7.09 4.39 4.63 ## 6 s(times) <NA> <NA> -0.877 2.69 1.96 -6.16 4.40 5.19 ## 7 s(times) <NA> <NA> -0.304 2.41 1.96 -5.03 4.42 5.75 ## 8 s(times) <NA> <NA> 0.370 2.10 1.96 -3.74 4.48 6.30 ## 9 s(times) <NA> <NA> 1.08 1.79 1.96 -2.42 4.59 6.86 ## 10 s(times) <NA> <NA> 1.70 1.54 1.96 -1.32 4.73 7.42 ## # ℹ 90 more rows ``` --- # Derivatives of `s(times)` ```r fd_plt <- draw(fd) + labs(title = "First derivative s(times)") sm_plt + fd_plt + plot_layout(ncol = 2) ``` <img src="index_files/figure-html/draw-derivatives-times-smooth-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Second derivative of `s(times)` — wrong! ```r fd2 <- derivatives(m, order = 2, eps = 0.1, type = "central", unconditional = TRUE) fd2_plt <- draw(fd2) fd_plt + fd2_plt + plot_layout(ncol = 2) ``` <img src="index_files/figure-html/draw-derivatives-times-smooth-2-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Second derivative of `s(times)` — right! ```r m2 <- gam(accel ~ s(times, m = 3), data = mcycle, method = "REML") fd2 <- derivatives(m2, order = 2, eps = 0.1, type = "central", unconditional = TRUE) fd2_plt <- draw(fd2) + labs(title = "Second derivative s(times)") fd_plt + fd2_plt + plot_layout(ncol = 2) ``` <img src="index_files/figure-html/draw-derivatives-times-smooth-2-right-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Change indicators ```r fds <- derivatives(m, type = "central", unconditional = TRUE, * interval = "simultaneous") m |> # take the model, and smooth_estimates(select = "s(times)") |> # evaluate the smooth add_sizer(derivatives = fds, type = "change") |> # add change indicator draw() # plot ``` <img src="index_files/figure-html/change-indicators-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Change indicators ```r m |> # take the model, and smooth_estimates(select = "s(times)") |> # evaluate the smooth add_sizer(derivatives = fds, type = "sizer") |> # add sizer change indicator draw() # plot ``` <img src="index_files/figure-html/change-indicators-2-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Defining change points Significant change is where the simultaneous interval on derivative *excludes* 0 <img src="index_files/figure-html/change-indicators-3-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Today's topics * Model checking, selection, and visualisation. * How do we do inference with GAMs? * Go beyond simple GAMs to include smooth interactions and models with multiples smooths. --- # Smooth interactions Two ways to fit smooth interactions 1. Bivariate (or higher order) thin plate splines * `s(x, z, bs = 'tp')` * Isotropic; single smoothness parameter for the smooth * Sensitive to scales of `x` and `z` 2. Tensor product smooths * Separate marginal basis for each smooth, separate smoothness parameters * Invariant to scales of `x` and `z` * Use for interactions when variables are in different units * `te(x, z)` --- # Smooth interactions data:image/s3,"s3://crabby-images/70f5a/70f5a0bc82093976b2087696f54173e82c38ac04" alt=""<!-- --> --- # Smooth interactions ```r df ``` ``` ## # A tibble: 500 × 3 ## y x z ## <dbl> <dbl> <dbl> ## 1 0.136 0.0244 0.531 ## 2 0.203 0.0216 0.00325 ## 3 0.358 0.0440 0.252 ## 4 -0.0966 0.0382 0.155 ## 5 0.437 0.00274 0.879 ## 6 0.499 0.00996 0.204 ## 7 0.465 0.0343 0.584 ## 8 0.553 0.0450 0.804 ## 9 0.441 0.00229 0.265 ## 10 0.425 0.00448 0.406 ## # ℹ 490 more rows ``` ```r m_tprs <- gam(y ~ s(x, z), data = df, method = "REML") m_te <- gam(y ~ te(x, z), data = df, method = "REML") ``` --- # Smooth interactions ```r truth_plt + (draw(m_tprs) + coord_cartesian()) + draw(m_te) + plot_layout(ncol = 3) ``` <img src="index_files/figure-html/draw-tprs-vs-tensor-product-1.svg" width="95%" /> --- # Smooth interactions ```r layout(matrix(1:3, ncol = 3)) persp(xs, zs, truth) vis.gam(m_tprs) vis.gam(m_te) layout(1) ``` data:image/s3,"s3://crabby-images/a7197/a71977ab8f9e4bbe6d2a627a212e0012cd6fa6e1" alt=""<!-- --> --- # Tensor product smooths There are multiple ways to build tensor products in *mgcv* 1. `te(x, z)` 2. `t2(x, z)` 3. `ti(x) + ti(z) + ti(x, z)` `te()` is the most general form but not usable in `gamm4::gamm4()` or *brms* `t2()` is an alternative implementation that does work in `gamm4::gamm4()` or *brms* `ti()` fits pure smooth interactions; where the main effects of `x` and `z` have been removed from the basis --- # Tensor product smooths .center[ <img src="resources/wood-gams-2ed-fig-5-17-tensor-product.svg" width="50%" /> ] --- # Factor smooth interactions Two ways for factor smooth interactions 1. `by` variable smooths * entirely separate smooth function for each level of the factor * each has it's own smoothness parameter * centred (no group means) so include factor as a fixed effect * `y ~ f + s(x, by = f)` 2. `bs = 'fs'` basis * smooth function for each level of the function * share a common smoothness parameter * fully penalized; include group means * closer to random effects * `y ~ s(x, f, bs = 'fs')` --- # `by` smooths Three options for the `by` argument 1. a continuous vector — varying coefficient model 2. a factor vector — a factor-smooth interaction 3. an ordered factor vector — an ANOVA-like factor-smooth interaction --- # Numeric `by` In a parametric model we have terms like `$$\beta_j x_{ij}$$` With a fixed, parametric effect of `\(\mathbf{x}\)` on `\(\mathbf{y}\)` With a numeric `by` we can make the effect of `\(\mathbf{x}\)` on `\(\mathbf{y}\)` vary smoothly as a function of other covariates `$$f_j(z_i) x_{ij}$$` --- # Factor `by` Two options for factor `by` models 1. a factor vector — a factor-smooth interaction, and 2. an ordered factor vector — an ANOVA-like factor-smooth interaction --- # Factor `by` A factor `by` variable results in a smooth being created for each level of the factor `f` ``` r gam(y ~ f + s(x, by = f), data = data, ....) ``` Each of the smooths is still centred about 0 so doesn't include the group mean Must include the parametric factor effect of `f` to model those group means --- # Ordered factor `by` This form also results in a smooth being created for each level of `f`, but it does so as: 1. the smooth effect of `\(x\)` for the reference group, plus 2. difference smooths for each other level of the factor (groups), where difference is with respect to the ference group's smooth ``` r data$o <- ordered(data$f) contrasts(data$o) <- "contr.treatment" gam(y ~ o + s(x) + s(x, by = o), data = data, ....) ``` --- # Constrained factor interactions This is like the factor `by` variant, but fits 1. and average or global smooth of `\(x\)`, plus 2. a difference smooth for each level of factor `f` It also includes the group means, so you don't need to add the parametric effects ``` r gam(y ~ s(x) + s(x, f, bs = "sz"), data = data, ....) ``` --- # Example Weights of chicks and effect of diet ``` r data(ChickWeight) cw <- ChickWeight |> as_tibble() |> janitor::clean_names() cw ``` ``` ## # A tibble: 578 × 4 ## weight time chick diet ## <dbl> <dbl> <ord> <fct> ## 1 42 0 1 1 ## 2 51 2 1 1 ## 3 59 4 1 1 ## 4 64 6 1 1 ## 5 76 8 1 1 ## 6 93 10 1 1 ## 7 106 12 1 1 ## 8 125 14 1 1 ## 9 149 16 1 1 ## 10 171 18 1 1 ## # ℹ 568 more rows ``` --- # Factor `by` ``` r m_cw_f <- gam(weight ~ diet + s(time, by = diet), data = cw, family = tw(), method = "REML") draw(m_cw_f) ``` <img src="index_files/figure-html/chick-factor-by-1.svg" style="display: block; margin: auto;" /> --- # Ordered factor `by` ``` r cw <- cw |> mutate(dieto = ordered(diet)) contrasts(cw$dieto) <- "contr.treatment" m_cw_o <- gam(weight ~ dieto + s(time) + s(time, by = dieto), data = cw, family = tw(), method = "REML") draw(m_cw_o) ``` <img src="index_files/figure-html/chick-ordered-factor-by-1.svg" style="display: block; margin: auto;" /> --- # Constrained Factor Interaction ``` r m_cw_z <- gam(weight ~ s(time) + s(time, diet, bs = "sz"), data = cw, family = tw(), method = "REML") draw(m_cw_z) ``` <img src="index_files/figure-html/chick-factor-sz-1.svg" style="display: block; margin: auto;" /> --- # Random effects When fitted with REML or ML, smooths can be viewed as just fancy random effects Inverse is true too; random effects can be viewed as smooths If you have simple random effects you can fit those in `gam()` and `bam()` without needing the more complex GAMM functions `gamm()` or `gamm4::gamm4()` These two models are equivalent ```r m_nlme <- lme(travel ~ 1, data = Rail, ~ 1 | Rail, method = "REML") m_gam <- gam(travel ~ s(Rail, bs = "re"), data = Rail, method = "REML") ``` --- # Random effects The random effect basis `bs = 're'` is not as computationally efficient as *nlme* or *lme4* for fitting * complex random effects terms, or * random effects with many levels Instead see `gamm()` and `gamm4::gamm4()` * `gamm()` fits using `lme()` * `gamm4::gamm4()` fits using `lmer()` or `glmer()` For non Gaussian models use `gamm4::gamm4()` --- class: inverse center middle subsection # Model checking --- # Model checking So you have a GAM: - How do you know you have the right degrees of freedom? `gam.check()` - Diagnosing model issues: `gam.check()` part 2 --- # GAMs are models too How accurate your predictions depends on how good the model is <img src="index_files/figure-html/misspecify-1.svg" width="95%" /> --- class: inverse center middle subsection # How do we test how well our model fits? --- # Simulated data `y` varies sinusoidally with `x1` `y` varies sigmoidally with `x2` Simulate response data from Gaussian, negative binomial, & binomial distributions ```r set.seed(2) n <- 400 x1 <- rnorm(n) x2 <- rnorm(n) y_val <- 1 + 2*cos(pi*x1) + 2/(1+exp(-5*(x2))) y_norm <- y_val + rnorm(n, 0, 0.5) y_negbinom <- rnbinom(n, mu = exp(y_val),size=10) y_binom <- rbinom(n,1,prob = exp(y_val)/(1+exp(y_val))) ``` --- # Simulated data data:image/s3,"s3://crabby-images/09424/0942408bfa0bccabb340679dce5c6b7a3fe79151" alt=""<!-- --> --- class: inverse middle center subsection # gam.check() part 1: do you have the right functional form? --- # How well does the model fit? - Many choices: k, family, type of smoother, … - How do we assess how well our model fits? --- # Basis size *k* - Set `k` per term - e.g. `s(x, k=10)` or `s(x, y, k=100)` - Penalty removes "extra" wigglyness - *up to a point!* - (But computation is slower with bigger `k`) --- # Checking basis size ```r norm_model_1 <- gam(y_norm ~ s(x1, k = 4) + s(x2, k = 4), method = 'REML') gam.check(norm_model_1) ``` ``` ## ## Method: REML Optimizer: outer newton ## full convergence after 8 iterations. ## Gradient range [-0.0003467788,0.0005154578] ## (score 736.9402 & scale 2.252304). ## Hessian positive definite, eigenvalue range [0.000346021,198.5041]. ## Model rank = 7 / 7 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(x1) 3.00 1.00 0.13 <2e-16 *** ## s(x2) 3.00 2.91 1.04 0.83 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Checking basis size ```r norm_model_2 <- gam(y_norm ~ s(x1, k = 12) + s(x2, k = 4), method = 'REML') gam.check(norm_model_2) ``` ``` ## ## Method: REML Optimizer: outer newton ## full convergence after 11 iterations. ## Gradient range [-5.658609e-06,5.392657e-06] ## (score 345.3111 & scale 0.2706205). ## Hessian positive definite, eigenvalue range [0.967727,198.6299]. ## Model rank = 15 / 15 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(x1) 11.00 10.84 0.99 0.38 ## s(x2) 3.00 2.98 0.86 0.01 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Checking basis size ```r norm_model_3 <- gam(y_norm ~ s(x1, k = 12) + s(x2, k = 12),method = 'REML') gam.check(norm_model_3) ``` ``` ## ## Method: REML Optimizer: outer newton ## full convergence after 8 iterations. ## Gradient range [-1.136192e-08,6.901146e-13] ## (score 334.2084 & scale 0.2485446). ## Hessian positive definite, eigenvalue range [2.812271,198.6868]. ## Model rank = 23 / 23 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(x1) 11.00 10.85 0.98 0.31 ## s(x2) 11.00 7.95 0.95 0.15 ``` --- # Checking basis size data:image/s3,"s3://crabby-images/5bfea/5bfeac3dad7e9de3b578336238433f3b02690022" alt=""<!-- --> --- # Checking basis size Two common ways that the `k.check()` (`gam.check()`) test will reject the null hypothesis of adequate basis size 1. Unmodelled auto-correlation 2. Incorrectly specified mean-variance relationship (wrong `family`) --- # Checking basis size (alt) Model with `k` set to be too low for each smooth ```r norm_model_1 <- gam(y_norm ~ s(x1, k = 4) + s(x2, k = 4), method = "REML") k.check(norm_model_1) ``` ``` ## k' edf k-index p-value ## s(x1) 3 1.001710 0.1250378 0.00 ## s(x2) 3 2.913909 1.0446399 0.85 ``` Alternate way of assessing if the basis size is sufficient is to take the deviance residuals from the model and fit a GAM to those with `family = quasi(link = "identity", variance = "constant")` and `k` doubled ??? Remember we had this model where we had set `k` to be too low for each of the two smooths --- # Checking basis size (alt) ```r res <- resid(norm_model_1, type = "deviance") res_model <- gam(res ~ s(x1, k = 12) + s(x2, k = 12), method = "REML", family = quasi(link = "identity", variance = "constant")) edf(res_model) ``` ``` ## # A tibble: 2 × 2 ## .smooth .edf ## <chr> <dbl> ## 1 s(x1) 10.8 ## 2 s(x2) 7.29 ``` --- # Checking basis size (alt) ```r draw(res_model) ``` <img src="index_files/figure-html/alt-basis-dim-check-3-1.svg" width="95%" style="display: block; margin: auto;" /> --- class: inverse middle center subsection # Model diagnostics --- class: inverse middle center subsection # Using gam.check() part 2: visual checks --- # gam.check() plots `gam.check()` creates 4 plots: 1. Quantile-quantile plots of residuals. If the model is right, should follow 1-1 line 2. Histogram of residuals 3. Residuals vs. linear predictor 4. Observed vs. fitted values `gam.check()` uses deviance residuals by default --- # Gaussian data, Gaussian model ```r norm_model <- gam(y_norm ~ s(x1, k=12) + s(x2, k=12), method="REML") gam.check(norm_model, rep = 500) ``` <img src="index_files/figure-html/gam_check_plots1-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Negative binomial data, Poisson model ```r pois_model <- gam(y_negbinom ~ s(x1, k=12) + s(x2, k=12), family=poisson, method="REML") gam.check(pois_model, rep = 500) ``` <img src="index_files/figure-html/gam_check_plots2-1.svg" width="90%" style="display: block; margin: auto;" /> --- # NB data, NB model ```r negbin_model <- gam(y_negbinom ~ s(x1, k=12) + s(x2, k=12), family = nb, method="REML") gam.check(negbin_model, rep = 500) ``` <img src="index_files/figure-html/gam_check_plots3-1.svg" width="90%" style="display: block; margin: auto;" /> --- # NB data, NB model ```r appraise(negbin_model, method = 'simulate') ``` data:image/s3,"s3://crabby-images/65727/6572731e8e0f9266da35bce7e60466599417d990" alt=""<!-- --> --- # DHARMa — alternate diagnostics The {DHARMa} 📦 by Florian Hartig provides additional diagnostics for a wide range of models DHARMa generates randomised quantile residuals from models Randomised quantile residuals are uniformly distributed These residuals are then used in model diagnostics --- # DHARMa — alternate diagnostics 1. Simulate new response data from the fitted model for each observation. 2. For each observation, calculate the empirical cumulative distribution function (ECDF) for the simulated observations, which describes the possible values at the predictor combination of the observed value, assuming the fitted model is correct. 3. The residual is then defined as the value of the ECDF at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value. --- # DHARMa — alternate diagnostics <img src="resources/dharma-randomised-residuals.png" width="50%" style="display: block; margin: auto;" /> --- # DHARMa — over dispersion A quick test for over dispersion is given by `testDispersion()` For the Poisson GAM fitted to Negative binomial data: ```r library("mgcViz") library("DHARMa") testDispersion(pois_model, plot = FALSE) ``` ``` ## ## DHARMa nonparametric dispersion test via sd of residuals fitted vs. ## simulated ## ## data: simulationOutput ## dispersion = 6.9332, p-value < 2.2e-16 ## alternative hypothesis: two.sided ``` --- # DHARMa — over dispersion A quick test for over dispersion is given by `testDispersion()` For the Negative binomial GAM fitted to Negative binomial data: ```r testDispersion(negbin_model, plot = FALSE) ``` ``` ## ## DHARMa nonparametric dispersion test via sd of residuals fitted vs. ## simulated ## ## data: simulationOutput ## dispersion = 0.92591, p-value = 0.736 ## alternative hypothesis: two.sided ``` --- # DHARMa — randomised quantile residuals To generate just the randomised quantile residuals we use `simulateResiduals()` Do this to compute and store the residuals once, rather than for each test ```r resids <- simulateResiduals(fittedModel = pois_model, plot = FALSE) ``` --- # DHARMa — randomised quantile residuals ```r plot(resids) ``` <img src="index_files/figure-html/dharma-plots-possion-1.svg" width="90%" style="display: block; margin: auto;" /> --- # DHARMa — randomised quantile residuals ```r resids <- simulateResiduals(fittedModel = negbin_model, plot = FALSE) plot(resids) ``` <img src="index_files/figure-html/dharma-plots-negbin-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Larks revisited ``` ## # A tibble: 6 × 8 ## QUADRICULA TET crestlark linnet x y e n ## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> ## 1 NG56 E <NA> <NA> 551000 4669000 551 4669 ## 2 NG56 J <NA> <NA> 553000 4669000 553 4669 ## 3 NG56 P <NA> <NA> 555000 4669000 555 4669 ## 4 NG56 U <NA> <NA> 557000 4669000 557 4669 ## 5 NG56 Z <NA> <NA> 559000 4669000 559 4669 ## 6 NG66 E <NA> <NA> 561000 4669000 561 4669 ``` ``` r crest <- gam(crestlark ~ s(e, n, k = 100), data = larks, family = binomial, method = "REML") ``` --- # Larks revisited ``` r resids <- simulateResiduals(fittedModel = crest, plot = FALSE) plot(resids) ``` data:image/s3,"s3://crabby-images/93168/93168a6c44043061b3cfc38a7e788efe4e68f696" alt=""<!-- --> Look's pretty good to me 🤩 --- class: inverse center middle subsection # Model selection --- # Model selection Model (or variable) selection — an important area of theoretical and applied interest - In statistics we aim for a balance between *fit* and *parsimony* - In applied research we seek the set of covariates with strongest effects on `\(y\)` We seek a subset of covariates that improves *interpretability* and *prediction accuracy* --- class: inverse center middle # Shrinkage & additional penalties --- # Shrinkage & additional penalties Smoothing parameter estimation allows selection of a wide range of potentially complex functions for smooths... But, cannot remove a term entirely from the model because the penalties used act only on the *range space* of a spline basis. The *null space* of the basis is unpenalised. - **Null space** — the basis functions that are smooth (constant, linear) - **Range space** — the basis functions that are wiggly --- # Shrinkage & additional penalties **mgcv** has two ways to penalize the null space, i.e. to do selection - *double penalty approach* via `select = TRUE` - *shrinkage approach* via special bases for - thin plate spline (default, `s(..., bs = 'ts')`), - cubic splines (`s(..., bs = 'cs')`) **double penalty** tends to works best, but applies to all smooths *and* doubles the number of smoothness parameters to estimate Other shrinkage/selection approaches *are available* in other software --- # Empirical Bayes...? `\(\mathbf{S}_j\)` can be viewed as prior precision matrices and `\(\lambda_j\)` as improper Gaussian priors on the spline coefficients. The impropriety derives from `\(\mathbf{S}_j\)` not being of full rank (zeroes in `\(\mathbf{\Lambda}_j\)`). Both the double penalty and shrinkage smooths remove the impropriety from the Gaussian prior --- # Empirical Bayes...? - **Double penalty** — makes no assumption as to how much to shrink the null space. This is determined from the data via estimation of `\(\lambda_j^{*}\)` - **Shrinkage smooths** — assumes null space should be shrunk less than the wiggly part Marra & Wood (2011) show that the double penalty and the shrinkage smooth approaches - performed significantly better than alternatives in terms of *predictive ability*, and - performed as well as alternatives in terms of variable selection --- # Example - Simulate Poisson counts - 4 known functions (left) - 2 spurious covariates (`runif()` & not shown) ```r ## an example of automatic model selection via null space penalization n <- 200 dat <- data_sim("eg1", n = n, scale = .15, dist = "poisson", seed = 3) ## simulate data set.seed(21) dat <- dat %>% mutate(x4 = runif(n, 0, 1), x5 = runif(n, 0, 1), f4 = rep(0, n), f5 = rep(0, n)) ## spurious ``` ```r b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3) + s(x4) + s(x5), data = dat, family = poisson, method = 'REML', select = TRUE) ``` --- # Example data:image/s3,"s3://crabby-images/a5b6d/a5b6d9cfb7a9f260b5bd2cce2a9e2527ed8d94f5" alt=""<!-- --> --- # Example .smaller[ ```r summary(b) ``` ``` ## ## Family: poisson ## Link function: log ## ## Formula: ## y ~ s(x0) + s(x1) + s(x2) + s(x3) + s(x4) + s(x5) ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.21564 0.04087 29.74 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(x0) 1.698e+00 9 4.554 0.0555 . ## s(x1) 1.937e+00 9 64.813 <2e-16 *** ## s(x2) 6.089e+00 9 158.767 <2e-16 *** ## s(x3) 7.431e-05 9 0.000 0.4534 ## s(x4) 6.127e-01 9 0.894 0.2190 ## s(x5) 8.089e-01 9 4.205 0.0217 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.555 Deviance explained = 52.7% ## -REML = 429.49 Scale est. = 1 n = 200 ``` ] --- # Example ```r draw(b, scales = 'fixed') ``` <img src="index_files/figure-html/shrinkage-example-plot-1.svg" width="95%" style="display: block; margin: auto;" /> --- class: inverse middle center subsection # Example --- # Galveston Bay .row[ .col-6[ Cross Validated question > I have a dataset of water temperature measurements taken from a large waterbody at irregular intervals over a period of decades. (Galveston Bay, TX if you’re interested) <https://stats.stackexchange.com/q/244042/1390> ] .col-6[ .center[ <img src="resources/cross-validated.png" width="1223" /> ] ] ] --- # Galveston Bay .small[ ``` r galveston <- read_csv("https://bit.ly/gam-galveston") |> mutate(datetime = as.POSIXct(paste(DATE, TIME), format = '%m/%d/%y %H:%M', tz = "CST6CDT"), STATION_ID = factor(STATION_ID), DoY = as.numeric(format(datetime, format = '%j')), ToD = as.numeric(format(datetime, format = '%H')) + (as.numeric(format(datetime, format = '%M')) / 60)) galveston ``` ``` ## # A tibble: 15,276 × 13 ## STATION_ID DATE TIME LATITUDE LONGITUDE YEAR MONTH DAY SEASON ## <fct> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 13296 6/20/91 11:04 29.5 -94.8 1991 6 20 Summer ## 2 13296 3/17/92 09:30 29.5 -94.8 1992 3 17 Spring ## 3 13296 9/23/91 11:24 29.5 -94.8 1991 9 23 Fall ## 4 13296 9/23/91 11:24 29.5 -94.8 1991 9 23 Fall ## 5 13296 6/20/91 11:04 29.5 -94.8 1991 6 20 Summer ## 6 13296 12/17/91 10:15 29.5 -94.8 1991 12 17 Winter ## 7 13296 6/29/92 11:17 29.5 -94.8 1992 6 29 Summer ## 8 13305 3/24/87 11:53 29.6 -95.0 1987 3 24 Spring ## 9 13305 4/2/87 13:39 29.6 -95.0 1987 4 2 Spring ## 10 13305 8/11/87 15:25 29.6 -95.0 1987 8 11 Summer ## # ℹ 15,266 more rows ## # ℹ 4 more variables: MEASUREMENT <dbl>, datetime <dttm>, DoY <dbl>, ToD <dbl> ``` ] --- # Galveston Bay model description $$ `\begin{align} \begin{split} \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\ & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\ & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i) \end{split} \end{align}` $$ * `\(\alpha\)` is the model intercept, * `\(f_1(\text{ToD}_i)\)` is a smooth function of time of day, * `\(f_2(\text{DoY}_i)\)` is a smooth function of day of year , * `\(f_3(\text{Year}_i)\)` is a smooth function of year, * `\(f_4(\text{x}_i, \text{y}_i)\)` is a 2D smooth of longitude and latitude, --- # Galveston Bay model description $$ `\begin{align} \begin{split} \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\ & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\ & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i) \end{split} \end{align}` $$ * `\(f_5(\text{DoY}_i, \text{Year}_i)\)` is a tensor product smooth of day of year and year, * `\(f_6(\text{x}_i, \text{y}_i, \text{ToD}_i)\)` tensor product smooth of location & time of day * `\(f_7(\text{x}_i, \text{y}_i, \text{DoY}_i)\)` tensor product smooth of location day of year& * `\(f_8(\text{x}_i, \text{y}_i, \text{Year}_i\)` tensor product smooth of location & year --- # Galveston Bay model description $$ `\begin{align} \begin{split} \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\ & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\ & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i) \end{split} \end{align}` $$ Effectively, the first four smooths are the main effects of 1. time of day, 2. season, 3. long-term trend, 4. spatial variation --- # Galveston Bay model description $$ `\begin{align} \begin{split} \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\ & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\ & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i) \end{split} \end{align}` $$ whilst the remaining tensor product smooths model smooth interactions between the stated covariates, which model 5. how the seasonal pattern of temperature varies over time, 6. how the time of day effect varies spatially, 7. how the seasonal effect varies spatially, and 8. how the long-term trend varies spatially --- # Galveston Bay — full model ```r knots <- list(DoY = c(0.5, 366.5)) m <- bam(MEASUREMENT ~ s(ToD, k = 10) + s(DoY, k = 12, bs = "cc") + s(YEAR, k = 30) + s(LONGITUDE, LATITUDE, k = 100, bs = "ds", m = c(1, 0.5)) + ti(DoY, YEAR, bs = c("cc", "tp"), k = c(12, 15)) + ti(LONGITUDE, LATITUDE, ToD, d = c(2,1), bs = c("ds", "tp"), m = list(c(1, 0.5), NA), k = c(20, 10)) + ti(LONGITUDE, LATITUDE, DoY, d = c(2,1), bs = c("ds", "cc"), m = list(c(1, 0.5), NA), k = c(25, 12)) + ti(LONGITUDE, LATITUDE, YEAR, d = c(2,1), bs = c("ds", "tp"), m = list(c(1, 0.5), NA), k = c(25, 15)), data = galveston, method = "fREML", knots = knots, nthreads = c(6, 1), discrete = FALSE) ``` --- # Galveston Bay — simpler model ```r m.sub <- bam(MEASUREMENT ~ s(ToD, k = 10) + s(DoY, k = 12, bs = "cc") + s(YEAR, k = 30) + s(LONGITUDE, LATITUDE, k = 100, bs = "ds", m = c(1, 0.5)) + ti(DoY, YEAR, bs = c("cc", "tp"), k = c(12, 15)), data = galveston, method = "fREML", knots = knots, nthreads = c(4, 1), discrete = TRUE) ``` --- # Galveston Bay — simpler model? ```r AIC(m, m.sub) ``` ``` ## df AIC ## m 484.7684 58544.03 ## m.sub 238.2766 59193.50 ``` --- # Galveston Bay — simpler model? .smaller[ ```r anova(m, m.sub, test = "F") ``` ``` ## Analysis of Deviance Table ## ## Model 1: MEASUREMENT ~ s(ToD, k = 10) + s(DoY, k = 12, bs = "cc") + s(YEAR, ## k = 30) + s(LONGITUDE, LATITUDE, k = 100, bs = "ds", m = c(1, ## 0.5)) + ti(DoY, YEAR, bs = c("cc", "tp"), k = c(12, 15)) + ## ti(LONGITUDE, LATITUDE, ToD, d = c(2, 1), bs = c("ds", "tp"), ## m = list(c(1, 0.5), NA), k = c(20, 10)) + ti(LONGITUDE, ## LATITUDE, DoY, d = c(2, 1), bs = c("ds", "cc"), m = list(c(1, ## 0.5), NA), k = c(25, 12)) + ti(LONGITUDE, LATITUDE, YEAR, ## d = c(2, 1), bs = c("ds", "tp"), m = list(c(1, 0.5), NA), ## k = c(25, 15)) ## Model 2: MEASUREMENT ~ s(ToD, k = 10) + s(DoY, k = 12, bs = "cc") + s(YEAR, ## k = 30) + s(LONGITUDE, LATITUDE, k = 100, bs = "ds", m = c(1, ## 0.5)) + ti(DoY, YEAR, bs = c("cc", "tp"), k = c(12, 15)) ## Resid. Df Resid. Dev Df Deviance F Pr(>F) ## 1 14684 38759 ## 2 15022 41769 -338.25 -3009.8 3.3979 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- # Galveston Bay — full model summary .small[ ```r summary(m) ``` ``` ## ## Family: gaussian ## Link function: identity ## ## Formula: ## MEASUREMENT ~ s(ToD, k = 10) + s(DoY, k = 12, bs = "cc") + s(YEAR, ## k = 30) + s(LONGITUDE, LATITUDE, k = 100, bs = "ds", m = c(1, ## 0.5)) + ti(DoY, YEAR, bs = c("cc", "tp"), k = c(12, 15)) + ## ti(LONGITUDE, LATITUDE, ToD, d = c(2, 1), bs = c("ds", "tp"), ## m = list(c(1, 0.5), NA), k = c(20, 10)) + ti(LONGITUDE, ## LATITUDE, DoY, d = c(2, 1), bs = c("ds", "cc"), m = list(c(1, ## 0.5), NA), k = c(25, 12)) + ti(LONGITUDE, LATITUDE, YEAR, ## d = c(2, 1), bs = c("ds", "tp"), m = list(c(1, 0.5), NA), ## k = c(25, 15)) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 21.61913 0.02078 1040 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(ToD) 1.000 1.00 25.979 8.8e-07 *** ## s(DoY) 9.896 10.00 13430.960 < 2e-16 *** ## s(YEAR) 28.026 28.75 75.994 < 2e-16 *** ## s(LONGITUDE,LATITUDE) 69.104 99.00 6.713 < 2e-16 *** ## ti(DoY,YEAR) 131.299 140.00 34.690 < 2e-16 *** ## ti(LONGITUDE,LATITUDE,ToD) 50.398 171.00 0.981 < 2e-16 *** ## ti(LONGITUDE,LATITUDE,DoY) 92.991 240.00 1.301 < 2e-16 *** ## ti(LONGITUDE,LATITUDE,YEAR) 91.791 287.00 1.305 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.94 Deviance explained = 94.2% ## fREML = 29799 Scale est. = 2.6187 n = 15276 ``` ] --- # Galveston Bay — full model plot ```r plot(m, pages = 1, scheme = 2, shade = TRUE) ``` data:image/s3,"s3://crabby-images/b0915/b0915b0f5cbddd8b82fd57fa1f115cab12159e21" alt=""<!-- --> --- # Galveston Bay — full model plot ```r draw(m, scales = "free", rug = FALSE, n = 50) + plot_layout(widths = 1) & theme(strip.text.x = element_text(size = 8)) ``` <img src="index_files/figure-html/galveston-full-model-draw-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Galveston Bay — predict ``` r pdata <- data_slice(m, ToD = 12, DoY = 180, YEAR = evenly(YEAR, by = 1), LONGITUDE = evenly(LONGITUDE, n = 50), LATITUDE = evenly(LATITUDE, n = 50)) fv <- fitted_values(m, data = pdata) # set fitted values to NA for grid points that are too far from the data ind <- too_far(pdata$LONGITUDE, pdata$LATITUDE, galveston$LONGITUDE, galveston$LATITUDE, dist = 0.1) fv <- fv |> mutate(.fitted = if_else(ind, NA_real_, .fitted)) ``` --- # Galveston Bay — plot ``` r plt <- ggplot(fv, aes(x = LONGITUDE, y = LATITUDE)) + geom_raster(aes(fill = .fitted)) + facet_wrap(~ YEAR, ncol = 12) + scale_fill_viridis_c(name = expression(degree*C), option = "plasma", na.value = "transparent") + coord_quickmap() + scale_x_continuous(guide = guide_axis(n.dodge = 2, check.overlap = TRUE)) + theme(legend.position = "top") plt ``` --- # Galveston Bay — plot data:image/s3,"s3://crabby-images/d4b8e/d4b8e93282b86e9c733bc3575bf9d5befc362f25" alt=""<!-- --> --- # Galveston Bay — .center[data:image/s3,"s3://crabby-images/fc0df/fc0dfd648a8732c0ea3a471ceab0185620f3d061" alt=""] --- # Galveston Bay — plot trends ``` r ds <- data_slice(m, ToD = 12, DoY = c(1, 90, 180, 270), YEAR = evenly(YEAR, n = 250), LONGITUDE = -94.8751, LATITUDE = 29.50866) fv <- fitted_values(m, data = ds, scale = "response") plt2 <- ggplot(fv, aes(x = YEAR, y = .fitted, group = factor(DoY))) + geom_ribbon(aes(ymin = .lower_ci, ymax = .upper_ci), fill = "black", alpha = 0.2) + geom_line() + facet_wrap(~ DoY, scales = "free_y") + labs(x = NULL, y = expression(Temperature ~ (degree * C))) plt2 ``` --- # Galveston Bay — plot trends data:image/s3,"s3://crabby-images/27837/27837a0a77709d94d2f6de228b4a52e2ea8fb364" alt=""<!-- --> --- # Next steps Read Simon Wood's book! Lots more material on our ESA GAM Workshop site [https://noamross.github.io/mgcv-esa-workshop/]() Noam Ross' free GAM Course <https://noamross.github.io/gams-in-r-course/> Noam also maintains a list of [GAM Resources](https://github.com/noamross/gam-resources) A couple of papers: .smaller[ 1. Simpson, G.L., 2018. Modelling Palaeoecological Time Series Using Generalised Additive Models. Frontiers in Ecology and Evolution 6, 149. https://doi.org/10.3389/fevo.2018.00149 2. Pedersen, E.J., Miller, D.L., Simpson, G.L., Ross, N., 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7, e6876. https://doi.org/10.7717/peerj.6876 ] Also see my blog: [fromthebottomoftheheap.net](http://fromthebottomoftheheap.net) --- # Reuse * HTML Slide deck [bit.ly/physalia-gam-3](https://bit.ly/physalia-gam-3) © Simpson (2020-2022) [data:image/s3,"s3://crabby-images/52d37/52d37f6a1ef99f64de63c9e0ae1f59f132907bee" alt="Creative Commons Licence"](http://creativecommons.org/licenses/by/4.0/) * RMarkdown [Source](https://bit.ly/physalia-gam) --- # References - [Marra & Wood (2011) *Computational Statistics and Data Analysis* **55** 2372–2387.](http://doi.org/10.1016/j.csda.2011.02.004) - [Marra & Wood (2012) *Scandinavian Journal of Statistics, Theory and Applications* **39**(1), 53–74.](http://doi.org/10.1111/j.1467-9469.2011.00760.x.) - [Nychka (1988) *Journal of the American Statistical Association* **83**(404) 1134–1143.](http://doi.org/10.1080/01621459.1988.10478711) - Wood (2017) *Generalized Additive Models: An Introduction with R*. Chapman and Hall/CRC. (2nd Edition) - [Wood (2013a) *Biometrika* **100**(1) 221–228.](http://doi.org/10.1093/biomet/ass048) - [Wood (2013b) *Biometrika* **100**(4) 1005–1010.](http://doi.org/10.1093/biomet/ast038) - [Wood et al (2016) *JASA* **111** 1548–1563](https://doi.org/10.1080/01621459.2016.1180986)