Generalized additive models and gratia

class: inverse, middle, left, my-title-slide, title-slide

.title[
# Generalized additive models and gratia
]
.author[
### Gavin Simpson
]
.institute[
### Aarhus University
]
.date[
### May 23, 2024
]

---

class: inverse middle center huge-subsection

# GAMs

---
class: inverse middle center subsection

# Motivating example

---

# HadCRUT4 time series

???

Hadley Centre NH temperature record ensemble

How would you model the trend in these data?

---

# (Generalized) Linear Models

`$$y_i \sim \mathcal{N}(\mu_i, \sigma^2)$$`

`$$\mu_i = \beta_0 + \beta_1 \mathtt{year}_{i} + \beta_2 \mathtt{year}^2_{1i} + \cdots + \beta_j \mathtt{year}^j_{i}$$`

Assumptions

1. linear effects of covariates are good approximation of the true effects
2. conditional on the values of covariates, `$y_i | \mathbf{X} \sim \mathcal{N}(\mu_i, \sigma^2)$`
3. this implies all observations have the same *variance*
4. `$y_i | \mathbf{X} = \mathbf{x}$` are *independent*

An **additive** model addresses the first of these

---
class: inverse center middle subsection

# Why bother with anything more complex?

---

# Is this linear?

---

# Polynomials perhaps&hellip;

---

# Polynomials perhaps&hellip;

We can keep on adding ever more powers of `$\boldsymbol{x}$` to the model &mdash; model selection problem

**Runge phenomenon** &mdash; oscillations at the edges of an interval &mdash; means simply moving to higher-order polynomials doesn't always improve accuracy

---
class: inverse middle center subsection

# GAMs offer a solution

---

# HadCRUT data set

```r
library("readr")
library("dplyr")
URL <-  "https://bit.ly/hadcrutv4"
gtemp <- read_table(URL, col_types = 'nnnnnnnnnnnn', col_names = FALSE) |>
  select(num_range('X', 1:2)) %>% setNames(nm = c('Year', 'Temperature'))
```

[File format](https://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/series_format.html)

---

# HadCRUT data set

```r
gtemp
```

```
## # A tibble: 172 × 2
##     Year Temperature
##    <dbl>       <dbl>
##  1  1850      -0.336
##  2  1851      -0.159
##  3  1852      -0.107
##  4  1853      -0.177
##  5  1854      -0.071
##  6  1855      -0.19 
##  7  1856      -0.378
##  8  1857      -0.405
##  9  1858      -0.4  
## 10  1859      -0.215
## # ℹ 162 more rows
```

---

# Fitting a GAM

```r
library("mgcv")
m <- gam(Temperature ~ s(Year), data = gtemp, method = 'REML')
summary(m)
```

.smaller[

```
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Temperature ~ s(Year)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.020477   0.009731  -2.104   0.0369 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Year) 7.837   8.65 145.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.88   Deviance explained = 88.6%
## -REML = -91.237  Scale est. = 0.016287  n = 172
```
]

---

# Fitted GAM

---
class: inverse middle center big-subsection

# GAMs

---

# Generalized Additive Models

<br />

![](resources/tradeoff-slider.png)

.references[Source: [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/)]

???

GAMs are an intermediate-complexity model

* can learn from data without needing to be informed by the user
* remain interpretable because we can visualize the fitted features

---

# How is a GAM different?

`$$\begin{align*}
y_i &\sim \mathcal{D}(\mu_i, \theta) \\ 
\mathbb{E}(y_i) &= \mu_i = g(\eta_i)^{-1}
\end{align*}$$`

We model the mean of data as a sum of linear terms:

`$$\eta_i = \beta_0 +\sum_j \color{red}{ \beta_j x_{ji}}$$`

A GAM is a sum of _smooth functions_ or _smooths_

`$$\eta_i = \beta_0 + \sum_j \color{red}{f_j(x_{ji})}$$`

---

# How did `gam()` *know*?

---
class: inverse
background-image: url('./resources/rob-potter-398564.jpg')
background-size: contain

# What magic is this?

.footnote[
<a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, &quot;San Francisco&quot;, &quot;Helvetica Neue&quot;, Helvetica, Ubuntu, Roboto, Noto, &quot;Segoe UI&quot;, Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px;" href="https://unsplash.com/@robpotter?utm_medium=referral&amp;utm_campaign=photographer-credit&amp;utm_content=creditBadge" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Rob Potter"><span style="display:inline-block;padding:2px 3px;"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-1px;fill:white;" viewBox="0 0 32 32"><title></title><path d="M20.8 18.1c0 2.7-2.2 4.8-4.8 4.8s-4.8-2.1-4.8-4.8c0-2.7 2.2-4.8 4.8-4.8 2.7.1 4.8 2.2 4.8 4.8zm11.2-7.4v14.9c0 2.3-1.9 4.3-4.3 4.3h-23.4c-2.4 0-4.3-1.9-4.3-4.3v-15c0-2.3 1.9-4.3 4.3-4.3h3.7l.8-2.3c.4-1.1 1.7-2 2.9-2h8.6c1.2 0 2.5.9 2.9 2l.8 2.4h3.7c2.4 0 4.3 1.9 4.3 4.3zm-8.6 7.5c0-4.1-3.3-7.5-7.5-7.5-4.1 0-7.5 3.4-7.5 7.5s3.3 7.5 7.5 7.5c4.2-.1 7.5-3.4 7.5-7.5z"></path></svg></span><span style="display:inline-block;padding:2px 3px;">Rob Potter</span></a>
]

---
class: inverse
background-image: url('resources/wiggly-things.png')
background-size: contain

---

# Fitting a GAM in R

```r
model <- gam(y ~ s(x1) + s(x2) + te(x3, x4), # formuala describing model
             data = my_data_frame,           # your data
             method = "REML",                # or "ML"
             family = gaussian)              # or something more exotic
```

`s()` terms are smooths of one or more variables

`te()` terms are the smooth equivalent of *main effects + interactions*

`$$\eta_i = \beta_0 + f(\mathtt{Year}_i)$$`

```r
library("mgcv")
gam(Temperature ~ s(Year, k = 10), data = gtemp, method = 'REML')
```

---

# Generalized additive models

A GAM is a model where the effects of covariates can be represented as smooth functions learned from the data

`\begin{align*}
y_i &\sim    \mathcal{D}(\mu_i, \boldsymbol{\phi}) \\
g(\mu_i) &=  \mathbf{A}_i\boldsymbol{\gamma} + \sum_{j=1} f_j(x_{ji})
\end{align*}`

The smooth functions `$f_j()$` are set up as penalized splines

---

# Splines formed from basis functions

---

# Weight basis functions &#8680; spline

---

# Maximise penalised log-likelihood &#8680; &beta;

???

Fitting a GAM involves finding the weights for the basis functions that produce a spline that fits the data best, subject to some constraints

---
class: inverse middle center big-subsection

# gratia 📦

---

# gratia 📦

{gratia} is a package for working with GAMs

Mostly limited to GAMs fitted with {mgcv} & {gamm4}

Follows (sort of) Tidyverse principles

* return tibbles (data frames)

* suitable for plotting with {ggplot2}

* graphics use {ggplot2} and {patchwork}

---

# `draw()`

The main function is `draw()`

Has methods for many of the function outputs in {gratia} and for models

---

# Plotting smooths

```r
data(mcycle, package = "MASS")
m <- gam(accel ~ s(times), data = mcycle, method = "REML")
draw(m)
```

---

# Plotting smooths

```r
df <- data_sim("eg1", n = 1000, seed = 42)

m_sim <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3),
  data = df, method = "REML")
draw(m_sim)
```

---

# Plotting smooths

```r
draw(m_sim)
```

---

# Plotting smooths

```r
draw(m_sim,
     residuals = TRUE,           # add partial residuals
     overall_uncertainty = TRUE, # include uncertainty due to constant
     unconditional = TRUE,       # correct for smoothness selection
     rug = FALSE)                # turn off rug plot
```

---

# Plotting smooths

.center[
<img src="index_files/figure-html/draw-mcycle-options-plot-1.svg" width="95%" style="display: block; margin: auto;" />
]

---

# New in 0.9.0

Posterior simulation

<https://gavinsimpson.github.io/gratia/articles/posterior-simulation.html>

* `smooth_samples()`
    * posterior of individual smooths &mdash; model uncertainty
* `fitted_samples()`
    * posterior expectations &mdash; model uncertainty
* `predicted_samples()`
    * simulated response data &mdash; sampling variation
* `posterior_samples()`
    * uncertainty in expectation & sampling variation

---

# Ocean chlorophyll-*a* &mdash; setup

Model chlorophyll-*a* in ocean samples

```r
library("gratia")
library("ggplot2")
library("dplyr")
```

```r
library("gamair")
data(chl, package = "gamair")
```

---

# Ocean chlorophyll-*a*

Use a spline on the sphere basis `bs = "sos"` for spatial smooth

```r
ctrl <- gam.control(nthreads = 4)
chl_m1 <- gam(
  chl ~ s(lat, lon, bs = "sos", m = -1, k = 150) +
    s(jul.day, bs = "cr", k = 20) +
    s(bath, k = 10),
  data = chl,
  method = "REML",
  control = ctrl,
  family = tw()
)
```

---

# Ocean chlorophyll-*a*

Model overview is tidy summary of the model &mdash; uses `summary.gam()` under the hood

```r
overview(chl_m1)
```

```
## 
## Generalized Additive Model with 3 terms
## 
##   term       type      k    edf statistic p.value
##   <chr>      <chr> <dbl>  <dbl>     <dbl> <chr>  
## 1 s(lat,lon) SOS     149 136.        41.9 < 0.001
## 2 s(jul.day) CRS      19  18.0      154.  < 0.001
## 3 s(bath)    TPRS      9   7.76      13.3 < 0.001
```

---

# Ocean chlorophyll-*a* &mdash; `draw()`

```r
crs <- "+proj=ortho +lat_0=20 +lon_0=-37" # centre on -37 longitude
draw(chl_m1, crs = crs, dist = 0.05, rug = FALSE)
```

---

# Ocean chlorophyll-*a* &mdash; model checking

```r
appraise(chl_m1, method = "simulate")
```

---

# Ocean chlorophyll-*a* &mdash; GAMLSS

Tweedie LSS model with 3 linear predictors

```r
chl_m2 <- gam(
  list(
    chl ~ s(lat, lon, bs = "sos", m = -1, k = 150) + # location
      s(jul.day, bs = "cr", k = 20) +
      s(bath, k = 10),
    ~ s(lat, lon, bs = "sos", m = -1, k = 100) +     # power
      s(jul.day, bs = "cr", k = 20) +
      s(bath, k = 10),
    ~ s(lat, lon, bs = "sos", m = -1, k = 100) +     # scale
      s(jul.day, bs = "cr", k = 20) +
      s(bath, k = 10)
  ),
  data = chl,
  method = "REML",
  control = ctrl,
  family = twlss()
)
```

---

# Ocean chlorophyll-*a*

```r
overview(chl_m2)
```

```
## 
## Generalized Additive Model with 9 terms
## 
##   term         type      k    edf statistic p.value 
##   <chr>        <chr> <dbl>  <dbl>     <dbl> <chr>   
## 1 s(lat,lon)   SOS     149 139.    14864.   < 0.001 
## 2 s(jul.day)   CRS      19  18.2    2672.   < 0.001 
## 3 s(bath)      TPRS      9   7.85    131.   < 0.001 
## 4 s.1(lat,lon) SOS      99  58.8     438.   < 0.001 
## 5 s.1(jul.day) CRS      19  15.2     304.   < 0.001 
## 6 s.1(bath)    TPRS      9   1.56      5.25 0.039308
## 7 s.2(lat,lon) SOS      99  85.6    1695.   < 0.001 
## 8 s.2(jul.day) CRS      19  10.3     406.   < 0.001 
## 9 s.2(bath)    TPRS      9   7.11    247.   < 0.001
```

---

# Ocean chlorophyll-*a* &mdash; model checking

```r
appraise(chl_m2, method = "simulate")
```

---

# Ocean chlorophyll-*a*

```r
draw(chl_m2, crs = crs, dist = 0.05, rug = FALSE)
```

---

# Ocean chlorophyll-*a*

```r
chl_m2 |>
  smooth_estimates(select = "s(lat,lon)", n = 50) |>
  add_confint()
```

```
## # A tibble: 2,500 × 9
##    .smooth    .type .by   .estimate   .se   lat   lon .lower_ci .upper_ci
##    <chr>      <chr> <chr>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
##  1 s(lat,lon) SOS   <NA>    -0.0582 0.475  30.2 -79.9    -0.989    0.873 
##  2 s(lat,lon) SOS   <NA>    -0.118  0.312  30.2 -76.6    -0.729    0.493 
##  3 s(lat,lon) SOS   <NA>    -0.378  0.239  30.2 -73.3    -0.847    0.0904
##  4 s(lat,lon) SOS   <NA>    -0.788  0.212  30.2 -70.1    -1.20    -0.373 
##  5 s(lat,lon) SOS   <NA>    -1.11   0.192  30.2 -66.8    -1.48    -0.730 
##  6 s(lat,lon) SOS   <NA>    -1.42   0.249  30.2 -63.5    -1.91    -0.932 
##  7 s(lat,lon) SOS   <NA>    -1.95   0.307  30.2 -60.2    -2.55    -1.35  
##  8 s(lat,lon) SOS   <NA>    -2.04   0.296  30.2 -56.9    -2.62    -1.46  
##  9 s(lat,lon) SOS   <NA>    -2.03   0.256  30.2 -53.6    -2.54    -1.53  
## 10 s(lat,lon) SOS   <NA>    -2.06   0.227  30.2 -50.4    -2.50    -1.61  
## # ℹ 2,490 more rows
```

---

# Ocean chlorophyll-*a* &mdash; using the model

What is the average chlorophyll-*a* in a box of the ocean 40&ndash;50&deg; N, 40&ndash;50&deg; W?

To answer this note

1. we don't (in this data set) have the ocean depth at each point in the box,
2. we want to ignore the day of year (`jul.day`) effect

So we'll only use the three spatial smooths & the intercept when predicting

---

# Ocean chlorophyll-*a* &mdash; using the model

```r
# at what values of covariates to predict at
ds <- data_slice(
  chl_m2,
  lat = evenly(lat, lower = 40, upper = 50, by = 0.5),
  lon = evenly(lon, lower = -50, upper = -40, by = 0.5)
)
# which model terms to include
use <- c("(Intercept)", "s(lat,lon)", "s.1(lat,lon)", "s.2(lat,lon)")
# generate predictions
fv <- fitted_values(
  chl_m2,
  data = ds,
  terms = use
)
# average predicted chl-a over the grid cells
fv |>
  filter(.parameter == "location") |>
  dplyr::summarise(chl_a = mean(.fitted))
```

```
## # A tibble: 1 × 1
##   chl_a
##   <dbl>
## 1  1.07
```

---

# Ocean chlorophyll-*a* &mdash; using the model

What is the average chlorophyll-*a* in a box of the ocean 40&ndash;50&deg; N, 40&ndash;50&deg; W?

Answer:

```
## # A tibble: 1 × 1
##   chl_a
##   <dbl>
## 1  1.07
```

What about the uncertainty in that estimate?

---

# Ocean chl-*a* &mdash; posterior simulation

Simulate large number of draws from the posterior

```r
fs <- fitted_samples(
  chl_m2,
  data = ds,            # where to predict; our box
  terms = use,          # include only these model terms
  n = 10000,            # how many posterior samples?
  method = "gaussian",  # Gaussian approximation is default
  unconditional = TRUE, # include uncertainty for unkown smoothing parameters
  n_cores = 4,          # use more cores for RNG
  seed = 342            # stochastic, so set a seed
)
```

---

# Ocean chl-*a* &mdash; posterior simulation

Repeat the _same_ `dplyr` code to average `$\widehat{\mathtt{chl}}_i$` over the box

Only difference is we do this separately for each posterior sample (`.draw`) & then we summarise posterior `median_qi()`

```r
library("ggdist")                    # for median_qi()
fs |>
  group_by(.draw) |>                 # do averaging for each posterior sample
  dplyr::summarise(chl_a = mean(.fitted)) |>
  median_qi()                        # median and 0.025 & 0.975 quantiles
```

```
## # A tibble: 1 × 6
##   chl_a .lower .upper .width .point .interval
##   <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  1.07  0.870   1.33   0.95 median qi
```

---

# Some tracking-related GAM resources

* Bravington _et al_ (2021) [Variance Propagation for Density Surface Models](https://doi.org/10.1007/s13253-021-00438-2)

* DSMs involve a detection function (distance sampling) and a GAM

* Paper shows how to propagate uncertainty from detection function to the GAM

---

# Harbour porpoise

Example from Bravington _et al_ (2021)

---

# Some tracking-related GAM resources

* Presence-only data
    * Imhomogeneous Poisson Point Process
    * Log-Gaussian Cox Process
    * Dovers et al (2024) [Fitting log-Gaussian Cox processes using generalized additive model software](https://doi.org/10.1080/00031305.2024.2316725)
    * Also `ppgam` package

* Step selection
    * Animal movement & habitat preferences
    * Klappstein *et al* (2024) [Step selection analysis with non-linear and random effects in mgcv](https://doi.org/10.1101/2024.01.05.574363)

---
class: inverse center middle subsection

# Posteriors and prediction

---

# Posteriors and prediction

The following slides explain what's going on inside `fitted_values()`, `fitted_samples()`, and related posterior sampling functions.

---

# 🐡🐠🐟🦐 Species richness & 🦐 biomass

The example comes from trawl data from off the coast of Newfoundland and Labrador, Canada

* Counts of species richness at each trawl location
* Shrimp biomass at each trawl location
* Annual trawls 2005&ndash;2014

---

# 🐡🐠🐟🦐 Species richness

.row[
.col-6[

```r
shrimp <- read_csv("data/trawl_nl.csv")
```

```r
m_rich <- gam(richness ~ s(year),
              family = poisson,
              method = "REML",
              data = shrimp)
```
]
.col-6[
<img src="index_files/figure-html/richness-violin-1.svg" style="display: block; margin: auto;" />
]
]

---

# 🐡🐠🐟🦐 Species richness

```r
draw(m_rich)
```

---

# Spatio-temporal data

🦐 biomass at each trawl

<img src="index_files/figure-html/biom-space-time-plot-1.png" style="display: block; margin: auto;" />
---

# Spatio-temporal model

```r
m_spt <- gam(shrimp ~ te(x, y, year, d = c(2,1),
                         bs = c('tp', 'cr'), k = c(20, 5)),
             data = shrimp,
             family = tw(),
             method = "REML")
```

---

# Predicting with `predict()`

`plot.gam()` and `gratia::draw()` show the component functions of the model on the link scale

Prediction allows us to evaluate the model at known values of covariates on the response scale

Use the standard function `predict()`

Provide `newdata` with a data frame of values of covariates

---

# `predict()`

```r
new_year <- with(shrimp, tibble(year = seq(min(year), max(year), length.out = 100)))
pred <- predict(m_rich, newdata = new_year, se.fit = TRUE, type = 'link')
pred <- bind_cols(new_year, as_tibble(as.data.frame(pred)))
pred
```

```
## # A tibble: 100 × 3
##     year   fit  se.fit
##    <dbl> <dbl>   <dbl>
##  1 2005   3.05 0.0100 
##  2 2005.  3.05 0.00901
##  3 2005.  3.06 0.00830
##  4 2005.  3.06 0.00792
##  5 2005.  3.06 0.00786
##  6 2005.  3.06 0.00807
##  7 2006.  3.07 0.00844
##  8 2006.  3.07 0.00887
##  9 2006.  3.07 0.00926
## 10 2006.  3.08 0.00955
## # ℹ 90 more rows
```

---

# `predict()` &rarr; response scale

```r
ilink <- inv_link(m_rich)                         # inverse link function
crit <- qnorm((1 - 0.89) / 2, lower.tail = FALSE) # or just `crit <- 2`
pred <- mutate(pred, richness = ilink(fit),
               lwr = ilink(fit - (crit * se.fit)), # lower...
               upr = ilink(fit + (crit * se.fit))) # upper credible interval
pred
```

```
## # A tibble: 100 × 6
##     year   fit  se.fit richness   lwr   upr
##    <dbl> <dbl>   <dbl>    <dbl> <dbl> <dbl>
##  1 2005   3.05 0.0100      21.1  20.8  21.4
##  2 2005.  3.05 0.00901     21.2  20.9  21.5
##  3 2005.  3.06 0.00830     21.2  20.9  21.5
##  4 2005.  3.06 0.00792     21.3  21.0  21.6
##  5 2005.  3.06 0.00786     21.4  21.1  21.6
##  6 2005.  3.06 0.00807     21.4  21.1  21.7
##  7 2006.  3.07 0.00844     21.5  21.2  21.8
##  8 2006.  3.07 0.00887     21.6  21.3  21.9
##  9 2006.  3.07 0.00926     21.6  21.3  22.0
## 10 2006.  3.08 0.00955     21.7  21.4  22.0
## # ℹ 90 more rows
```

---

# `predict()` &rarr; plot

Tidy objects like this are easy to plot with `ggplot()`

```r
ggplot(pred, aes(x = year)) + scale_x_continuous(breaks = 2005:2014) +
    geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) +
    geom_line(aes(y = richness)) + labs(y = "Species richness", x = NULL)
```

---

# `predict()` for space and time

This idea is very general;  spatiotemporal model needs a grid of x,y coordinates for each year

```r
sp_new <- with(shrimp, expand.grid(x = evenly(x, n = 100), y = evenly(y, n = 100),
                                   year = unique(year)))
sp_pred <- predict(m_spt, newdata = sp_new, se.fit = TRUE) # link scale is default
sp_pred <- bind_cols(as_tibble(sp_new), as_tibble(as.data.frame(sp_pred)))
sp_pred
```

```
## # A tibble: 100,000 × 5
##          x        y  year   fit se.fit
##      <dbl>    <dbl> <dbl> <dbl>  <dbl>
##  1 428238. 5078244.  2005  3.34  1.06 
##  2 436886. 5078244.  2005  3.32  1.06 
##  3 445535. 5078244.  2005  3.31  1.05 
##  4 454183. 5078244.  2005  3.29  1.04 
##  5 462831. 5078244.  2005  3.26  1.03 
##  6 471479. 5078244.  2005  3.24  1.02 
##  7 480127. 5078244.  2005  3.21  1.01 
##  8 488775. 5078244.  2005  3.18  1.00 
##  9 497423. 5078244.  2005  3.15  0.994
## 10 506071. 5078244.  2005  3.12  0.985
## # ℹ 99,990 more rows
```

---

# `predict()` &rarr; response scale

```r
ilink <- inv_link(m_spt)
too_far <- exclude.too.far(sp_pred$x, sp_pred$y, shrimp$x, shrimp$y, dist = 0.1)
sp_pred <- sp_pred %>% mutate(biomass = ilink(fit),
                              biomass = case_when(too_far ~ NA_real_,
                                                  TRUE ~ biomass))
sp_pred
```

```
## # A tibble: 100,000 × 6
##          x        y  year   fit se.fit biomass
##      <dbl>    <dbl> <dbl> <dbl>  <dbl>   <dbl>
##  1 428238. 5078244.  2005  3.34  1.06       NA
##  2 436886. 5078244.  2005  3.32  1.06       NA
##  3 445535. 5078244.  2005  3.31  1.05       NA
##  4 454183. 5078244.  2005  3.29  1.04       NA
##  5 462831. 5078244.  2005  3.26  1.03       NA
##  6 471479. 5078244.  2005  3.24  1.02       NA
##  7 480127. 5078244.  2005  3.21  1.01       NA
##  8 488775. 5078244.  2005  3.18  1.00       NA
##  9 497423. 5078244.  2005  3.15  0.994      NA
## 10 506071. 5078244.  2005  3.12  0.985      NA
## # ℹ 99,990 more rows
```

---

# `predict()` &rarr; plot

```r
ggplot(sp_pred, aes(x = x, y = y, fill = biomass)) + geom_raster() +
    scale_fill_viridis_c(option = "plasma") + facet_wrap(~ year, ncol = 5) + coord_equal()
```

---

# Visualizing the trend?

We have this model

.smaller[

```r
m_spt
```

```
## 
## Family: Tweedie(p=1.686) 
## Link function: log 
## 
## Formula:
## shrimp ~ te(x, y, year, d = c(2, 1), bs = c("tp", "cr"), k = c(20, 
##     5))
## 
## Estimated degrees of freedom:
## 70.4  total = 71.38 
## 
## REML score: 19102.91
```
]

How would you visualize the average change in biomass over time?

---

# Welcome back old friend

One way is to  decompose the spatio-temporal function in main effects plus interaction

```r
m_ti <- gam(shrimp ~ ti(x, y, year, d = c(2, 1), bs = c("tp", "cr"), k = c(20, 5)) +
              s(x, y, bs = "tp", k = 20) +
                s(year, bs = "cr", k = 5),
            data = shrimp, family = tw, method = "REML")
```

and predict from the model using only the marginal effect of `s(year)`

---

# `predict()` with `exclude`

.row[
.col-6[
We can exclude the spatial & spatiotemporal terms from predictions using `exclude`

**Step 1** run `gratia::smooths()` on model & note the names of the smooth you *don't* want &rarr;
]
.col-6[
.smaller[

```r
smooths(m_ti)
```

```
## [1] "ti(x,y,year)" "s(x,y)"       "s(year)"
```
]
]
]

---

# `predict()` with `exclude` &mdash; Step 2 *predict*

Prediction data only needs dummy values for `x` and `y`

```r
ti_new <- with(shrimp, expand.grid(x = mean(x), y = mean(y), year = evenly(year, n = 100)))

ti_pred <- predict(m_ti, newdata = ti_new, se.fit = TRUE,
*                  exclude = c("ti(x,y,year)", "s(x,y)"))

ti_pred <- bind_cols(as_tibble(ti_new), as_tibble(as.data.frame(ti_pred))) %>%
    mutate(biomass = ilink(fit),
           lwr = ilink(fit - (crit * se.fit)),
           upr = ilink(fit + (crit * se.fit)))
```

`exclude` takes a character vector of terms to exclude &mdash; `predict()` sets the contributions of those terms to 0

Could also use `terms` to select only the named terms

```r
predict(m_ti, newdata = ti_new, se.fit = TRUE, terms = c("(Intercept)", "s(year)"))
```

---

# `predict()` with `exclude`&mdash; Step 3 *plot it!*

```r
ggplot(ti_pred, aes(x = year)) + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3) +
    geom_line(aes(y = biomass)) + labs(y = "Biomass", x = NULL)
```

---

# Using `fitted_values()`

```r
ti_pred2 <- fitted_values(m_ti, data = ti_new,
                          scale = "response",
*                         exclude = c("ti(x,y,year)", "s(x,y)"))

ggplot(ti_pred2, aes(x = year)) + geom_ribbon(aes(ymin = .lower_ci, ymax = .upper_ci), alpha = 0.3) +
  geom_line(aes(y = .fitted)) + labs(y = "Biomass", x = NULL)
```

---
class: inverse middle center subsection

# Posterior simulation

---

# Posterior simulation

.center[

```r
knitr::include_graphics("resources/miller-bayesian-gam-interpretation-fig.svg")
```

<img src="resources/miller-bayesian-gam-interpretation-fig.svg" width="80%" style="display: block; margin: auto;" />
]

.smaller[
Miller (2021) Bayesian Views of Generalized Additive Modelling. [*arXiv*:1902.01330v3](http://arxiv.org/abs/1902.01330v3)
]

---

# Posterior distributions

Each line is a draw from the *posterior distribution* of the smooth

Remember the coefficients for each basis function?: `$\beta_j$`

Together they are distributed *multivariate normal* with

* mean vector given by `$\hat{\beta}_j$`
* covariance matrix `$\boldsymbol{\hat{V}}_{\beta}$`

`$$\text{MVN}(\boldsymbol{\hat{\beta}}, \boldsymbol{\hat{V}}_{\beta})$$`

The model as a whole has a posterior distribution too

We can simulate data from the model by taking draws from the posterior distribution

---

# Posterior simulation for a smooth

Sounds fancy but it's only just slightly more complicated than using `rnorm()`

To do this we need a few things:

1. The vector of model parameters for the smooth, `$\boldsymbol{\hat{\beta}}$`
2. The covariance matrix of those parameters, `$\boldsymbol{\hat{V}}_{\beta}$`
3. A matrix `$\boldsymbol{X}_p$` that maps parameters to the linear predictor for the smooth

`$$\boldsymbol{\hat{\eta}}_p = \boldsymbol{X}_p \boldsymbol{\hat{\beta}}$$`

Let's do this for `m_rich`

---

# Posterior sim for a smooth &mdash; step 1

The vector of model parameters for the smooth, `$\boldsymbol{\hat{\beta}}$`

```r
sm_year <- get_smooth(m_rich, "s(year)")     # extract the smooth object from model
idx <- gratia:::smooth_coef_indices(sm_year) # indices of the coefs for this smooth
idx
```

```
## [1]  2  3  4  5  6  7  8  9 10
```

```r
beta <- coef(m_rich)                         # vector of model parameters
beta[idx]                                    # coefs for this smooth
```

```
##   s(year).1   s(year).2   s(year).3   s(year).4   s(year).5   s(year).6   s(year).7   s(year).8 
## -0.17559264  1.13222927 -0.46532056  5.90630566  0.18400060 -1.09147043 -0.20021520 -0.44434784 
##   s(year).9 
## -0.02398653
```

---

# Posterior sim for a smooth &mdash; step 2

The covariance matrix of the model parameters, `$\boldsymbol{\hat{V}}_{\beta}$`

```r
Vb <- vcov(m_rich) # default is the bayesian covariance matrix
Vb
```

.small[

```
##               (Intercept)     s(year).1     s(year).2     s(year).3     s(year).4     s(year).5     s(year).6     s(year).7     s(year).8     s(year).9
## (Intercept)  1.027059e-05  1.578864e-06 -9.032418e-06 -1.307231e-06 -4.622411e-05 -9.668346e-06  1.481563e-05  8.523791e-07  4.057120e-06  7.715888e-08
## s(year).1    1.578864e-06  4.766242e-02 -1.705627e-01  1.280727e-01 -1.447873e-01  2.579066e-02 -7.928522e-02  1.444655e-02  3.437523e-02  6.254985e-03
## s(year).2   -9.032418e-06 -1.705627e-01  7.441849e-01 -6.341764e-01  9.230513e-01 -3.008818e-01  5.476052e-01 -1.972615e-01 -1.370834e-01 -2.260069e-02
## s(year).3   -1.307231e-06  1.280727e-01 -6.341764e-01  1.756373e+00 -1.488830e+00  8.995848e-01  2.440806e-02 -2.444633e-01  6.839307e-03 -2.921669e-03
## s(year).4   -4.622411e-05 -1.447873e-01  9.230513e-01 -1.488830e+00  2.743191e+00 -2.018595e+00  1.612778e+00 -5.371137e-01 -1.362334e-01 -1.741728e-02
## s(year).5   -9.668346e-06  2.579066e-02 -3.008818e-01  8.995848e-01 -2.018595e+00  2.276558e+00 -1.671047e+00  5.120318e-01  3.881950e-02 -4.407525e-03
## s(year).6    1.481563e-05 -7.928522e-02  5.476052e-01  2.440806e-02  1.612778e+00 -1.671047e+00  2.357642e+00 -1.045468e+00 -1.807873e-01 -1.797243e-02
## s(year).7    8.523791e-07  1.444655e-02 -1.972615e-01 -2.444633e-01 -5.371137e-01  5.120318e-01 -1.045468e+00  5.391215e-01  8.494867e-02  8.167830e-03
## s(year).8    4.057120e-06  3.437523e-02 -1.370834e-01  6.839307e-03 -1.362334e-01  3.881950e-02 -1.807873e-01  8.494867e-02  3.836358e-02  6.579738e-03
## s(year).9    7.715888e-08  6.254985e-03 -2.260069e-02 -2.921669e-03 -1.741728e-02 -4.407525e-03 -1.797243e-02  8.167830e-03  6.579738e-03  1.683112e-03
```
]

---

# Posterior sim for a smooth &mdash; step 3

A matrix `$\boldsymbol{X}_p$` maps parameters to the linear predictor for the smooth

We get `$\boldsymbol{X}_p$` using the `predict()` method with `type = "lpmatrix"`

```r
new_year <- with(shrimp, tibble(year = evenly(year, n = 100)))
Xp <- predict(m_rich, newdata = new_year, type = 'lpmatrix')
dim(Xp)
```

```
## [1] 100  10
```

---

# Posterior sim for a smooth &mdash; step 4

Take only the columns of `$\boldsymbol{X}_p$` that are involved in the smooth of `year`

```r
Xp <- Xp[, idx, drop = FALSE]
dim(Xp)
```

```
## [1] 100   9
```

---

# Posterior sim for a smooth &mdash; step 5

Simulate parameters from the posterior distribution of the smooth of `year`

```r
set.seed(42)
beta_sim <- rmvn(n = 20, beta[idx], Vb[idx, idx, drop = FALSE])
dim(beta_sim)
```

```
## [1] 20  9
```

Simulating many sets (20) of new model parameters from the estimated parameters and their uncertainty (covariance)

Result is a matrix where each row is a set of new model parameters, each consistent with the fitted smooth

---

# Posterior sim for a smooth &mdash; step 6

.row[
.col-6[
Form `$\boldsymbol{\hat{\eta}}_p$`, the posterior draws for the smooth

```r
sm_draws <- Xp %*% t(beta_sim)
dim(sm_draws)
```

```
## [1] 100  20
```

```r
matplot(sm_draws, type = 'l')
```

A bit of rearranging is needed to plot with `ggplot()`
]

.col-6[
<img src="index_files/figure-html/richness-posterior-draws-1.svg" style="display: block; margin: auto;" />
]

]

Or use `smooth_samples()`

---

# Posterior sim for a smooth &mdash; steps 1&ndash;6

```r
sm_post <- smooth_samples(m_rich, 's(year)', n = 20, seed = 42)
draw(sm_post)
```

---

# Posterior simulation from the model

Simulating from the posterior distribution of the model requires 1 modification of the recipe for a smooth and one extra step

We want to simulate new values for all the parameters in the model, not just the ones involved in a particular smooth

Additionally, we could simulate *new response data* from the model and the simulated parameters (**not shown** below)

---

# Posterior simulation from the model

```r
beta <- coef(m_rich)   # vector of model parameters
Vb <- vcov(m_rich)     # default is the bayesian covariance matrix
Xp <- predict(m_rich, type = "lpmatrix")
set.seed(42)
beta_sim <- rmvn(n = 1000, beta, Vb) # simulate parameters
eta_p <- Xp %*% t(beta_sim)        # form linear predictor values
mu_p <- inv_link(m_rich)(eta_p)    # apply inverse link function

mean(mu_p[1, ]) # mean of posterior for the first observation in the data
```

```
## [1] 21.10123
```

```r
quantile(mu_p[1, ], probs = c(0.025, 0.975))
```

```
##     2.5%    97.5% 
## 20.70134 21.49528
```

---

# Posterior simulation from the model

```r
ggplot(tibble(richness = mu_p[587, ]), aes(x = richness)) +
    geom_histogram() + labs(title = "Posterior richness for obs #587")
```

---

# Posterior simulation from the model

Or easier using `fitted_samples()`

```r
rich_post <- fitted_samples(m_rich, n = 1000, data = shrimp, seed = 42)
ggplot(filter(rich_post, .row == 587), aes(x = .fitted)) +
    geom_histogram() + labs(title = "Posterior richness for obs #587", x = "Richness")
```

---

# Why is this of interest?

Say you wanted to get an estimate for the total biomass of shrimp over the entire region of the trawl survey for 2007

You could predict for the spatial grid for `year == 2007` using code shown previously and sum the predicted biomass values over all the grid cells

**Easy**

But what if you also wanted the uncertainty in that estimate?

**Hard**

**Math** 😱😱 "something, something, delta method, something" 😱😱

---

# Posterior simulation makes this easy

1. Take a draw from the posterior distribution of the model
2. Use the posterior draw to predict biomass for each grid cell
3. Sum the predicted biomass values over all grid cells
4. Store the total biomass value
5. Repeat 1&ndash;4 a lot of times to get posterior distribution for total biomass
6. Summarize the total biomass posterior
    * Estimated total biomass is the mean of the total biomass posterior
	* Uncertainty is some lower/upper tail probability quantiles of the posterior

---

# Let's do it

```r
sp_new <- with(shrimp, expand.grid(x = evenly(x, n = 100), y = evenly(y, n = 100),
                                   year = 2007))
Xp <- predict(m_spt, newdata = sp_new, type = "lpmatrix")

## work out now which points are too far now
too_far <- exclude.too.far(sp_new$x, sp_new$y, shrimp$x, shrimp$y, dist = 0.1)

beta <- coef(m_spt)                  # vector of model parameters
Vb <- vcov(m_spt)                    # default is the bayesian covariance matrix
set.seed(42)
beta_sim <- rmvn(n = 1000, beta, Vb) # simulate parameters
eta_p <- Xp %*% t(beta_sim)          # form linear predictor values
mu_p <- inv_link(m_spt)(eta_p)       # apply inverse link function
```

Columns of `mu_p` contain the expected or mean biomass for each grid cell per area trawled

Sum the columns of `mu_p` and summarize

---

# Summarize the expected biomass

```r
mu_copy <- mu_p              # copy mu_p
mu_copy[too_far, ] <- NA     # set cells too far from data to be NA
total_biomass <- colSums(mu_copy, na.rm = TRUE)  # total biomass over the region

mean(total_biomass)
```

```
## [1] 1561644
```

```r
quantile(total_biomass, probs = c(0.025, 0.975))
```

```
##    2.5%   97.5% 
## 1404871 1750293
```

---

# Summarize the expected biomass

```r
ggplot(tibble(biomass = total_biomass), aes(x = biomass)) +
    geom_histogram()
```

---

# With `fitted_samples()`

.row[

.col-7[

```r
bio_post <- fitted_samples(m_spt, n = 1000,
                           data = sp_new[!too_far, ],
                           seed = 42) |>
    group_by(.draw) |>
    summarise(total = sum(.fitted),
              .groups = "drop_last")

with(bio_post, mean(total))
```

```
## [1] 1561644
```

```r
with(bio_post, quantile(total, probs = c(0.025, 0.975)))
```

```
##    2.5%   97.5% 
## 1404871 1750293
```
]

.col-5[

```r
ggplot(bio_post, aes(x = total)) +
    geom_histogram() +
    labs(x = "Total biomass")
```

]

---
class: inverse center middle subsection

# Soap films

---

# Soap films

For the narrative description of what is going on in this example, see [my blog post](https://fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/) on soap film smooths, from where the code below was sourced & modernified a little

---

# Soap film smooths

```r
fsb <- fs.boundary()
m <- 300
n <- 150 
xm <- seq(-1, 4, length = m)
yn <- seq(-1, 1, length = n)
xx <- rep(xm, n)
yy <- rep(yn, rep(m, n))
tru <- matrix(fs.test(xx, yy), m, n) ## truth
truth <- data.frame(x = xx, y = yy, value = as.vector(tru))

p <- ggplot(truth, aes(x = x, y = y)) +
  geom_raster(aes(fill = value)) +
  geom_contour(aes(z = value), binwidth = 0.5, colour = "white") +
  geom_path(data = as.data.frame(fsb), aes(x = x, y = y)) +
  scale_fill_viridis(na.value = NA) +
  theme(legend.position = "top", legend.key.width = unit(2.5, "cm"))
```

---

# Soap film smooths

---

# Soap films &mdash; wrong way

```r
library("sf")
outline <- read_sf("data/42721_Cosmeston_Lake_lake_polyline.shp")
depth <- read_sf("data/d17_42721_xyz.shp")

come_p <- ggplot(outline) +
  geom_sf() +
  geom_sf(data = depth, aes(colour = depth)) +
  coord_sf() +
  labs(y = "Northing", x = "Easting") +
  scale_color_viridis()
```

---