Estimating trends in messy time series

class: inverse, middle, left, my-title-slide, title-slide

# Estimating trends in messy time series
## — using generalized additive models
### Gavin Simpson
### Statistics & Biostatistic Seminar Series • University of Waterloo • Department of Statistics & Actuarial Science • March 10, 2022

---

### Land acknowledgment

* This work was done at URegina on Treaty 4 lands

### Funding

.row[

.col-6[
.center[![:scale 70%](./resources/NSERC_C.svg)]
]

.col-6[
.center[![:scale 70%](./resources/fgsr-logo.jpg)]
]

]

### Data

* Woolway and colleagues for archiving the lake surface water data
* Lake ice data from [Global Lake and River Ice Phenology Database](http://nsidc.org/data/G01377.html)
* Lake ice study joint work with Stefano Mezzini

<!-- ### Slides

* HTML Slide deck [bit.ly/hifmb-talk](http://bit.ly/hifmb-talk) &copy; Simpson (2020) [![Creative Commons Licence](https://i.creativecommons.org/l/by/4.0/88x31.png)](http://creativecommons.org/licenses/by/4.0/) -->

???

This research was conducted while I was at the University of Regina, which is situated on the territories of the nay-hi-yuh-wuk (Cree; nêhiyawak), uh-nish-i-naa-payk (Salteaux; Anihšināpēk), Dakota, Lakota, and Nakoda, and the homeland of the Métis/Michif Nation.

---
class: inverse
background-image: url('./resources/franki-chamaki-z4H9MYmWIMA-unsplash.jpg')
background-size: cover

.footnote[
<a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, &quot;San Francisco&quot;, &quot;Helvetica Neue&quot;, Helvetica, Ubuntu, Roboto, Noto, &quot;Segoe UI&quot;, Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px" href="https://unsplash.com/@franki?utm_medium=referral&amp;utm_campaign=photographer-credit&amp;utm_content=creditBadge" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Franki Chamaki"><span style="display:inline-block;padding:2px 3px"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-2px;fill:white" viewBox="0 0 32 32"><title>unsplash-logo</title><path d="M10 9V0h12v9H10zm12 5h10v18H0V14h10v9h12v-9z"></path></svg></span><span style="display:inline-block;padding:2px 3px">Franki Chamaki</span></a>
]

???

We learn from data because it can highlight our preconceptions and biases

---

# Learning from data

![](index_files/figure-html/lm-plot-1.svg)

???

Learning from data could be a simple as fitting a linear regression model...

---
class: inverse
background-image: url('./resources/deep-learning-turned-up-to-11.jpg')
background-size: contain

???

Or as complex as fitting a sophisticated multi-layered neural network trained on huge datasets or corpora

---

# Learning involves trade-offs

.row[
.col-6[
.center[![:scale 75%](resources/low-bias-low-variance.jpg)]
]

.col-6[
.center[![:scale 75%](resources/interpretable-black-box.jpg)]
]
]

???

Learning from data involves trade offs

We can have models that fit our data well &mdash; low bias &mdash; but which are highly variable, or

We can fit models that have lower variance, but these tend to have higher bias, i.e. fit the data less well

A linear regression model is very interpretable but unless the underlying relationship is linear it will have poor fit

Deep learning may fit data incredibly well but the model is very difficult to interpret and understand

---

# Generalized Additive Models

<br />

![](resources/tradeoff-slider.png)

.references[Source: [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/)]

???

GAMs are an intermediate-complexity model

* can learn from data without needing to be informed by the user
* remain interpretable because we can visualize the fitted features

---

# HadCRUT4 time series

![](index_files/figure-html/hadcrut-temp-example-1.svg)

---

# Polynomials

![](index_files/figure-html/hadcrut-temp-polynomial-1.svg)

---

# How is a GAM different?

In LM we model the mean of data as a sum of linear terms:

`$$y_i = \beta_0 + \sum_j \color{red}{ \beta_j x_{ji}} + \epsilon_i$$`

A GAM is a sum of _smooth functions_ or _smooths_

`$$y_i = \beta_0 + \sum_j \color{red}{s_j(x_{ji})} + \epsilon_i$$`

where `$\epsilon_i \sim N(0, \sigma^2)$`

---

# More generally

`$$y_i \sim \mathcal{D}(\mu_i, \phi)$$`

`$$g(\mu_i) = \boldsymbol{A}_i\boldsymbol{\gamma} + \sum_j f_j(x_{ji})$$`

---
class: inverse middle center subsection

# GAMs use splines

---
class: inverse
background-image: url('resources/wiggly-things.png')
background-size: contain

???

---

# Splines formed from basis functions

![](index_files/figure-html/basis-functions-1.svg)

???

Splines are built up from basis functions

Here I'm showing a cubic regression spline basis with 10 knots/functions

We weight each basis function to get a spline. Here all the basisi functions have the same weight so they would fit a horizontal line

---

# Weight basis functions &#8680; spline

.center[![](resources/basis-fun-anim.gif)]

???

But if we choose different weights we get more wiggly spline

Each of the splines I showed you earlier are all generated from the same basis functions but using different weights

---

# How do GAMs learn from data?

![](index_files/figure-html/example-data-figure-1.svg)

???

How does this help us learn from data?

Here I'm showing a simulated data set, where the data are drawn from the orange functions, with noise. We want to learn the orange function from the data

---

# Maximise penalised log-likelihood &#8680; &beta;

.center[![](resources/gam-crs-animation.gif)]

???

Fitting a GAM involves finding the weights for the basis functions that produce a spline that fits the data best, subject to some constraints

---
class: inverse middle center subsection

# Avoid overfitting our sample

---
class: inverse middle center subsection

# Use a wiggliness penalty &mdash; avoid fitting too wiggly models

---

# Wiggliness

`$$\int_{\mathbb{R}} [f^{\prime\prime}]^2 dx = \boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta} = \large{W}$$`

`$\mathbf{S}$` is a *penalty* matrix

`$\boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta}$` is a *quadratic* penalty

---

# Making wiggliness matter

`$W$` measures **wiggliness**

(log) likelihood measures closeness to the data

We use a **smoothing parameter** `$\lambda$` to define the trade-off, to find the spline coefficients `$\boldsymbol{\beta}$` (and `$\boldsymbol{\gamma}$`) that maximize the **penalized** log-likelihood

`$$\mathcal{L}_p = \log(\text{Likelihood})  - \lambda W$$`

---

# HadCRUT4 time series

![](index_files/figure-html/hadcrut-temp-penalty-1.svg)

---

# Picking the right wiggliness

.pull-left[
Two ways to think about how to optimize `$\lambda$`:

* Predictive: Minimize out-of-sample error
* Bayesian:  Put priors on our basis coefficients
]

.pull-right[
Many methods: AIC, Mallow's `$C_p$`, GCV, ML, REML

* **Practically**, use **REML**, because of numerical stability
* Hence `gam(..., method='REML')`
]

.center[
![Animation of derivatives](./resources/remlgcv.png)
]

---

# Maximum allowed wiggliness

We set **basis complexity** or "size" `$k$`

This is _maximum wigglyness_, can be thought of as number of small functions that make up a curve

Once smoothing is applied, curves have fewer **effective degrees of freedom (EDF)**

EDF < `$k$`

---

# Maximum allowed wiggliness

`$k$` must be *large enough*, the `$\lambda$` penalty does the rest

*Large enough* &mdash; space of functions representable by the basis includes the true function or a close approximation to the tru function

Bigger `$k$` increases computational cost

In **mgcv**, default `$k$` values are arbitrary &mdash; after choosing the model terms, this is the key user choice

**Must be checked!** &mdash; `gam.check()`

---

# Knots?

Can avoid the knot placement problem using low-rank thin plate regression splines

---

# Thin plate splines
Default spline in {mgcv} is `bs = "tp"`

Thin plate splines &mdash; knot at every unique data value

Penalty on wiggliness (default is 2nd derivative)

As many basis functions as data

This is wasteful

---

# Low rank thin plate splines

Thin plate regression spline

Form the full TPS basis

Take the wiggly basis functions and eigen-decompose it

Concentrates the *information* in the TPS basis into as few as possible new basis functions

Retain the `k` eigenvectors with the `k` largest eigenvalues as the new basis

Fit the GAM with those new basis functions

---

# TPRS basis & penalty

.center[
<img src="index_files/figure-html/unnamed-chunk-1-1.png" width="100%" />
]

---

# CRS basis & penalty

.center[
<img src="index_files/figure-html/unnamed-chunk-2-1.png" width="100%" />
]

---

# Fitted GAM

![](index_files/figure-html/hadcrutemp-fitted-gam-1.svg)

---
class: inverse middle center subsection

# Why GAMs?

---

# Why GAMs?

Originally motivated by messy irregularly-spaced time series data

.center[
![](/home/au690221/work/presentations/2022/waterloo/slides/resources/frontiers-paper-title.png)
]

Simpson (2018, Frontiers in Ecology & Evolution)

---

# Why GAMs?

Originally motivated by messy irregularly-spaced time series data

.row[

.col-6[
* sediment cores

* samples irregularly spaced in time
]

.col-6[
<img src="/home/au690221/work/presentations/2022/waterloo/slides/resources/tom_core.jpg" width="75%" style="display: block; margin: auto;" />
]
]

---

# Why GAMs?

Originally motivated by messy irregularly-spaced time series data

Interest in the "trends" themselves not (necessarily) the temporal dependence

(Relatively) Easy to fit & &mdash; user friendly tools

Familiar

---
class: inverse middle center subsection

#  Galveston Bay

---

# Galveston Bay

.row[

.col-6[
Cross Validated question

> I have a dataset of water temperature measurements taken from a large waterbody at irregular intervals over a period of decades. (Galveston Bay, TX if you’re interested)

]

.col-6[

.center[
![](/home/au690221/work/presentations/2022/waterloo/slides/resources/cross-validated.png)
]

]
]

<https://stats.stackexchange.com/q/244042/1390>

---

# Galveston Bay

```
## # A tibble: 15,276 × 13
##    STATION_ID DATE     TIME   LATITUDE LONGITUDE  YEAR MONTH   DAY SEASON
##    <fct>      <chr>    <time>    <dbl>     <dbl> <dbl> <dbl> <dbl> <chr> 
##  1 13296      6/20/91  11:04      29.5     -94.8  1991     6    20 Summer
##  2 13296      3/17/92  09:30      29.5     -94.8  1992     3    17 Spring
##  3 13296      9/23/91  11:24      29.5     -94.8  1991     9    23 Fall  
##  4 13296      9/23/91  11:24      29.5     -94.8  1991     9    23 Fall  
##  5 13296      6/20/91  11:04      29.5     -94.8  1991     6    20 Summer
##  6 13296      12/17/91 10:15      29.5     -94.8  1991    12    17 Winter
##  7 13296      6/29/92  11:17      29.5     -94.8  1992     6    29 Summer
##  8 13305      3/24/87  11:53      29.6     -95.0  1987     3    24 Spring
##  9 13305      4/2/87   13:39      29.6     -95.0  1987     4     2 Spring
## 10 13305      8/11/87  15:25      29.6     -95.0  1987     8    11 Summer
## # … with 15,266 more rows, and 4 more variables: MEASUREMENT <dbl>,
## #   datetime <dttm>, DoY <dbl>, ToD <dbl>
```

---

# Galveston Bay model description

$$
`\begin{align}
  \begin{split}
      \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\
        & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\
        & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i)
  \end{split}
\end{align}`
$$

* `$\alpha$` is the model intercept,
* `$f_1(\text{ToD}_i)$` is a smooth function of time of day,
* `$f_2(\text{DoY}_i)$` is a smooth function of day of year ,
* `$f_3(\text{Year}_i)$` is a smooth function of year,
* `$f_4(\text{x}_i, \text{y}_i)$` is a 2D smooth of longitude and latitude,

---

# Galveston Bay model description

* `$f_5(\text{DoY}_i, \text{Year}_i)$` is a tensor product smooth of day of year and year,
* `$f_6(\text{x}_i, \text{y}_i, \text{ToD}_i)$` tensor product smooth of location & time of day
* `$f_7(\text{x}_i, \text{y}_i, \text{DoY}_i)$` tensor product smooth of location day of year& 
* `$f_8(\text{x}_i, \text{y}_i, \text{Year}_i$` tensor product smooth of location & year

---

# Galveston Bay model description

Effectively, the first four smooths are the main effects of

1. time of day,
2. season,
3. long-term trend,
4. spatial variation

---

# Galveston Bay model description

whilst the remaining tensor product smooths model smooth interactions between the stated covariates, which model

5. how the seasonal pattern of temperature varies over time,
6. how the time of day effect varies spatially,
7. how the seasonal effect varies spatially, and
8. how the long-term trend varies spatially

---

# Galveston Bay &mdash; full model

```r
knots <- list(DoY = c(0.5, 366.5))
m <- bam(MEASUREMENT ~
             s(ToD, k = 10) +
             s(DoY, k = 12, bs = 'cc') +
             s(YEAR, k = 30) +
             s(LONGITUDE, LATITUDE, k = 100, bs = 'ds', m = c(1, 0.5)) +
             ti(DoY, YEAR, bs = c('cc', 'tp'), k = c(12, 15)) +
             ti(LONGITUDE, LATITUDE, ToD, d = c(2,1), bs = c('ds','tp'),
                m = list(c(1, 0.5), NA), k = c(20, 10)) +
             ti(LONGITUDE, LATITUDE, DoY, d = c(2,1), bs = c('ds','cc'),
                m = list(c(1, 0.5), NA), k = c(25, 12)) +
             ti(LONGITUDE, LATITUDE, YEAR, d = c(2,1), bs = c('ds','tp'),
                m = list(c(1, 0.5), NA), k = c(25, 15)),
         data = galveston, method = 'fREML', knots = knots,
         nthreads = c(4, 1), discrete = TRUE)
```

---

# Galveston Bay &mdash; full model plot

![](index_files/figure-html/galveston-full-model-plot-1.svg)

---

# Galveston Bay &mdash; plot

![](index_files/figure-html/galveston-full-predict-plot-1.svg)

---

# Galveston Bay &mdash; plot

.center[![](resources/galveston-animation.gif)]

---

# Galveston Bay &mdash; plot trends

![](index_files/figure-html/galveston-trends-by-month-1.svg)

---
class: inverse middle center subsection

# Climate change affecting lake temperatures?

---

.references[
Data: Woolway *et al* (2019) *Climate Change* **155**, 81&ndash;94 [doi: 10/c7z9](http://doi.org/c7z9)
]

![](index_files/figure-html/plot-blelham-1.svg)

---
class: inverse middle center subsection

# Why worry about minimum temperatures?

---

# Why worry about minimum temperatures?

Annual minimum temperature is a strong control on many in-lake processes (eg Hampton *et al* 2017)

Extreme events can have long-lasting effects on lake ecology &mdash; mild winter in Europe 2006&ndash;7 (eg Straile *et al* 2010)

Reduction in habitat or refugia for cold-adapted species

* Arctic charr (*Salvelinus alpinus*)
* Opossum shrimp (*Mysis salemaai*)

.references[Hampton *et al* (2017). Ecology under lake ice. *Ecology Letters* **20**, 98–111. [doi: 10/f3tpzh](http://doi.org/f3tpzh)

Straile *et al* (2010). Effects of a half a millennium winter on a deep lake &mdash; a shape of things to come? *Global Change Biology* **16**, 2844–2856. [doi: 10/bx6t4d](http://doi.org/bx6t4d)]

---

# Multiple time series &#8680; HGAM

![](index_files/figure-html/plot-all-lakes-1.svg)

---
class: inverse center middle large-subsection

# Central Limit Theorem

???

Central limit theorem shows us that the Gaussian or normal distribution is the sampling distribution for many sample statistics, including sample means, as samples sizes become large

Central limit theorem underlies much of the theory that justifies much of the statistics you learn about in your statistics courses, and supports the use of the Gaussian or normal distribution

---

# Annual minimum temperature

![](index_files/figure-html/plot-minima-1.svg)

---
class: inverse middle center large-subsection

# Block Minima

---

# Fisher&ndash;Tippett&ndash;Gnedenko theorem

> The **maximum** of a sample of *iid* random variables after proper renormalization can only converge in distribution to one of three possible distributions; the *Gumbel* distribution, the *Fréchet* distribution, or the *Weibull* distribution.

.row[

.col-4[
.center[![:scale 90%](./resources/fisher.jpg)
.smaller[Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:%E0%A4%B0%E0%A5%8B%E0%A4%A8%E0%A4%BE%E0%A4%A1%E0%A5%8D.jpg)]]
]
.col-4[ 
.center[![:scale 84%](./resources/tippett.jpg)]
.smaller[Source: [ral.ucar.edu](https://ral.ucar.edu/projects/extremes/Extremes/history.html)]
]
.col-4[ 
.center[![:scale 93%](./resources/gnedenko.png)]
.smaller[Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Gnedenko.png)]
]

]

---
class: inverse middle center large-subsection

# Block Minima&hellip;?

---

# Negate the minima &#8680; maxima

![](index_files/figure-html/plot-minima-negated-1.svg)

---
class: inverse middle center large-subsection

# Three Distributions&hellip;?

---

# Generalised extreme value distribution

In 1978 Daniel McFadden demonstrated the common functional form for all three distributions — the GEVD

`$$G(y) = \exp \left \{ - \left [ 1 + \xi \left ( \frac{y - \mu}{\sigma} \right ) \right ]_{+}^{-1/\xi} \right \}$$`

.row[
.col-5[
Three parameters to estimate

* location `$\mu$`,
* scale `$\sigma$`, and
* shape `$\xi$`
]
.col-7[
Three distributions

* Gumbel distribution when `$\xi$` = 0,
* Fréchet distribution when `$\xi$` > 0, &
* Weibull distribution when `$\xi$` < 0
]
]

---
class: inverse middle center subsection

# Fit Distributional GAM using GEV for response

---
class: inverse middle center subsection

# Model μ, σ, ξ with smooths of Year

---

# Estimated smooths

&nbsp;

![](index_files/figure-html/draw-gev-model-smooths-1.svg)

---

# Observed vs fitted

![](index_files/figure-html/fitted-values-plot-1.svg)

---

# Estimated minimum temperature

![](index_files/figure-html/fitted-trends-plot-1.svg)

---
class: inverse middle center subsection

#  Lake ice phenology

---

# Lake ice phenology

.center[
<img src="resources/animated-map.gif" width="550" height="550" />
]

---

# Problems

.center[
<img src="resources/dates.svg" width="550" height="550" />
]

---

# Time-to-event data &mdash; Event history models

Follow-up times `->` Survival analysis model

Piece-wise exponential model is essentially a Poisson GLM on transformed data &mdash; hazard rate is piece-wise constant within time intervals

Piece-wise exponential additive mixed model &mdash, PAMM

Estimate hazard rate as a smooth function rather than a step function

Can include all of *mgcv* 📦 tools for splines & effects

*pammtools* 📦 (Bender, Groll, Scheipl, 2018 *Statistical Modelling*)

---

# PAMMs

.center[
![](resources/mendota-freeze-viz.svg)
]

---

# Results

.center[
<img src="resources/p-freeze-thaw.svg" width="700" />
]

---

# Results

.center[
<img src="resources/hpam-spatial-freeze.svg" width="550" height="550" />
]

---

# Results

.center[
<img src="resources/hpam-change-in-dates.svg" width="550" height="550" />
]

---

# Summary

GAMs & penalized splines are a good choice for modelling messy spatio-temporal
data

Models are Bayesian (empirical or fully Bayes) &mdash; posterior

Penalised splines are in a wide range of software

* *mgcv*, *brms*, *bamlss*, *BayesX*, *JAGS*

Can handle relatively large data: `mgcv::bam()`

Tools for evaluating fitted models

* *mgcViz*, *gratia*