Applying statistical thinking to palaeo data through GAMs

class: inverse, middle, left, my-title-slide, title-slide

.title[
# Applying statistical thinking to palaeo data through GAMs
]
.author[
### Gavin Simpson
]
.institute[
### Aarhus University
]
.date[
### July 19, 2023
]

---

.row[

.col-6[
### Slides & code

[bit.ly/inqua-talk-2023](https://bit.ly/inqua-talk-2023)

&copy; Simpson (2020&ndash;2023) [![Creative Commons Licence](https://i.creativecommons.org/l/by/4.0/88x31.png)](http://creativecommons.org/licenses/by/4.0/)
]

.col-6[

.center[

<img src="resources/bit.ly_inqua-talk-2023.png" width="60%" style="display: block; margin: auto;" />
]

]
]

### Land acknowledgment

* This work was started at University of Regina, on Treaty 4 lands

### Funding

.row[

.col-4[
.center[![:scale 80%](./resources/NSERC_C.svg)]
]

.col-4[
.center[![:scale 100%](./resources/fgsr-logo.jpg)]
]

.col-4[
.center[![:scale 100%](./resources/AUFF_logo_en.png)]
]

]

???

Thanks to Quinn and Jack for inviting me to give this talk

All my slides and code are on GitHub. The QR code here will take you to the slides - it will be shown again at the end as well. There will be other QR codes in the slides that will take you to extra information that I don't have time to cover today

This research was started while I was at the University of Regina, which is situated on the territories of the nay-hi-yuh-wuk (Cree; nêhiyawak), uh-nish-i-naa-payk (Salteaux; Anihšināpēk), Dakota, Lakota, and Nakoda, and the homeland of the Métis/Michif Nation.

My work has been funded by NSERC, Faculty of Graduate Studies & Research at University of Regina, and Aarhus University Research Fund

---
class: inverse
background-image: url("resources/PS1920-1_0-750_sediment-core_hg.jpg")
background-position: right center
background-size: contain

# Statistical thinking & palaeo

Palaeo data are challenging

Important to grapple with these "problems"

Approaching data from the point of view of modelling process in our systems statistically

Estimate quantities of interest & quantify their uncertainty

.footnote[
    Image: Hannes Grobe, [CC BY 3.0](https://creativecommons.org/licenses/by/3.0), via Wikimedia Commons
]

???

We all know palaeo data are challenging to analyse, for a whole host of reasons & they don't play well with formal statistical time series methods

However, it is important to grapple with these problems and address them statistically

When I talk about statistical thinking and applying it to palaeo data, I'm thinking about having researchers approach the data from the view point of modelling them statistically, and using statistical models to estimate quantities of interest from a statistical model

---
class: inverse middle center subsection

# Generalized additive models

???

One such statistical model that I have found useful is the Generalized additive model; 
for example, they trivially handle the irregularity in time of most sequences

GAMs are a broad model class that are suited to modelling palaeo data

GAMs are models that represent the effects of variables on the response using smooth functions

---

# GAMs are built from splines

.center[![](resources/spline-anim.gif)]

???

GAMs use splines for the smooth functions

---

# Splines are formed from basis functions

![](index_files/figure-html/basis-functions-1.svg)

???

Splines are built up from basis functions

Here I'm showing a cubic regression spline basis with 10 knots/functions

If we weight each basis function and sum them up, we get a spline

---

# Weight basis functions &#8680; spline

.center[![](resources/basis-fun-anim.gif)]

???

If we choose different weights we get more wiggly splines

Each of the splines I showed you earlier are all generated from the same basis functions but using different weights

---

# Maximise penalised log-likelihood &#8680; &beta;

.center[![](resources/gam-crs-animation.gif)]

???

Fitting a GAM to data involves finding the weights for the basis functions that produce a spline that fits the data best, subject to some constraints to avoid over fitting

---
class: inverse middle center subsection

# Beyond the mean

---
background-image: url("resources/7sq9ip.jpg")
background-position: right center
background-size: auto

# Distributional GAMs

.row[

.col-7[

We mostly model the mean

But other parameters are important

...and of ecological interest

Also a way to handle the varying amount of time averaging (precision) in non-annually laminated sediments

]

.col-5[
&nbsp;
]

]

---

# Variance

Intuitive & key descriptor of ecosystem state

.row[
.col-6[
.center[
<img src="./resources/scheffer-2012-high-resilience.jpg" width="828" />
]
]

.col-6[
.center[
<img src="./resources/scheffer-2012-low-resilience.jpg" width="828" />
]
]
.col-12[
.small[Source: Scheffer *et al* Science (2012)]
]
]

???

The variance is an important measure of ecosystem state

Ecologists have linked variance with coexistence of populations and variance can be interpreted in terms of resilience theory

Here I'm showing two cartoons;

A: in the high resilience state the ecosystem returns rapidly to the equilibrium following perturbation and hence has low variance

B: in the low resilience state, where the basin of attraction has become shallower, the ecosystem takes longer to return to equilibrium following perturbation and hence exhibits high variance

Variance, however, is much harder to estimate from data

---

# Distributional GAMs

.row[

.col-6[

See my talk @ INQUA 2019 in Dublin
]

.col-6[
<img src="resources/bit.ly_inquaresilience.png" width="960" />
]

]

---
class: inverse middle center subsection

# Rates of change

---

# Rates of change

Knowing how quickly systems changed in the past can be used a to put current or future rates of change in context

Long history of RoC analysis in palaeo (Birks & Gordon, 1985) & recent updates ([RRatepol 📦 Ondřej Mottl](https://doi.org/10.1016/j.revpalbo.2021.104483))

An alternative is to use the _derivatives_ of temporal smooths

???

Most options are not grounded in a statistical model &mdash; making inference difficult

---

# Finite differences

We don't have an equation for splines, so mathy derivatives are out&hellip;

But we can use _finite differences_ to estimate them

???

We do actually have an equation but we'd need to figure it out and that is tedious and hard

---

# Rate of change

For a single variable, this is relatively simple

[Taricco, _et al_ (2016) _Scientific Data_ **3**, 160042](https://doi.org/10.1038/sdata.2016.42)

???

Gulf of Taranto (Ionian Sea)

Oxygen isotope composition δ18O of planktonic foraminifera in one of the cores extracted from the Gallipoli Terrace

---

# Estimate the trend

![](index_files/figure-html/display-foram-gam-1.svg)

---

# Compute derivatives = RoC

![](index_files/figure-html/display-foram-derivs-1.svg)

???

Above the 0 line d18O is increasingm below the line it is decreasing

We can compute the uncertainty in the derivative (RoC) that arises from the uncertainty in the estimated trend

Where the uncertainty band on the derivative excludes 0 we might also infer the istopes are changing in a way that is inconsistent with no change

---

# Compositional rate of change

Compositional data are counts from a total &mdash; closed compositional data

Can't use a Hierarchical GAM for these data

Multinomial or Dirichlet models would be the correct statsy way to go

But the number of taxa is problematic

???

Doing this with compositional data is a bit more difficult

Because the data are close compositional due to the fixed sample total count we can't use hierarchical GAMs

Fitting a multinomial or Dirichlet GAM would be the way to go, but we typically have too many taxa

---

# Dimension reduction

Need dimension reduction, but not PCA or CA etc

We don't want to break non-linear trends over multiple axes

Use a topic model, which

1. finds groups of taxa that tend to occur together &mdash; *associations*
2. models each sample as different proportions of the *associations*

More ecologically realistic

???

So we need to reduce the dimensionalty in some way

PCA and CA are not useful as they split up trends into separate, uncorrelated components

Instead we could use a topic model

After application of the topic model we are typically fitting a Dirichlet model to the proportions of a few (<= 10) associations of species

---

# Topic models &mdash; moar

PaleoEcoGen Seminar 2022 [slides](https://bit.ly/paleoecogen-gavin) & [recording](https://bit.ly/paleoecogen-video)

.row[
.col-6[
<img src="resources/bit.ly_paleoecogen-gavin.png" width="960" />
]

.col-6[
<img src="resources/bit.ly_paleoecogen-video.png" width="960" />

]
]

???

I don't have time to explain topic models here, but I recently gave in the PAGES PaleoEcoGen seminar series that includes an explanation of them. The QR codes are links to the slides and the zoom recording of that talk

---

# Dirichlet GAM

.row[

.col-6[
% sand, silt clay

Such data lie on a simplex; `$k$` classes have `$k-1$` dimensions

Fitted using brms 📦 & Stan
]

.col-6[
<img src="resources/page1-1500px-Dirichlet.pdf.jpg" width="2000" />
]
]

.small[

```
##              1         2            3            4            5            6
## 1 0.0001269863 0.7735201 0.1929165150 0.0331824691 0.0001269863 0.0001269862
## 2 0.0001229692 0.7295627 0.2575480141 0.0125203925 0.0001229692 0.0001229691
## 3 0.0001198470 0.9130680 0.0809658414 0.0056065835 0.0001198469 0.0001198469
## 4 0.0001056662 0.8557656 0.1438117623 0.0001056661 0.0001056660 0.0001056660
## 5 0.0001294743 0.9895075 0.0001294743 0.0099746407 0.0001294742 0.0001294742
## 6 0.0001267428 0.9451821 0.0543109606 0.0001267427 0.0001267427 0.0001267426
```
]

???

Having reduced the many taxa to a few associations, we model their proportions with a Dirichlet GAM

The classic example of Dirichlet distributed data is % sand, silt, and clay. You only need two of these to deduce all three proportions

Technically we say such data (samples) lie on a simplex with 1 fewer dimensions than "types"

---

# Abernethy Forest

Birks & Mathewes (1978) _New Phytologist_ **80**, 455-484.

???

To illustrate the approach I'll use the classic late glacial pollen data from Abernethy Forest, Scotland, of Birks and Mathewes

---

# Trends in species associations

![](index_files/figure-html/plot-dirichlet-trends-1.svg)

???

Firstly, we estimate the trends in the relative composition of each association of taxa using the Dirichlet GAM

---

# Compute derivatives of trends

![](index_files/figure-html/plot-dirichlet-derivatives-1.svg)

???

Then we compute the derivatives of those trends using finite differences

---

# Take the absolute values |d|

![](index_files/figure-html/plot-dirichlet-abs-derivatives-1.svg)

???

We take the absolute value of the derivatve as decreases are as important as increases

---

# Sum |d| at each time point - RoC

![](index_files/figure-html/plot-dirichlet-roc-1.svg)

???

And then we sum the absolute derivative to get our estimate of the rate of change or turnover
---

![](index_files/figure-html/compare-with-rratepol-1.svg)

???

If we compare with a basic run of RRatepol, we see similar features in the rate of change, but there are differences in some of the detail that are worthy of further investigation

---
class: inverse middle center subsection

# Correlating time series

???

The last example I want to show is how we can use GAMs to estimate the correlation between time series

---

# How correlated are variables in time?

![](index_files/figure-html/copula-gamlss-load-plot-data-1.svg)

???

Here I'm showing a stable isotope record of 13C and 15N in bulk organic matter from Small Water, a small corrie lake on the edge of the English Lake District

We were interested in investigating biogeochemical changes due to atmospheric deposition over the last couple of hundred years

---

# Copulas

Copula distributional GAM

Copulas bind two models/distributions together

Define a general way to think about dependence between random variables

Starting to be used in ecology:

* Popovic, Hui, Warton, 2018. J. Multivar. Anal. 165, 86–100. doi: 10/dzx9
* Anderson et al 2019. Ecol. Evol. 44, 182. doi: 10/dzzb

???

To correlate the two records we need to model them and their correlation over time

We will use a copula distributional GAM for this

The new bit, the copula, is a function that binds distributions or models together in a way that estimates the dependence between the two data series

Copulas are a general way to think about the dependence between variables that extends the idea of bivariate normality and Pearson's product moment correlation

---

# Examples of copulas

![](index_files/figure-html/copula-examples-1.svg)

???

Here I am showing different copulas to link two normally distributed data sets

The Normal and t copula are familiar and relate directly to the correlation coefficients

The Clayton and Gumbel copulas are different; they model differences in tail dependences, where the data can be uncorrelated in one tail and correlated in the other

---

# Copula distributional GAMs

Model the parameters (mean and variance) of each stable isotope time series, plus the copula parameter `$\theta$` as an extra linear predictor

.small[
`$$\begin{align}
F(y_{\mathsf{\delta^{15}N}_{i}}, y_{\mathsf{\delta^{13}C}_{i}} | \vartheta^{k} ) & = \mathcal{C}(F_{\mathsf{\delta^{15}N}_{i}}(y_{\mathsf{\delta^{15}N}_{i}} | \mu_{\mathsf{\delta^{15}N}_{i}}, \sigma_{\mathsf{\delta^{15}N}_{i}}), F_{\mathsf{\delta^{13}C}_{i}}(y_{\mathsf{\delta^{13}C}_{i}} | \mu_{\mathsf{\delta^{13}C}_{i}}, \sigma_{\mathsf{\delta^{13}C}_{i}}), \theta) \\
y_{\mathsf{\delta^{15}N}_{i}} & \sim \mathcal{N}(\mu_{\mathsf{\delta^{15}N}_{i}}, \sigma_{\mathsf{\delta^{15}N}_{i}}) \\
y_{\mathsf{\delta^{13}C}_{i}} & \sim \mathcal{N}(\mu_{\mathsf{\delta^{13}C}_{i}}, \sigma_{\mathsf{\delta^{13}C}_{i}})
\end{align}$$`
]

`$$\begin{align}
\log(\mu_{\mathsf{\delta^{15}N}_{i}}) & = \beta^{\mu_{\mathsf{\delta^{15}N}}}_0 + f^{\mu_{\mathsf{\delta^{15}N}}}(\text{Year}_i) \\
\log(\mu_{\mathsf{\delta^{13}C}_{i}}) & = \beta^{\mu_{\mathsf{\delta^{13}C}}}_0 + f^{\mu_{\mathsf{\delta^{13}C}}}(\text{Year}_i) \\
\log(\sigma_{\mathsf{\delta^{15}N}_{i}}) & = \beta^{\sigma_{\mathsf{\delta^{15}N}}}_0 \\
\log(\sigma_{\mathsf{\delta^{13}C}_{i}}) & = \beta^{\sigma_{\mathsf{\delta^{13}C}}}_0 \\
g(\theta_i)   & = \beta^{\theta}_0 + f^{\theta}(\text{Year}_i)
\end{align}$$`

???

This is the nastiest slides I'll show, but the important point is in these lines at the bottom

We note that I am modelling the means of both records with smooths of time, that I am assuming no change in variance (precision), and most importantly for this example, that I am modelling the correlation between to two series using a smooth function of time

---

# Copulas &mdash; AIC

.row[

.col-6[

Fit a range of copula types

* Bivariate normal
* Frank
* Ali-Mikhail-Haq
* Farlie-Gumbel-Morgenstern
* Plackett
* Student t
* Hougaard
* Clayton (0, 270)
* Gumbel (0, 90, 180, 270)
* Joe (0, 90)

AIC to select best-fitting model

]

.col-6[

.small[

|model      |       df|       AIC|
|:----------|--------:|---------:|
|m_cop_G270 | 17.41895| -24.21843|
|m_cop_G90  | 17.50362| -22.18304|
|m_cop_N    | 18.47071| -21.46334|
|m_cop_F    | 18.16605| -20.85806|
|m_cop_PL   | 17.62977| -20.79832|
|m_cop_T    | 18.39205| -16.31174|
|m_cop_AMH  | 17.64876| -14.93747|
|m_cop_FGM  | 17.49500| -14.81073|
|m_cop_C270 | 17.31024| -13.66620|
|m_cop_J90  | 17.24088| -13.40605|
]

]

???

There are many kinds of copula to choose from and selecting just one that matches your data makes model selection more difficult

Instead we fit many different kinds of copula and choose the one (or several) that are most compatible with the data, say using AIC

Here we see that Gumbel copula (rotated 270 degrees) gives the best fit, but several other copulas do similarly

---

# Estimated correlation

![](index_files/figure-html/plot-copula-1.svg)

???

We modelled `$\theta$` but is is easier to convert this to equivalent Kendall's `$\tau$`

Here I'm showing the results from the top few copula models, which all suggest the same thing, that the two series become gradually uncorrelated over time, which we would interpret as indications that the two major biogeochemical cycles in the lake became decoupled due to the input of nitrogen to the ctachment from atmospheric deposition

---

# Summary I

GAMs & their extensions are one way to adress many common questions asked of palaeo data

GAMs are relatively simple & familiar

Incredibly powerful models but remain interprettable

---

# Summary II

Don't be afraid to think beyond the usual quantitiative methods

Reach out to colleagues for help, collaborate, etc.

As a community we need to take statistical training more seriously for our ECRs

---

# Thank you & More 📷

📨 gavin@anivet.au.dk

💻 [github.com/gavinsimpson/inqua23](https://github.com/gavinsimpson/inqu23)

.center[

<img src="resources/bit.ly_inqua-talk-2023.png" width="45%" style="display: block; margin: auto;" />
]