class: inverse, middle, left, my-title-slide, title-slide .title[ # Applying statistical thinking to palaeo data through GAMs ] .author[ ### Gavin Simpson ] .institute[ ### Aarhus University ] .date[ ### July 19, 2023 ] --- .row[ .col-6[ ### Slides & code [bit.ly/inqua-talk-2023](https://bit.ly/inqua-talk-2023) © Simpson (2020–2023) [![Creative Commons Licence](https://i.creativecommons.org/l/by/4.0/88x31.png)](http://creativecommons.org/licenses/by/4.0/) ] .col-6[ .center[ <img src="resources/bit.ly_inqua-talk-2023.png" width="60%" style="display: block; margin: auto;" /> ] ] ] ### Land acknowledgment * This work was started at University of Regina, on Treaty 4 lands ### Funding .row[ .col-4[ .center[![:scale 80%](./resources/NSERC_C.svg)] ] .col-4[ .center[![:scale 100%](./resources/fgsr-logo.jpg)] ] .col-4[ .center[![:scale 100%](./resources/AUFF_logo_en.png)] ] ] ??? Thanks to Quinn and Jack for inviting me to give this talk All my slides and code are on GitHub. The QR code here will take you to the slides - it will be shown again at the end as well. There will be other QR codes in the slides that will take you to extra information that I don't have time to cover today This research was started while I was at the University of Regina, which is situated on the territories of the nay-hi-yuh-wuk (Cree; nêhiyawak), uh-nish-i-naa-payk (Salteaux; Anihšināpēk), Dakota, Lakota, and Nakoda, and the homeland of the Métis/Michif Nation. My work has been funded by NSERC, Faculty of Graduate Studies & Research at University of Regina, and Aarhus University Research Fund --- class: inverse background-image: url("resources/PS1920-1_0-750_sediment-core_hg.jpg") background-position: right center background-size: contain # Statistical thinking & palaeo Palaeo data are challenging Important to grapple with these "problems" Approaching data from the point of view of modelling process in our systems statistically Estimate quantities of interest & quantify their uncertainty .footnote[ Image: Hannes Grobe, [CC BY 3.0](https://creativecommons.org/licenses/by/3.0), via Wikimedia Commons ] ??? We all know palaeo data are challenging to analyse, for a whole host of reasons & they don't play well with formal statistical time series methods However, it is important to grapple with these problems and address them statistically When I talk about statistical thinking and applying it to palaeo data, I'm thinking about having researchers approach the data from the view point of modelling them statistically, and using statistical models to estimate quantities of interest from a statistical model --- class: inverse middle center subsection # Generalized additive models ??? One such statistical model that I have found useful is the Generalized additive model; for example, they trivially handle the irregularity in time of most sequences GAMs are a broad model class that are suited to modelling palaeo data GAMs are models that represent the effects of variables on the response using smooth functions --- # GAMs are built from splines .center[![](resources/spline-anim.gif)] ??? GAMs use splines for the smooth functions --- # Splines are formed from basis functions ![](index_files/figure-html/basis-functions-1.svg)<!-- --> ??? Splines are built up from basis functions Here I'm showing a cubic regression spline basis with 10 knots/functions If we weight each basis function and sum them up, we get a spline --- # Weight basis functions ⇨ spline .center[![](resources/basis-fun-anim.gif)] ??? If we choose different weights we get more wiggly splines Each of the splines I showed you earlier are all generated from the same basis functions but using different weights --- # Maximise penalised log-likelihood ⇨ β .center[![](resources/gam-crs-animation.gif)] ??? Fitting a GAM to data involves finding the weights for the basis functions that produce a spline that fits the data best, subject to some constraints to avoid over fitting --- class: inverse middle center subsection # Beyond the mean --- background-image: url("resources/7sq9ip.jpg") background-position: right center background-size: auto # Distributional GAMs .row[ .col-7[ We mostly model the mean But other parameters are important ...and of ecological interest Also a way to handle the varying amount of time averaging (precision) in non-annually laminated sediments ] .col-5[ ] ] --- # Variance Intuitive & key descriptor of ecosystem state .row[ .col-6[ .center[ <img src="./resources/scheffer-2012-high-resilience.jpg" width="828" /> ] ] .col-6[ .center[ <img src="./resources/scheffer-2012-low-resilience.jpg" width="828" /> ] ] .col-12[ .small[Source: Scheffer *et al* Science (2012)] ] ] ??? The variance is an important measure of ecosystem state Ecologists have linked variance with coexistence of populations and variance can be interpreted in terms of resilience theory Here I'm showing two cartoons; A: in the high resilience state the ecosystem returns rapidly to the equilibrium following perturbation and hence has low variance B: in the low resilience state, where the basin of attraction has become shallower, the ecosystem takes longer to return to equilibrium following perturbation and hence exhibits high variance Variance, however, is much harder to estimate from data --- # Distributional GAMs .row[ .col-6[ See my talk @ INQUA 2019 in Dublin ] .col-6[ <img src="resources/bit.ly_inquaresilience.png" width="960" /> ] ] --- class: inverse middle center subsection # Rates of change --- # Rates of change Knowing how quickly systems changed in the past can be used a to put current or future rates of change in context Long history of RoC analysis in palaeo (Birks & Gordon, 1985) & recent updates ([RRatepol 📦 Ondřej Mottl](https://doi.org/10.1016/j.revpalbo.2021.104483)) An alternative is to use the _derivatives_ of temporal smooths ??? Most options are not grounded in a statistical model — making inference difficult --- # Finite differences We don't have an equation for splines, so mathy derivatives are out… But we can use _finite differences_ to estimate them <img src="index_files/figure-html/finite-differences-plot-1.svg" width="90%" style="display: block; margin: auto;" /> ??? We do actually have an equation but we'd need to figure it out and that is tedious and hard --- # Rate of change For a single variable, this is relatively simple <img src="index_files/figure-html/display-foram-data-1.svg" width="80%" style="display: block; margin: auto;" /> [Taricco, _et al_ (2016) _Scientific Data_ **3**, 160042](https://doi.org/10.1038/sdata.2016.42) ??? Gulf of Taranto (Ionian Sea) Oxygen isotope composition δ18O of planktonic foraminifera in one of the cores extracted from the Gallipoli Terrace --- # Estimate the trend ![](index_files/figure-html/display-foram-gam-1.svg)<!-- --> --- # Compute derivatives = RoC ![](index_files/figure-html/display-foram-derivs-1.svg)<!-- --> ??? Above the 0 line d18O is increasingm below the line it is decreasing We can compute the uncertainty in the derivative (RoC) that arises from the uncertainty in the estimated trend Where the uncertainty band on the derivative excludes 0 we might also infer the istopes are changing in a way that is inconsistent with no change --- # Compositional rate of change Compositional data are counts from a total — closed compositional data Can't use a Hierarchical GAM for these data Multinomial or Dirichlet models would be the correct statsy way to go But the number of taxa is problematic ??? Doing this with compositional data is a bit more difficult Because the data are close compositional due to the fixed sample total count we can't use hierarchical GAMs Fitting a multinomial or Dirichlet GAM would be the way to go, but we typically have too many taxa --- # Dimension reduction Need dimension reduction, but not PCA or CA etc We don't want to break non-linear trends over multiple axes Use a topic model, which 1. finds groups of taxa that tend to occur together — *associations* 2. models each sample as different proportions of the *associations* More ecologically realistic ??? So we need to reduce the dimensionalty in some way PCA and CA are not useful as they split up trends into separate, uncorrelated components Instead we could use a topic model -- After application of the topic model we are typically fitting a Dirichlet model to the proportions of a few (<= 10) associations of species --- # Topic models — moar PaleoEcoGen Seminar 2022 [slides](https://bit.ly/paleoecogen-gavin) & [recording](https://bit.ly/paleoecogen-video) .row[ .col-6[ <img src="resources/bit.ly_paleoecogen-gavin.png" width="960" /> ] .col-6[ <img src="resources/bit.ly_paleoecogen-video.png" width="960" /> ] ] ??? I don't have time to explain topic models here, but I recently gave in the PAGES PaleoEcoGen seminar series that includes an explanation of them. The QR codes are links to the slides and the zoom recording of that talk --- # Dirichlet GAM .row[ .col-6[ % sand, silt clay Such data lie on a simplex; `\(k\)` classes have `\(k-1\)` dimensions Fitted using brms 📦 & Stan ] .col-6[ <img src="resources/page1-1500px-Dirichlet.pdf.jpg" width="2000" /> ] ] .small[ ``` ## 1 2 3 4 5 6 ## 1 0.0001269863 0.7735201 0.1929165150 0.0331824691 0.0001269863 0.0001269862 ## 2 0.0001229692 0.7295627 0.2575480141 0.0125203925 0.0001229692 0.0001229691 ## 3 0.0001198470 0.9130680 0.0809658414 0.0056065835 0.0001198469 0.0001198469 ## 4 0.0001056662 0.8557656 0.1438117623 0.0001056661 0.0001056660 0.0001056660 ## 5 0.0001294743 0.9895075 0.0001294743 0.0099746407 0.0001294742 0.0001294742 ## 6 0.0001267428 0.9451821 0.0543109606 0.0001267427 0.0001267427 0.0001267426 ``` ] ??? Having reduced the many taxa to a few associations, we model their proportions with a Dirichlet GAM The classic example of Dirichlet distributed data is % sand, silt, and clay. You only need two of these to deduce all three proportions Technically we say such data (samples) lie on a simplex with 1 fewer dimensions than "types" --- # Abernethy Forest <img src="index_files/figure-html/abernethy-strat-plot-1.svg" width="90%" style="display: block; margin: auto;" /> Birks & Mathewes (1978) _New Phytologist_ **80**, 455-484. ??? To illustrate the approach I'll use the classic late glacial pollen data from Abernethy Forest, Scotland, of Birks and Mathewes --- # Trends in species associations ![](index_files/figure-html/plot-dirichlet-trends-1.svg)<!-- --> ??? Firstly, we estimate the trends in the relative composition of each association of taxa using the Dirichlet GAM --- # Compute derivatives of trends ![](index_files/figure-html/plot-dirichlet-derivatives-1.svg)<!-- --> ??? Then we compute the derivatives of those trends using finite differences --- # Take the absolute values |d| ![](index_files/figure-html/plot-dirichlet-abs-derivatives-1.svg)<!-- --> ??? We take the absolute value of the derivatve as decreases are as important as increases --- # Sum |d| at each time point - RoC ![](index_files/figure-html/plot-dirichlet-roc-1.svg)<!-- --> ??? And then we sum the absolute derivative to get our estimate of the rate of change or turnover --- # ![](index_files/figure-html/compare-with-rratepol-1.svg)<!-- --> ??? If we compare with a basic run of RRatepol, we see similar features in the rate of change, but there are differences in some of the detail that are worthy of further investigation --- class: inverse middle center subsection # Correlating time series ??? The last example I want to show is how we can use GAMs to estimate the correlation between time series --- # How correlated are variables in time? ![](index_files/figure-html/copula-gamlss-load-plot-data-1.svg)<!-- --> ??? Here I'm showing a stable isotope record of 13C and 15N in bulk organic matter from Small Water, a small corrie lake on the edge of the English Lake District We were interested in investigating biogeochemical changes due to atmospheric deposition over the last couple of hundred years --- # Copulas Copula distributional GAM Copulas bind two models/distributions together Define a general way to think about dependence between random variables Starting to be used in ecology: * Popovic, Hui, Warton, 2018. J. Multivar. Anal. 165, 86–100. doi: 10/dzx9 * Anderson et al 2019. Ecol. Evol. 44, 182. doi: 10/dzzb ??? To correlate the two records we need to model them and their correlation over time We will use a copula distributional GAM for this The new bit, the copula, is a function that binds distributions or models together in a way that estimates the dependence between the two data series Copulas are a general way to think about the dependence between variables that extends the idea of bivariate normality and Pearson's product moment correlation --- # Examples of copulas ![](index_files/figure-html/copula-examples-1.svg)<!-- --> ??? Here I am showing different copulas to link two normally distributed data sets The Normal and t copula are familiar and relate directly to the correlation coefficients The Clayton and Gumbel copulas are different; they model differences in tail dependences, where the data can be uncorrelated in one tail and correlated in the other --- # Copula distributional GAMs Model the parameters (mean and variance) of each stable isotope time series, plus the copula parameter `\(\theta\)` as an extra linear predictor .small[ `$$\begin{align} F(y_{\mathsf{\delta^{15}N}_{i}}, y_{\mathsf{\delta^{13}C}_{i}} | \vartheta^{k} ) & = \mathcal{C}(F_{\mathsf{\delta^{15}N}_{i}}(y_{\mathsf{\delta^{15}N}_{i}} | \mu_{\mathsf{\delta^{15}N}_{i}}, \sigma_{\mathsf{\delta^{15}N}_{i}}), F_{\mathsf{\delta^{13}C}_{i}}(y_{\mathsf{\delta^{13}C}_{i}} | \mu_{\mathsf{\delta^{13}C}_{i}}, \sigma_{\mathsf{\delta^{13}C}_{i}}), \theta) \\ y_{\mathsf{\delta^{15}N}_{i}} & \sim \mathcal{N}(\mu_{\mathsf{\delta^{15}N}_{i}}, \sigma_{\mathsf{\delta^{15}N}_{i}}) \\ y_{\mathsf{\delta^{13}C}_{i}} & \sim \mathcal{N}(\mu_{\mathsf{\delta^{13}C}_{i}}, \sigma_{\mathsf{\delta^{13}C}_{i}}) \end{align}$$` ] `$$\begin{align} \log(\mu_{\mathsf{\delta^{15}N}_{i}}) & = \beta^{\mu_{\mathsf{\delta^{15}N}}}_0 + f^{\mu_{\mathsf{\delta^{15}N}}}(\text{Year}_i) \\ \log(\mu_{\mathsf{\delta^{13}C}_{i}}) & = \beta^{\mu_{\mathsf{\delta^{13}C}}}_0 + f^{\mu_{\mathsf{\delta^{13}C}}}(\text{Year}_i) \\ \log(\sigma_{\mathsf{\delta^{15}N}_{i}}) & = \beta^{\sigma_{\mathsf{\delta^{15}N}}}_0 \\ \log(\sigma_{\mathsf{\delta^{13}C}_{i}}) & = \beta^{\sigma_{\mathsf{\delta^{13}C}}}_0 \\ g(\theta_i) & = \beta^{\theta}_0 + f^{\theta}(\text{Year}_i) \end{align}$$` ??? This is the nastiest slides I'll show, but the important point is in these lines at the bottom We note that I am modelling the means of both records with smooths of time, that I am assuming no change in variance (precision), and most importantly for this example, that I am modelling the correlation between to two series using a smooth function of time --- # Copulas — AIC .row[ .col-6[ Fit a range of copula types * Bivariate normal * Frank * Ali-Mikhail-Haq * Farlie-Gumbel-Morgenstern * Plackett * Student t * Hougaard * Clayton (0, 270) * Gumbel (0, 90, 180, 270) * Joe (0, 90) AIC to select best-fitting model ] .col-6[ .small[ |model | df| AIC| |:----------|--------:|---------:| |m_cop_G270 | 17.41895| -24.21843| |m_cop_G90 | 17.50362| -22.18304| |m_cop_N | 18.47071| -21.46334| |m_cop_F | 18.16605| -20.85806| |m_cop_PL | 17.62977| -20.79832| |m_cop_T | 18.39205| -16.31174| |m_cop_AMH | 17.64876| -14.93747| |m_cop_FGM | 17.49500| -14.81073| |m_cop_C270 | 17.31024| -13.66620| |m_cop_J90 | 17.24088| -13.40605| ] ] ] ??? There are many kinds of copula to choose from and selecting just one that matches your data makes model selection more difficult Instead we fit many different kinds of copula and choose the one (or several) that are most compatible with the data, say using AIC Here we see that Gumbel copula (rotated 270 degrees) gives the best fit, but several other copulas do similarly --- # Estimated correlation ![](index_files/figure-html/plot-copula-1.svg)<!-- --> ??? We modelled `\(\theta\)` but is is easier to convert this to equivalent Kendall's `\(\tau\)` Here I'm showing the results from the top few copula models, which all suggest the same thing, that the two series become gradually uncorrelated over time, which we would interpret as indications that the two major biogeochemical cycles in the lake became decoupled due to the input of nitrogen to the ctachment from atmospheric deposition --- # Summary I GAMs & their extensions are one way to adress many common questions asked of palaeo data GAMs are relatively simple & familiar Incredibly powerful models but remain interprettable --- # Summary II Don't be afraid to think beyond the usual quantitiative methods Reach out to colleagues for help, collaborate, etc. As a community we need to take statistical training more seriously for our ECRs --- # Thank you & More 📷 📨 gavin@anivet.au.dk 💻 [github.com/gavinsimpson/inqua23](https://github.com/gavinsimpson/inqu23) .center[ <img src="resources/bit.ly_inqua-talk-2023.png" width="45%" style="display: block; margin: auto;" /> ]