Dissimilarity & Unconstrained ordination

class: inverse, middle, left, my-title-slide, title-slide

.title[
# Dissimilarity & Unconstrained ordination
]
.author[
### Gavin Simpson
]
.date[
### August 6, 2024
]

---

class: inverse middle center big-subsection

# Welcome

---

# Logistics

* Dissimilarity & unconstrained ordination
* Constrained ordination
* Permutation tests

---

# Today's topics

* Dissimilarity
* Unconstrained ordination
    * Principal Components Analysis (PCA)
	* Correspondence Analysis (CA)
	* Principal Coordinates Analysis (PCO or PCoA)
	* Non-metric Multidimensional Scaling (NMDS)
* Practical tips for working & plotting ordinations using **vegan**

---

# Multivariate data

---

# Multivariate data

Multivariate != multiple regression

Multivariate means we have two or more *response* variables

We are interested in learning about the common patterns or modes of variation among those multiple response variables

Multivariate data require special statistical methods

---

# Multivariate data in biology

Community composition &mdash; many species as the responses

In modern biology we have OTUs and ASVs

In chemistry we have metabolites or spectra or masses or &hellip;

All of these constitute multivariate data

---

# Species composition

Species composition is the set of species in a site or a sample

Typically comes with a measure of abundance

"abundance" could also be whether each species is present or absent in the sample

Relative abundance expresses the abundance of each species as its proportion out of the total abundance in each sample

---

# Data box

---

# Q and R mode

Measuring dependence between variables (descriptors; species, environment) typically done using coefficients like Pearson's `$r$`

Hence this mode of analysis is *R* mode

*Q* mode relates to methods that focus on dependencies among objects (samples)

Often difficult to identify the mode; PCA starts from a dispersion matrix of variables (R mode) but provides an ordination of the samples (Q mode)

---
class: inverse middle center big-subsection

# Dissimilarity

---

# Measuring association &mdash; binary data

.row[

.col-6[

<table>
 <tr>
 <td>&nbsp;</td>
 <td colspan=3>Object j</td>
 </tr>
 <tr>
 <td rowspan=3>Object i</td>
 <td> &nbsp; </td>
 <td> + </td>
 <td> - </td>
 </tr>
 <tr>
 <td> + </td>
 <td> a </td>
 <td> c </td>
 </tr>
 <tr>
 <td> - </td>
 <td> b </td>
 <td> d </td>
 </tr>
</table>

]

.col-6[
Dissimilarity based on the number of species present only in `$i$` ( `$b$` ), or `$j$` ( `$c$` ), or in present in both ( `$a$` ), or absent in both ( `$d$` ).

]
]

.row[

.col-6[

Jaccard similarity

`$$s_{ij} = \frac{a}{a + b + c}$$`

Jaccard dissimilarity

`$$d_{ij} = \frac{b + c}{a + b + c}$$`
]

.col-6[
Simple matching coefficient

`$$s_{ij} = \frac{a + d}{a + b + c + d}$$`

Simple matching coefficient

`$$d_{ij} = \frac{b + c}{a + b + c + d}$$`
]
]

---

# Dissimilarity

.row[

.col-6[

![](slides_files/figure-html/dissimilarity-1.svg)
]

.col-6[

`$$d_{ij} = \sqrt{\sum\limits^m_{k=1}(x_{ik} - x_{jk})^2}$$`

`$$d_{ij} = \sum\limits^m_{k=1}|x_{ik} - x_{jk}|$$`

`$$d_{ij} = \frac{\sum\limits^m_{k=1}|x_{ik} - x_{jk}|}{\sum\limits^m_{k=1}(x_{ik} + x_{jk})}$$`
]
]

---

# Measuring association &mdash; quantitative data

* Euclidean distance dominated by large values
* Manhattan distance less affected by large values
* Bray-Curtis treats all species with equal weight
* Similarity ratio (Steinhaus-Marczewski `$\equiv$` Jaccard) less dominated by extremes
* Chord distance, used for proportional data; _signal-to-noise_ measure

.row[

.col-6[
Similarity ratio

`$d_{ij} = \frac{\sum\limits^m_{k=1}x_{ik}x_{jk}}{\left(\sum\limits^m_{k=1}x_{ik}^2 + \sum\limits^m_{k=1}x_{jk}^2 - \sum\limits^m_{k=1}x_{ik}x_{jk}\right)^2}$`

]

.col-6[
Chord distance

`$d_{ij} = \sqrt{\sum\limits^m_{k=1}(\sqrt{p_{ik}} - \sqrt{p_{jk}})^2}$`
]
]

---

# Measuring association &mdash; mixed data

.row[

.col-8[
.small[
* `$s_{ijk}$` is similarity between sites `$i$` and `$j$` for the `$k$`th variable.
* Weights `$w_{ijk}$` are typically 0 or 1 depending on whether the comparison is valid for variable `$k$`. Can also use variable weighting with `$w_{ijk}$` between 0 and 1.
* `$w_{ijk}$` is zero if the `$k$`th variable is missing for one or both of `$i$` or `$j$`.
* For binary variables `$s_{ijk}$` is the Jaccard coefficient.
* For categorical data `$s_{ijk}$` is 1 of `$i$` and `$k$` have same category, 0 otherwise.
* For quantitative data `$s_{ijk} = (1 - |x_{ik} - x_{jk}|) / R_k$`
]
]

.col-4[

Gower's coefficient

`$$s_{ij} = \frac{\sum\limits^m_{i=1} w_{ijk}s_{ijk}}{\sum\limits^m_{i=1} w_{ijk}}$$`
]
]

---

# Double zero problem

---

# Double zero problem

If a species is present at sites `$i$` and `$j$` we take this as a reflection of the two sites' similarity

If a species is *absent* from one of `$i$` or `$j$` we take this as a reflection of some ecological difference between the two sites

But, what if the species is absent from *both* `$i$` and `$j$`?

The species could be absent for different reasons

* too hot in `$i$` and too cold in `$j$`
* too dry in `$i$` and to wet in `$j$`
---

# Double zero problem

We may choose not to attach ecological meaning to a joint or *double* absence when computing association

Coefficients that skip double zeroes are *asymmetric*

Coefficients that do not skip double zeroes are *symmetric*

---

# Metrics

Three types of quantitative distance coefficient

1. Metric

2. Semi-metric

3. Non-metric

Groups pertain to how the coefficient obeys certain metric rules

Some use the term "distance coefficient" to only refer to those that are metric, using "dissimilarity" for other semi- and non-metrics

---

# Metric coefficients

Obey four rules

1. minimum = 0: if `$i = j$` then `$d_{ij} = 0$`

2. positivity: if `$i \neq j$` then `$d_{ij} \gt 0$`

3. symmetry: `$d_{ij} = d_{ji}$`

4. triangle inequality: `$d_{ij} + d_{jk} \ge d_{ik}$`

Last rule states that the *sum* of 2 sides of a triangle in Euclidean space are necessarily `$\ge$` the 3rd side

---

# Others types

Semi-metrics do not obey the triangle inequality rule

Non-metrics do not obey the positivity rule

---

# Species abundance paradox

|   | y1| y2| y3|
|:--|--:|--:|--:|
|x1 |  0|  4|  8|
|x2 |  0|  1|  1|
|x3 |  1|  0|  0|

Euclidean distances:

|   |       x1|       x2|       x3|
|:--|--------:|--------:|--------:|
|x1 | 0.000000| 7.615773| 9.000000|
|x2 | 7.615773| 0.000000| 1.732051|
|x3 | 9.000000| 1.732051| 0.000000|

---

# Species abundance paradox

|   | y1| y2| y3|
|:--|--:|--:|--:|
|x1 |  0|  4|  8|
|x2 |  0|  1|  1|
|x3 |  1|  0|  0|

Chi-square distances:

|   |        x1|        x2|       x3|
|:--|---------:|---------:|--------:|
|x1 | 0.0000000| 0.3600411| 4.009249|
|x2 | 0.3600411| 0.0000000| 4.020779|
|x3 | 4.0092486| 4.0207794| 0.000000|

---

# Transformations

* Can transform the variables (e.g. species) or the samples to improve the gradient separation of the dissimilarity coefficient.
* No transformation of variables or samples leads to a situation of quantity domination &mdash; big values dominate `$d_{ij}$`.
* Normalise samples &mdash;gives all samples equal weight.
* Normalise variables;
    * gives all variables equal weight,
    * inflates the influence of rare species.
* Double (_Wisconsin_) transformation; normalise variables then samples.
* Noy-Meir _et al_. (1975) _J. Ecology_ **63**; 779--800.
* Faith _et al_. (1987) _Vegetatio_ **69**; 57--68.

---

# Dissimilarity

Two key functions

1. `vegdist()`
2. `decostand()`

``` r
data(varespec)

euc_dij <- vegdist(varespec, method = "euclidean")

bc_dij <- vegdist(varespec)

hell_dij <- vegdist(decostand(varespec, method = "hellinger"),
 method = "euclidean")
```

---
class: inverse middle center big-subsection

# Ordination

---
class: inverse middle center big-subsection

# ordnung

???

The word "ordination" comes from the German word "ordnung"

---

# Ordination

Putting things in order is exactly what we we do in ordination

* we arrange our samples along gradients by fitting lines and planes through the data that describe the main patterns in those data

* we map data to lower dimensions reflecting how similar the samples are to one another in terms of the variables measured

Three families of models

1. Linear
2. Unimodal
3. Distance-based

---
class: inverse middle center big-subsection

# Unconstrained

---

# Unconstrained

What is .text-higlight[unconstrained]?

First we look for major variation, then relate it to environmental variation

vs. *constrained* ordination, where we only want to see what can be explained by environmental variables of interest

*How well do we explain the main patterns in the species data*? vs *How large are the patterns we can expain with the measured data*?

---

# Ordination methods

Principal Components Analysis (PCA) is a linear method &mdash; most useful for environmental data or sometimes with species data and short gradients

Correspondence Analysis (CA) is a unimodal method &mdash; most useful for species data, especially where non-linear responses are observed

Principal Coordinates Analysis (PCO) and Non-metric Multidimensional Scaling (NMDS) &mdash; can be used for any kind of data

---
class: inverse middle center huge-subsection

# PCA

---

# Principal Components Analysis

Instead of doing many regressions, do one with all the responses

No explanatory variables &mdash; uncover latent, underlying gradients

PCA fits a line through our cloud of data in such a way that it maximises the variance in the data captured by that line (i.e.~minimises the distance between the fitted line and the observations)

Fit a second line to form a plane, and so on, until we have one PCA axis for every dimension of the data

Each of these subsequent axes is uncorrelated with previous axes &mdash; they are .text-highlight[orthogonal] &mdash; the variance each axis explains is uncorrelated

---

# Principal Components Analysis

.center-fig[
![:scale 75% © HappyWaldo CC BY-SA](./resources/Sherrin_footy.jpg)
]

---

# Principal Components Analysis

.center[
![]("resources/pca-anim.gif")
]

.row[
.col-4[
![](slides_files/figure-html/pc-anim-plt1-1.svg)
]
.col-4[
![](slides_files/figure-html/pc-anim-plt2-1.svg)
]
.col-4[
![](slides_files/figure-html/pc-anim-plt3-1.svg)
]
]

---

# Principal Components Analysis

.center[![:scale 64%](resources/pca-anim.gif)]

---

# Load vegan

**vegan** is an add-on package

``` r
## install.packages("vegan") # Only need if you've never installed before
library("vegan")
data(varespec)
data(varechem)
```

**vegan** comes with a number of data sets which we'll use to get started

---

# Vegetation in lichen pastures &mdash; species

``` r
class(varespec)
```

```
## [1] "data.frame"
```

``` r
dim(varespec)                               # number of samples, species
```

```
## [1] 24 44
```

``` r
head(varespec[,1:6], n = 5)
```

```
##    Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti Pinusylv
## 18     0.55    11.13     0.00     0.00    17.80     0.07
## 15     0.67     0.17     0.00     0.35    12.13     0.12
## 24     0.10     1.55     0.00     0.00    13.47     0.25
## 27     0.00    15.13     2.42     5.92    15.97     0.00
## 23     0.00    12.68     0.00     0.00    23.73     0.03
```

`varespec` is a data frame

* *Variables* are the columns (here the species)
* *Observations* are the rows (the samples, sites, etc)

???

Data are cover values of 44 understorey species recorded at 24 locations in lichen pastures within dry *Pinus sylvestris* forests

This arrangement is back-to-front if you are used to OTU like data

---

# Vegetation in lichen pastures &mdash; chemistry

Also have associated soil physical and chemical measurements at the 24 sites

``` r
head(varechem)
```

```
##       N    P     K    Ca    Mg    S    Al   Fe    Mn   Zn  Mo Baresoil Humdepth
## 18 19.8 42.1 139.9 519.4  90.0 32.3  39.0 40.9  58.1  4.5 0.3     43.9      2.2
## 15 13.4 39.1 167.3 356.7  70.7 35.2  88.1 39.0  52.4  5.4 0.3     23.6      2.2
## 24 20.2 67.7 207.1 973.3 209.1 58.1 138.0 35.4  32.1 16.8 0.8     21.2      2.0
## 27 20.6 60.8 233.7 834.0 127.2 40.7  15.4  4.4 132.0 10.7 0.2     18.7      2.9
## 23 23.8 54.5 180.6 777.0 125.8 39.5  24.2  3.0  50.1  6.6 0.3     46.0      3.0
## 19 22.8 40.9 171.4 691.8 151.4 40.8 104.8 17.6  43.6  9.1 0.4     40.5      3.8
##     pH
## 18 2.7
## 15 2.8
## 24 3.0
## 27 2.8
## 23 2.7
## 19 2.7
```

???

`Baresoil` is the estimated cover of bare soil
`Humdepth` is the depth of humus layer

---

# Vegetation in lichen pastures &mdash; PCA

PCA is fitted using `rda()`

* Provide a data frame of observations on one or more variables
* To scale all variables to be &mu; = 0, &sigma;2 = 1: `scale = TRUE`

``` r
pca <- rda(decostand(varespec, method = "hellinger"), scale = TRUE)
pca
```

```
## Call: rda(X = decostand(varespec, method = "hellinger"), scale = TRUE)
## 
##               Inertia Rank
## Total              44     
## Unconstrained      44   23
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 8.603 5.134 4.576 3.714 3.245 2.779 2.626 2.221 
## (Showing 8 of 23 unconstrained eigenvalues)
```

---

# Vegetation in lichen pastures &mdash; PCA

PCA of the covariance matrix &mdash; default is `scale = FALSE`

``` r
rda(decostand(varespec, method = "hellinger"), scale = FALSE)
```

```
## Call: rda(X = decostand(varespec, method = "hellinger"), scale = FALSE)
## 
##               Inertia Rank
## Total          0.3647     
## Unconstrained  0.3647   23
## Inertia is variance 
## 
## Eigenvalues for unconstrained axes:
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.14586 0.07908 0.02866 0.02446 0.02209 0.01263 0.01179 0.00873 
## (Showing 8 of 23 unconstrained eigenvalues)
```

How **vegan** scales the eigenvalues is different to *Canoco*

---

# Vegetation in lichen pastures &mdash; PCA

``` r
biplot(pca, scaling = "symmetric")
```

---

# PCA biplots

.row[
.col-6[
* Sample (species) points plotted close together have similar species compositions (occur together)
* In PCA, species scores often drawn as arrows &mdash; point in direction of increasing abundance
* Species arrows with small angles to an axis are highly correlated with that axis
]
.col-6[

``` r
biplot(pca, scaling = "symmetric")
```

![](slides_files/figure-html/vare-pca-2-1.svg)
]
]

---

# Eigenvalues &lambda;

.row[
.col-6[

Eigenvalues are the amount of variance (inertia) explained by each axis

``` r
head(eigenvals(pca), 5)
```

```
##      PC1      PC2      PC3      PC4      PC5 
## 8.602826 5.133623 4.575623 3.713926 3.244925
```
]
.col-6[

``` r
screeplot(pca, bstick = TRUE, type = "l",
          main = NULL)
```

![](slides_files/figure-html/vare-pca-4-1.svg)
]
]

---

# Eigenvalues &lambda;

The `summary()` method provides additional information

``` r
summary(eigenvals(pca))
```

```
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Eigenvalue            8.6028 5.1336 4.5756 3.71393 3.24492 2.77919 2.62560
## Proportion Explained  0.1955 0.1167 0.1040 0.08441 0.07375 0.06316 0.05967
## Cumulative Proportion 0.1955 0.3122 0.4162 0.50059 0.57434 0.63750 0.69718
##                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Eigenvalue            2.22100 1.67857 1.65627 1.30432 1.03472 0.91190 0.87323
## Proportion Explained  0.05048 0.03815 0.03764 0.02964 0.02352 0.02073 0.01985
## Cumulative Proportion 0.74765 0.78580 0.82344 0.85309 0.87660 0.89733 0.91718
##                         PC15   PC16    PC17    PC18     PC19     PC20     PC21
## Eigenvalue            0.7612 0.6336 0.52021 0.51641 0.408841 0.267175 0.206336
## Proportion Explained  0.0173 0.0144 0.01182 0.01174 0.009292 0.006072 0.004689
## Cumulative Proportion 0.9345 0.9489 0.96070 0.97243 0.981727 0.987799 0.992488
##                           PC22     PC23
## Eigenvalue            0.179715 0.150795
## Proportion Explained  0.004084 0.003427
## Cumulative Proportion 0.996573 1.000000
```

---
class: inverse middle center subsection

# Correspondence Analysis

---

# Correspondence Analysis

Correspondence analysis (CA) is very similar to PCA &mdash; a weighted form of PCA

The row and column sums are used as weights and this has the effect of turning the analysis into one of relative composition

The weighting is a trick to get linear-based software to fit non-linear responses

These nonlinear response are assumed to unimodal Gaussian curves, all with equal height and tolerance widths, and equally spaced optima

---

# Correspondence Analysis

![](slides_files/figure-html/unnamed-chunk-1-1.svg)

---

# Correspondence Analysis

Correspondence analysis (CA) is very similar to PCA &mdash; a weighted form of PCA

The row and column sums are used as weights and this has the effect of turning the analysis into one of relative composition

The weighting is a trick to get linear-based software to fit non-linear responses

These nonlinear response are assumed to unimodal Gaussian curves, all with equal height and tolerance widths, and equally spaced optima

So, not very realistic, but it is surprisingly robust at times to violation of this assumption

---

# Vegetation in lichen pastures &mdash; CA biplots

.row[
.col-7[
- Have two sets of scores
    1. Species scores
    2. Site scores
- Sample (species) points plotted close together have similar species compositions (occur together)
- In CA, species scores drawn as points &mdash; this is the fitted optima along the gradients
- Abundance of species declines in concentric circles away from the optima
]
.col-5[

``` r
ca <- cca(varespec)
plot(ca)
```

![](slides_files/figure-html/vare-ca-1-1.svg)

]
]

---

# Vegetation in lichen pastures &mdash; CA biplots

- Species scores plotted as weighted averages of site scores, or
- Site scores plotted as weighted averages of species scores, or
- A symmetric plot

![](slides_files/figure-html/vare-ca-2-1.svg)

---
class: inverse middle center huge-subsection

# vegan

---

# Vegan basics

- The majority of vegan functions work with a single vector, or more commonly an entire data frame
- This data frame may contain the species abundances
- Where subsidiary data is used/required, these two are supplied as data frames
- For example; the environmental constraints in a CCA
- It is not a problem if you have all your data in a single file/object; just subset it into two data frames after reading it into R

``` r
spp <- allMyData[, 1:20] ## columns 1-20 contain the species data
env <- allMyData[, 21:26] ## columns 21-26 contain the environmental data
```

---
<!--
# Simple vegan usage

First we start with a simple correspondence analysis (CA) to illustrate the basic features

Here I am using one of the in-built data sets on lichen pastures

For various reasons to fit a CA we use the `cca()` function

Store the fitted CA in `ca1` and print it to view the results

``` r
ca1 <- cca(varespec)
ca1
```

```
## Call: cca(X = varespec)
## 
##               Inertia Rank
## Total           2.083     
## Unconstrained   2.083   23
## Inertia is scaled Chi-square 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.5249 0.3568 0.2344 0.1955 0.1776 0.1216 0.1155 0.0889 
## (Showing 8 of 23 unconstrained eigenvalues)
```

--- 
-->

# `scores()` & `scaling`

- When we draw the results of many ordinations we display 2 or more sets of data
- Can't display all of these and maintain relationships between the scores
- Solution; scale one set of scores relative to the other
- Controlled via the `scaling` argument

---

# Scaling

How we scale scores is controlled via the `scaling` argument

- `scaling = 1` &mdash; Focus on sites, scale site scores by `$\lambda_i$`
- `scaling = 2` &mdash; Focus on species, scale species scores by `$\lambda_i$`
- `scaling = 3` &mdash; Symmetric scaling, scale both scores by `$\sqrt{\lambda_i}$`
- `scaling = -1` &mdash; As above, but for `rda()` get correlation scores
- `scaling = -2` &mdash; for `cca()` multiply results by `$\sqrt{(1/(1-\lambda_i))}$`
- `scaling = -3` &mdash; this is Hill's scaling
- `scaling < 0` &mdash; For `rda()` divide species scores by species' `$\sigma$`
- `scaling = 0` &mdash; raw scores

---

# No one can remember all that&hellip;

Use the text names instead:

* `scaling = "none"` means `scaling = 0`
* `scaling = "sites"` means `scaling = 1`
* `scaling = "species"` means `scaling = 3`
* `scaling = "symmetric"` means `scaling = 3`

For PCA (`rda()`) use `correlation = TRUE` to get the correlation scores (negative scaling)

For CA (`cca()`)  use `hill = TRUE` to get Hill's scaling scores (negative scaling)

---

# Extractor functions &mdash; scores()

Don't rummage around in the objects returned by *vegan* functions &mdash; unless you know what you're doing

``` r
head(scores(pca, choices = 1:3, display = "species", scaling = "species", correlation = TRUE))
```

```
##                 PC1        PC2         PC3
## Callvulg -0.2896130  0.1338259  0.18460432
## Empenigr  0.5732742  0.2057975 -0.24778607
## Rhodtome  0.9262883 -0.2248872 -0.28211297
## Vaccmyrt  0.8558251 -0.2508985 -0.33438598
## Vaccviti  0.6800662  0.5703573  0.02059066
## Pinusylv  0.2530739  0.2280044  0.80172865
```

---

# Basic ordination plots &mdash; plot()

.row[
.col-6[
- `choices = 1:2` &mdash; which axes?
- `scaling = 3` &mdash; scaling to use
- `display = c("sites","species")` &mdash; which scores (default is both)
- `type = "text"` &mdash; display scores using labels or points (`"points"`)
- Other graphics arguments can be supplied but apply to all scores
]

.col-6[

``` r
plot(ca1, choices = c(1,3),
     scaling = "symmetric")
```

![](slides_files/figure-html/ca_biplot-1.svg)

]
]

---
class: inverse center middle subsection

# Distance-based methods

---
class: inverse center middle huge-subsection

# PCO

---

# Principal Coordinates Analysis

PCoA (or PCO, or **metric** multidimensional scaling (MDS)) finds a mapping to Euclidean space of `$n$` objects using the `$n$` by `$n$` matrix of dissimilarities `$d_{ij}$`

PCoA is an eigen decomposition like PCA

* first axis is the best 1D mapping of the dissimilarities
* subsequent axes are orthogonal to the first, but improve the mapping by smaller & smaller amounts

Can use *any* dissimilarity coefficient (with a big **but**)

PCoA on a Euclidean distance matrix &#8680; PCA (without species scores)

---

# Principal Coordinates Analysis

The big **but** is that not all dissimilarity coefficients can be represented in Euclidean space

If dissimilarity matrix is metric we're OK &mdash; *usually*

If not metric, get negative eigenvalues &#8680; correspond to distances in imaginary space

Distortion can be measured as

`$$\frac{\sum \left | \lambda^{-} \right |}{\sum \left | \lambda \right |}$$`

---

# PCoA &mdash; correcting negative &lambda;

1. Could square root transform the `$d_{ij}$`

2. Add a sufficiently large constant to `$d_{ij}$` or `$d_{ij}^2$`

* **Lingoes** method: `$\hat{d}_{ij} = \sqrt{d_{ij}^2 + 2c_1}$` for `$i \neq j$`

where `$c_1$` is `$\mathsf{max} \left ( \left | \lambda_i^{-} \right | \right )$`

* **Cailliez** method: `$\hat{d}_{ij} = d_{ij} + c_2$` for `$i \neq j$`
	
	    where `$c_2$` is computed from a special matrix formed during the PCoA calculations

---

# Principal Coordinates Analysis

Default dissimilarity in `vegdist()` is Bray-Curtis

``` r
pco1 <- wcmdscale(vegdist(varespec), eig = TRUE)
round(eigenvals(pco1), 3)
```

```
##  [1]  1.755  1.133  0.443  0.370  0.245  0.196  0.175  0.128  0.097  0.076
## [11]  0.064  0.058  0.039  0.017  0.005  0.000 -0.006 -0.013 -0.025 -0.038
## [21] -0.048 -0.054 -0.074
```

``` r
pco2 <- wcmdscale(vegdist(varespec), eig = TRUE, add = "lingoes")
round(eigenvals(pco2), 3)
```

```
##  [1] 1.829 1.208 0.517 0.444 0.319 0.270 0.249 0.203 0.171 0.150 0.138 0.132
## [13] 0.114 0.091 0.079 0.074 0.068 0.061 0.049 0.037 0.026 0.020
```

*Note there's one fewer dimensions after correction*

---

# Principal Coordinates Analysis

Can plot the PCoA using `plot()` and add *species scores*

.row[
.col-6[

``` r
pco <- wcmdscale(vegdist(varespec), eig = TRUE)

## plot
plot(pco)

## get PCoA scores
scrs <- scores(pco, choices = 1:2)

## take WA of PCoA scores,
## weight by abundance
spp_scrs <- wascores(scrs, varespec,
 expand = FALSE)

## add
points(spp_scrs, col = "red", pch = 19)
```
]
.col-6[
![](slides_files/figure-html/pco-example-2-1.svg)
]
]

---
class: inverse center middle huge-subsection

# NMDS

---

# Non-Metric Multidimensional Scaling

NMDS find a low-dimensional mapping that preserves as best as possible the .text-highlight[rank order] of the original dissimilarities `$d_{ij}$`

Solution with minimal `stress` is sought; a measure of how well the NMDS mapping fits the `$d_{ij}$`

Stress is sum of squared residuals of monotonic regression between distances in NMDS space, `$d^{*}_{ij}$`, & `$d_{ij}$`

Non-linear regression can cope with non-linear responses in species data

Iterative solution; *convergence is not guaranteed*

Must solve separately different dimensionality solutions

---

# Non-Metric Multidimensional Scaling

- Use an appropriate dissimilarity metric that gives good gradient separation `rankindex()`, e.g Bray-Curtis, Jaccard, Kulczynski
- Wisconsin transformation useful; Standardize species to equal maxima, then sites to equal totals `wisconsin()`
- Use many random starts and look at the fits with lowest stress (`try` & `trymax`
- Only conclude solution reached if lowest stress solutions are similar (Procrustes rotation)
- Fit NMDS for 1, 2, 3, &hellip; dimensions; stop after a sudden drop in stress observed in a screeplot
- NMDS solutions can be rotated ; common to rotate to PCs
- Also scale axes in half-change units

???

Half-change units: samples separated by a distance of 1 correspond, on average, to a 50% turnover in composition

---

# NMDS in vegan

**vegan** implements all these ideas via the `metaMDS()` wrapper

``` r
data(dune)
set.seed(10)
(sol <- metaMDS(dune, trace = FALSE))
```

```
## 
## Call:
## metaMDS(comm = dune, trace = FALSE) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1183186 
## Stress type 1, weak ties
## Best solution was repeated 6 times in 20 tries
## The best solution was from try 2 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'dune'
```

---

# NMDS in vegan

If no convergent solutions, continue iterations from previous best solution

``` r
(sol <- metaMDS(dune, previous.best = sol, trace = FALSE))
```

```
## 
## Call:
## metaMDS(comm = dune, trace = FALSE, previous.best = sol) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1183186 
## Stress type 1, weak ties
## Best solution was repeated 16 times in 40 tries
## The best solution was from try 2 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'dune'
```

---

# NMDS in vegan

``` r
layout(matrix(1:2, ncol = 2))
plot(sol, main = "Dune NMDS plot"); stressplot(sol, main = "Shepard plot")
layout(1)
```
![](slides_files/figure-html/nmds3-1.svg)

???

Shepard diagram which is a plot of ordination distances and monotone or linear fit line against original dissimilarities.  In addition, it displays two correlation-like statistics on the goodness of fit in the graph. The nonmetric fit is based on stress S and defined as R2 = 1-S*S. The “linear fit” is the squared correlation between fitted values and ordination distances. For monoMDS, the “linear fit” and R2 from “stress type 2” are equal.

---

# Checking dimensionality *k*

Fit NMDS solutions for a number of `k`

.row[
.col-6[

``` r
k_vec <- 1:10
stress <- numeric(length(k_vec))
dune_dij <- metaMDSdist(dune, trace = FALSE)
set.seed(25)
for(i in seq_along(k_vec)) {
 sol <- metaMDSiter(dune_dij, k = i,
 trace = FALSE)
 stress[i] <- sol$stress
}
```

Need to use the helper functions to do this right

``` r
plot(k_vec, stress, type = "b", ylab = "Stress",
     xlab = "Dimensions")
```

]
.col-6[
![](slides_files/figure-html/plot_nmds-scree-1.svg)
]
]

---

# NMDS &mdash; Goodness of fit

A goodness of fit statistic `$g_i$` can be computed for the observations; defined such that `$\sum_{i=1}^{n} g_i^2 = S^2$`

``` r
(g <- goodness(sol))
```

```
##  [1] 0.0008559796 0.0014569292 0.0006071222 0.0010337685 0.0008109554
##  [6] 0.0006932209 0.0009813765 0.0007284320 0.0007316829 0.0009052744
## [11] 0.0004102242 0.0004846332 0.0008688444 0.0011563520 0.0013120348
## [16] 0.0007051008 0.0008226499 0.0005374015 0.0008933316 0.0006139805
```

``` r
sum(g^2)
```

```
## [1] 1.513378e-05
```

``` r
sol$stress^2
```

```
## [1] 1.513378e-05
```

``` r
all.equal(sqrt(sum(g^2)), sol$stress)
```

```
## [1] TRUE
```

---

# Supplementary data

If we have other data collected at the same sites, say about the environment, we can investigate relationships between the main components of variation in species composition and those environmental variables

Two main **vegan** functions for this

1. `envfit()` and helpers `vectorfit()` and `factorfit()`

2. `ordisurf()`

`envfit()` fits vectors or planes while `ordisurf()` fits smooth, potentially non-linear surfaces

---

# Vector fitting

`envfit()` fits vectors using a regression

`$$\hat{y_i} = \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots \beta_p x_{pi}$$`

where:

* `$x_{pi}$` are the site scores for the `$i$`th site on axis 1, 2, &hellip;, `$p$`
* Usually `$p = 2$` as we're fitting vectors into 2D ordination plots
* `$y_i$` is the value of the environment at the `$i$`th site
* `envfit()` can handle a matrix of environmental variables

*Note no intercept; internally center `$\boldsymbol{y}$` and `$\boldsymbol{x_p}$`*

For categorical `$\boldsymbol{y}$`, `envfit()` finds averages of scores for each level of the category

---

# Vector fitting

.row[
.col-6[

`envfit()` with formula

``` r
set.seed(42)
ev <- envfit(pca ~ ., data = varechem,
 choices = 1:2,
 scaling = "symmetric",
 permutations = 1000)
ev
```

or data frames

``` r
envfit(pca, varechem,
       choices = 1:2, scaling = "symmetric",
       permutations = 1000)
```
]

.col-6[

```
## 
## ***VECTORS
## 
##               PC1      PC2     r2   Pr(>r)    
## N        -0.80597 -0.59196 0.0384 0.676324    
## P         0.78221 -0.62301 0.0335 0.696304    
## K         0.94909 -0.31501 0.0394 0.638362    
## Ca        0.87606  0.48220 0.1607 0.153846    
## Mg        0.83079  0.55658 0.1774 0.121878    
## S         0.86667 -0.49888 0.0044 0.962038    
## Al       -0.64541 -0.76384 0.3168 0.017982 *  
## Fe       -0.56956 -0.82195 0.3536 0.018981 *  
## Mn        0.96802 -0.25088 0.0730 0.439560    
## Zn        0.96899  0.24711 0.0321 0.712288    
## Mo       -0.99233 -0.12362 0.0630 0.491508    
## Baresoil  0.51078  0.85971 0.5280 0.000999 ***
## Humdepth  0.81116  0.58483 0.2937 0.026973 *  
## pH       -0.34311 -0.93929 0.0881 0.333666    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
```
]
]

---

# Vector fitting

.row[
.col-6[
Values in `PC1` etc are *direction cosines* for vectors of unit length

`r2` is the squared correlation, `$R^2$`, between `$\boldsymbol{y}$` and projection of axis scores on to the vector

For factors is `$R^2 = 1 - \mathsf{SS_{W}} / \mathsf{SS_{T}}$`

Permutation test shuffles the `$\boldsymbol{y}$` to generate distribution of `$R^2$` under H0
]

.col-6[

---

# Vector fitting

.row[
.col-6[

``` r
plot(ev, add = FALSE) # oops bug!
```
]

.col-6[
![](slides_files/figure-html/plot-envfit-1-1.svg)
]
]

---

# Vector fitting

.row[
.col-6[

``` r
plot(pca, display = "sites", type = "n",
     scaling = "symmetric")
points(pca, display = "sites",
       scaling = "symmetric")
plot(ev, add = TRUE)
```
]

.col-6[
![](slides_files/figure-html/plot-envfit-2-1.svg)
]
]

---

# Smooth surfaces

`envfit()` fitted vectors, linear planes, to ordinations.

`ordisurf()` fits smooth surfaces using a GAM via package **mgcv**

`$$\hat{y_i} = f(x_{1i}, x_{2i})$$`

where `$f()$` is a *bivariate* smooth function of a pair of axis scores `$x_{1i}$` and `$x_{2i}$`

`ordisurf()` exposes a lot of functionality from `mgcv::gam()` and the smooths it can fit

---

## Smooth surfaces

Fitting a 10 basis function isotropic surface

.row[
.col-6[

``` r
surf <- ordisurf(pca ~ Ca,
 data = varechem,
 knots = 10,
 isotropic = TRUE,
 main = NULL)
```

`ordisurf()` plots by default

`surf` contains the fitted `gam()` model
]

.col-6[
![](slides_files/figure-html/ordisurf-1-1.svg)
]
]

---

# Smooth surfaces

``` r
summary(surf)
```

```
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   569.66      40.27   14.14 1.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df    F p-value  
## s(x1,x2) 3.517      9 1.34  0.0187 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.344   Deviance explained = 44.4%
## -REML = 158.56  Scale est. = 38924     n = 24
```

---

# Rarefaction

Species richness increases with sample size (effort)

Rarefaction gives the expected number of species rarefied from `$N$` to `$n$` individuals

`$$\hat{S}_n = \sum_{i=1}^S (1 - q_i) \; \mathsf{where} \; q_i = \frac{\binom{N - x_i}{n}}{\binom{N}{n}}$$`

`$x_i$` is count of species `$i$` and `$\binom{N}{n}$` is a binomial coefficient &mdash; the number of ways to choose `$n$` from `$N$`

---

# Rarefaction

.row[
.col-6[

``` r
data(BCI, package = "vegan")
rs <- rowSums(BCI)
quantile(rs)
```

```
##    0%   25%   50%   75%  100% 
## 340.0 409.0 428.0 443.5 601.0
```

``` r
Srar <- rarefy(BCI, min(rs))
head(Srar)
```

```
##        1        2        3        4        5        6 
## 84.33992 76.53165 79.11504 82.46571 86.90901 78.50953
```

``` r
rarecurve(BCI, sample = min(rs))
```
]

.col-6[
![](slides_files/figure-html/rarefy-1.svg)
]
]

---

# Rarefaction

With rarefaction we can be accused of throwing away data

Yet we need to do something about the often large differences in variances of species and samples owing to sampling effort

---

# High-throughput data

Revolutions in biology & biotechnology have lead to exponential increases in our capacity to generate data arising from the counting of biological molecules

* DNA sequencing,
* RNA-Seq &mdash; sequence RNA molecules in populations of cells or tissues,
* ChIp-Seq &mdash; sequence DNA molecules that are bound to particular proteins,
* &hellip;

Relative cheaply, today we can generate data sets with thousands of variables on tens to hundreds of samples

---

# High-throughput data

Counts of such molecules present a statistical challenge

* the counts typically have a large *dynamic range*, covering many orders of magnitudes

* over this very large range we observe changes in both variance (spread about the mean) and also in distribution of the data &mdash; **heterogeneity**

* like other count data, observations are integers and distributions are skewed

* the biotech methodology imparts systematic sampling biases into the data that we need to account for in an analysis &mdash; typically called **normalization** in this sub-field

---

# Sequence depth & size factors

The number of reads for each sample is a kind of *effort* variable

All else equal, the more reads we generate the more species (OTUs, ASVs, etc) we would expect to identify

If number of reads differs between samples (libraries) then, all else equal we might assume the counts in different samples are proportional to one another, following some proportionality factor `$s$`

A simple estimate for `$s$` might be the total reads per sample

But we can do better than this &hellip;

---

# Sequence depth & size factors

.row[
.col-8[
A small dataset of 5 genes in 2 samples

Two views:

1. estimate `$s_j$` as `$\sum_{i=1}^{m} x_{ij}$`, blue line is ratio of `$s_j$`
2. instead, estimate `$s_j$` such that their ratio is the red line

In 1 we would C is downregulated & A, B, D, & E are upregulated

2 is more parsimonious

`DESeq2::estimateSizeFactorsForMatrix()`
]

.col-4[

![](resources/chap7-rnaseq-normalization-1.png)

.smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology]
]
]

---

# Variance stabilizing transformations

The square-root transformation is known as the _variance stabilizing transformation_ for the Poisson distribution

Taking the square root of observations that are distributed Poisson leads to *homoscedasticity*

Can construct variance stabilizing transformations for other distributions

High-throughput count data typically show extra-Poisson variation

`DESeq2::vst()`

---

# Regularized log transform

This transforms counts to a log2-like scale via a simple model

`$$\log_2(q_{ij}) = \sum_k x_{jk} \beta_{ik}$$`

where `$q_{ij}$` are the transformed data and `$x_{jk}$` is a particular design matrix with a dummy variable for the `$j$`th sample for ($i = {1, 2, \dots m}$ variables)

Priors on the `$\beta_{ik}$` make the model identifiable

The _rlog_ transformation is ~ variance stabilizing, but handles highly varying size factors across samples

`DESeq2::rlog()`

---

# Transformations

Comparison of three transformations

![](resources/chap7-rnaseq-meansd-1.png)

.smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology]

???

Figure 8.12: Per-gene standard deviation (sd, taken across samples) against the rank of the mean, for the shifted logarithm , the variance-stabilizing transformation (vst) and the rlog. Note that for the leftmost  2,500 genes, the counts are all zero, and hence their standard deviation is zero. The mean-sd dependence becomes more interesting for genes with non-zero counts. Note also the high value of the standard deviation for genes that are weakly detected (but not with all zero counts) when the shifted logarithm is used, and compare to the relatively flat shape of the mean-sd relationship for the variance-stabilizing transformation.

---

# Implications

Size factors can be used in GLMs etc to normalize samples via as `offset()` term in the formula

For ordination we can't easily use the size factors

Need to use _vst_ or _rlog_ instead

_rlog_ creates data that can't be used in CCA, so if you use it, RDA or db-RDA are the ordination methods to use

---

# Practicalities

---

# Rarefied counts

Can take a random draw `$n^{\prime}$` out of the `$N$` individuals in a sample in proportion to species abundances

This yields rarefied counts which some people then go on to analyse

Be careful though as this is a random process &mdash; what if you took a different random sample?

---

# Rarefied counts

`avgdist()` in *vegan* tries to get around this by

1. taking a random draw to get rarefied counts
2. computing dissimilarity between smaples on basis of rarefied count

Repeat that many times and then as your dissimilarity `$d_{ij}$`, take the average of the many `$d_{ij}^*$` values you generated above

See also the help page for other suggestions `?avgdist`

---
class: inverse middle center big-subsection

# plotting

---

# Better plotting

Ordination diagrams are often messy &mdash; `plot()` methods designed to get a quick plot of results

To produce better plots you need to know some base graphics skills *and* make use of some *vegan* helpers

---

# Points and labels

`ordipointlabel()` can draw points and labels

.row[
.col-6[

``` r
set.seed(10)
ordipointlabel(pca,
               display = "sites",
               scaling = "symmetric")
```

Iteratively finds space to draw labels so they don't overplot

]
.col-6[
![](slides_files/figure-html/ordipointlabel-1-1.svg)
]
]

---

# Points and labels

`ordipointlabel()` can also add to an existing plot

.row[
.col-6[

``` r
plot(pca, display = "sites",
     scaling = "symmetric", type = "n")
points(pca, display = "sites",
       scaling = "symmetric", pch = 19,
       col = "#025196")
set.seed(10)
ordipointlabel(pca,
               display = "sites",
               scaling = "symmetric",
               add = TRUE)
```

]
.col-6[
![](slides_files/figure-html/ordipointlabel-2-1.svg)
]
]

---

# Points and labels

`ordipointlabel()` can also add to an existing plot

.row[
.col-6[

``` r
disp <- "species"
scl <- "symmetric"
plot(pca, display = disp,
 scaling = scl, type = "n")
points(pca, display = disp,
 scaling = scl, pch = 19,
 col = "#025196")
set.seed(10)
ordipointlabel(pca,
 display = disp,
 scaling = scl,
 add = TRUE)
```

]
.col-6[
![](slides_files/figure-html/ordipointlabel-3-1.svg)
]
]

---

# Points and labels

How successful `ordipointlabel()` is depends on how much you plot & how big you plot it

.row[
.col-6[

``` r
disp <- c("sites", "species")
scl <- "symmetric"
plot(pca, display = disp,
 scaling = scl, type = "n")
points(pca, display = disp[1],
 scaling = scl, pch = 19,
 col = "#025196")
points(pca, display = disp[2],
 scaling = scl, pch = 19,
 col = "#fdb338")
set.seed(10)
ordipointlabel(pca,
 display = disp,
 scaling = scl,
 add = TRUE,
 col = c(1,1), cex = c(0.7, 0.7))
```

]
.col-6[
![](slides_files/figure-html/ordipointlabel-4-1.svg)
]
]

---

# Building up by layers

With base graphics you are in control of *everything*

.row[
.col-6[

``` r
data(dune, dune.env)
col_vec <- c("red", "blue", "orange", "grey")
disp <- "sites"
scl <- "symmetric"

ord <- rda(decostand(dune, method="hellinger"))

plot(ord, type = "n", scaling = scl,
 display = disp)
cols <- with(dune.env, col_vec[Management])
points(ord, display = disp, scaling = scl,
 pch = 19, col = cols, cex = 2)
lvl <- with(dune.env, levels(Management))
legend("topright", legend = lvl,
 bty = "n", col = col_vec, pch = 19)
```

]
.col-6[
![](slides_files/figure-html/layer-plot-1-1.svg)
]
]

---

# Other utilities &mdash; `ordihull()`

Convex hulls around groups of data

.row[
.col-6[

``` r
disp <- "sites"
scl <- "symmetric"

plot(ord, type = "n", scaling = scl,
     display = disp)
ordihull(ord, groups = dune.env$Management,
         col = col_vec,
         scaling = scl, lwd = 2)
ordispider(ord, groups = dune.env$Management,
           col = col_vec,
           scaling = scl, label = TRUE)
points(ord, display = disp, scaling = scl,
       pch = 21, col = "red", bg = "yellow")
```

]
.col-6[
![](slides_files/figure-html/hull-1-1.svg)
]
]

---

# Other utilities &mdash; `ordiellipse()`

Draws ellipsoid hulls & standard error & deviation ellipses

.row[
.col-6[

``` r
disp <- "sites"
scl <- "symmetric"

plot(ord, type = "n", scaling = scl,
     display = disp)
## ellipsoid hull
ordiellipse(ord, groups = dune.env$Management,
            kind = "ehull", col = col_vec,
            scaling = scl, lwd = 2)
## standard error of centroid  ellipse
ordiellipse(ord, groups = dune.env$Management,
            draw = "polygon", col = col_vec,
            scaling = scl, lwd = 2)
ordispider(ord, groups = dune.env$Management,
           col = col_vec,
           scaling = scl, label = TRUE)
points(ord, display = disp, scaling = scl,
       pch = 21, col = "red", bg = "yellow")
```

]
.col-6[
![](slides_files/figure-html/ellipse-1-1.svg)
]
]

---

# Any love for *ggplot*?

The **ggvegan** package is in development

.row[
.col-6[

``` r
## library('remotes')
## install_github("gavinsimpson/ggvegan")
library('ggvegan')
library('ggplot2')

autoplot(ord)
```
]

.col-6[
![](slides_files/figure-html/ggvegan-1.svg)
]
]

---

# Any love for *ggplot*?

The **ggvegan** package is in development

``` r
# Install src from github
remotes::install_github("gavinsimpson/ggvegan")
# or binary from R-universe:
# Enable repository from gavinsimpson
options(repos = c(
  gavinsimpson = 'https://gavinsimpson.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
# Download and install ggvegan in R
install.packages('ggvegan')
```

---

# ggvegan

.row[
.col-6[

``` r
ford <- fortify(ord, axes = 1:2)
## not yet a tibble
head(ford, 4)
```

```
##     score    label         PC1         PC2
## 1 species Achimill  0.22413552 -0.05704791
## 2 species Agrostol -0.41424457  0.18006829
## 3 species Airaprae  0.03506944 -0.15464367
## 4 species Alopgeni -0.13865500  0.31844362
```

``` r
blue <- "#025196"
filter(ford, score == "sites") %>%
 ggplot(aes(x = PC1, y = PC2)) +
 geom_hline(yintercept = 0, colour = blue) +
 geom_vline(xintercept = 0, colour = blue) +
 geom_point() +
 coord_fixed()
```
]

.col-6[
![](slides_files/figure-html/ggvegan-2-1.svg)
]
]