class: inverse, middle, left, my-title-slide, title-slide .title[ # Dissimilarity & Unconstrained ordination ] .author[ ### Gavin Simpson ] .date[ ### August 6, 2024 ] --- class: inverse middle center big-subsection # Welcome --- # Logistics * Dissimilarity & unconstrained ordination * Constrained ordination * Permutation tests --- # Today's topics * Dissimilarity * Unconstrained ordination * Principal Components Analysis (PCA) * Correspondence Analysis (CA) * Principal Coordinates Analysis (PCO or PCoA) * Non-metric Multidimensional Scaling (NMDS) * Practical tips for working & plotting ordinations using **vegan** --- # Multivariate data --- # Multivariate data Multivariate != multiple regression Multivariate means we have two or more *response* variables We are interested in learning about the common patterns or modes of variation among those multiple response variables Multivariate data require special statistical methods --- # Multivariate data in biology Community composition — many species as the responses In modern biology we have OTUs and ASVs In chemistry we have metabolites or spectra or masses or … All of these constitute multivariate data --- # Species composition Species composition is the set of species in a site or a sample Typically comes with a measure of abundance "abundance" could also be whether each species is present or absent in the sample Relative abundance expresses the abundance of each species as its proportion out of the total abundance in each sample --- # Data box <img src="./resources/data-box-cropped.png" width="50%" style="display: block; margin: auto;" /> --- # Q and R mode Measuring dependence between variables (descriptors; species, environment) typically done using coefficients like Pearson's `\(r\)` Hence this mode of analysis is *R* mode *Q* mode relates to methods that focus on dependencies among objects (samples) Often difficult to identify the mode; PCA starts from a dispersion matrix of variables (R mode) but provides an ordination of the samples (Q mode) --- class: inverse middle center big-subsection # Dissimilarity --- # Measuring association — binary data .row[ .col-6[ <table> <tr> <td> </td> <td colspan=3>Object <i>j</i></td> </tr> <tr> <td rowspan=3>Object <i>i</i></td> <td> </td> <td> + </td> <td> - </td> </tr> <tr> <td> + </td> <td> a </td> <td> c </td> </tr> <tr> <td> - </td> <td> b </td> <td> d </td> </tr> </table> ] .col-6[ Dissimilarity based on the number of species present only in `\(i\)` ( `\(b\)` ), or `\(j\)` ( `\(c\)` ), or in present in both ( `\(a\)` ), or absent in both ( `\(d\)` ). ] ] .row[ .col-6[ Jaccard similarity `$$s_{ij} = \frac{a}{a + b + c}$$` Jaccard dissimilarity `$$d_{ij} = \frac{b + c}{a + b + c}$$` ] .col-6[ Simple matching coefficient `$$s_{ij} = \frac{a + d}{a + b + c + d}$$` Simple matching coefficient `$$d_{ij} = \frac{b + c}{a + b + c + d}$$` ] ] --- # Dissimilarity .row[ .col-6[ ![](slides_files/figure-html/dissimilarity-1.svg)<!-- --> ] .col-6[ `$$d_{ij} = \sqrt{\sum\limits^m_{k=1}(x_{ik} - x_{jk})^2}$$` `$$d_{ij} = \sum\limits^m_{k=1}|x_{ik} - x_{jk}|$$` `$$d_{ij} = \frac{\sum\limits^m_{k=1}|x_{ik} - x_{jk}|}{\sum\limits^m_{k=1}(x_{ik} + x_{jk})}$$` ] ] --- # Measuring association — quantitative data * Euclidean distance dominated by large values * Manhattan distance less affected by large values * Bray-Curtis treats all species with equal weight * Similarity ratio (Steinhaus-Marczewski `\(\equiv\)` Jaccard) less dominated by extremes * Chord distance, used for proportional data; _signal-to-noise_ measure .row[ .col-6[ Similarity ratio `\(d_{ij} = \frac{\sum\limits^m_{k=1}x_{ik}x_{jk}}{\left(\sum\limits^m_{k=1}x_{ik}^2 + \sum\limits^m_{k=1}x_{jk}^2 - \sum\limits^m_{k=1}x_{ik}x_{jk}\right)^2}\)` ] .col-6[ Chord distance `\(d_{ij} = \sqrt{\sum\limits^m_{k=1}(\sqrt{p_{ik}} - \sqrt{p_{jk}})^2}\)` ] ] --- # Measuring association — mixed data .row[ .col-8[ .small[ * `\(s_{ijk}\)` is similarity between sites `\(i\)` and `\(j\)` for the `\(k\)`th variable. * Weights `\(w_{ijk}\)` are typically 0 or 1 depending on whether the comparison is valid for variable `\(k\)`. Can also use variable weighting with `\(w_{ijk}\)` between 0 and 1. * `\(w_{ijk}\)` is zero if the `\(k\)`th variable is missing for one or both of `\(i\)` or `\(j\)`. * For binary variables `\(s_{ijk}\)` is the Jaccard coefficient. * For categorical data `\(s_{ijk}\)` is 1 of `\(i\)` and `\(k\)` have same category, 0 otherwise. * For quantitative data `\(s_{ijk} = (1 - |x_{ik} - x_{jk}|) / R_k\)` ] ] .col-4[ Gower's coefficient `$$s_{ij} = \frac{\sum\limits^m_{i=1} w_{ijk}s_{ijk}}{\sum\limits^m_{i=1} w_{ijk}}$$` ] ] --- # Double zero problem <img src="./resources/response-curve.png" width="90%" style="display: block; margin: auto;" /> --- # Double zero problem If a species is present at sites `\(i\)` and `\(j\)` we take this as a reflection of the two sites' similarity If a species is *absent* from one of `\(i\)` or `\(j\)` we take this as a reflection of some ecological difference between the two sites But, what if the species is absent from *both* `\(i\)` and `\(j\)`? -- The species could be absent for different reasons * too hot in `\(i\)` and too cold in `\(j\)` * too dry in `\(i\)` and to wet in `\(j\)` --- # Double zero problem We may choose not to attach ecological meaning to a joint or *double* absence when computing association Coefficients that skip double zeroes are *asymmetric* Coefficients that do not skip double zeroes are *symmetric* --- # Metrics Three types of quantitative distance coefficient 1. Metric 2. Semi-metric 3. Non-metric Groups pertain to how the coefficient obeys certain metric rules Some use the term "distance coefficient" to only refer to those that are metric, using "dissimilarity" for other semi- and non-metrics --- # Metric coefficients Obey four rules 1. minimum = 0: if `\(i = j\)` then `\(d_{ij} = 0\)` 2. positivity: if `\(i \neq j\)` then `\(d_{ij} \gt 0\)` 3. symmetry: `\(d_{ij} = d_{ji}\)` 4. triangle inequality: `\(d_{ij} + d_{jk} \ge d_{ik}\)` Last rule states that the *sum* of 2 sides of a triangle in Euclidean space are necessarily `\(\ge\)` the 3rd side --- # Others types Semi-metrics do not obey the triangle inequality rule Non-metrics do not obey the positivity rule --- # Species abundance paradox | | y1| y2| y3| |:--|--:|--:|--:| |x1 | 0| 4| 8| |x2 | 0| 1| 1| |x3 | 1| 0| 0| Euclidean distances: | | x1| x2| x3| |:--|--------:|--------:|--------:| |x1 | 0.000000| 7.615773| 9.000000| |x2 | 7.615773| 0.000000| 1.732051| |x3 | 9.000000| 1.732051| 0.000000| --- # Species abundance paradox | | y1| y2| y3| |:--|--:|--:|--:| |x1 | 0| 4| 8| |x2 | 0| 1| 1| |x3 | 1| 0| 0| Chi-square distances: | | x1| x2| x3| |:--|---------:|---------:|--------:| |x1 | 0.0000000| 0.3600411| 4.009249| |x2 | 0.3600411| 0.0000000| 4.020779| |x3 | 4.0092486| 4.0207794| 0.000000| --- # Transformations * Can transform the variables (e.g. species) or the samples to improve the gradient separation of the dissimilarity coefficient. * No transformation of variables or samples leads to a situation of quantity domination — big values dominate `\(d_{ij}\)`. * Normalise samples —gives all samples equal weight. * Normalise variables; * gives all variables equal weight, * inflates the influence of rare species. * Double (_Wisconsin_) transformation; normalise variables then samples. * Noy-Meir _et al_. (1975) _J. Ecology_ **63**; 779--800. * Faith _et al_. (1987) _Vegetatio_ **69**; 57--68. --- # Dissimilarity Two key functions 1. `vegdist()` 2. `decostand()` ``` r data(varespec) euc_dij <- vegdist(varespec, method = "euclidean") bc_dij <- vegdist(varespec) hell_dij <- vegdist(decostand(varespec, method = "hellinger"), method = "euclidean") ``` --- class: inverse middle center big-subsection # Ordination --- class: inverse middle center big-subsection # ordnung ??? The word "ordination" comes from the German word "ordnung" --- # Ordination Putting things in order is exactly what we we do in ordination * we arrange our samples along gradients by fitting lines and planes through the data that describe the main patterns in those data * we map data to lower dimensions reflecting how similar the samples are to one another in terms of the variables measured Three families of models 1. Linear 2. Unimodal 3. Distance-based --- class: inverse middle center big-subsection # Unconstrained --- # Unconstrained What is .text-higlight[unconstrained]? First we look for major variation, then relate it to environmental variation vs. *constrained* ordination, where we only want to see what can be explained by environmental variables of interest *How well do we explain the main patterns in the species data*? vs *How large are the patterns we can expain with the measured data*? --- # Ordination methods Principal Components Analysis (PCA) is a linear method — most useful for environmental data or sometimes with species data and short gradients Correspondence Analysis (CA) is a unimodal method — most useful for species data, especially where non-linear responses are observed Principal Coordinates Analysis (PCO) and Non-metric Multidimensional Scaling (NMDS) — can be used for any kind of data --- class: inverse middle center huge-subsection # PCA --- # Principal Components Analysis Instead of doing many regressions, do one with all the responses No explanatory variables — uncover latent, underlying gradients PCA fits a line through our cloud of data in such a way that it maximises the variance in the data captured by that line (i.e.~minimises the distance between the fitted line and the observations) Fit a second line to form a plane, and so on, until we have one PCA axis for every dimension of the data Each of these subsequent axes is uncorrelated with previous axes — they are .text-highlight[orthogonal] — the variance each axis explains is uncorrelated --- # Principal Components Analysis .center-fig[ ![:scale 75% © HappyWaldo CC BY-SA](./resources/Sherrin_footy.jpg) ] --- # Principal Components Analysis .center[ ![]("resources/pca-anim.gif") ] .row[ .col-4[ ![](slides_files/figure-html/pc-anim-plt1-1.svg)<!-- --> ] .col-4[ ![](slides_files/figure-html/pc-anim-plt2-1.svg)<!-- --> ] .col-4[ ![](slides_files/figure-html/pc-anim-plt3-1.svg)<!-- --> ] ] --- # Principal Components Analysis .center[![:scale 64%](resources/pca-anim.gif)] --- # Load vegan **vegan** is an add-on package ``` r ## install.packages("vegan") # Only need if you've never installed before library("vegan") data(varespec) data(varechem) ``` **vegan** comes with a number of data sets which we'll use to get started --- # Vegetation in lichen pastures — species ``` r class(varespec) ``` ``` ## [1] "data.frame" ``` ``` r dim(varespec) # number of samples, species ``` ``` ## [1] 24 44 ``` ``` r head(varespec[,1:6], n = 5) ``` ``` ## Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti Pinusylv ## 18 0.55 11.13 0.00 0.00 17.80 0.07 ## 15 0.67 0.17 0.00 0.35 12.13 0.12 ## 24 0.10 1.55 0.00 0.00 13.47 0.25 ## 27 0.00 15.13 2.42 5.92 15.97 0.00 ## 23 0.00 12.68 0.00 0.00 23.73 0.03 ``` `varespec` is a data frame * *Variables* are the columns (here the species) * *Observations* are the rows (the samples, sites, etc) ??? Data are cover values of 44 understorey species recorded at 24 locations in lichen pastures within dry *Pinus sylvestris* forests This arrangement is back-to-front if you are used to OTU like data --- # Vegetation in lichen pastures — chemistry Also have associated soil physical and chemical measurements at the 24 sites ``` r head(varechem) ``` ``` ## N P K Ca Mg S Al Fe Mn Zn Mo Baresoil Humdepth ## 18 19.8 42.1 139.9 519.4 90.0 32.3 39.0 40.9 58.1 4.5 0.3 43.9 2.2 ## 15 13.4 39.1 167.3 356.7 70.7 35.2 88.1 39.0 52.4 5.4 0.3 23.6 2.2 ## 24 20.2 67.7 207.1 973.3 209.1 58.1 138.0 35.4 32.1 16.8 0.8 21.2 2.0 ## 27 20.6 60.8 233.7 834.0 127.2 40.7 15.4 4.4 132.0 10.7 0.2 18.7 2.9 ## 23 23.8 54.5 180.6 777.0 125.8 39.5 24.2 3.0 50.1 6.6 0.3 46.0 3.0 ## 19 22.8 40.9 171.4 691.8 151.4 40.8 104.8 17.6 43.6 9.1 0.4 40.5 3.8 ## pH ## 18 2.7 ## 15 2.8 ## 24 3.0 ## 27 2.8 ## 23 2.7 ## 19 2.7 ``` ??? `Baresoil` is the estimated cover of bare soil `Humdepth` is the depth of humus layer --- # Vegetation in lichen pastures — PCA PCA is fitted using `rda()` * Provide a data frame of observations on one or more variables * To scale all variables to be μ = 0, σ<sup>2</sup> = 1: `scale = TRUE` ``` r pca <- rda(decostand(varespec, method = "hellinger"), scale = TRUE) pca ``` ``` ## Call: rda(X = decostand(varespec, method = "hellinger"), scale = TRUE) ## ## Inertia Rank ## Total 44 ## Unconstrained 44 23 ## Inertia is correlations ## ## Eigenvalues for unconstrained axes: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 ## 8.603 5.134 4.576 3.714 3.245 2.779 2.626 2.221 ## (Showing 8 of 23 unconstrained eigenvalues) ``` --- # Vegetation in lichen pastures — PCA PCA of the covariance matrix — default is `scale = FALSE` ``` r rda(decostand(varespec, method = "hellinger"), scale = FALSE) ``` ``` ## Call: rda(X = decostand(varespec, method = "hellinger"), scale = FALSE) ## ## Inertia Rank ## Total 0.3647 ## Unconstrained 0.3647 23 ## Inertia is variance ## ## Eigenvalues for unconstrained axes: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 ## 0.14586 0.07908 0.02866 0.02446 0.02209 0.01263 0.01179 0.00873 ## (Showing 8 of 23 unconstrained eigenvalues) ``` How **vegan** scales the eigenvalues is different to *Canoco* --- # Vegetation in lichen pastures — PCA ``` r biplot(pca, scaling = "symmetric") ``` <img src="slides_files/figure-html/vare-pca-1c-1.svg" width="95%" /> --- # PCA biplots .row[ .col-6[ * Sample (species) points plotted close together have similar species compositions (occur together) * In PCA, species scores often drawn as arrows — point in direction of increasing abundance * Species arrows with small angles to an axis are highly correlated with that axis ] .col-6[ ``` r biplot(pca, scaling = "symmetric") ``` ![](slides_files/figure-html/vare-pca-2-1.svg)<!-- --> ] ] --- # Eigenvalues λ .row[ .col-6[ Eigenvalues are the amount of variance (inertia) explained by each axis ``` r head(eigenvals(pca), 5) ``` ``` ## PC1 PC2 PC3 PC4 PC5 ## 8.602826 5.133623 4.575623 3.713926 3.244925 ``` ] .col-6[ ``` r screeplot(pca, bstick = TRUE, type = "l", main = NULL) ``` ![](slides_files/figure-html/vare-pca-4-1.svg)<!-- --> ] ] --- # Eigenvalues λ The `summary()` method provides additional information ``` r summary(eigenvals(pca)) ``` ``` ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Eigenvalue 8.6028 5.1336 4.5756 3.71393 3.24492 2.77919 2.62560 ## Proportion Explained 0.1955 0.1167 0.1040 0.08441 0.07375 0.06316 0.05967 ## Cumulative Proportion 0.1955 0.3122 0.4162 0.50059 0.57434 0.63750 0.69718 ## PC8 PC9 PC10 PC11 PC12 PC13 PC14 ## Eigenvalue 2.22100 1.67857 1.65627 1.30432 1.03472 0.91190 0.87323 ## Proportion Explained 0.05048 0.03815 0.03764 0.02964 0.02352 0.02073 0.01985 ## Cumulative Proportion 0.74765 0.78580 0.82344 0.85309 0.87660 0.89733 0.91718 ## PC15 PC16 PC17 PC18 PC19 PC20 PC21 ## Eigenvalue 0.7612 0.6336 0.52021 0.51641 0.408841 0.267175 0.206336 ## Proportion Explained 0.0173 0.0144 0.01182 0.01174 0.009292 0.006072 0.004689 ## Cumulative Proportion 0.9345 0.9489 0.96070 0.97243 0.981727 0.987799 0.992488 ## PC22 PC23 ## Eigenvalue 0.179715 0.150795 ## Proportion Explained 0.004084 0.003427 ## Cumulative Proportion 0.996573 1.000000 ``` --- class: inverse middle center subsection # Correspondence Analysis --- # Correspondence Analysis Correspondence analysis (CA) is very similar to PCA — a weighted form of PCA The row and column sums are used as weights and this has the effect of turning the analysis into one of relative composition The weighting is a trick to get linear-based software to fit non-linear responses These nonlinear response are assumed to unimodal Gaussian curves, all with equal height and tolerance widths, and equally spaced optima --- # Correspondence Analysis ![](slides_files/figure-html/unnamed-chunk-1-1.svg)<!-- --> --- # Correspondence Analysis Correspondence analysis (CA) is very similar to PCA — a weighted form of PCA The row and column sums are used as weights and this has the effect of turning the analysis into one of relative composition The weighting is a trick to get linear-based software to fit non-linear responses These nonlinear response are assumed to unimodal Gaussian curves, all with equal height and tolerance widths, and equally spaced optima So, not very realistic, but it is surprisingly robust at times to violation of this assumption --- # Vegetation in lichen pastures — CA biplots .row[ .col-7[ - Have two sets of scores 1. Species scores 2. Site scores - Sample (species) points plotted close together have similar species compositions (occur together) - In CA, species scores drawn as points — this is the fitted optima along the gradients - Abundance of species declines in concentric circles away from the optima ] .col-5[ ``` r ca <- cca(varespec) plot(ca) ``` ![](slides_files/figure-html/vare-ca-1-1.svg)<!-- --> ] ] --- # Vegetation in lichen pastures — CA biplots - Species scores plotted as weighted averages of site scores, or - Site scores plotted as weighted averages of species scores, or - A symmetric plot ![](slides_files/figure-html/vare-ca-2-1.svg)<!-- --> --- class: inverse middle center huge-subsection # vegan --- # Vegan basics - The majority of vegan functions work with a single vector, or more commonly an entire data frame - This data frame may contain the species abundances - Where subsidiary data is used/required, these two are supplied as data frames - For example; the environmental constraints in a CCA - It is not a problem if you have all your data in a single file/object; just subset it into two data frames after reading it into R ``` r spp <- allMyData[, 1:20] ## columns 1-20 contain the species data env <- allMyData[, 21:26] ## columns 21-26 contain the environmental data ``` --- <!-- # Simple vegan usage First we start with a simple correspondence analysis (CA) to illustrate the basic features Here I am using one of the in-built data sets on lichen pastures For various reasons to fit a CA we use the `cca()` function Store the fitted CA in `ca1` and print it to view the results ``` r ca1 <- cca(varespec) ca1 ``` ``` ## Call: cca(X = varespec) ## ## Inertia Rank ## Total 2.083 ## Unconstrained 2.083 23 ## Inertia is scaled Chi-square ## ## Eigenvalues for unconstrained axes: ## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 ## 0.5249 0.3568 0.2344 0.1955 0.1776 0.1216 0.1155 0.0889 ## (Showing 8 of 23 unconstrained eigenvalues) ``` --- --> # `scores()` & `scaling` - When we draw the results of many ordinations we display 2 or more sets of data - Can't display all of these and maintain relationships between the scores - Solution; scale one set of scores relative to the other - Controlled via the `scaling` argument --- # Scaling How we scale scores is controlled via the `scaling` argument - `scaling = 1` — Focus on sites, scale site scores by `\(\lambda_i\)` - `scaling = 2` — Focus on species, scale species scores by `\(\lambda_i\)` - `scaling = 3` — Symmetric scaling, scale both scores by `\(\sqrt{\lambda_i}\)` - `scaling = -1` — As above, but for `rda()` get correlation scores - `scaling = -2` — for `cca()` multiply results by `\(\sqrt{(1/(1-\lambda_i))}\)` - `scaling = -3` — this is Hill's scaling - `scaling < 0` — For `rda()` divide species scores by species' `\(\sigma\)` - `scaling = 0` — raw scores --- # No one can remember all that… Use the text names instead: * `scaling = "none"` means `scaling = 0` * `scaling = "sites"` means `scaling = 1` * `scaling = "species"` means `scaling = 3` * `scaling = "symmetric"` means `scaling = 3` For PCA (`rda()`) use `correlation = TRUE` to get the correlation scores (negative scaling) For CA (`cca()`) use `hill = TRUE` to get Hill's scaling scores (negative scaling) --- # Extractor functions — scores() Don't rummage around in the objects returned by *vegan* functions — unless you know what you're doing ``` r head(scores(pca, choices = 1:3, display = "species", scaling = "species", correlation = TRUE)) ``` ``` ## PC1 PC2 PC3 ## Callvulg -0.2896130 0.1338259 0.18460432 ## Empenigr 0.5732742 0.2057975 -0.24778607 ## Rhodtome 0.9262883 -0.2248872 -0.28211297 ## Vaccmyrt 0.8558251 -0.2508985 -0.33438598 ## Vaccviti 0.6800662 0.5703573 0.02059066 ## Pinusylv 0.2530739 0.2280044 0.80172865 ``` --- # Basic ordination plots — plot() .row[ .col-6[ - `choices = 1:2` — which axes? - `scaling = 3` — scaling to use - `display = c("sites","species")` — which scores (default is both) - `type = "text"` — display scores using labels or points (`"points"`) - Other graphics arguments can be supplied but apply to all scores ] .col-6[ ``` r plot(ca1, choices = c(1,3), scaling = "symmetric") ``` ![](slides_files/figure-html/ca_biplot-1.svg)<!-- --> ] ] --- class: inverse center middle subsection # Distance-based methods --- class: inverse center middle huge-subsection # PCO --- # Principal Coordinates Analysis PCoA (or PCO, or **metric** multidimensional scaling (MDS)) finds a mapping to Euclidean space of `\(n\)` objects using the `\(n\)` by `\(n\)` matrix of dissimilarities `\(d_{ij}\)` PCoA is an eigen decomposition like PCA * first axis is the best 1D mapping of the dissimilarities * subsequent axes are orthogonal to the first, but improve the mapping by smaller & smaller amounts Can use *any* dissimilarity coefficient (with a big **but**) PCoA on a Euclidean distance matrix ⇨ PCA (without species scores) --- # Principal Coordinates Analysis The big **but** is that not all dissimilarity coefficients can be represented in Euclidean space If dissimilarity matrix is metric we're OK — *usually* If not metric, get negative eigenvalues ⇨ correspond to distances in imaginary space Distortion can be measured as `$$\frac{\sum \left | \lambda^{-} \right |}{\sum \left | \lambda \right |}$$` --- # PCoA — correcting negative λ 1. Could square root transform the `\(d_{ij}\)` 2. Add a sufficiently large constant to `\(d_{ij}\)` or `\(d_{ij}^2\)` * **Lingoes** method: `\(\hat{d}_{ij} = \sqrt{d_{ij}^2 + 2c_1}\)` for `\(i \neq j\)` where `\(c_1\)` is `\(\mathsf{max} \left ( \left | \lambda_i^{-} \right | \right )\)` * **Cailliez** method: `\(\hat{d}_{ij} = d_{ij} + c_2\)` for `\(i \neq j\)` where `\(c_2\)` is computed from a special matrix formed during the PCoA calculations --- # Principal Coordinates Analysis Default dissimilarity in `vegdist()` is Bray-Curtis ``` r pco1 <- wcmdscale(vegdist(varespec), eig = TRUE) round(eigenvals(pco1), 3) ``` ``` ## [1] 1.755 1.133 0.443 0.370 0.245 0.196 0.175 0.128 0.097 0.076 ## [11] 0.064 0.058 0.039 0.017 0.005 0.000 -0.006 -0.013 -0.025 -0.038 ## [21] -0.048 -0.054 -0.074 ``` ``` r pco2 <- wcmdscale(vegdist(varespec), eig = TRUE, add = "lingoes") round(eigenvals(pco2), 3) ``` ``` ## [1] 1.829 1.208 0.517 0.444 0.319 0.270 0.249 0.203 0.171 0.150 0.138 0.132 ## [13] 0.114 0.091 0.079 0.074 0.068 0.061 0.049 0.037 0.026 0.020 ``` *Note there's one fewer dimensions after correction* --- # Principal Coordinates Analysis Can plot the PCoA using `plot()` and add *species scores* .row[ .col-6[ ``` r pco <- wcmdscale(vegdist(varespec), eig = TRUE) ## plot plot(pco) ## get PCoA scores scrs <- scores(pco, choices = 1:2) ## take WA of PCoA scores, ## weight by abundance spp_scrs <- wascores(scrs, varespec, expand = FALSE) ## add points(spp_scrs, col = "red", pch = 19) ``` ] .col-6[ ![](slides_files/figure-html/pco-example-2-1.svg)<!-- --> ] ] --- class: inverse center middle huge-subsection # NMDS --- # Non-Metric Multidimensional Scaling NMDS find a low-dimensional mapping that preserves as best as possible the .text-highlight[rank order] of the original dissimilarities `\(d_{ij}\)` Solution with minimal `stress` is sought; a measure of how well the NMDS mapping fits the `\(d_{ij}\)` Stress is sum of squared residuals of monotonic regression between distances in NMDS space, `\(d^{*}_{ij}\)`, & `\(d_{ij}\)` Non-linear regression can cope with non-linear responses in species data Iterative solution; *convergence is not guaranteed* Must solve separately different dimensionality solutions --- # Non-Metric Multidimensional Scaling - Use an appropriate dissimilarity metric that gives good gradient separation `rankindex()`, e.g Bray-Curtis, Jaccard, Kulczynski - Wisconsin transformation useful; Standardize species to equal maxima, then sites to equal totals `wisconsin()` - Use many random starts and look at the fits with lowest stress (`try` & `trymax` - Only conclude solution reached if lowest stress solutions are similar (Procrustes rotation) - Fit NMDS for 1, 2, 3, … dimensions; stop after a sudden drop in stress observed in a screeplot - NMDS solutions can be rotated ; common to rotate to PCs - Also scale axes in half-change units ??? Half-change units: samples separated by a distance of 1 correspond, on average, to a 50% turnover in composition --- # NMDS in vegan **vegan** implements all these ideas via the `metaMDS()` wrapper ``` r data(dune) set.seed(10) (sol <- metaMDS(dune, trace = FALSE)) ``` ``` ## ## Call: ## metaMDS(comm = dune, trace = FALSE) ## ## global Multidimensional Scaling using monoMDS ## ## Data: dune ## Distance: bray ## ## Dimensions: 2 ## Stress: 0.1183186 ## Stress type 1, weak ties ## Best solution was repeated 6 times in 20 tries ## The best solution was from try 2 (random start) ## Scaling: centring, PC rotation, halfchange scaling ## Species: expanded scores based on 'dune' ``` --- # NMDS in vegan If no convergent solutions, continue iterations from previous best solution ``` r (sol <- metaMDS(dune, previous.best = sol, trace = FALSE)) ``` ``` ## ## Call: ## metaMDS(comm = dune, trace = FALSE, previous.best = sol) ## ## global Multidimensional Scaling using monoMDS ## ## Data: dune ## Distance: bray ## ## Dimensions: 2 ## Stress: 0.1183186 ## Stress type 1, weak ties ## Best solution was repeated 16 times in 40 tries ## The best solution was from try 2 (random start) ## Scaling: centring, PC rotation, halfchange scaling ## Species: expanded scores based on 'dune' ``` --- # NMDS in vegan ``` r layout(matrix(1:2, ncol = 2)) plot(sol, main = "Dune NMDS plot"); stressplot(sol, main = "Shepard plot") layout(1) ``` ![](slides_files/figure-html/nmds3-1.svg)<!-- --> ??? Shepard diagram which is a plot of ordination distances and monotone or linear fit line against original dissimilarities. In addition, it displays two correlation-like statistics on the goodness of fit in the graph. The nonmetric fit is based on stress S and defined as R2 = 1-S*S. The “linear fit” is the squared correlation between fitted values and ordination distances. For monoMDS, the “linear fit” and R2 from “stress type 2” are equal. --- # Checking dimensionality *k* Fit NMDS solutions for a number of `k` .row[ .col-6[ ``` r k_vec <- 1:10 stress <- numeric(length(k_vec)) dune_dij <- metaMDSdist(dune, trace = FALSE) set.seed(25) for(i in seq_along(k_vec)) { sol <- metaMDSiter(dune_dij, k = i, trace = FALSE) stress[i] <- sol$stress } ``` Need to use the helper functions to do this right ``` r plot(k_vec, stress, type = "b", ylab = "Stress", xlab = "Dimensions") ``` ] .col-6[ ![](slides_files/figure-html/plot_nmds-scree-1.svg)<!-- --> ] ] --- # NMDS — Goodness of fit A goodness of fit statistic `\(g_i\)` can be computed for the observations; defined such that `\(\sum_{i=1}^{n} g_i^2 = S^2\)` ``` r (g <- goodness(sol)) ``` ``` ## [1] 0.0008559796 0.0014569292 0.0006071222 0.0010337685 0.0008109554 ## [6] 0.0006932209 0.0009813765 0.0007284320 0.0007316829 0.0009052744 ## [11] 0.0004102242 0.0004846332 0.0008688444 0.0011563520 0.0013120348 ## [16] 0.0007051008 0.0008226499 0.0005374015 0.0008933316 0.0006139805 ``` ``` r sum(g^2) ``` ``` ## [1] 1.513378e-05 ``` ``` r sol$stress^2 ``` ``` ## [1] 1.513378e-05 ``` ``` r all.equal(sqrt(sum(g^2)), sol$stress) ``` ``` ## [1] TRUE ``` --- # Supplementary data If we have other data collected at the same sites, say about the environment, we can investigate relationships between the main components of variation in species composition and those environmental variables Two main **vegan** functions for this 1. `envfit()` and helpers `vectorfit()` and `factorfit()` 2. `ordisurf()` `envfit()` fits vectors or planes while `ordisurf()` fits smooth, potentially non-linear surfaces --- # Vector fitting `envfit()` fits vectors using a regression `$$\hat{y_i} = \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots \beta_p x_{pi}$$` where: * `\(x_{pi}\)` are the site scores for the `\(i\)`<sup>th</sup> site on axis 1, 2, …, `\(p\)` * Usually `\(p = 2\)` as we're fitting vectors into 2D ordination plots * `\(y_i\)` is the value of the environment at the `\(i\)`<sup>th</sup> site * `envfit()` can handle a matrix of environmental variables *Note no intercept; internally center `\(\boldsymbol{y}\)` and `\(\boldsymbol{x_p}\)`* For categorical `\(\boldsymbol{y}\)`, `envfit()` finds averages of scores for each level of the category --- # Vector fitting .row[ .col-6[ `envfit()` with formula ``` r set.seed(42) ev <- envfit(pca ~ ., data = varechem, choices = 1:2, scaling = "symmetric", permutations = 1000) ev ``` or data frames ``` r envfit(pca, varechem, choices = 1:2, scaling = "symmetric", permutations = 1000) ``` ] .col-6[ ``` ## ## ***VECTORS ## ## PC1 PC2 r2 Pr(>r) ## N -0.80597 -0.59196 0.0384 0.676324 ## P 0.78221 -0.62301 0.0335 0.696304 ## K 0.94909 -0.31501 0.0394 0.638362 ## Ca 0.87606 0.48220 0.1607 0.153846 ## Mg 0.83079 0.55658 0.1774 0.121878 ## S 0.86667 -0.49888 0.0044 0.962038 ## Al -0.64541 -0.76384 0.3168 0.017982 * ## Fe -0.56956 -0.82195 0.3536 0.018981 * ## Mn 0.96802 -0.25088 0.0730 0.439560 ## Zn 0.96899 0.24711 0.0321 0.712288 ## Mo -0.99233 -0.12362 0.0630 0.491508 ## Baresoil 0.51078 0.85971 0.5280 0.000999 *** ## Humdepth 0.81116 0.58483 0.2937 0.026973 * ## pH -0.34311 -0.93929 0.0881 0.333666 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Permutation: free ## Number of permutations: 1000 ``` ] ] --- # Vector fitting .row[ .col-6[ Values in `PC1` etc are *direction cosines* for vectors of unit length `r2` is the squared correlation, `\(R^2\)`, between `\(\boldsymbol{y}\)` and projection of axis scores on to the vector For factors is `\(R^2 = 1 - \mathsf{SS_{W}} / \mathsf{SS_{T}}\)` Permutation test shuffles the `\(\boldsymbol{y}\)` to generate distribution of `\(R^2\)` under H<sub>0</sub> ] .col-6[ ``` ## ## ***VECTORS ## ## PC1 PC2 r2 Pr(>r) ## N -0.80597 -0.59196 0.0384 0.676324 ## P 0.78221 -0.62301 0.0335 0.696304 ## K 0.94909 -0.31501 0.0394 0.638362 ## Ca 0.87606 0.48220 0.1607 0.153846 ## Mg 0.83079 0.55658 0.1774 0.121878 ## S 0.86667 -0.49888 0.0044 0.962038 ## Al -0.64541 -0.76384 0.3168 0.017982 * ## Fe -0.56956 -0.82195 0.3536 0.018981 * ## Mn 0.96802 -0.25088 0.0730 0.439560 ## Zn 0.96899 0.24711 0.0321 0.712288 ## Mo -0.99233 -0.12362 0.0630 0.491508 ## Baresoil 0.51078 0.85971 0.5280 0.000999 *** ## Humdepth 0.81116 0.58483 0.2937 0.026973 * ## pH -0.34311 -0.93929 0.0881 0.333666 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Permutation: free ## Number of permutations: 1000 ``` ] ] --- # Vector fitting .row[ .col-6[ ``` r plot(ev, add = FALSE) # oops bug! ``` ] .col-6[ ![](slides_files/figure-html/plot-envfit-1-1.svg)<!-- --> ] ] --- # Vector fitting .row[ .col-6[ ``` r plot(pca, display = "sites", type = "n", scaling = "symmetric") points(pca, display = "sites", scaling = "symmetric") plot(ev, add = TRUE) ``` ] .col-6[ ![](slides_files/figure-html/plot-envfit-2-1.svg)<!-- --> ] ] --- # Smooth surfaces `envfit()` fitted vectors, linear planes, to ordinations. `ordisurf()` fits smooth surfaces using a GAM via package **mgcv** `$$\hat{y_i} = f(x_{1i}, x_{2i})$$` where `\(f()\)` is a *bivariate* smooth function of a pair of axis scores `\(x_{1i}\)` and `\(x_{2i}\)` `ordisurf()` exposes a lot of functionality from `mgcv::gam()` and the smooths it can fit --- ## Smooth surfaces Fitting a 10 basis function isotropic surface .row[ .col-6[ ``` r surf <- ordisurf(pca ~ Ca, data = varechem, knots = 10, isotropic = TRUE, main = NULL) ``` `ordisurf()` plots by default `surf` contains the fitted `gam()` model ] .col-6[ ![](slides_files/figure-html/ordisurf-1-1.svg)<!-- --> ] ] --- # Smooth surfaces ``` r summary(surf) ``` ``` ## ## Family: gaussian ## Link function: identity ## ## Formula: ## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 569.66 40.27 14.14 1.05e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(x1,x2) 3.517 9 1.34 0.0187 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.344 Deviance explained = 44.4% ## -REML = 158.56 Scale est. = 38924 n = 24 ``` --- # Rarefaction Species richness increases with sample size (effort) Rarefaction gives the expected number of species rarefied from `\(N\)` to `\(n\)` individuals `$$\hat{S}_n = \sum_{i=1}^S (1 - q_i) \; \mathsf{where} \; q_i = \frac{\binom{N - x_i}{n}}{\binom{N}{n}}$$` `\(x_i\)` is count of species `\(i\)` and `\(\binom{N}{n}\)` is a binomial coefficient — the number of ways to choose `\(n\)` from `\(N\)` --- # Rarefaction .row[ .col-6[ ``` r data(BCI, package = "vegan") rs <- rowSums(BCI) quantile(rs) ``` ``` ## 0% 25% 50% 75% 100% ## 340.0 409.0 428.0 443.5 601.0 ``` ``` r Srar <- rarefy(BCI, min(rs)) head(Srar) ``` ``` ## 1 2 3 4 5 6 ## 84.33992 76.53165 79.11504 82.46571 86.90901 78.50953 ``` ``` r rarecurve(BCI, sample = min(rs)) ``` ] .col-6[ ![](slides_files/figure-html/rarefy-1.svg)<!-- --> ] ] --- # Rarefaction With rarefaction we can be accused of throwing away data Yet we need to do something about the often large differences in variances of species and samples owing to sampling effort --- # High-throughput data Revolutions in biology & biotechnology have lead to exponential increases in our capacity to generate data arising from the counting of biological molecules * DNA sequencing, * RNA-Seq — sequence RNA molecules in populations of cells or tissues, * ChIp-Seq — sequence DNA molecules that are bound to particular proteins, * … Relative cheaply, today we can generate data sets with thousands of variables on tens to hundreds of samples --- # High-throughput data Counts of such molecules present a statistical challenge * the counts typically have a large *dynamic range*, covering many orders of magnitudes * over this very large range we observe changes in both variance (spread about the mean) and also in distribution of the data — **heterogeneity** * like other count data, observations are integers and distributions are skewed * the biotech methodology imparts systematic sampling biases into the data that we need to account for in an analysis — typically called **normalization** in this sub-field --- # Sequence depth & size factors The number of reads for each sample is a kind of *effort* variable All else equal, the more reads we generate the more species (OTUs, ASVs, etc) we would expect to identify If number of reads differs between samples (libraries) then, all else equal we might assume the counts in different samples are proportional to one another, following some proportionality factor `\(s\)` A simple estimate for `\(s\)` might be the total reads per sample But we can do better than this … --- # Sequence depth & size factors .row[ .col-8[ A small dataset of 5 genes in 2 samples Two views: 1. estimate `\(s_j\)` as `\(\sum_{i=1}^{m} x_{ij}\)`, blue line is ratio of `\(s_j\)` 2. instead, estimate `\(s_j\)` such that their ratio is the red line In 1 we would C is downregulated & A, B, D, & E are upregulated 2 is more parsimonious `DESeq2::estimateSizeFactorsForMatrix()` ] .col-4[ ![](resources/chap7-rnaseq-normalization-1.png)<!-- --> .smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology] ] ] --- # Variance stabilizing transformations The square-root transformation is known as the _variance stabilizing transformation_ for the Poisson distribution Taking the square root of observations that are distributed Poisson leads to *homoscedasticity* Can construct variance stabilizing transformations for other distributions High-throughput count data typically show extra-Poisson variation `DESeq2::vst()` --- # Regularized log transform This transforms counts to a log<sub>2</sub>-like scale via a simple model `$$\log_2(q_{ij}) = \sum_k x_{jk} \beta_{ik}$$` where `\(q_{ij}\)` are the transformed data and `\(x_{jk}\)` is a particular design matrix with a dummy variable for the `\(j\)`th sample for ($i = {1, 2, \dots m}$ variables) Priors on the `\(\beta_{ik}\)` make the model identifiable The _rlog_ transformation is ~ variance stabilizing, but handles highly varying size factors across samples `DESeq2::rlog()` --- # Transformations Comparison of three transformations ![](resources/chap7-rnaseq-meansd-1.png)<!-- --> .smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology] ??? Figure 8.12: Per-gene standard deviation (sd, taken across samples) against the rank of the mean, for the shifted logarithm , the variance-stabilizing transformation (vst) and the rlog. Note that for the leftmost 2,500 genes, the counts are all zero, and hence their standard deviation is zero. The mean-sd dependence becomes more interesting for genes with non-zero counts. Note also the high value of the standard deviation for genes that are weakly detected (but not with all zero counts) when the shifted logarithm is used, and compare to the relatively flat shape of the mean-sd relationship for the variance-stabilizing transformation. --- # Implications Size factors can be used in GLMs etc to normalize samples via as `offset()` term in the formula For ordination we can't easily use the size factors Need to use _vst_ or _rlog_ instead _rlog_ creates data that can't be used in CCA, so if you use it, RDA or db-RDA are the ordination methods to use --- # Practicalities --- # Rarefied counts Can take a random draw `\(n^{\prime}\)` out of the `\(N\)` individuals in a sample in proportion to species abundances This yields rarefied counts which some people then go on to analyse Be careful though as this is a random process — what if you took a different random sample? --- # Rarefied counts `avgdist()` in *vegan* tries to get around this by 1. taking a random draw to get rarefied counts 2. computing dissimilarity between smaples on basis of rarefied count Repeat that many times and then as your dissimilarity `\(d_{ij}\)`, take the average of the many `\(d_{ij}^*\)` values you generated above See also the help page for other suggestions `?avgdist` --- class: inverse middle center big-subsection # plotting --- # Better plotting Ordination diagrams are often messy — `plot()` methods designed to get a quick plot of results To produce better plots you need to know some base graphics skills *and* make use of some *vegan* helpers --- # Points and labels `ordipointlabel()` can draw points and labels .row[ .col-6[ ``` r set.seed(10) ordipointlabel(pca, display = "sites", scaling = "symmetric") ``` Iteratively finds space to draw labels so they don't overplot ] .col-6[ ![](slides_files/figure-html/ordipointlabel-1-1.svg)<!-- --> ] ] --- # Points and labels `ordipointlabel()` can also add to an existing plot .row[ .col-6[ ``` r plot(pca, display = "sites", scaling = "symmetric", type = "n") points(pca, display = "sites", scaling = "symmetric", pch = 19, col = "#025196") set.seed(10) ordipointlabel(pca, display = "sites", scaling = "symmetric", add = TRUE) ``` ] .col-6[ ![](slides_files/figure-html/ordipointlabel-2-1.svg)<!-- --> ] ] --- # Points and labels `ordipointlabel()` can also add to an existing plot .row[ .col-6[ ``` r disp <- "species" scl <- "symmetric" plot(pca, display = disp, scaling = scl, type = "n") points(pca, display = disp, scaling = scl, pch = 19, col = "#025196") set.seed(10) ordipointlabel(pca, display = disp, scaling = scl, add = TRUE) ``` ] .col-6[ ![](slides_files/figure-html/ordipointlabel-3-1.svg)<!-- --> ] ] --- # Points and labels How successful `ordipointlabel()` is depends on how much you plot & how big you plot it .row[ .col-6[ ``` r disp <- c("sites", "species") scl <- "symmetric" plot(pca, display = disp, scaling = scl, type = "n") points(pca, display = disp[1], scaling = scl, pch = 19, col = "#025196") points(pca, display = disp[2], scaling = scl, pch = 19, col = "#fdb338") set.seed(10) ordipointlabel(pca, display = disp, scaling = scl, add = TRUE, col = c(1,1), cex = c(0.7, 0.7)) ``` ] .col-6[ ![](slides_files/figure-html/ordipointlabel-4-1.svg)<!-- --> ] ] --- # Building up by layers With base graphics you are in control of *everything* .row[ .col-6[ ``` r data(dune, dune.env) col_vec <- c("red", "blue", "orange", "grey") disp <- "sites" scl <- "symmetric" ord <- rda(decostand(dune, method="hellinger")) plot(ord, type = "n", scaling = scl, display = disp) cols <- with(dune.env, col_vec[Management]) points(ord, display = disp, scaling = scl, pch = 19, col = cols, cex = 2) lvl <- with(dune.env, levels(Management)) legend("topright", legend = lvl, bty = "n", col = col_vec, pch = 19) ``` ] .col-6[ ![](slides_files/figure-html/layer-plot-1-1.svg)<!-- --> ] ] --- # Other utilities — `ordihull()` Convex hulls around groups of data .row[ .col-6[ ``` r disp <- "sites" scl <- "symmetric" plot(ord, type = "n", scaling = scl, display = disp) ordihull(ord, groups = dune.env$Management, col = col_vec, scaling = scl, lwd = 2) ordispider(ord, groups = dune.env$Management, col = col_vec, scaling = scl, label = TRUE) points(ord, display = disp, scaling = scl, pch = 21, col = "red", bg = "yellow") ``` ] .col-6[ ![](slides_files/figure-html/hull-1-1.svg)<!-- --> ] ] --- # Other utilities — `ordiellipse()` Draws ellipsoid hulls & standard error & deviation ellipses .row[ .col-6[ ``` r disp <- "sites" scl <- "symmetric" plot(ord, type = "n", scaling = scl, display = disp) ## ellipsoid hull ordiellipse(ord, groups = dune.env$Management, kind = "ehull", col = col_vec, scaling = scl, lwd = 2) ## standard error of centroid ellipse ordiellipse(ord, groups = dune.env$Management, draw = "polygon", col = col_vec, scaling = scl, lwd = 2) ordispider(ord, groups = dune.env$Management, col = col_vec, scaling = scl, label = TRUE) points(ord, display = disp, scaling = scl, pch = 21, col = "red", bg = "yellow") ``` ] .col-6[ ![](slides_files/figure-html/ellipse-1-1.svg)<!-- --> ] ] --- # Any love for *ggplot*? The **ggvegan** package is in development .row[ .col-6[ ``` r ## library('remotes') ## install_github("gavinsimpson/ggvegan") library('ggvegan') library('ggplot2') autoplot(ord) ``` ] .col-6[ ![](slides_files/figure-html/ggvegan-1.svg)<!-- --> ] ] --- # Any love for *ggplot*? The **ggvegan** package is in development ``` r # Install src from github remotes::install_github("gavinsimpson/ggvegan") # or binary from R-universe: # Enable repository from gavinsimpson options(repos = c( gavinsimpson = 'https://gavinsimpson.r-universe.dev', CRAN = 'https://cloud.r-project.org')) # Download and install ggvegan in R install.packages('ggvegan') ``` --- # ggvegan .row[ .col-6[ ``` r ford <- fortify(ord, axes = 1:2) ## not yet a tibble head(ford, 4) ``` ``` ## score label PC1 PC2 ## 1 species Achimill 0.22413552 -0.05704791 ## 2 species Agrostol -0.41424457 0.18006829 ## 3 species Airaprae 0.03506944 -0.15464367 ## 4 species Alopgeni -0.13865500 0.31844362 ``` ``` r blue <- "#025196" filter(ford, score == "sites") %>% ggplot(aes(x = PC1, y = PC2)) + geom_hline(yintercept = 0, colour = blue) + geom_vline(xintercept = 0, colour = blue) + geom_point() + coord_fixed() ``` ] .col-6[ ![](slides_files/figure-html/ggvegan-2-1.svg)<!-- --> ] ]