Abstract

This practical will use the PONDS dataset to demonstrate methods of indirect gradient analysis (PCA, CA, and NMDS) of species and environmental data. The file pondsenv.csv contains the species data (48 taxa) and pondsenv.csv contains the transformed environmental variables (15 variables) for 30 sites. You will use vegan to analyse these data using a variety of indirect ordination graphical display techniques.

## Loading required package: nlme

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

Principal Components Analysis

Principal Components Analysis (PCA) is a common statistical methods and is implemented in R by two functions in the standard installation (prcomp() and princomp()), as well as in numerous guises as part of other add-on packages. In this practical you will use the implimentation provided in the vegan package by Jari Oksanen. This is because you will be using vegan for the direct gradient analysis class and the vegan package provides a rich set of analysis and graphical tools that can be applied to a variety of ordination techniques. As such, you only have to learn a single set of commands to run a wide variety of analyses.

Start R and load the vegan package for use. Read in the two data sets:

library("vegan")

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-7

url_env <- "https://bit.ly/pondsenv"
pondsenv <- read.csv(url(url_env))

url_diat <- "https://bit.ly/pondsdiat"
pondsdiat <- read.csv(url(url_diat))

PCA is fitted using the rda() function, which was designed to implement Redundancy Analysis (RDA). RDA is the constrained form of PCA so running rda() without any constraints yields PCA. Run a PCA of the Ponds environmental data using rda() and display the results. The scale argument to rda() scales the variables to zero mean and unit standard deviation. This results in a PCA on a correlation rather than covariance matrix. This is appropriate in situations where the variables are measured in different units, as is the case with the hydrochemical data analysed here. It may also be appropriate in situations where species abundance (counts etc.) are being analysed and you wish to focus on explaining variation in all species not just the most abundant ones.

pondspca <- rda(pondsenv, scale = TRUE)
pondspca

## Call: rda(X = pondsenv, scale = TRUE)
## 
##               Inertia Rank
## Total              15     
## Unconstrained      15   15
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10  PC11  PC12  PC13 
## 4.964 3.094 2.329 1.428 0.772 0.626 0.534 0.398 0.309 0.293 0.128 0.069 0.029 
##  PC14  PC15 
## 0.019 0.010

The ouput shows the results of the PCA, displaying the eigenvalues for each axis. Note that these values differ from the ones report by CANOCO, which scales the total variance (called inertia in vegan) to 1, rda() does not. To achieve a comparable set of eigenvalues simply divide each eigenvalue by the total inertia. This conveniently, is what the summary() method for eigenvals() does:

summary(eigenvals(pondspca))

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Eigenvalue            4.9639 3.0937 2.3289 1.42778 0.77180 0.62609 0.53372
## Proportion Explained  0.3309 0.2062 0.1553 0.09519 0.05145 0.04174 0.03558
## Cumulative Proportion 0.3309 0.5372 0.6924 0.78763 0.83908 0.88082 0.91640
##                           PC8    PC9    PC10     PC11     PC12     PC13
## Eigenvalue            0.39793 0.3090 0.29284 0.127767 0.069340 0.028661
## Proportion Explained  0.02653 0.0206 0.01952 0.008518 0.004623 0.001911
## Cumulative Proportion 0.94293 0.9635 0.98305 0.991570 0.996193 0.998103
##                           PC14     PC15
## Eigenvalue            0.018654 0.009794
## Proportion Explained  0.001244 0.000653
## Cumulative Proportion 0.999347 1.000000

What are the values of \(\lambda_1\) and \(\lambda_2\), the eigenvalues for axes one and two?

How much of the total variance in the environmental data is explained by axes one and two individually and cummulatively?

For a longer print out of the results of the PCA and to display the species and site scores a summary() method is available. This truncates the output to the first six axes but this can be changed using the axes argument.

# output not show as it is large!
summary(pondspca, scaling = "species")

The use of the scaling argument is redundant in this case as the default is scaling = "species" but it is included here to illustrate a common argument that can be specified by the user.

The meaningful components

To assess the likely importance of the axes, it is useful to both plot a scree plot and to compare the sizes of the actual PCA axes with the sizes expected under a random (null) model, such as the broken stick distribution. A scree plot of the results of a PCA can be produced using the screeplot() method. The broken-stick distribution is simple to calculate and the expected values of the pieces of the broken stick are given by

\[E_j=\frac{1}{n}\sum_{x=j}^n \frac{1}{x}\]

where \(E_j\) is the expected value of the \(j^{th}\) piece, and \(n\) is the number of pieces (axes in this case). We can generate the null hypothesis values using the bstick argument.

screeplot(pondspca, type = "lines", bstick = TRUE)

How many axes does the scree plot suggest are significant?

Compare the variances of the pieces expected under the broken-stick model with the eigenvalues obtained from the results of the PCA of the Ponds environmental data. The broken-stick distribution represents the null model, so significant axes are those that have an eigenvalue that exceeds the expected value for the \(j^{th}\) piece of the broken-stick distribution.

How many axes does the broken-stick distribution suggest are significant?

PCA Biplots

Two types of biplot may be used to visualise the results of a PCA; a distance biplot and a correlation biplot. In this analysis of the Ponds environmental data, the focus is on the correlations between the environmental variables (“species” in common parlance, even though environmental data are being analysed!), therefore, a correlation biplot is appropriate. The type of biplot produced is determined by the scaling applied to one or both of the species and site scores. For a correlation biplot, scaling "species" is used, in which the scores for the \(k^{th}\) species (the \(k^{th}\) eigenvector) are scaled to length \(\sqrt{\lambda_k}\) [^The scores for the \(k^{th}\) are then proportional to the square root of the \(k^{th}\) eigenvalue.], whilst the sites are left unscaled. Biplots are easily obtained using the plot() method of rda().

plot(pondspca, scaling = "species")

Species are traditionally represented by biplot arrows, which we can do for rda() objects using the biplot() method

biplot(pondspca, scaling = "species")

What are the main chemical gradients represented by axes one and two?

Are there any outliers sites on axes one or two?

Interpretting ordination diagrams can be improved by enhancing the plot with additional information, commonly in the form of response surfaces. Function ordisurf() generates response surfaces using a generalized additive model (GAM) to predict a response variable for the ordination configuration. The irregular configuration is then interpolated to a regular grid using linear interpolation routines.

plot(pondspca, scaling = "species", display = "sites")
ordisurf(pondspca ~ TP, data = pondsenv, add = TRUE)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 2.3  total = 3.3 
## 
## REML score: 10.03019

Firstly, the biplot is redisplayed with the species suppressed (display = "sites"), then a response surface for the variables TP (Total Phosphorus) produced using ordisurf(). The argument add = TRUE is used to add the response surface to the existing plot without clearing the graphics device first.

Make response surface plots for selected variables

Which variables respond non-linearly across the ordination?

Correspondence Analysis

Species often show unimodal responses to environmental gradients. PCA assumes a linear response model and as such may not be best suited to the analysis of species data exhibiting unimodal responses as PCA is unlikely to fit the data adequately. Correspondence Analysis (CA) is an indirect ordination technique that assumes an idealised unimodal response in species. The response is idealised in that it assumes that species responses are symmetrical, of equal height and width and are equally spaced.

A CA is performed using function cca(), also in package vegan. CA can produced unstable results when there are rare species or odd samples in the data set, therefore, rare species tend to be downweighted. Function downweight() provides this capability, replicating the behaviour of CANOCO.

pondsca <- cca(downweight(pondsdiat))
pondsca

## Call: cca(X = downweight(pondsdiat))
## 
##               Inertia Rank
## Total           4.996     
## Unconstrained   4.996   29
## Inertia is scaled Chi-square 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.6728 0.5162 0.4199 0.3619 0.3408 0.3254 0.2913 0.2663 
## (Showing 8 of 29 unconstrained eigenvalues)

summary(eigenvals(pondsca))

## Importance of components:
##                          CA1    CA2     CA3     CA4     CA5     CA6    CA7
## Eigenvalue            0.6728 0.5162 0.41994 0.36191 0.34085 0.32535 0.2913
## Proportion Explained  0.1347 0.1033 0.08406 0.07244 0.06823 0.06513 0.0583
## Cumulative Proportion 0.1347 0.2380 0.32206 0.39450 0.46273 0.52785 0.5862
##                           CA8     CA9    CA10    CA11   CA12    CA13    CA14
## Eigenvalue            0.26632 0.22195 0.21172 0.20888 0.1659 0.14270 0.13265
## Proportion Explained  0.05331 0.04443 0.04238 0.04181 0.0332 0.02856 0.02655
## Cumulative Proportion 0.63946 0.68389 0.72627 0.76808 0.8013 0.82985 0.85640
##                          CA15    CA16    CA17    CA18    CA19    CA20    CA21
## Eigenvalue            0.11605 0.10483 0.08433 0.07646 0.06781 0.06392 0.05188
## Proportion Explained  0.02323 0.02098 0.01688 0.01530 0.01357 0.01280 0.01038
## Cumulative Proportion 0.87963 0.90061 0.91749 0.93280 0.94637 0.95917 0.96955
##                           CA22     CA23     CA24     CA25     CA26    CA27
## Eigenvalue            0.035947 0.034768 0.023473 0.019797 0.014481 0.01244
## Proportion Explained  0.007195 0.006959 0.004699 0.003963 0.002899 0.00249
## Cumulative Proportion 0.976747 0.983707 0.988406 0.992368 0.995267 0.99776
##                           CA28      CA29
## Eigenvalue            0.009305 0.0019015
## Proportion Explained  0.001863 0.0003806
## Cumulative Proportion 0.999619 1.0000000

What are the values of \(\lambda_{1-4}\), the eigenvalues for axes one to four?

How much of the total variance (inertia) in the species data is explained by axes 1 and 2, individually and combined?

The meaningful components

As with PCA, a scree plot is a useful way of visualising the important components for further analysis. Comprison of the eignevalues of the CA with those expected under the broken-stick model is also useful.

screeplot(pondsca, bstick = TRUE, type = "lines")

How many axes are iportant when compared with the null model?

CA Biplots

Biplots, are generated by plotting the site and species scores obtained from CA of the species matrix. As with PCA, these scores can be scaled to focus the resulting diagram on relationships among sites (scaling = "sites") or species (scaling = "species) or some compromise of the two (scaling = "symmetric"). So-called biplot and Hill’s scaling are the two types of scaling that can be applied to three scalings mentioned above. These two types dictate how information on the species data is gleaned from the resulting biplot. With biplot scaling, the biplot rule applies and is most suited for short gradients. Hill’s scaling equalises the average niche breadth for all axes and allows, for long gradients, interpretation via the distance rule. The plot() method for cca() produces a biplot of the results of a CA

plot(pondsca)

Are there outliers in the plot (species or samples)?

Is there an arch apparent in the biplot?

It is useful to enhance these plots with further information to aid interpretation. One useful addition is to overlay measures of diversity on to the biplot. Diversity indices can be calculated using the diversity() function of vegan. One particular measure of diveristy is Hill’s \(\mathrm{N_2}\), which gives the effective number of occurrences of a species across all sites or the effective number of species in an individual plot. Hill’s \(\mathrm{N_2}\) is equivalent to the Inverse Simpson diversity measure we tell diversity() to use. The third argument is what Jari Oksanen refers to as the MARGIN; rows (1) or columns (2). For the effective number of occurences of a particular species then the calculation should be over the columns, but rows should be used to calculate the effective numbers of species per sample.

spp.n2 <- renyi(t(pondsdiat), scales = 2, hill = TRUE)
site.n2 <- renyi(pondsdiat, scales = 2, hill = TRUE)

Having calculated the relevant diversity information, this can be used to augment the biplot. A simple way to use these data is to scale the plotting symbol according to the Hill’s \(\mathrm{N_2}\) value, using the cex graphical parameter. The identify function is used to label the plotting symbols after plotting.

ca.plot <- plot(pondsca, type = "n")
points(pondsca, display = "species", cex = 0.3 * spp.n2)
identify(ca.plot, what = "species", col = "red", ps = 10)
plot(pondsca, type = "n")
points(pondsca, display = "sites", cex = 0.5 * site.n2)
identify(ca.plot, what = "sites", col = "red", ps = 10)

Are outlying species common or rare?

Are outlying samples dominated by a few, rare, species?

oldpar <- par(mfrow = c(1,2), pty = "s")
plot(pondsca, scaling = "species",
    main = "Inter-species distance & Biplot scaling")
plot(pondsca, scaling = "sites", hill = TRUE,
    main = "Inter-sample distance & Hills scaling")

par(oldpar)

What effect does the choice of scaling have on the ordination plots? Use the code below to display the biplots with two different scalings.

Non-metric multidimensional scaling

Non-metric multidimensional scaling can be performed using metaMDS(). Function metaMDS() in vegan implements helper functions that start the NMDS iterative algorithm at \(k\) randomly selected start points and choose the best model fit (i.e. that reduces the stress the most). metaMDS() only requires a matrix of data as the function internally calculates the specified dissimilarity matrix for you. We will fit a NMDS model to the Ponds environmental data using Euclidean distances.

library(MASS)
set.seed(123456)
euclid.dis <- vegdist(pondsenv, "euclidean")
nmds.env <- metaMDS(pondsenv, distance = "euclidean", trymax = 50)

## 'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE

## Run 0 stress 0.1739185 
## Run 1 stress 0.2151342 
## Run 2 stress 0.1739173 
## ... New best solution
## ... Procrustes: rmse 0.0005701495  max resid 0.002644136 
## ... Similar to previous best
## Run 3 stress 0.1739184 
## ... Procrustes: rmse 0.0004964127  max resid 0.002297515 
## ... Similar to previous best
## Run 4 stress 0.1785826 
## Run 5 stress 0.2199966 
## Run 6 stress 0.188833 
## Run 7 stress 0.1888522 
## Run 8 stress 0.1888292 
## Run 9 stress 0.1785827 
## Run 10 stress 0.1739184 
## ... Procrustes: rmse 0.001842324  max resid 0.008514986 
## ... Similar to previous best
## Run 11 stress 0.1785824 
## Run 12 stress 0.1888298 
## Run 13 stress 0.2148062 
## Run 14 stress 0.1739171 
## ... New best solution
## ... Procrustes: rmse 0.0001568914  max resid 0.0007246686 
## ... Similar to previous best
## Run 15 stress 0.2400498 
## Run 16 stress 0.173918 
## ... Procrustes: rmse 0.0005095874  max resid 0.002357942 
## ... Similar to previous best
## Run 17 stress 0.173918 
## ... Procrustes: rmse 0.001562874  max resid 0.00722221 
## ... Similar to previous best
## Run 18 stress 0.1739171 
## ... Procrustes: rmse 0.001092042  max resid 0.005046018 
## ... Similar to previous best
## Run 19 stress 0.2147383 
## Run 20 stress 0.1739186 
## ... Procrustes: rmse 0.001825222  max resid 0.008436809 
## ... Similar to previous best
## *** Best solution repeated 5 times

nmds.env

## 
## Call:
## metaMDS(comm = pondsenv, distance = "euclidean", trymax = 50) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     pondsenv 
## Distance: euclidean 
## 
## Dimensions: 2 
## Stress:     0.1739171 
## Stress type 1, weak ties
## Best solution was repeated 5 times in 20 tries
## The best solution was from try 14 (random start)
## Scaling: centring, PC rotation 
## Species: scores missing

NMDS maps the observed distances on to the ordination space in a non-linear fashion. How well this mapping is achieved can be visualised using stressplot(), which draws a Shepard plot and the fit of the NMDS as a stepped line. stressplot() also displays two correlation statistics for the goodness of the fit. The correlation based on stress is \(R^2=1-S^2\), and the ``fit-based’’ correlation is the correlation between the fitted values, \(\theta(d)\) and the original distances, \(d\), which is the correlation between the stepped line and the points.

stressplot(nmds.env, euclid.dis)

To draw the ordination diagram for the NMDS model, vegan provides a plot() method:

plot(nmds.env, type = "text")

## species scores not available

Comparing ordinations using Procrustes rotation

Two ordinations can be very similar but this similarity may be masked as a result of the two ordinations having different scalings, orientations and signs. Procrustes rotation is a good way of of comparing ordination configurations. vegan has function procrustes(), which performs Procrustes rotation.

pondsenv.pro <- procrustes(nmds.env, pondspca, symmetric = TRUE)
summary(pondsenv.pro)

## 
## Call:
## procrustes(X = nmds.env, Y = pondspca, symmetric = TRUE) 
## 
## Number of objects: 30    Number of dimensions: 2 
## 
## Procrustes sum of squares:  
##  0.2453721 
## Procrustes root mean squared error: 
##  0.09043821 
## Quantiles of Procrustes errors:
##         Min          1Q      Median          3Q         Max 
## 0.007707713 0.027312647 0.050119512 0.078506790 0.358473373 
## 
## Rotation matrix:
##            [,1]       [,2]
## [1,] -0.5073434  0.8617440
## [2,] -0.8617440 -0.5073434
## 
## Translation of averages:
##              [,1]          [,2]
## [1,] -2.20516e-18 -6.468687e-18
## 
## Scaling of target:
## [1] 0.8686932

The plot() method for procrustes() can produce two kinds of plots; 1) the ordination digram showing the comparison between the two configurations, and 2) a residuals plot

par(mfrow = c(1,2))
plot(pondsenv.pro, kind = "1")
plot(pondsenv.pro, kind = "2")

par(mfrow = c(1,1))

The protest() method allows you to test whether the two configurations are significantly similar to one another by means of a permutation test.

set.seed(123456)
pondsenv.prot <- protest(nmds.env, pondspca)
pondsenv.prot

## 
## Call:
## protest(X = nmds.env, Y = pondspca) 
## 
## Procrustes Sum of Squares (m12 squared):        0.2454 
## Correlation in a symmetric Procrustes rotation: 0.8687 
## Significance:  0.001 
## 
## Permutation: free
## Number of permutations: 999

Birds Data

This data set is on bird assemblages in a montane forest in the Velka Fatra Mountains (Slovak Republic).

library("readr")
library("tidyr")
library("dplyr")
library("tibble")

We can load the bird data and process it into two data sets

the species abundances, and
the associated environmental variables.

birds_url <- "https://bit.ly/maed-birds"

birds <- read_csv(birds_url, na = "", skip = 1)

## New names:
## Rows: 43 Columns: 51
## ── Column specification
## ──────────────────────────────────────────────────────────────── Delimiter: "," chr
## (4): ...1, ...40, Rocks, Expos dbl (46): AegiCaud, AlauArve, AnthTriv, CertFami,
## ColuPalu, CorvCora, CucuCa... lgl (1): ...39
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the
## column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...39`
## • `` -> `...40`

birds_spp <- birds %>%
    mutate(across(AegiCaud:OriOri, ~ replace_na(.x, 0))) %>%
    column_to_rownames("...1")
birds_spp <- birds_spp[, 1:36]

birds_env <- birds %>%
    select(c("...1", Altit:Expos)) %>%
    mutate(Rocks = factor(Rocks), Expos = factor(Expos)) %>%
    column_to_rownames("...1")

The aim of the analysis is to find a good ordination of the species data (choose your ordination method, transformation, dissimilarity, etc according to what features of the species data you want to focus on). Then, having found a good ordination, use the environmental data to try to explain the ordination axes or mapping you have created.

The variables are:

Altitude
Forest cover
Forest density
Amount of broad leaf trees
E2 layer (scrub layer cover)
E2 layer conifers
E1 layer (herb layer cover)
E1 height
Slope
Presence of rocks
Exposure

Unconstrained ordination

Gavin Simpson

August 6th, 2024

Abstract

Principal Components Analysis

The meaningful components

PCA Biplots

Correspondence Analysis

The meaningful components

CA Biplots

Non-metric multidimensional scaling

Comparing ordinations using Procrustes rotation

Birds Data