class: inverse, middle, left, my-title-slide, title-slide .title[ # Other stuff ] .author[ ### Gavin L. Simpson ] .date[ ### August 8th, 2024 ] --- class: inverse middle center big-subsection # Welcome --- # Today's topics * PERMANOVA * Distance-based RDA * Diagnostics * High-throughput count data --- class: inverse middle center big-subsection # PERMANOVA --- # MANOVA MANOVA is the multivariate form of ANOVA * Multivariate response data * Categorical predictor variables Decompose variation in the responses into 1. variation within groups 2. variation between groups Test to see if two is unusually large relative to H<sub>0</sub> --- # PERMANOVA Doing that test requires lots of assumptions that rarely hold for ecological data PERMANOVA: Permutational multivariate analysis of variance Avoids most of these issues through the use of permutation tests Directly decomposes a dissimilarity matrix into 1. variation within groups 2. variation between groups --- # PERMANOVA *sensu stricto* *vegan* has four different ways to do essentially do this kind of analysis 1. `adonis()` — implements Anderson (2001) 2. `adonis2()` — implements McArdle & Anderson (2001) 3. `dbrda()` — implementation based on McArdle & Anderson (2001) 4. `capscale()` — implements Legendre & Anderson (1999) Be careful with `adonis()` as it allows only sequential tests A difference between the functions is how they treat negative eigenvalues --- # The PERMANOVA idea .center[ ![](slides_files/figure-html/permanova-idea-plot-1.svg)<!-- --> ] --- # PERMANOA — `adonis2()` ```r data(dune, dune.env) adonis2(dune ~ Management*A1, data = dune.env, by = "terms") ``` ``` ## Permutation test for adonis under reduced model ## Terms added sequentially (first to last) ## Permutation: free ## Number of permutations: 999 ## ## adonis2(formula = dune ~ Management * A1, data = dune.env, by = "terms") ## Df SumOfSqs R2 F Pr(>F) ## Management 3 1.4686 0.34161 3.2629 0.002 ** ## A1 1 0.4409 0.10256 2.9387 0.023 * ## Management:A1 3 0.5892 0.13705 1.3090 0.190 ## Residual 12 1.8004 0.41878 ## Total 19 4.2990 1.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # PERMANOA — `adonis2()` ```r data(dune, dune.env) adonis2(dune ~ A1*Management, data = dune.env, by = "terms") ``` ``` ## Permutation test for adonis under reduced model ## Terms added sequentially (first to last) ## Permutation: free ## Number of permutations: 999 ## ## adonis2(formula = dune ~ A1 * Management, data = dune.env, by = "terms") ## Df SumOfSqs R2 F Pr(>F) ## A1 1 0.7230 0.16817 4.8187 0.001 *** ## Management 3 1.1865 0.27600 2.6362 0.005 ** ## A1:Management 3 0.5892 0.13705 1.3090 0.219 ## Residual 12 1.8004 0.41878 ## Total 19 4.2990 1.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # PERMANOA — `adonis2()` ```r data(dune, dune.env) adonis2(dune ~ Management*A1, data = dune.env, by = "margin") ``` ``` ## Permutation test for adonis under reduced model ## Marginal effects of terms ## Permutation: free ## Number of permutations: 999 ## ## adonis2(formula = dune ~ Management * A1, data = dune.env, by = "margin") ## Df SumOfSqs R2 F Pr(>F) ## Management:A1 3 0.5892 0.13705 1.309 0.221 ## Residual 12 1.8004 0.41878 ## Total 19 4.2990 1.00000 ``` -- The interaction is the only term that isn't *marginal* to other terms; not significant --- # PERMANOA — `adonis2()` ```r adonis2(dune ~ Management + A1, data = dune.env, by = "margin") ``` ``` ## Permutation test for adonis under reduced model ## Marginal effects of terms ## Permutation: free ## Number of permutations: 999 ## ## adonis2(formula = dune ~ Management + A1, data = dune.env, by = "margin") ## Df SumOfSqs R2 F Pr(>F) ## Management 3 1.1865 0.27600 2.4828 0.004 ** ## A1 1 0.4409 0.10256 2.7676 0.031 * ## Residual 15 2.3895 0.55583 ## Total 19 4.2990 1.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # The dispersion problem Anderson (2001) noted that PERMANOVA could confound *location* & *dispersion* effects If one or more groups are more variable — dispersed around the centroid — than the others, this can result in a false detection of a difference of means — a *location* effect Same problem affects *t* tests, ANOVA Warton et al (2012) Anderson & Walsh (2013) Anderson *et al* (2017) --- # Dispersion .center[ ![](slides_files/figure-html/permanova-idea-plot-1.svg)<!-- --> ] --- # Test for dispersion effects Marti Anderson (2006) developed a test for multivariate dispersions — PERMDISP2 1. Calculate how far each observation is from its group median (or centroid) 2. Take the absolute values of these distances-to-medians 3. Do an ANOVA on the absolute distances with the *groups* as covariates 4. Test the H<sub>0</sub> of equal absolute distances to median among groups using a permutation test In *vegan* this is `betadisper()` --- # Test for dispersion effects .row[ .col-6[ .not-so-small[ ```r data(varespec) dis <- vegdist(varespec) # Bray-Curtis distances ## First 16 sites grazed, remaining 8 sites ungrazed groups <- factor(c(rep(1,16), rep(2,8)), labels = c("grazed","ungrazed")) mod <- betadisper(dis, groups) mod ``` ``` ## ## Homogeneity of multivariate dispersions ## ## Call: betadisper(d = dis, group = groups) ## ## No. of Positive Eigenvalues: 15 ## No. of Negative Eigenvalues: 8 ## ## Average distance to median: ## grazed ungrazed ## 0.3926 0.2706 ## ## Eigenvalues for PCoA axes: ## (Showing 8 of 23 eigenvalues) ## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8 ## 1.7552 1.1334 0.4429 0.3698 0.2454 0.1961 0.1751 0.1284 ``` ] ] .col-6[ ```r boxplot(mod) ``` ![](slides_files/figure-html/permdisp-plot-1.svg)<!-- --> ] ] --- # Test for dispersions .row[ .col-6[ .smaller[ ```r set.seed(25) permutest(mod) ``` ``` ## ## Permutation test for homogeneity of multivariate dispersions ## Permutation: free ## Number of permutations: 999 ## ## Response: Distances ## Df Sum Sq Mean Sq F N.Perm Pr(>F) ## Groups 1 0.07931 0.079306 4.6156 999 0.045 * ## Residuals 22 0.37801 0.017182 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] ] .col-6[ ```r plot(mod) ``` ![](slides_files/figure-html/permdisp-plot-it-1.svg)<!-- --> ] ] --- # Test for dispersions ```r set.seed(4) permutest(mod, pairwise = TRUE) ``` ``` ## ## Permutation test for homogeneity of multivariate dispersions ## Permutation: free ## Number of permutations: 999 ## ## Response: Distances ## Df Sum Sq Mean Sq F N.Perm Pr(>F) ## Groups 1 0.07931 0.079306 4.6156 999 0.036 * ## Residuals 22 0.37801 0.017182 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Pairwise comparisons: ## (Observed p-value below diagonal, permuted p-value above diagonal) ## grazed ungrazed ## grazed 0.043 ## ungrazed 0.04295 ``` --- # Test for locations with non-equal dispersion? Marti Anderson & colleagues (2017) have proposed a solution that is related to the Berens-Fisher problem This is in Primer but not yet in *vegan* <https://github.com/vegandevs/vegan/issues/344> --- class: inverse middle center big-subsection # Distance-based RDA --- # Distance-based RDA Multiple models that all do something Similar 1. `adonis()` (deprecated) 2. `adonis2()` 3. `capscale()` 4. `dbrda()` They all do essentially the same thing, but they do it differently & have slightly different behaviour --- # Distance-based RDA Distance-based RDA (db-RDA) is a constrained form of principal coordinates analysis (PCO) It is similar to RDA but allows for non-Euclidean dissimilarity indices In *vegan*, db-RDA is implemented in `dbrda()` --- # Constrained analysis of principal coordinates `capscale()` is *another* constrained form of PCO due to Legendre & Anderson (1999) It is *very* similar to `dbrda()` but handles the negative eigenvalue problem in a different way — it ignores them! --- # Constrained analysis of principal coordinates `capscale()` works by 1. convert the response data into dissimilarities 2. apply PCO on the dissimilarities, take the PCO sample (site) scores as *new* response data 3. fit `rda()` to the *new* response data and predictor variables as constraints Essentially, we embed the dissimilarities in a Euclidean space using PCO, and then we use RDA on this highly transformed response data --- # Distance-based RDA db-RDA foregoes step 2., and directly decomposes the dissimilarities into components explained by each term in the model Negative eigenvalues resulting from non-metric dissimilarity coefficients are handled via 1. square-root transform of the dissimilarities, or 2. adding a constant to the dissimilarities using methods `"lingoes"` (default, preferred) or `"cailliez"` db-RDA is based on the ideas in McArdle & Anderson (2001) -- Err… isn't that what `adonis2()` was developed to do? -- *Yes*, but… --- # Distance-based RDA `adonis2()` was a ground up redevelopment of the `adonis()` implementation and as such it retains many of the arguments and concepts of PERMANOVA, just updated to use the direct decomposition of dissimilarities `dbrda()` inherits from `rda()` and `cca()` and as a result has expanded set of capability `dbrda()` can use `Condition()` in the formula to fit partial db-RDA `Condition()` is often needed to provide correct restricted perumtation tests --- # Distance-based RDA The equivalent model to `adonis2()` in `dbrda()`-form is ```r data(dune, dune.env) dune_dbrda <- dbrda(dune ~ Management * A1, data = dune.env, method = "bray") ``` because they have different default `method` values --- class: inverse middle center subsection # Ohraz Case Study --- # Restricted permutations | Ohraz Analyse the Ohraz data Case study 5 of Leps & Smilauer Repeated observations of composition from an experiment * Factorial design (3 replicates) * Treatments: fertilisation, mowing, *Molinia* removal Test 1 of the hypotheses > There are *no* directional changes in species composition in time that are common to all treatments or specific treatments --- # Restricted permutations | Ohraz ```r spp <- read_csv(url("https://bit.ly/ohraz-spp")) %>% rename(label = "...1") %>% janitor::clean_names() molinia <- spp %>% select(label:molicaer) spp <- spp %>% select(-molicaer) %>% column_to_rownames("label") env <- read_csv(url("https://bit.ly/ohraz-env")) %>% rename(label = "...1") %>% mutate(across(c(mowing:removal, plotid), ~ factor(.x))) %>% column_to_rownames("label") ``` --- # Distance-based RDA | Ohraz ```r ohraz_dbrda <- dbrda(spp ~ year + year:mowing + year:fertilizer + year:removal + Condition(plotid), data = env, method = "bray", add = "lingoes") h <- how(within = Within(type = "none"), plots = Plots(strata = env$plotid, type = "free")) set.seed(42) anova(ohraz_dbrda, permutations = h, model = "reduced") ``` ``` ## Permutation test for dbrda under reduced model ## Plots: env$plotid, plot permutation: free ## Permutation: none ## Number of permutations: 199 ## ## Model: dbrda(formula = spp ~ year + year:mowing + year:fertilizer + year:removal + Condition(plotid), data = env, add = "lingoes", method = "bray") ## Df Variance F Pr(>F) ## Model 4 96.846 5.3777 0.005 ** ## Residual 68 306.146 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse middle center big-subsection # Other stuff --- # Diagnostics for constrained ordinations **vegan** provides a series of diagnostics to help assess the model fit * `goodness()` * `inertcomp()` * `spenvcor()` * `intersetcor()` * `vif.caa()` --- # Diagnostics | goodness of fit `goodness()` computes a goodness of fit statistic for species or sites, controlled by argument `display` Gives the cumulative proportion of variance explained by each axis ```r upr <- cca(varespec ~ ., data = varechem) lwr <- cca(varespec ~ 1, data = varechem) set.seed(1) mods <- ordistep(lwr, scope = formula(upr), trace = 0) head(goodness(mods)) ``` ``` ## CCA1 CCA2 CCA3 ## Callvulg 0.0062471656 0.318907619 0.8254657 ## Empenigr 0.1164701677 0.137604904 0.1953245 ## Rhodtome 0.0999089739 0.169697909 0.1824153 ## Vaccmyrt 0.2361482843 0.240516323 0.2406730 ## Vaccviti 0.1523704591 0.156502301 0.2110550 ## Pinusylv 0.0009244423 0.004802076 0.0060096 ``` --- # Diagnostics | inertia decomposition `inertcomp()` decomposes the variance in samples or species in partial, constrained, and unconstrained components * `statistic = "explained` (default) gives the decomposition in terms of variance * `statistic = "distance"` gives decomposition in terms of the the residual distance ```r head(inertcomp(mods, proportional = TRUE)) ``` ``` ## CCA CA ## Callvulg 0.8254657 0.1745343 ## Empenigr 0.1953245 0.8046755 ## Rhodtome 0.1824153 0.8175847 ## Vaccmyrt 0.2406730 0.7593270 ## Vaccviti 0.2110550 0.7889450 ## Pinusylv 0.0060096 0.9939904 ``` --- # Diagnostics | species-environment correlations `spenvcor()` returns the (weighted) correlation between the weighted average-based and the linear combination-based sets of site scores A *poor* measure of goodness of fit. Sensitive to * outliers (like all correlations) * overfitting (using too many constraints) Better models can have poorer species-environment correlations ```r spenvcor(mods) ``` ``` ## CCA1 CCA2 CCA3 ## 0.8554793 0.8131627 0.8792221 ``` --- # Diagnostics | interset correlations `intersetcor()` returns the (weighted) correlation between the weighted average-based site scores and each constraint variable Another *poor* diagnostic * correlation based * focuses on a single constraint--axis combination at a time ```r intersetcor(mods) ``` ``` ## CCA1 CCA2 CCA3 ## Al 0.7356445 -0.1304293 0.4260453 ## P -0.3588931 -0.6109601 0.4478786 ## K -0.3767902 -0.1339051 0.7759566 ``` Vector fitting (`envfit()`) or biplot scores (`scores(model, display = "bp")`) are better alternatives --- class: inverse middle center big-subsection # High-throughput data --- # High-throughput data Revolutions in biology & biotechnology have lead to exponential increases in our capacity to generate data arising from the counting of biological molecules * DNA sequencing, * RNA-Seq — sequence RNA molecules in populations of cells or tissues, * ChIp-Seq — sequence DNA molecules that are bound to particular proteins, * … Relative cheaply, today we can generate data sets with thousands of variables on tens to hundreds of samples --- # High-throughput data Counts of such molecules present a statistical challenge * the counts typically have a large *dynamic range*, covering many orders of magnitudes * over this very large range we observe changes in both variance (spread about the mean) and also in distribution of the data — **heterogeneity** * like other count data, observations are integers and distributions are skewed * the biotech methodology imparts systematic sampling biases into the data that we need to account for in an analysis — typically called **normalization** in this sub-field --- # Sequence depth & size factors The number of reads for each sample is a kind of *effort* variable All else equal, the more reads we generate the more species (OTUs, ASVs, etc) we would expect to identify If number of reads differs between samples (libraries) then, all else equal we might assume the counts in different samples are proportional to one another, following some proportionality factor `\(s\)` A simple estimate for `\(s\)` might be the total reads per sample But we can do better than this … --- # Sequence depth & size factors .row[ .col-8[ A small dataset of 5 genes in 2 samples Two views: 1. estimate `\(s_j\)` as `\(\sum_{i=1}^{m} x_{ij}\)`, blue line is ratio of `\(s_j\)` 2. instead, estimate `\(s_j\)` such that their ratio is the red line In 1 we would C is downregulated & A, B, D, & E are upregulated 2 is more parsimonious `DESeq2::estimateSizeFactorsForMatrix()` ] .col-4[ <img src="resources/chap7-rnaseq-normalization-1.png" width="667" /> .smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology] ] ] --- # Variance stabilizing transformations The square-root transformation is known as the _variance stabilizing transformation_ for the Poisson distribution Taking the square root of observations that are distributed Poisson leads to *homoscedasticity* Can construct variance stabilizing transformations for other distributions High-throughput count data typically show extra-Poisson variation `DESeq2::varianceStabilizingTransformation()` --- # Regularized log transform This transforms counts to a log<sub>2</sub>-like scale via a simple model `$$\log_2(q_{ij}) = \sum_k x_{jk} \beta_{ik}$$` where `\(q_{ij}\)` are the transformed data and `\(x_{jk}\)` is a particular design matrix with a dummy variable for the `\(j\)`th sample for ($i = {1, 2, \dots m}$ variables) Priors on the `\(\beta_{ik}\)` make the model identifiable The _rlog_ transformation is ~ variance stabilizing, but handles highly varying size factors across samples `DESeq2::rlogTransformation()` --- # Transformations Comparison of three transformations <img src="resources/chap7-rnaseq-meansd-1.png" width="2667" /> .smaller[Holmes & Huber (2019) Modern Statistics for Modern Biology] ??? Figure 8.12: Per-gene standard deviation (sd, taken across samples) against the rank of the mean, for the shifted logarithm , the variance-stabilizing transformation (vst) and the rlog. Note that for the leftmost 2,500 genes, the counts are all zero, and hence their standard deviation is zero. The mean-sd dependence becomes more interesting for genes with non-zero counts. Note also the high value of the standard deviation for genes that are weakly detected (but not with all zero counts) when the shifted logarithm is used, and compare to the relatively flat shape of the mean-sd relationship for the variance-stabilizing transformation. --- # Implications Size factors can be used in GLMs etc to normalize samples via as `offset()` term in the formula For ordination we can't easily use the size factors Need to use _vst_ or _rlog_ instead _rlog_ creates data that can't be used in CCA, so if you use it, RDA or db-RDA are the ordination methods to use --- # Practicalities ```r # install.packages("BiocManager") # BiocManager::install("DESeq2") library("DESeq2") dds_asv <- DESeqDataSetFromMatrix(countData = orig_asv, # response count data colData = meta_data, # other data including predictors design = ~ treatment) # experimental design ## Make sure that the control is the reference level dds_asv$treatment <- relevel(dds_asv$treatment, ref = "CON") # compute the regularized log transform rld <- rlog(dds_asv, blind = FALSE) # compute the variance stabilizing transform vst <- vst(dds_asv, blind = FALSE) # extract rlog transformed data & get ready to go into *vegan* cts <- t(assay(rld)) cts_df <- as.data.frame(cts) ``` --- # Rarefied counts Can take a random draw `\(n^{\prime}\)` out of the `\(N\)` individuals in a sample in proportion to species abundances This yields rarefied counts which some people then go on to analyse Be careful though as this is a random process — what if you took a different random sample? --- # Rarefied counts `avgdist()` in *vegan* tries to get around this by 1. taking a random draw to get rarefied counts 2. computing dissimilarity between smaples on basis of rarefied count Repeat that many times and then as your dissimilarity `\(d_{ij}\)`, take the average of the many `\(d_{ij}^*\)` values you generated above See also the help page for other suggestions `?avgdist` --- # References .smaller[ * Anderson, M.J., 2001. A new method for non-parametric multivariate analysis of variance. Austral Ecol. 26, 32–46 * Anderson, M.J., 2006. Distance-based tests for homogeneity of multivariate dispersions. Biometrics 62, 245–253 * Anderson, M.J., Walsh, D.C.I., 2013. PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecol. Monogr. 83, 557–574 * Anderson, M.J., Walsh, D.C.I., Robert Clarke, K., Gorley, R.N., Guerra-Castro, E., 2017. Some solutions to the multivariate Behrens-Fisher problem for dissimilarity-based analyses. Aust. N. Z. J. Stat. 59, 57–79 * Blanchet, F.G., Legendre, P., Borcard, D., 2008. Forward selection of explanatory variables. Ecology 89, 2623–2632 * Legendre, P., Anderson, M.J., 1999. Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol. Monogr. 69, 1–24 * McArdle, B.H., Anderson, M.J., 2001. Fitting Multivariate Models to Community Data: A Comment on Distance-Based Redundancy Analysis. Ecology 82, 290–297 ]