In this example, we will use a data set from Baxter et al (2016; Genome Medicine), who investigated the use of non-invasive tests for colorectal cancer. They studied the microbiomes of a number of patients and investigated whether there were detectable differences in microbial composition in stool samples between patients with cancer, versus adenomas, versus healthy individuals.
Load the packages we will use and the OTU data:
# load vegan, dplyr
library("vegan")
library("dplyr")
# load the data
baxter <- read.table(
"https://raw.githubusercontent.com/michaelgreenacre/CODAinPractice/master/Baxter_OTU_table.txt",
header = TRUE
)
We need to remove the first three columns of the data as provided:
baxter <- baxter |>
as_tibble() |>
select(-c(label:numOtus))
There are 490 samples and 335 OTUs in the data:
dim(baxter)
## [1] 490 335
The associated meta data are loaded using
meta <- read.table(
"https://raw.githubusercontent.com/michaelgreenacre/CODAinPractice/master/Baxter_Metadata.txt",
header = TRUE
) |> as_tibble()
dim(meta)
## [1] 490 27
names(meta)
## [1] "sample" "fit_result" "Site" "Dx_Bin" "dx"
## [6] "Hx_Prev" "Hx_of_Polyps" "Age" "Gender" "Smoke"
## [11] "Diabetic" "Hx_Fam_CRC" "Height" "Weight" "BMI"
## [16] "White" "Native" "Black" "Pacific" "Asian"
## [21] "Other" "Ethnic" "NSAID" "Abx" "Diabetes_Med"
## [26] "stage" "Location"
meta
## # A tibble: 490 × 27
## sample fit_result Site Dx_Bin dx Hx_Prev Hx_of_Polyps Age Gender Smoke
## <int> <int> <chr> <chr> <chr> <int> <int> <int> <chr> <int>
## 1 2003650 0 UMic… HighR… norm… 0 1 64 m NA
## 2 2005650 0 UMic… HighR… norm… 0 1 61 m 0
## 3 2007660 26 UMic… HighR… norm… 0 1 47 f 0
## 4 2009650 10 Toro… Adeno… aden… 0 1 81 f 1
## 5 2013660 0 UMic… Normal norm… 0 0 44 f 0
## 6 2015650 0 Dana… HighR… norm… 0 1 51 f 1
## 7 2017660 7 Dana… Cancer canc… 1 1 78 m 1
## 8 2019651 19 UMic… Normal norm… 0 0 59 m 0
## 9 2023680 0 Dana… HighR… norm… 1 1 63 f 1
## 10 2025653 1509 UMic… Cancer canc… 1 1 67 m 1
## # ℹ 480 more rows
## # ℹ 17 more variables: Diabetic <int>, Hx_Fam_CRC <int>, Height <int>,
## # Weight <int>, BMI <dbl>, White <int>, Native <int>, Black <int>,
## # Pacific <int>, Asian <int>, Other <int>, Ethnic <int>, NSAID <int>,
## # Abx <int>, Diabetes_Med <int>, stage <int>, Location <chr>
The data represent one of three types of cell (variable
dx):
normal,cancer, andadenoma (benign)We need to convert ths variable to a factor so it can be used in the
ordination method. We also centre and standardize the Age
variable
meta <- meta |>
mutate(
dx = factor(dx),
std_Age = (Age - mean(Age)) / sd(Age)
)
The species data are counts of the OTUs. We will look at two ordination methods in this brief example:
CCA on the fourth-root transformed percentage abundances, and
RDA on the CLR-transformed percentage abundances.
We begin by converting the data to proportions
baxter_prop <- baxter |> decostand(method = "total")
and then to stabilise the variance of the data we could use a square root transform, but we follow Greenacre (2021; Annual Review of Statistics and its Applications) and use a quarter root transformation
baxter_qroot <- baxter_prop^0.25
The aim of this brief example is to explore the effects of
dx (diagnosis) and std_Age on the OTU
composition. We can fit this model using
baxter_cca <- cca(baxter_qroot ~ dx + std_Age, data = meta)
If we plot these data, we notice a considerable outlier, sample 371:
plot(baxter_cca, display = "sites", type = "text")
We can delete this observation and refit the model
baxter_qroot <- baxter_qroot[-371, ]
meta <- meta[-371, ]
baxter_cca <- cca(baxter_qroot ~ dx + std_Age, data = meta)
plot(baxter_cca, display = "sites", type = "text")
The two covariates in the model explain a relatively small amount of the variation in the OTUs
baxter_cca
##
## Call: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
##
## Inertia Proportion Rank
## Total 1.50842 1.00000
## Constrained 0.02156 0.01430 3
## Unconstrained 1.48685 0.98570 334
##
## Inertia is scaled Chi-square
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.011218 0.007582 0.002765
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.10613 0.05698 0.04028 0.03674 0.02812 0.02662 0.02429 0.02082
## (Showing 8 of 334 unconstrained eigenvalues)
but such small proportions of variation explained are not necessarily an issue; with such a large number of taxa, even 490 samples are very sparsely scattered around the 335 dimensional space represented by the OTUs.
We can use the anova() method to test the overall
model
set.seed(2026-4-29)
anova(baxter_cca, parallel = 4)
## Permutation test for cca under reduced model
## Permutation: free
## Number of permutations: 999
##
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
## Df ChiSquare F Pr(>F)
## Model 3 0.02156 2.3448 0.001 ***
## Residual 485 1.48685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the individual axes of the CCA
set.seed(2026-4-29)
anova(baxter_cca, by = "axis", parallel = 4)
## Permutation test for cca under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
##
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
## Df ChiSquare F Pr(>F)
## CCA1 1 0.01122 3.6592 0.001 ***
## CCA2 1 0.00758 2.4777 0.001 ***
## CCA3 1 0.00276 0.9052 0.747
## Residual 485 1.48685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the marginal effects of the covariates
set.seed(2026-4-29)
anova(baxter_cca, by = "margin", parallel = 4)
## Permutation test for cca under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
##
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
## Df ChiSquare F Pr(>F)
## dx 2 0.01395 2.2748 0.001 ***
## std_Age 1 0.00731 2.3833 0.001 ***
## Residual 485 1.48685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To visualise the effects of the covariates, we can use a couple of vegan’s plotting commands
baxter_cca |> plot(
display = c("sites", "bp", "cn"),
scaling = "sites"
)
ordiellipse(
baxter_cca, meta$dx,
kind = "se", conf = 0.95,
col = palette.colors(3),
scaling = "sites", lwd = 2, label = TRUE
)
It’s not an especially pretty plot, but we see that the first axis captures the contrast between cancerous and normal/adenoma samples, while the second axis reflects the age of the patient.
To look at the CLR transformation, we start with the original data,
baxter and use decostand() to apply the
transformation. We use a pseudocount of 1.
pseudo <- min(baxter_prop[baxter_prop > 0])
baxter_clr <- baxter_prop |>
decostand(method = "clr", pseudocount = pseudo)
baxter_clr <- baxter_clr[-371, ]
We fit the same model as we did with CCA:
baxter_rda <- rda(baxter_clr ~ dx + std_Age, data = meta)
and apply the same tests; for brevity, I just show the results for the marginal effects:
set.seed(2026-4-29)
anova(baxter_rda, by = "margin", parallel = 4)
## Permutation test for rda under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
##
## Model: rda(formula = baxter_clr ~ dx + std_Age, data = meta)
## Df Variance F Pr(>F)
## dx 2 4.27 2.0126 0.001 ***
## std_Age 1 2.63 2.4809 0.001 ***
## Residual 485 515.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A plot that is superficially the same as the one we achieved using CCA is obtained from the RDA results:
baxter_rda |> plot(
display = c("sites", "bp", "cn"),
scaling = "sites"
)
ordiellipse(
baxter_rda, meta$dx,
kind = "se", conf = 0.95,
col = palette.colors(3),
scaling = "sites", lwd = 2, label = TRUE
)