Baxter microbiome example

In this example, we will use a data set from Baxter et al (2016; Genome Medicine), who investigated the use of non-invasive tests for colorectal cancer. They studied the microbiomes of a number of patients and investigated whether there were detectable differences in microbial composition in stool samples between patients with cancer, versus adenomas, versus healthy individuals.

Load the packages we will use and the OTU data:

# load vegan, dplyr
library("vegan")
library("dplyr")

# load the data
baxter <- read.table(
  "https://raw.githubusercontent.com/michaelgreenacre/CODAinPractice/master/Baxter_OTU_table.txt",
  header = TRUE
  )

We need to remove the first three columns of the data as provided:

baxter <- baxter |>
  as_tibble() |>
  select(-c(label:numOtus))

There are 490 samples and 335 OTUs in the data:

dim(baxter)

## [1] 490 335

The associated meta data are loaded using

meta <- read.table(
  "https://raw.githubusercontent.com/michaelgreenacre/CODAinPractice/master/Baxter_Metadata.txt",
  header = TRUE
) |> as_tibble()

dim(meta)

## [1] 490  27

names(meta)

##  [1] "sample"       "fit_result"   "Site"         "Dx_Bin"       "dx"          
##  [6] "Hx_Prev"      "Hx_of_Polyps" "Age"          "Gender"       "Smoke"       
## [11] "Diabetic"     "Hx_Fam_CRC"   "Height"       "Weight"       "BMI"         
## [16] "White"        "Native"       "Black"        "Pacific"      "Asian"       
## [21] "Other"        "Ethnic"       "NSAID"        "Abx"          "Diabetes_Med"
## [26] "stage"        "Location"

meta

## # A tibble: 490 × 27
##     sample fit_result Site  Dx_Bin dx    Hx_Prev Hx_of_Polyps   Age Gender Smoke
##      <int>      <int> <chr> <chr>  <chr>   <int>        <int> <int> <chr>  <int>
##  1 2003650          0 UMic… HighR… norm…       0            1    64 m         NA
##  2 2005650          0 UMic… HighR… norm…       0            1    61 m          0
##  3 2007660         26 UMic… HighR… norm…       0            1    47 f          0
##  4 2009650         10 Toro… Adeno… aden…       0            1    81 f          1
##  5 2013660          0 UMic… Normal norm…       0            0    44 f          0
##  6 2015650          0 Dana… HighR… norm…       0            1    51 f          1
##  7 2017660          7 Dana… Cancer canc…       1            1    78 m          1
##  8 2019651         19 UMic… Normal norm…       0            0    59 m          0
##  9 2023680          0 Dana… HighR… norm…       1            1    63 f          1
## 10 2025653       1509 UMic… Cancer canc…       1            1    67 m          1
## # ℹ 480 more rows
## # ℹ 17 more variables: Diabetic <int>, Hx_Fam_CRC <int>, Height <int>,
## #   Weight <int>, BMI <dbl>, White <int>, Native <int>, Black <int>,
## #   Pacific <int>, Asian <int>, Other <int>, Ethnic <int>, NSAID <int>,
## #   Abx <int>, Diabetes_Med <int>, stage <int>, Location <chr>

The data represent one of three types of cell (variable dx):

normal,
cancer, and
adenoma (benign)

We need to convert ths variable to a factor so it can be used in the ordination method. We also centre and standardize the Age variable

meta <- meta |>
  mutate(
    dx = factor(dx),
    std_Age = (Age - mean(Age)) / sd(Age)
  )

The species data are counts of the OTUs. We will look at two ordination methods in this brief example:

CCA on the fourth-root transformed percentage abundances, and
RDA on the CLR-transformed percentage abundances.

CCA

We begin by converting the data to proportions

baxter_prop <- baxter |> decostand(method = "total")

and then to stabilise the variance of the data we could use a square root transform, but we follow Greenacre (2021; Annual Review of Statistics and its Applications) and use a quarter root transformation

baxter_qroot <- baxter_prop^0.25

The aim of this brief example is to explore the effects of dx (diagnosis) and std_Age on the OTU composition. We can fit this model using

baxter_cca <- cca(baxter_qroot ~ dx + std_Age, data = meta)

If we plot these data, we notice a considerable outlier, sample 371:

plot(baxter_cca, display = "sites", type = "text")

We can delete this observation and refit the model

baxter_qroot <- baxter_qroot[-371, ]
meta <- meta[-371, ]
baxter_cca <- cca(baxter_qroot ~ dx + std_Age, data = meta)

plot(baxter_cca, display = "sites", type = "text")

The two covariates in the model explain a relatively small amount of the variation in the OTUs

baxter_cca

## 
## Call: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
## 
##               Inertia Proportion Rank
## Total         1.50842    1.00000     
## Constrained   0.02156    0.01430    3
## Unconstrained 1.48685    0.98570  334
## 
## Inertia is scaled Chi-square
## 
## Eigenvalues for constrained axes:
##     CCA1     CCA2     CCA3 
## 0.011218 0.007582 0.002765 
## 
## Eigenvalues for unconstrained axes:
##     CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8 
## 0.10613 0.05698 0.04028 0.03674 0.02812 0.02662 0.02429 0.02082 
## (Showing 8 of 334 unconstrained eigenvalues)

but such small proportions of variation explained are not necessarily an issue; with such a large number of taxa, even 490 samples are very sparsely scattered around the 335 dimensional space represented by the OTUs.

We can use the anova() method to test the overall model

set.seed(2026-4-29)
anova(baxter_cca, parallel = 4)

## Permutation test for cca under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
##           Df ChiSquare      F Pr(>F)    
## Model      3   0.02156 2.3448  0.001 ***
## Residual 485   1.48685                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the individual axes of the CCA

set.seed(2026-4-29)
anova(baxter_cca, by = "axis", parallel = 4)

## Permutation test for cca under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
##           Df ChiSquare      F Pr(>F)    
## CCA1       1   0.01122 3.6592  0.001 ***
## CCA2       1   0.00758 2.4777  0.001 ***
## CCA3       1   0.00276 0.9052  0.747    
## Residual 485   1.48685                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the marginal effects of the covariates

set.seed(2026-4-29)
anova(baxter_cca, by = "margin", parallel = 4)

## Permutation test for cca under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: cca(formula = baxter_qroot ~ dx + std_Age, data = meta)
##           Df ChiSquare      F Pr(>F)    
## dx         2   0.01395 2.2748  0.001 ***
## std_Age    1   0.00731 2.3833  0.001 ***
## Residual 485   1.48685                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To visualise the effects of the covariates, we can use a couple of vegan’s plotting commands

baxter_cca |> plot(
  display = c("sites", "bp", "cn"),
  scaling = "sites"
)
ordiellipse(
  baxter_cca, meta$dx,
  kind = "se", conf = 0.95,
  col = palette.colors(3),
  scaling = "sites", lwd = 2, label = TRUE
)

It’s not an especially pretty plot, but we see that the first axis captures the contrast between cancerous and normal/adenoma samples, while the second axis reflects the age of the patient.

CLR transformation

To look at the CLR transformation, we start with the original data, baxter and use decostand() to apply the transformation. We use a pseudocount of 1.

pseudo <- min(baxter_prop[baxter_prop > 0])
baxter_clr <- baxter_prop |>
  decostand(method = "clr", pseudocount = pseudo)
baxter_clr <- baxter_clr[-371, ]

We fit the same model as we did with CCA:

baxter_rda <- rda(baxter_clr ~ dx + std_Age, data = meta)

and apply the same tests; for brevity, I just show the results for the marginal effects:

set.seed(2026-4-29)
anova(baxter_rda, by = "margin", parallel = 4)

## Permutation test for rda under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = baxter_clr ~ dx + std_Age, data = meta)
##           Df Variance      F Pr(>F)    
## dx         2     4.27 2.0126  0.001 ***
## std_Age    1     2.63 2.4809  0.001 ***
## Residual 485   515.07                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A plot that is superficially the same as the one we achieved using CCA is obtained from the RDA results:

baxter_rda |> plot(
  display = c("sites", "bp", "cn"),
  scaling = "sites"
)
ordiellipse(
  baxter_rda, meta$dx,
  kind = "se", conf = 0.95,
  col = palette.colors(3),
  scaling = "sites", lwd = 2, label = TRUE
)

Baxter microbiome example

Gavin Simpson (Based on Greenacre)

April 29, 2026

CCA

CLR transformation