On the first day we discussed cluster analysis and classification as being unsupervised and supervised methods respectively. In cluster analysis we attempt to identify group structure in the data, but we don’t know the groups ahead of time. In classification, we do know the groups (hence it is supervised) and we try to find prediction rules from among the variables that best predict to which group individual samples belong.

Classification is a large area of statistics and computer science with tens if not hundreds of classification modelling methods available. Linear discriminant analysis is the classical multivariate technique for finding prediction rules that best explain and hence predict the known groups.

LDA finds axes that are linear combinations of the variables that best predict the grouping variable. We can think of these axes or linear combinations as being those directions in the multivariate space of the variables that best separate the group centroids. Axes subsequent to the first LDA axis are subject to the usual constraint that they be orthogonal to earlier axes. If this sounds familiar, it is. LDA is a special case of Canonical Correspondence Analysis, with the groups being the species.

We can fit an LDA using the lda() function of the MASS package, which ships with R. To illustrate the method, we’ll use a data set of morphological measurements on the fruits of two species of Betula (Birch)

  1. Betula pubescens (tree birch), and
  2. Betula nana (dwarf birch)

The fruits from these species are readily found in lake sediments and are used to help reconstruct climate in the late-glacial and Holocene periods. However, many of the fruits we find in these old sediments lack their wings, with only the fruit body being preserved. If we can predict which species of Betula a fruit body belongs, we could improve climate reconstructions.

Here we use a data set of 50 fruits from each species where seven (7) morphological measurements have been made on the fruit bodies. We will use LDA to build a model to predict to which species each of the 100 fruits belongs, and then we will use this LDA model to predict for 45 fossil, wingless Betula fruits from Eigebakken, a late-glacial site in south-west Norway. Fossil fruits 1-15 and 40-45 are from the Allerød interstadial, fossil fruits 16-22 and from the Younger Dryas, and fruits 23-39 are from the early Holocene.

library("vegan")
## Loading required package: permute
library("readr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("MASS")
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
modern <- read_csv("https://bit.ly/betula-modern") %>%
  mutate(species = factor(species))
## Rows: 90 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): species
## dbl (7): var1, var2, var3, var4, var5, var6, var7
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fossil <- read_csv("https://bit.ly/betula-fossil") %>%
  mutate(period = rep(c("Alerod", "Younger Dryas", "early Holocene", "Alerod"),
    times = c(15, 7, 17, 6)))
## Rows: 45 Columns: 7
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): var1, var2, var3, var4, var5, var6, var7
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
m_lda <- lda(species ~ ., data = modern)
m_lda
## Call:
## lda(species ~ ., data = modern)
## 
## Prior probabilities of groups:
##      nana pubescens 
## 0.5555556 0.4444444 
## 
## Group means:
##              var1   var2    var3 var4  var5  var6  var7
## nana      153.780 104.32 51.9360 48.5 95.98 38.16 30.68
## pubescens 186.725 109.50 59.6425 46.4 66.60 31.45  9.45
## 
## Coefficients of linear discriminants:
##                LD1
## var1  0.0003303159
## var2  0.0343776969
## var3  0.0527724852
## var4 -0.0320033244
## var5 -0.0428463635
## var6 -0.0296381642
## var7 -0.0742924453
pred <- predict(m_lda, newdata = fossil)

pred$class
##  [1] nana      nana      nana      nana      nana      nana      nana     
##  [8] nana      nana      nana      nana      nana      nana      nana     
## [15] nana      nana      nana      nana      nana      nana      nana     
## [22] nana      pubescens pubescens pubescens pubescens pubescens pubescens
## [29] pubescens nana      pubescens pubescens pubescens pubescens pubescens
## [36] pubescens pubescens pubescens nana      nana      nana      nana     
## [43] nana      nana      nana     
## Levels: nana pubescens
round(pred$posterior, 3)
##     nana pubescens
## 1  1.000     0.000
## 2  0.999     0.001
## 3  0.992     0.008
## 4  0.999     0.001
## 5  0.999     0.001
## 6  1.000     0.000
## 7  1.000     0.000
## 8  1.000     0.000
## 9  1.000     0.000
## 10 1.000     0.000
## 11 1.000     0.000
## 12 1.000     0.000
## 13 0.997     0.003
## 14 0.981     0.019
## 15 1.000     0.000
## 16 1.000     0.000
## 17 0.999     0.001
## 18 1.000     0.000
## 19 0.996     0.004
## 20 1.000     0.000
## 21 1.000     0.000
## 22 1.000     0.000
## 23 0.002     0.998
## 24 0.006     0.994
## 25 0.009     0.991
## 26 0.064     0.936
## 27 0.004     0.996
## 28 0.000     1.000
## 29 0.007     0.993
## 30 0.657     0.343
## 31 0.000     1.000
## 32 0.050     0.950
## 33 0.289     0.711
## 34 0.050     0.950
## 35 0.120     0.880
## 36 0.000     1.000
## 37 0.032     0.968
## 38 0.001     0.999
## 39 1.000     0.000
## 40 1.000     0.000
## 41 1.000     0.000
## 42 0.998     0.002
## 43 1.000     0.000
## 44 1.000     0.000
## 45 1.000     0.000