Dissimilarity, cluster analysis,

class: inverse, middle, left, my-title-slide, title-slide

.title[
# Dissimilarity, cluster analysis,
]
.author[
### Gavin Simpson
]
.date[
### September 19, 2022
]

---

class: inverse middle center big-subsection

# Welcome

---

# Logistics

---

# Today's topics

* Diversity

* Dissimilarity

* Cluster analysis

---
class: inverse middle center big-subsection

# Diversity

---

# Diversity metrics

**vegan** has many functions for computing diversity metrics

.row[
.col-6[
Three popular ones are

1. Shannon-Weaver

2. Simpson

3. Inverse Simpson

`$p_i$` is proportion of species `$i$`

`$b$` is the base, usually `$e$`

`$S$` is number of species (richness)
  
]
.col-6[
`$$H = - \sum_{i=1}^{S} p_i \log_b p_i$$`

`$$D_1 = 1 - \sum_{i=1}^{S} p_i^2$$`

`$$D_2 = \frac{1}{\sum_{i=1}^{S} p_i^2}$$`
]
]

---

# Diversity metrics

```r
data(BCI)
H <- diversity(BCI)
head(H)
```

```
##        1        2        3        4        5        6 
## 4.018412 3.848471 3.814060 3.976563 3.969940 3.776575
```

```r
D1 <- diversity(BCI, index = "simpson")
head(D1)
```

```
##         1         2         3         4         5         6 
## 0.9746293 0.9683393 0.9646078 0.9716117 0.9678267 0.9627557
```

```r
D2 <- diversity(BCI, index = "invsimpson", base = 2)
head(D2)
```

```
##        1        2        3        4        5        6 
## 39.41555 31.58488 28.25478 35.22577 31.08166 26.84973
```

---

# Diversity metrics

## Richness

```r
head(specnumber(BCI)) # species richness
```

```
##   1   2   3   4   5   6 
##  93  84  90  94 101  85
```

```r
head(rowSums(BCI > 0)) # simple
```

```
##   1   2   3   4   5   6 
##  93  84  90  94 101  85
```

## Pielou's Evenness `$J$`

```r
J <- H / log(specnumber(BCI))
head(J)
```

```
##         1         2         3         4         5         6 
## 0.8865579 0.8685692 0.8476046 0.8752597 0.8602030 0.8500724
```

---

# Diversity &mdash; R&eacute;nyi entropy & Hill's numbers

R&eacute;nyi's *generalized entropy*

`$$H_a = \frac{1}{1-a} \log \sum_{i = 1}^{S} p_i^a$$`

where `$a$` is the *order* of the entropy

Corresponding Hill's numbers are

`$$N_a = \exp{(H_a)}$$`

---

# Diversity &mdash; R&eacute;nyi entropy & Hill's numbers

```r
R <- renyi(BCI, scales = 2)
head(R)
```

```
##        1        2        3        4        5        6 
## 3.674161 3.452678 3.341263 3.561778 3.436618 3.290256
```

```r
N2 <- renyi(BCI, scales = 2, hill = TRUE)
head(N2) # inverse simpson
```

```
##        1        2        3        4        5        6 
## 39.41555 31.58488 28.25478 35.22577 31.08166 26.84973
```

---

# Diversity &mdash; R&eacute;nyi entropy & Hill's numbers

```r
k <- sample(nrow(BCI), 6)
R <- renyi(BCI[k,])
plot(R)
```

---

# Rarefaction

Species richness increases with sample size (effort)

Rarefaction gives the expected number of species rarefied from `$N$` to `$n$` individuals

`$$\hat{S}_n = \sum_{i=1}^S (1 - q_i) \; \mathsf{where} \; q_i = \frac{\binom{N - x_i}{n}}{\binom{N}{n}}$$`

`$x_i$` is count of species `$i$` and `$\binom{N}{n}$` is a binomial coefficient &mdash; the number of ways to choose `$n$` from `$N$`

---

# Rarefaction

.row[
.col-6[

```r
rs <- rowSums(BCI)
quantile(rs)
```

```
##    0%   25%   50%   75%  100% 
## 340.0 409.0 428.0 443.5 601.0
```

```r
Srar <- rarefy(BCI, min(rs))
head(Srar)
```

```
##        1        2        3        4        5        6 
## 84.33992 76.53165 79.11504 82.46571 86.90901 78.50953
```

```r
rarecurve(BCI, sample = min(rs))
```
]

.col-6[
![](slides_files/figure-html/rarefy-1.svg)
]
]

---
class: inverse middle center big-subsection

# Dissimilarity

---

# Measuring association &mdash; binary data

.row[

.col-6[

<table>
    <tr>
        <td>&nbsp;</td>
        <td colspan=3>Object <i>j</i></td>
    </tr>
    <tr>
        <td rowspan=3>Object <i>i</i></td>
        <td> &nbsp; </td>
        <td> + </td>
        <td> - </td>
    </tr>
    <tr>
        <td> + </td>
        <td> a </td>
        <td> c </td>
    </tr>
    <tr>
        <td> - </td>
        <td> b </td>
        <td> d </td>
    </tr>
</table>

]

.col-6[
Dissimilarity based on the number of species present only in `$i$` ( `$b$` ), or `$j$` ( `$c$` ), or in present in both ( `$a$` ), or absent in both ( `$d$` ).

]
]

.row[

.col-6[

Jaccard similarity

`$$s_{ij} = \frac{a}{a + b + c}$$`

Jaccard dissimilarity

`$$d_{ij} = \frac{b + c}{a + b + c}$$`
]

.col-6[
Simple matching coefficient

`$$s_{ij} = \frac{a + d}{a + b + c + d}$$`

Simple matching coefficient

`$$d_{ij} = \frac{b + c}{a + b + c + d}$$`
]
]

---

# Dissimilarity

.row[

.col-6[

![](slides_files/figure-html/dissimilarity-1.svg)
]

.col-6[

`$$d_{ij} = \sqrt{\sum\limits^m_{k=1}(x_{ik} - x_{jk})^2}$$`

`$$d_{ij} = \sum\limits^m_{k=1}|x_{ik} - x_{jk}|$$`

`$$d_{ij} = \frac{\sum\limits^m_{k=1}|x_{ik} - x_{jk}|}{\sum\limits^m_{k=1}(x_{ik} + x_{jk})}$$`
]
]

---

# Measuring association &mdash; quantitative data

* Euclidean distance dominated by large values.
* Manhattan distance less affected by large values.
* Bray-Curtis sensitive to extreme values.
* Similarity ratio (Steinhaus-Marczewski `$\equiv$` Jaccard) less dominated by extremes.
* Chord distance, used for proportional data; _signal-to-noise_ measure.

.row[

.col-6[
Similarity ratio

`$d_{ij} = \frac{\sum\limits^m_{k=1}x_{ik}x_{jk}}{\left(\sum\limits^m_{k=1}x_{ik}^2 + \sum\limits^m_{k=1}x_{jk}^2 - \sum\limits^m_{k=1}x_{ik}x_{jk}\right)^2}$`

]

.col-6[
Chord distance

`$d_{ij} = \sqrt{\sum\limits^m_{k=1}(\sqrt{p_{ik}} - \sqrt{p_{jk}})^2}$`
]
]

---

# Measuring association &mdash; mixed data

.row[

.col-8[
.small[
* `$s_{ijk}$` is similarity between sites `$i$` and `$j$` for the `$k$`th variable.
* Weights `$w_{ijk}$` are typically 0 or 1 depending on whether the comparison is valid for variable `$k$`. Can also use variable weighting with `$w_{ijk}$` between 0 and 1.
* `$w_{ijk}$` is zero if the `$k$`th variable is missing for one or both of `$i$` or `$j$`.
* For binary variables `$s_{ijk}$` is the Jaccard coefficient.
* For categorical data `$s_{ijk}$` is 1 of `$i$` and `$k$` have same category, 0 otherwise.
* For quantitative data `$s_{ijk} = (1 - |x_{ik} - x_{jk}|) / R_k$`
]
]

.col-4[

Gower's coefficient

`$$s_{ij} = \frac{\sum\limits^m_{i=1} w_{ijk}s_{ijk}}{\sum\limits^m_{i=1} w_{ijk}}$$`
]
]

---

# Transformations

* Can transform the variables (e.g. species) or the samples to improve the gradient separation of the dissimilarity coefficient.
* No transformation of variables or samples leads to a situation of quantity domination &mdash; big values dominate `$d_{ij}$`.
* Normalise samples &mdash;gives all samples equal weight.
* Normalise variables;
    * gives all variables equal weight,
    * inflates the influence of rare species.
* Double (_Wisconsin_) transformation; normalise variables then samples.
* Noy-Meir _et al_. (1975) _J. Ecology_ **63**; 779--800.
* Faith _et al_. (1987) _Vegetatio_ **69**; 57--68.

---

# Dissimilarity

Two key functions

1. `vegdist()`
2. `decostand()`

```r
data(varespec)

euc_dij <- vegdist(varespec, method = "euclidean")

bc_dij <- vegdist(varespec)

hell_dij <- vegdist(decostand(varespec, method = "hellinger"),
                    method = "euclidean")
```

---
class: inverse middle center big-subsection

# Cluster analysis

---

# Basic aim of cluster analysis

* Partition a set of data (objects, samples) into groups known as clusters.
* Partitions formed that minimise a stated mathematical criterion, e.g. sum of squares (SS)
    * Minimise within groups SS &rarr; maximising between group SS
* Cluster analysis is a compromise however:
    * With 50 objects there are `$10^{80}$` possible ways of partitioning the objects
* Compromise is made in selecting a clustering scheme that reduces the number of partitions to a reasonable value.
* Commonest approaches impose a hierarchy and then either fuse (_agglomerative_) or split (_divisive_) samples and clusters.

---

# A taxonomy of clusterings

* Clustering techniques can be characterised in many ways:

<table>
<tr>
<td>Formal</td>
<td>Informal</td>
</tr>
<tr>
<td>Hierarchical</td>
<td>Non-hierarchical</td>
</tr>
<tr>
<td>Quantitative</td>
<td>Qualitative</td>
</tr>
<tr>
<td>Agglomerative</td>
<td>Divisive</td>
</tr>
<tr>
<td>Polythetic</td>
<td>Monothetic</td>
</tr>
<tr>
<td>Sharp</td>
<td>Fuzzy</td>
</tr>
<tr>
<td>Supervised</td>
<td>Unsupervised</td>
</tr>
<tr>
<td>Useful</td>
<td>Not useful</td>
</tr>
</table>

---

# A cautionary tale

* Cluster analysis _per se_ is _unsupervised_; we want to find groups in our data.
* We will define _classification_ as a _supervised_ procedure where we know _a priori_ what the groups are.

---

# A cautionary tale

> The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other "statistical" innovation (with the possible exception of multiple regression techniques).
>  R.M. Cormack (1970) _J. Roy. Stat. Soc. A_ **134(3)**; 321--367.

---
class: inverse middle center big-subsection

# Hierarchical cluster analysis

---

# Agglomerative hierarchical clustering

* Agglomerative methods start with all observations in separate clusters and fuse the two most similar observations.
* Fusion continues, with the two most similar observations and/or clusters being fused at each step.
* Five main stages in this analysis:
    * calculate matrix of _(dis)similarities_ `$d_{ij}$` between all pairs of `$m$` objects,
    * fuse objects into groups using chosen _clustering strategy_,
    * display the results graphically via a _dendrogram_ or superimposed on to an ordination,
    * check for distortion,
    * validate the results

---

# Clustering strategies

* Different strategies can be used to determine the dissimilarity (_distance_) between a sample and a cluster or two clusters.

* Single link or nearest neighbour;
    * the distance between the closest members of two clusters,
    * finds the minimum spanning tree, the shortest tree connecting all points,
    * can produce _chaining_, producing groups of unequal size.

* Complete link or furthest neighbour;
    * the distance between the farthest members of two clusters,
    * produces compact clusters of roughly equal size,
    * may make compact groups even when none exist

---

# Clustering strategies

* Centroid;
    * the distance between the centroids (the centre of gravity) of two clusters,
    * centroid is the point that is the average of the coordinates (variables) of all objects in the cluster
    * can produce reversals

* Unweighted group average
    * the average of the distances between the samples in one cluster and the samples of another,
    * intermediate between single and complete link methods,
    * maximises the \alert{cophenetic correlation},
    * may make compact clusters where none exist,

---

# Clustering strategies
* Minimum variance (Ward's method)
    * fuse two groups if and only if neither group will combine with any other group to produce a lower within group \alert{sum of squares}
    * forms compact clusters of equal size; even where none exist,

---

# Clustering strategies

![](slides_files/figure-html/clustering-strategies-figure-1.svg)

---

# Prehistoric dogs from Thailand

* Archaeological digs at prehistoric sites in Thailand have produced a collection of canine mandibles, covering period 3500BC to today
* Origins of the prehistoric dog uncertain
    * Could possibly descend from the golden jackal or the wolf
    * Wolf not native to Thailand; nearest indigenous wolves in western China or Indian subcontinent
* Data are mean values of each of six measurements of specimens from 7 canine groups

---

# Prehistoric dogs from Thailand

![](slides_files/figure-html/canine-1.svg)

---

# Graphical display; dendrograms

.row[

.col-7[
* The height is the dissimilarity at which the fusion was made.
* The length of the _stem_ represents the dissimilarities between clusters.
* _Nodes_ represent clusters; internal or terminal.
* Configuration of nodes and stems is known as the tree _topology_.
* Can flip two adjacent stems without affecting topology.
* Form a clustering by cutting the dendrogram.

]

.col-5[

![](slides_files/figure-html/canine-minimum-1.svg)
]
]

---

# Graphical display; heatmap

.row[
.col-7[

* A heatmap is another graphical display of data structure.
* Begin by reordering the rows (sites) and columns (variables) of data matrix.
* Reorder using cluster analysis on rows and columns separately.
* Central panel shows the values of the variables for each site as a shading.

]

.col-5[
![](slides_files/figure-html/heatmap-1.svg)
]
]

---

# Test for distortion

.row[
.col-8[

* The hierarchical cluster analysis represents the underlying original dissimilarities
* Check the dendrogram for distortion
* Calculate the _cophenetic_ distances; the heights on the dendrogram where samples are fused
* Calculate the Pearson correlation between the original dissimilarities and cophenetic distances; _cophenetic correlation_
* Cophenetic correlation = 0.712

]

.col-4[

![](slides_files/figure-html/cophenetic-1.svg)
]
]

---
class: inverse middle center big-subsection

# _k_-means clustering

---

# _k_-means clustering

* Hierarchical clustering has a legacy of history: once formed, clusters cannot be changed even if it would be sensible to do so.
* `$k$`-means is an iterative procedure that produces a non-hierarchical cluster analysis.
* If algorithm started from a hierarchical cluster analysis, it will be optimised.
* Best suited with _centroid_, _group-average_ or _minimum variance_ linkage methods.
* Computationally difficult; cannot be sure that an optimal solution is found.

---

# _k_-means clustering

* Given `$n$` objects in an `$m$`-dimensional space, find a partition into `$k$` groups (clusters) such that the objects within each cluster are more similar to one another than to objects in the other clusters.
* `$k$`-means minimises the within group sums of squares (WSS).
* `$k$` is chosen by the user; a scree plot of the WSS for `$k = 1,\dots,a$` where `$a$` is some small number.
* Even with modest `$n$` cannot evaluate all possible partitions to find one with lowest WSS.
* As such, algorithms have been developed that rearrange existing partitions and keep the new one only if it is an improvement.

---

# _k_-means clustering

<table>
<tr>
<td> `$n$` </td>
<td> `$k$` </td>
<td>Number of possible partitions</td>
</tr>
<tr>
<td>15</td>
<td>3</td>
<td>2,375,101</td>
</tr>
<tr>
<td>20</td>
<td>4</td>
<td>45,232,115,901</td>
</tr>
<tr>
<td>25</td>
<td>8</td>
<td>690,223,721,118,368,580</td>
</tr>
<tr>
<td>100</td>
<td>5</td>
<td> `$10^{68}$` </td>
</tr>
</table>

---

# _k_-means clustering algorithm

* `$k$`-means algorithm proceeds as follows:
    1. Find some initial partition of the individuals into `$k$` groups. May be provided by previous hierarchical cluster analysis,
    2. Calculate the change in the clustering criterion (e.g. WSS) by moving each individual from its current cluster to another,
    3. Make the change that leads to the greatest improvement in the value of the clustering criterion,
    4. Repeat steps 2 and 3 until no move of an individual leads to an improvement in the clustering criterion.
* If variables are on different scales, they should be standardised before applying `$k$`-means.
* Display results on an ordination; no hierarchy so no dendrogram.

---

# k-means clustering: Ponds

* Cluster 30 shallow ponds and pools from south east UK on basis of water chemistry.
* Run `$k$`-means for `$k = 1,\dots,10$` and collect WSS.
* No clear elbow in scree plot of WSS; change of slope at `$k = 4$`.
* Display results of `$k$`-means with `$k = 4$` on an NMDS of the Ponds data set.

---

# k-means clustering: Ponds

![](slides_files/figure-html/ponds-k-means-1.svg)![](slides_files/figure-html/ponds-k-means-2.svg)
---
class: inverse middle center big-subsection

# Fuzzy clustering

---

# Fuzzy clustering

* Some objects may clearly belong to some groups. For others, group membership is much less obvious.
* In fuzzy clustering, objects are not assigned to a particular cluster.
* Instead, each observation has a membership function indicating the _strength of membership_ in all or some groups.
* Membership function; values _between_ 0 and 1 in fuzzy clustering.
* In previous methods discussed, strength of membership has been either 0 or 1; these are _crisp_ methods.
* For continuous data, a weighted sums of squares criterion leads to a fuzzy `$k$`-means algorithm (often known as fuzzy `$c$`-means; Bezdek (1974) _J. Mathematical Biology_ 1, 57--71).
* Introduction for ecologists in Equihua (1990; _J. Ecology_ 78, 519--534).

---

# Fuzzy clustering: Ponds

* As for `$k$`-means, with results drawn on an NMDS plot.
* Segments used to show the membership function for each Pond.
* Clearly many intermediate ponds.

---

# Fuzzy clustering: Ponds

![](slides_files/figure-html/fuzzy-c-means-1.svg)

---
class: inverse middle center big-subsection

# Model-based clustering

---

# Model-based clustering

* The agglomerative, `$k$`-means and fuzzy clustering techniques described thus far are largely heuristic methods, not based on a formal statistical model.
* This makes estimating the numbers of clusters or deciding between particular methods difficult, and formal inference is precluded.
* This is not too great a problem as cluster analysis is largely an exploratory procedure.
* _Mixture models_ or _model-based_ clustering is a group of techniques that address these issues.
* Several methods for model-based clustering have been proposed; the one proposed by Scott & Symons (1971) and extended by Banfield & Raftery (1993) and Fraley & Raftery (1999, 2002) has been the most successful.

---

# Model-based clustering

* Assume that the population from which observations arise consists of `$c$` sub-populations, each corresponding to a cluster.
* Further assume that the density of a `$q$`-dimensional observation `$\mathbf{x}^{\mathrm{T}} = (x_1,\dots,x_q)$` from the `$j$`th sub-population is `$f_j(\mathbf{x}, \vartheta_j)$`, `$j = 1,\dots,c$`, for some unknown vector of parameters `$\vartheta$`.
* Next, introduce a vector `$\gamma = (\gamma_1,\dots,\gamma_n)$` where `$\gamma_i = j$` if `$\mathbf{x}_i$` is from the `$j$`th sub-population; the `$\gamma_i$` label each observation `$i = 1,\dots,n$`.
* The clustering problem now becomes one of choosing `$\vartheta = (\vartheta_1, \dots, \vartheta_c)$` and `$\gamma$` to maximise a likelihood function:

`$$L(\vartheta,\gamma) = \prod\limits^n_{i=1} f_{\gamma_i}(\mathbf{x}_i, \vartheta_{\gamma_i})$$`

---

# Model-based clustering

* If `$f_j(\mathbf{x}, \vartheta_j)$` with mean vector `$\mu_j$` and covariance matrix `$\Sigma_j$` the following log-likelihood results:
`$$L(\vartheta,\gamma) = -\frac{1}{2}\sum\limits^c_{j=1} \mathrm{trace}(\mathbf{W}_j\Sigma_j^{-1} + n \log|\Sigma_j|)$$`
* For a particular, constrained covariance matrix `$\Sigma_j$` we choose `$\gamma$` to minimise `$\mathrm{trace}(\mathbf{W})$`, i.e. the within group sums of squares.
* This tends to produce spherical clusters of generally equal sizes.
---

# Model-based clustering

* An alternative form for `$\Sigma_j$` leads to minimising `$|\mathbf{W}|$`, yielding clusters with the same elliptical slope.
* Unconstrained covariance matrices are also allowed
* Banfield and Raftery (1993) also consider criteria that allow clusters to be spherical but with different volumes, or to have different size, shape and/or orientation.

---

# Mixture models; a simple example

.row[

.col-7[

* 50 observations drawn from each of two bivariate normal distributions
* Density 1;
    * mean vector: `$\mu_x = 1.0$`, `$\mu_y = 1.0$`
    * covariance: `$\sigma_x = 1.0$`, `$\sigma_y = 1.0$`
* Density 2;
    * mean vector: `$\mu_x = 4.0$`, `$\mu_y = 5.0$`
    * covariance: `$\sigma_x = 2.0$`, `$\sigma_y = 0.5$`
* Fit a two component mixture model to retrieve the clustered data

]

.col-5[

![](slides_files/figure-html/simple-mixture-1-1.svg)
]
]

---

# Mixture models; a simple example

<table>
<tr>
<th>Cluster</th>
<th>Means</th>
<th> $ \Sigma $ </th>
</tr>
<tr>
<td>Cluster 1</td>
<td>(4.137, 5.077)</td>
<td>(2.027, 0.407)</td>
</tr>
<tr>
<td>Cluster 2</td>
<td>(1.012, 1.002)</td>
<td>(0.918, 0.869)</td>
</tr>
</table>

---

# Mixture models; a simple example

![](slides_files/figure-html/simple-mixture-2-1.svg)

---

# Model-based clustering; Ponds

.row[

.col-7[
* Fit a series of mixture models allowing spherical, diagonal and elliptical clusters, with combinations of equal or varying shape, volume and orientation.
* Used the _mclust_ package for; model selection via BIC.
* BIC selects a 2 component mixture, ellipsoidal with equal shape, varying volume.
]

.col-5[

![](slides_files/figure-html/mclust-1.svg)

]
]

---

# Mdel-based clustering; Ponds

![](slides_files/figure-html/mclust-pairs-1.svg)

---

# Model-based clustering; Ponds

![](slides_files/figure-html/mclust-proj-1.svg)

---

# Model-based clustering

* Mixture models and model-based clustering provide a formal statistical approach to cluster analysis.
* They are computationally and theoretically complex compared to agglomerative methods.
* Precision of parameter estimates may be dependent on having large sample sizes.
* The widespread availability of computer code to fit these methods in software such as **R** will hopefully lead to more convincing applications of cluster analysis than has often been the case in the past.
* Model selection tools; AIC, BIC.
* Mixture models for mixed data require non-Gaussian mixtures; _latent class analysis_

---

# Choice of clustering method

* Some opt for single link: finds distinct clusters but prone to chaining and sensitive to sampling pattern.
* Most opt for average linkage or minimum variance methods: chops data more evenly, finds spherical clusters.
* All are dependent on choosing an appropriate and ecologically meaningful dissimilarity.
* Small changes in data can cause large visual changes in the dendrogram

---

# Choice of clustering method

* Hierarchical clustering may be optimised for a chosen `$k$` using `$k$`-means.
* Fuzzy clustering may fail as well, but at least it shows the uncertainty.
* Mixture models are theoretically complex but that complexity pays in terms of model selection and choice of number of clusters

---
class: inverse middle center big-subsection

# Indicator species analysis

---

# Detection of indicator species

* A basic concept and tradition in ecology and biogeography.
* Species characteristic of e.g. particular habitat, geographical region, vegetation type.
* Adds ecological meaning to groups of sites delineated by cluster analysis.
* _Indicator species_ are species that are indicative of particular groups of sites.
* _Good_ indicator species should be found mostly in a single group _and_ be present be present at most sites in that group.
* This is an important duality: _faithfulness_ and _constancy_.

---

# INDVAL

* _INDVAL_ method of Dufrene and Legendre (1997; _Ecological Monographs_ 67, 345--366) is a well respected approach for identifying indicator species.
* INDVAL can derive indicator species for any clustering of objects.
* Indicator species values based only on within-species abundances and occurrence comparisons. Values not affected by abundances of other species.
* Significance of indicator values for each species is determined via a randomisation procedure.

---

# INDVAL

* _Specificity_ is `$A_{ij} = \mu_{\mathrm{species}_{ij}} / \mu_{\mathrm{species}_i}$`.
* _Fidelity_ is `$B_{ij} = \mu_{\mathrm{sites}_{ij}} / \mu_{\mathrm{sites}_i}$`.
* `$A_{ij}$` is maximum when species `$i$` is present only in group `$j$`.
* `$B_{ij}$` is maximum when species `$i$` is present in all sites within group `$j$`.
* `$\mathrm{INDVAL}_{ij} = (A_{ij} \times B_{ij} \times 100)\%$`.
* Indicator value for species `$i$` is the largest value of `$\mathrm{INDVAL}_{ij}$` observed over all `$j$` groups: `$\mathrm{INDVAL}_i = \mathrm{max}(\mathrm{INDVAL}_{ij})$`.
* `$\mathrm{INDVAL}_i$` will be 100% when individuals of species `$i$` are observed at all sites belonging only to group `$j$`.
* A random reallocation of sites among groups is used to test the significance of `$\mathrm{INDVAL}_i$`.

---
class: inverse middle center big-subsection

# Interpretting clusters

---

# Silhouette widths

* Deciding how many clusters should be retained is problematic
* Little good theory to guide the choice
* Silhouette plots offer a simple graphical means for assessing quality of a clustering
* Silhouette width `$s_i$` of object `$i$` is
    `$$s_i = \frac{b_i - a_i}{\mathrm{max}(a_i, b_i)}$$`
* `$a_i$` is average distance of object `$i$` to all other objects in same cluster
* `$b_i$` is smallest distance of object `$i$` to another cluster
* Hence, maximal value of `$s_i$` will be found where the intra-cluster distance `$a$` is much smaller that the inter-cluster distance `$b$`

---

# Silhouette widths: Ponds `$k$`-means

.row[

.col-7[
* `silhouette()` in the _cluster_ package
* Provide a vector of cluster memberships and the dissimilarity matrix used to cluster the data
* Returns object containing the nearest neighbouring cluster and the silhouette width `$s_i$` for each observation.
* Low values of `$s_i$` indicate observation lies between two clusters
* High values of `$s_i$`, close to 1, indicate well-clustered objects
]

.col-5[

![](slides_files/figure-html/ponds-km-silhouette-1.svg)
]
]

---

# Calinski-Harabasz criterion

Calinski, T. and Harabasz, J. (1974) A dendrite method for cluster analysis. _Commun. Stat._ **3**: 1--27.

* The Calinski-Harabasz (1974) criterion is suggested as a simple means to determine the number of clusters to retain for analysis
* Function `cascadeKM()` in _vegan_ for `$k$`-means
* Criterion computed as
    `$$\frac{\mathrm{SSB}/(k-1)}{\mathrm{SSW}/(n-k)}$$`
* `$\mathrm{SSW}$` \& `$\mathrm{SSB}$` within cluster and between cluster sums of squares
* `$k$` is number of clusters, `$n$` is number of observations

---

# Calinski-Harabasz criterion

![](slides_files/figure-html/ponds-km-calinski-harabasz-1.svg)

---

# Relating classifications to external sets of variables

* Cluster analysis is not an end in and of itself. It is a means to an end.
* Interpretation of results of cluster analysis can be aided through the use of external data, e.g. environmental data used to aid interpretation of species-derived clusters.
* Basic EDA graphical approaches, such as boxplots, coded scatterplots.
* Discriminants analysis.
* Indicator species analysis.

---
class: inverse middle center big-subsection

# Comparing different cluster analyses

---

# Comparing cluster analyses

.row[

.col-8[
* If several cluster analysis methods give similar results we can more confident in these results.
* Simplest way of comparing cluster analyses is via a _cross classification table_.
* Several coefficients can be also be used, such as _Rand index_ or _Cohen's `$\kappa$`_.
* Example is cross classification table for Ponds data set comparing single link and minium variance clusterings.
]

.col-4[
<table>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td colspan=4>Ward</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td rowspan=4>Single</td>
<td>1</td>
<td>6</td>
<td>4</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>6</td>
<td>0</td>
<td>8</td>
<td>0</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
</table>
]
]

---

# Rand index

* Rand index is a measure of the similarity between two cluster analysis partitions.
* Define several values:
    * `$a$`, the number of pairs of samples in the same cluster in both partitions,
    * `$b$`, the number of pairs of samples in different clusters in both partitions,
    * `$c$`, the number of pairs of samples in same cluster in partition 1 but in different clusters in partition 2,
    * `$d$`, the number of pairs of samples in same cluster in partition 2 but in different clusters in partition 1.

---

# Rand index

* `$a$` and `$b$` are the number of agreements between the two partitions.
* `$c$` and `$d$` are the number of disagreements between the two partitions.
  `$$R = \frac{a + b}{a + b + c + d}$$`
* The Rand index takes values between 0 and 1, with 0 indicating no agreement on any pairs of samples, and 1 indicating that the partitions are exactly the same

---

# Rand index

* The standard Rand index has been shown to increase as the number of clusters increase.
* The possible range of values for the Rand index is also restricted.
* Rand index should be corrected for chance; i.e. it should measure the agreement between two partitions over and above that expected by chance.
* The corrected Rand index takes a value of 0 when the partitions are selected at random (row and column sums are fixed).
* Maximum value is again 1.

---

# Cohen's `$\kappa$`

* Cohen's `$\kappa$` is a measure of inter-rater reliability.

`$$\kappa = \frac{\mathrm{Pr}(o) - \mathrm{Pr}(e)}{1 - \mathrm{Pr}(e)}$$`

* `$\mathrm{Pr}(o) = \sum\limits^n_{i=1}\mathrm{Pr}_{ii}$`, the sum proportion agreements
* `$\mathrm{Pr}(e) = \sum\limits^n_{i=1}\mathrm{Pr}_i \times \mathrm{Pr}_i$`, the chance expected agreements.
* `$\kappa = 1$` when two partitions are in complete agreement.
* `$\kappa \approx 0$` when observed agreement approximately the same as would be expected by chance $ `\(\mathrm{Pr}(o) \approx \mathrm{Pr}(e)$` \)

---

# Suggested readings

B. Everitt, S. Landau & M. Leese, 2001, Cluster analysis 4th Edition. Arnold

A.D. Gordon, 1999, Classification. Chapman & Hall

L. Kaufman & P.J. Rousseeuw, 1990, Finding groups in data. An introduction to cluster analysis. Wiley

P. Legendre & L. Legendre, 1998, Numerical ecology. Elsevier (Third English Edition)