Abstract

This practical will use diatom and hydro-chemistry for a series of 30 ponds and pools from Southern England; including hierarchical, k-means and fuzzy clustering techniques. The ponds data-set consists of mean annual chemistry for 30 lowland ponds and pools in south-east England spanning a range of water chemistry along a total phosphorus (TP) gradient. In addition, diatom data from the surface sediments in these ponds and pools are available. These chemistry and diatom data were collected as part of a surface-sediment diatom – water chemistry data set used to generate a transfer function for inferring past TP values from fossil diatom assemblages preserved in lake sediment cores.

Tasks

Use R to perform a nearest neighbour, furthest neighbour, unweighted group average, weighted group average and minimum variance cluster analysis of the standardised water chemistry, and compare the resulting classifications;
Perform a k-means cluster analysis of the standardised water chemistry data and compare this classification with the agglomerative methods used in 1;
Perform a cluster analysis of the Vltava river valley data

Hierarchical agglomerative cluster analysis

Hierarchical agglomerative cluster analysis methods start with all samples in separate clusters and sequentially fuse together the two most similar clusters until all samples are in a single cluster. A variety of measures of similarity (distance) (using dissimilarity or distance coefficients) can be employed as can a range of different ways in which the clusters are fused (e.g. single link, group average link, minimum variance method).

The functions hclust(), and agnes() (in package cluster) are the two standard functions for performing hierarchical agglomerative cluster analysis in R. The dist() function is used to create distance matrices using a variety of distance (or dissimilarity) coefficients. First, read in the ponds data set:

ponds_url <- "https://bit.ly/pondsfull"
ponds <- read.csv(url(ponds_url))
names(ponds)

##  [1] "AC001A"       "AC013A"       "AC013E"       "AM011A"       "AM012A"      
##  [6] "AS001A"       "AU002A"       "AU003B"       "CC001A"       "CC002A"      
## [11] "CC9997"       "CM004A"       "CO001A"       "CY002A"       "CY003A"      
## [16] "CY009A"       "CY011A"       "FR001A"       "FR002A"       "FR002C"      
## [21] "FR006A"       "FR006E"       "FR009B"       "FR018A"       "FR019A"      
## [26] "GO013A"       "NA004A"       "NA007A"       "NA022A"       "NA042A"      
## [31] "NA114A"       "NI009A"       "NI014A"       "NI015A"       "NI083A"      
## [36] "NI196A"       "NI9969"       "NI9971"       "OP001A"       "ST001A"      
## [41] "ST002A"       "ST010A"       "SU016A"       "SY002A"       "SY003A"      
## [46] "SY003C"       "SY010A"       "UN9992"       "pH"           "Conductivity"
## [51] "Alkalinity"   "TP"           "SiO2"         "NO3"          "Na"          
## [56] "K"            "Mg"           "Ca"           "Cl"           "SO4"         
## [61] "Chla"         "Secchi"       "Maxdepth"

rownames(ponds)

##  [1] "4"   "7"   "31"  "34"  "37"  "42"  "50"  "53"  "57"  "58"  "65"  "69" 
## [13] "73"  "74"  "76"  "79"  "82"  "83"  "85"  "86"  "98"  "100" "101" "105"
## [25] "107" "108" "112" "113" "114" "120"

str(ponds)

## 'data.frame':    30 obs. of  63 variables:
##  $ AC001A      : num  0 0.36 0.9 0.17 0 ...
##  $ AC013A      : num  0.55 3.4 1.08 0.52 6.84 ...
##  $ AC013E      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AM011A      : num  0.74 1.07 0.9 0.69 2.54 0 0 0.94 0.34 0.34 ...
##  $ AM012A      : num  0.92 8.05 5.39 0.35 2.34 0.73 0 0 0.34 0 ...
##  $ AS001A      : num  1.66 0.36 0 0 0.19 0.36 2.4 2.83 0 1.18 ...
##  $ AU002A      : num  4.6 0 0 0 0 ...
##  $ AU003B      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CC001A      : num  0 2.15 0 9.69 0 0 0.2 0 0 0 ...
##  $ CC002A      : num  0 3.4 0.18 7.96 0 0 0.2 2.26 0 0 ...
##  $ CC9997      : num  0 0 0.18 15.23 0 ...
##  $ CM004A      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CO001A      : num  0.18 3.94 0.72 0.52 2.15 0.91 2.4 0 0 4.06 ...
##  $ CY002A      : num  1.11 1.97 0 3.46 0.59 ...
##  $ CY003A      : num  0 3.04 0 2.77 0.19 0 1.2 0.94 0.17 1.35 ...
##  $ CY009A      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CY011A      : num  0 0.72 0 9.17 0 0 0.2 0.75 0 0 ...
##  $ FR001A      : num  3.13 3.22 2.15 2.77 1.17 ...
##  $ FR002A      : num  5.53 0 7.54 0 0.98 18.4 0 0 0 0 ...
##  $ FR002C      : num  7 0.18 37.52 1.38 1.95 ...
##  $ FR006A      : num  53.22 1.07 18.13 3.11 8.4 ...
##  $ FR006E      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FR009B      : num  0 2.68 0 0 33.59 ...
##  $ FR018A      : num  0 1.25 0.72 4.67 1.37 7.47 0.2 3.95 0.34 9.65 ...
##  $ FR019A      : num  0 0.18 1.44 0.52 0 0 6.01 1.88 1.21 0.51 ...
##  $ GO013A      : num  0 1.07 0.54 0 2.34 0 3.01 0.75 0 0.85 ...
##  $ NA004A      : num  0 0.36 0 0.17 0 0 1.4 0.38 0.52 0 ...
##  $ NA007A      : num  0 0.89 0 0.35 0.98 0.36 6.41 2.26 0.52 0.34 ...
##  $ NA022A      : num  0 0 0 0 0 0 0 1.7 0 0.17 ...
##  $ NA042A      : num  0 0 0 0.17 0 0 1 0.19 0.86 0.85 ...
##  $ NA114A      : num  4.97 0 1.44 0 0 0 3.61 0.38 8.62 4.57 ...
##  $ NI009A      : num  0.37 1.97 0 0.17 1.95 0 0.4 1.32 1.55 0.85 ...
##  $ NI014A      : num  0 0.72 1.98 0 1.56 0.18 0 0 0 0 ...
##  $ NI015A      : num  0.18 0.72 0 2.08 0.39 0 0.2 0.38 0.17 0 ...
##  $ NI083A      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NI196A      : num  0 0 0 0 0.59 0 0.4 1.13 0.52 0 ...
##  $ NI9969      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NI9971      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OP001A      : num  0 0 0 0 0 0 0 0 7.76 0 ...
##  $ ST001A      : num  0.18 2.5 3.59 6.92 0.98 0 0 1.13 0 0 ...
##  $ ST002A      : num  0 3.22 0 0.52 0.39 0 0 2.45 0 0 ...
##  $ ST010A      : num  0.55 7.87 1.79 10.03 1.56 ...
##  $ SU016A      : num  0 0 0 0.35 0 0 0 0.19 0 0 ...
##  $ SY002A      : num  0 0 0 0 0 0 6.21 0 0 2.2 ...
##  $ SY003A      : num  3.13 3.94 0.54 0 0 0 1.4 1.7 0 1.86 ...
##  $ SY003C      : num  0.92 0 0.18 0 0 0 0 0 0 0 ...
##  $ SY010A      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UN9992      : num  0 0 0 0 0 0 0 0 0 4.57 ...
##  $ pH          : num  8.01 8 8.16 7.83 7.91 ...
##  $ Conductivity: num  2.68 2.76 2.7 2.65 2.57 ...
##  $ Alkalinity  : num  0.539 0.546 0.602 0.35 0.376 ...
##  $ TP          : num  1.89 2.22 2.8 2.66 2.04 ...
##  $ SiO2        : num  1.27 1.61 1.63 1.57 1.51 ...
##  $ NO3         : num  -0.0333 0.1957 -0.0301 0.1511 0.7469 ...
##  $ Na          : num  2.59 3.09 2.79 3.09 2.64 ...
##  $ K           : num  1.9 2.39 2.38 2.16 1.71 ...
##  $ Mg          : num  2.66 2.78 2.59 2.48 2.21 ...
##  $ Ca          : num  3.55 3.73 3.65 3.45 3.54 ...
##  $ Cl          : num  2.79 3.22 2.8 3.05 2.74 ...
##  $ SO4         : num  3.21 3.29 3.08 2.99 2.77 ...
##  $ Chla        : num  1.52 2.3 2.43 2 1.19 ...
##  $ Secchi      : num  0.837 0.775 1.342 0.548 1.483 ...
##  $ Maxdepth    : num  1.73 1.1 1.1 1 1.41 ...

The ponds object you have just created contains observations on 30 ponds. Columns 1 through 48 of the ponds object contain the diatom counts of surface samples taken from each of the 30 ponds. The remaining columns contain the mean water chemistry data for each pond taken over the period represented by the diatom surface samples.

For the sake of simplicity during the remainder of the practical, we should subset ponds into a new object containing just the standardised water chemistry data. As its name suggests, the subset() function can be used to return a subset of a data frame. The scale() function ca be used to center (using the center argument) or standardise (using the scale argument) the columns of a data frame or matrix.

pondsenv <- subset(ponds, select = pH:Maxdepth)
pondsenv <- scale(pondsenv, scale = TRUE, center = FALSE)

Using subset() we choose which columns to retain using the select argument and by stating the names of the columns. Note that we have standardised the columns of the pondsenv object (dividing the values in each column by the standard deviation of the values in that column) but we have not centered (i.e. subtracted the column mean from each value in that column) these data. We now proceed to perform a nearest neighbour clustering of the pondsenv data,

env.ed <- dist(pondsenv)
pond.slink <- hclust(env.ed, method = "single")
pond.slink

## 
## Call:
## hclust(d = env.ed, method = "single")
## 
## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 30

plot(pond.slink, hang = -0.01, main = "Ponds Hydrochemistry Data",
     sub = "Single Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.slink, k = 4, border = "red")

cor(coph.slink <- cophenetic(pond.slink), env.ed)

## [1] 0.5318913

In the first line, dist() is used to create a Euclidean distance matrix and store it in env.ed for use later in the practical when different clustering techniques are compared. hclust() performs the actual cluster analysis and requires a dissimilarity matrix from dist() and the clustering method to use to be given as arguments. The resulting pond.slink object contains the results needed to produce the dendrogram of the cluster analysis, but in keeping with R’s philosophy, only a brief summary of the object is presented when the object’s name is entered at the prompt.

The dendrogram of the cluster analysis is produced using plot(), with the hang argument instructing the plot method to draw the labels of the nodes of the dendrogram at the baseline of the plot. rect.hclust() is used to cut the dendrogram into k clusters (in this case 4) and then draw rectangles around the resulting k clusters on the dendrogram. Finally, the cophenetic correlation is calculated, using the cophenetic() function to first generate the cophenetic distances between ponds from the dendrogram ¹, and then the cor() function to calculate the Pearson product moment correlation between the original Euclidean distances and the cophenetic distances. The cophenetic correlation is one measure of how well the dendrogram represents the original distance matrix and therefore, how well it represents the real distance between samples.

An alternative way of visualising how faithfully a dendrogram represents the original dissimilarity matrix is via a Shepard-like plot. In a Shepard-like plot, the original distances (or dissimilarities) are plotted on the x-axis with the corresponding cophenetic distances plotted on the y-axis. A one to one line is drawn through the points to aid interpretation; deviations from the one to one line indicate how faithfully the original distances are represented from the dendrogram. If the dendrogram were able to fully represent the original distances, all the points would fall on the one to one line.

plot(env.ed, coph.slink)
abline(a = 0, b = 1)

What does the Shepard-like plot indicate about the faithfulness of the nearest neighbour dendrogram in representing the original Euclidean distances?

Repeat the above commands to perform the remaining clustering methods on the pondsenv data set. The commands needed are reproduced below. Try to come to a consensus as to how many clusters there are in the water chemistry data.

# complete link
pond.clink <- hclust(env.ed, method = "complete")
plot(pond.clink, hang = -0.01, main = "Ponds Hydrochemistry Data",
     sub = "Complete Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.clink, k = 4, border = "red")

cor(coph.clink <- cophenetic(pond.clink), env.ed)

## [1] 0.6543283

plot(env.ed, coph.clink)
abline(a = 0, b = 1)

# average link
pond.avg <- hclust(env.ed, method = "average")
plot(pond.avg, hang = -0.01, main = "Ponds Hydrochemistry Data",
     sub = "Average Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.avg, k = 4, border = "red")

cor(coph.avg <- cophenetic(pond.avg), env.ed)

## [1] 0.7093294

plot(env.ed, coph.avg)
abline(a = 0, b = 1)

# Ward's minimum variance clustering
pond.min <- hclust(env.ed, method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(pond.min, hang = -0.01, main = "Ponds Hydrochemistry Data",
     sub = "Minimum Variance", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.min, k = 4, border = "red")

cor(coph.min <- cophenetic(pond.min), env.ed)

## [1] 0.5672972

plot(env.ed, coph.min)
abline(a = 0, b = 1)

# weighted average link
require("cluster")

## Loading required package: cluster

pond.wavg <- agnes(env.ed, method = "weighted")
pond.wavg <- as.hclust(pond.wavg)
plot(pond.wavg, hang = -0.01, main = "Ponds Hydrochemistry Data",
     sub = "Weighted Average Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.wavg, k = 4, border = "red")

cor(coph.wavg <- cophenetic(pond.wavg), env.ed)

## [1] 0.7481309

plot(env.ed, coph.wavg)
abline(a = 0, b = 1)

The final method applied, weighted group average clustering, is not implemented in tbase R. Instead, the agnes() function from the cluster package is used. The as.hclust() function is used to convert the output from agnes() into a hclust() object.

All cluster methods are designed to find clusters in any given data-set and as such, it is important to produce a number of clusterings of a data-set using a range of cluster techniques to try to find some consensus or agreement between the results of the various techniques employed. There are a number of ways in which this can be done. One simple way is to calculate the cophenetic correlation between all available clusterings of a given data-set using the cor() function.

cor(cbind(coph.slink, coph.clink, coph.avg, coph.wavg, coph.min, env.ed))

##            coph.slink coph.clink  coph.avg coph.wavg  coph.min    env.ed
## coph.slink  1.0000000  0.3545052 0.6246113 0.5416554 0.1905182 0.5318913
## coph.clink  0.3545052  1.0000000 0.7001381 0.8023645 0.4645943 0.6543283
## coph.avg    0.6246113  0.7001381 1.0000000 0.8110084 0.5285280 0.7093294
## coph.wavg   0.5416554  0.8023645 0.8110084 1.0000000 0.4231211 0.7481309
## coph.min    0.1905182  0.4645943 0.5285280 0.4231211 1.0000000 0.5672972
## env.ed      0.5318913  0.6543283 0.7093294 0.7481309 0.5672972 1.0000000

cbind() was used to bind each of the vectors of distances into a data frame object, because cor() expects to be passed a matrix like object as one of its arguments.

Are any of the clusterings closely correlated with each other and with the original Euclidean distances?

An alternative approach is to calculate the percentage of samples assigned to the same cluster and the Rand coefficient. The compareMatchedClasses() function in package e1071 can be used to calculate these statistics. The function accepts two vectors of cluster memberships as arguments so can only be used directly to compare any two classifications. As such, we have written a simple function that allows us to run the function with a data frame as the main argument and print the results nicely in matrix format; compCluster().

# install.packages("e1071")
require(e1071)

## Loading required package: e1071

mem.slink <- cutree(pond.slink, k = 4)
mem.clink <- cutree(pond.clink, k = 4)
mem.avg <- cutree(pond.avg, k = 4)
mem.wavg <- cutree(pond.wavg, k = 4)
mem.min <- cutree(pond.min, k = 4)
members <- cbind(mem.slink, mem.clink, mem.avg, mem.wavg, mem.min)
compareMatchedClasses(members)

## $diag
##      [,1]     [,2]      [,3]      [,4]      [,5]
## [1,]   NA 0.326087 0.5000000 0.4901961 0.3095238
## [2,]   NA       NA 0.5283019 0.5576923 0.4722222
## [3,]   NA       NA        NA 0.9666667 0.4166667
## [4,]   NA       NA        NA        NA 0.3846154
## [5,]   NA       NA        NA        NA        NA
## 
## $kappa
##      [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]   NA 0.03778677 0.09866667 0.09734513 0.03791469
## [2,]   NA         NA 0.21411625 0.26445264 0.27848101
## [3,]   NA         NA         NA 0.90936556 0.14864865
## [4,]   NA         NA         NA         NA 0.12359551
## [5,]   NA         NA         NA         NA         NA
## 
## $rand
##      [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]   NA 0.4347826 0.4977376 0.4839216 0.4367015
## [2,]   NA        NA 0.5776488 0.6010558 0.6111111
## [3,]   NA        NA        NA 0.9402299 0.4746032
## [4,]   NA        NA        NA        NA 0.3846154
## [5,]   NA        NA        NA        NA        NA
## 
## $crand
##      [,1]       [,2]       [,3]      [,4]        [,5]
## [1,]   NA -0.0127354 0.08084006 0.0707203 -0.02605023
## [2,]   NA         NA 0.11487677 0.1617804  0.17221740
## [3,]   NA         NA         NA 0.8733483  0.10836604
## [4,]   NA         NA         NA        NA  0.02547242
## [5,]   NA         NA         NA        NA          NA

The first five lines all utilise the cutree() function to return the cluster membership of each sample when the dendrogram is cut into the specified k classes. In this case we have chosen to cut each dendrogram into k = 4 classes. In each case, cutree() returns a vector of class member ships (1 - 4 in our case). Cohen’s $\kappa$ is the percentage of data points (samples) in the same cluster adjusted for chance. The Rand index is also presented both unadjusted for and adjusted for chance.

Do any of the classifications show any consensus as the the resulting clusterings of the water chemistry data?

k-means cluster analysis

Agglomerative cluster analysis methods impose a hierarchy of between-sample distances; a hierarchy that is unlikely to exist in reality, but one which helps reduce the computational effort required to produce a clustering of the samples. Alternative solutions to the computational effort are available that do not impose a hierarchy on the data-set being clustered. The k-means clustering algorithm partitions a data-set into an a priori defined number of clusters by minimising the total error sum of squares, $E_k^2$ (TESS), which is to say that the distances between cluster members and the cluster centroid are minimised. The algorithm is, to a certain degree, sensitive to initial positions of the centroids of the clusters, which can result in the algorithm converging to a local minimum. The inital starting positions for the centroids are specified by the user, so care must be taken to ensure that the algorithm doesn’t converge towards local minima.

Several solutions can be used to help avoid convergence towards local minima:

Provide an initial configuration based on an ecological hypothesis;
Provide an initial configuration corresponding to the results of a hierarchical cluster analysis;
Provide an initial configuration by choosing at random k samples from the total data-set to act as the k group centroids. It is important to remember that this solution may still lead to a local minimum. Therefore, this procedure should be repeated many (say 100) times each with a different set of random samples as the starting points. The solution that minimises $E_k^2$ should then be retained.

It is easy, using the kmeans() function in R to implement solutions two or three from the list above, and for the purposes of this practical, you will provide the k-means algorithm with an initial configuration based on one of the hierachical cluster analyses performed earlier. We proceed by loading the MASS package, and creating a temporary matrix containing the water chemistry data to assist in calculating the initial starting configuration, which is stored in initial.

require(MASS) #returns TRUE if loaded or is available to load FALSE if not

## Loading required package: MASS

pondsenv.mat <- as.matrix(pondsenv)
# initial <- tapply(pondsenv.mat, list(rep(cutree(pond.avg, k = 4),
#     ncol(pondsenv.mat)), col(pondsenv.mat)),
# mean)
# dimnames(initial) <- list(NULL, dimnames(pondsenv.mat)[[2]])
# initial

initial <- aggregate(pondsenv ~ cutree(pond.avg, k = 4), FUN = mean)[, -1]
initial

##          pH Conductivity Alkalinity        TP      SiO2        NO3        Na
## 1 0.9906810    0.9881786  0.7738583 0.9732949 0.8987840  0.2763468 0.9882921
## 2 0.9686033    0.9498077  0.8444599 0.9718913 1.2580812  2.1319248 0.8700127
## 3 0.8783098    0.9307489 -0.5210372 0.9059333 1.0425081  1.5048369 1.0240844
## 4 1.0084576    0.9965745  0.6618439 0.9435906 0.5747634 -0.4682216 0.9526573
##           K        Mg        Ca        Cl       SO4      Chla    Secchi
## 1 0.9939912 0.9867959 0.9873855 0.9878328 0.9893787 0.9851775 0.8660565
## 2 0.8245920 0.8838306 0.9845333 0.8653552 0.8892995 0.7244555 1.2139745
## 3 0.9026149 0.9959402 0.8782816 1.0343345 0.9224448 0.9179478 0.6075038
## 4 1.0473468 0.9743230 1.0291234 0.9851449 1.0873895 0.4719017 1.6520901
##    Maxdepth
## 1 0.8930688
## 2 0.6140988
## 3 0.9644373
## 4 2.3653447

The code producing initial looks a little daunting at first, but is easier to understand if it is broken down into its constituent parts. The first argument to tapply() is the temporary matrix of water chemistry data we just created (it needs to be a matrix, because col() and ncol() only work with a matrix object). The last argument is mean, and tells R to calculate the mean value of some combination of the values in pondsenv.mat.

That combination is defined by the second (and most complicated) argument used above, which basically creates a list containing as its first object the cluster membership vector (from the average dendrogram cut into 4 clusters) repeated 15 times (from the number of columns in pondsenv.mat, returned by ncol()). The second object in the list is a matrix of integers indicating their column number in pondsenv.mat. What this list is used for is to produce a cross-classification table which defines the groups of values for which we want to calculate the mean water chemistry.

If you don’t understand this, don’t worry about it; it is sufficient to know that for your own data all you need to do is replace the first argument with your own matrix of data, give your own dendrogram, specify k, and substitute your matrix of data into the ncol() and col() functions.

The penultimate line of code above takes the column names from the original data frame and stores then as the column names for the initial object.

Now that we have calculated the initial starting configuration, we can proceed with the k-means clustering:

ponds.km <- kmeans(pondsenv, centers = initial)
ponds.km

## K-means clustering with 4 clusters of sizes 10, 7, 6, 7
## 
## Cluster means:
##          pH Conductivity Alkalinity        TP      SiO2          NO3        Na
## 1 1.0147997    1.0174257  1.1395204 0.9909872 0.9547134 -0.092901819 1.0024274
## 2 1.0070371    1.0033378  1.0986669 1.1009471 1.1024407  0.902807066 0.9912853
## 3 0.9738466    0.9511577  0.5986458 0.8716487 0.9580308  1.708050113 0.9019637
## 4 0.9152708    0.9313164 -0.2788570 0.8834023 0.7132049 -0.009911867 0.9935459
##           K        Mg        Ca        Cl       SO4      Chla    Secchi
## 1 1.0415040 0.9974897 1.0099982 1.0017756 1.0059339 0.9614444 1.1044150
## 2 1.0289529 0.9791206 1.0101287 0.9802753 1.0180834 1.2774124 0.5221512
## 3 0.8158069 0.9388798 0.9745891 0.9090489 0.9306542 0.7267176 1.2101862
## 4 0.9527984 0.9769680 0.9168744 1.0034129 0.9393455 0.7441124 0.7620074
##    Maxdepth
## 1 1.1229066
## 2 0.7948381
## 3 0.7679073
## 4 0.8813988
## 
## Clustering vector:
##   4   7  31  34  37  42  50  53  57  58  65  69  73  74  76  79  82  83  85  86 
##   1   2   1   2   3   1   3   4   3   4   2   1   4   4   2   1   1   4   4   1 
##  98 100 101 105 107 108 112 113 114 120 
##   2   2   4   3   2   3   1   1   3   1 
## 
## Within cluster sum of squares by cluster:
## [1] 7.632418 3.236025 6.381747 4.382761
##  (between_SS / total_SS =  58.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The cluster component of ponds.km contains the cluster membership for each of the 30 ponds. The centers component contains the water chemistry values for the cluster centers, whilst the withinss and size components contains the within cluster sum of squares (the sum of the distances of each cluster member to the cluster center, the error) and the number of objects within each of the clusters respectively.

As the k-means method does not utilise a hierarchy in the analysis it is not possible to view the results of a k-means clustering in dendrogram form. An alternative is to plot the data and associated clustering on a biplot of the original data. The biplot could be produced using principal co-ordinates analysis (PCoA), non-metric multi-dimensional scaling (NMDS) or, as we shall now show, principal components analysis (PCA). Enter the following code chunk to display the results of the k-means clustering (after identify() you will need to click on the plotted points to identify them, end with a right click):

ponds.pca <- prcomp(pondsenv)
screeplot(ponds.pca)

ponds.pred <- predict(ponds.pca, newdata = pondsenv)
dimnames(ponds.km$centers)[[2]] <- dimnames(pondsenv)[[2]]
pond.kmcenters <- predict(ponds.pca, ponds.km$centers)
eqscplot(ponds.pred[, 1:2], type = "n", xlab = "First principal component",
      ylab = "Second principal component", main = "K-means clustering")
text(ponds.pred[, 1:2], labels = ponds.km$cluster, cex = 0.9,
      col = ponds.km$cluster)
points(pond.kmcenters[, 1:2], pch = 3, cex = 3)
identify(ponds.pred[, 1:2], cex = 0.5)

## integer(0)

The first line uses princomp() to calculate a principal components analysis of the ponds hydrochemical data. The next line displays a scree plot of the eigenvalues (More on scree plots in John’s lecture on Indirect Ordination). The third line predicts where in the PCA space all the samples should lie, and the fourth line just adds the variable names from the pondsenv object to the names for the cluster centers. The eqscplot() function from the MASS package is used to plot the PCA results, but we suppress plotting of any of the data (type = "n"; this just sets up the axes and plotting region for us to plot on to. The final three lines do the actual plotting. Firstly, we plot each pond on the biplot using as a symbol a number representing which cluster that sample is in, and to help differentiate the clusters we plot them in different colours (col = ponds.km$cluster). The penultimate line add the cluster centers as large crosses. identify() is used to identify the samples. Click the left mouse button on each of the samples in the plot to display their sample ID. Press the right mouse button to finish.

Canine data

canine <- read.csv(url("https://bit.ly/canine-data"))
head(canine)

##           Group MandibleB MandibleH MolarLen MolarB Molar13 Premolar14
## 1    Modern dog       9.7      21.0     19.4    7.7    32.0       36.5
## 2 Golden jackal       8.1      16.7     18.3    7.0    30.3       32.9
## 3  Chinese wolf      13.5      27.3     26.8   10.6    41.9       49.1
## 4   Indian wolf      11.5      24.3     24.5    9.3    40.0       44.6
## 5          Cuon      10.7      23.5     21.4    8.5    28.8       37.6
## 6         Dingo       9.6      22.6     21.1    8.3    34.4       43.1

rownames(canine) <- canine[, 1]
canine <- canine[, -1]
canine

##                 MandibleB MandibleH MolarLen MolarB Molar13 Premolar14
## Modern dog            9.7      21.0     19.4    7.7    32.0       36.5
## Golden jackal         8.1      16.7     18.3    7.0    30.3       32.9
## Chinese wolf         13.5      27.3     26.8   10.6    41.9       49.1
## Indian wolf          11.5      24.3     24.5    9.3    40.0       44.6
## Cuon                 10.7      23.5     21.4    8.5    28.8       37.6
## Dingo                 9.6      22.6     21.1    8.3    34.4       43.1
## Prehistoric dog      10.3      22.1     19.1    8.1    32.2       35.0

Vltava river valley forest vegatation

Vegetation plots, located at even distances along transects following the steep valley slopes of Vltava river valley, collected during 2001-2003. Each transect starts at the valley bottom and ends up at the upper part of the valley slope. Altogether 97 plots located in 27 transects were collected, each of the size 10x15 m. In each plot, all tree, shrub and herb species were recorded and their abundances were estimated using 9-degree ordinal Braun-Blanquette scale (these values were consequently transformed into percentage scale).

v_url <- "https://bit.ly/vltava-spp"
vltava <- read.delim (url(v_url), row.names = 1)
head(vltava)

The distances are stored in coph.slink for later use. This illustrates a useful feature of the R language; namely, we do not need to create an object before using it with separate commands. Here, the coph.slink object is both created and used as an argument to cor() in a single line of code↩

Cluster analysis

Gavin Simpson

September 19, 2020