This practical will use diatom and hydro-chemistry for a series of 30 ponds and pools from Southern England; including hierarchical, k-means and fuzzy clustering techniques. The ponds data-set consists of mean annual chemistry for 30 lowland ponds and pools in south-east England spanning a range of water chemistry along a total phosphorus (TP) gradient. In addition, diatom data from the surface sediments in these ponds and pools are available. These chemistry and diatom data were collected as part of a surface-sediment diatom – water chemistry data set used to generate a transfer function for inferring past TP values from fossil diatom assemblages preserved in lake sediment cores.
Hierarchical agglomerative cluster analysis methods start with all samples in separate clusters and sequentially fuse together the two most similar clusters until all samples are in a single cluster. A variety of measures of similarity (distance) (using dissimilarity or distance coefficients) can be employed as can a range of different ways in which the clusters are fused (e.g. single link, group average link, minimum variance method).
The functions hclust()
, and agnes()
(in package cluster
) are the two standard functions for performing hierarchical agglomerative cluster analysis in R. The dist()
function is used to create distance matrices using a variety of distance (or dissimilarity) coefficients. First, read in the ponds data set:
ponds_url <- "https://bit.ly/pondsfull"
ponds <- read.csv(url(ponds_url))
names(ponds)
## [1] "AC001A" "AC013A" "AC013E" "AM011A" "AM012A"
## [6] "AS001A" "AU002A" "AU003B" "CC001A" "CC002A"
## [11] "CC9997" "CM004A" "CO001A" "CY002A" "CY003A"
## [16] "CY009A" "CY011A" "FR001A" "FR002A" "FR002C"
## [21] "FR006A" "FR006E" "FR009B" "FR018A" "FR019A"
## [26] "GO013A" "NA004A" "NA007A" "NA022A" "NA042A"
## [31] "NA114A" "NI009A" "NI014A" "NI015A" "NI083A"
## [36] "NI196A" "NI9969" "NI9971" "OP001A" "ST001A"
## [41] "ST002A" "ST010A" "SU016A" "SY002A" "SY003A"
## [46] "SY003C" "SY010A" "UN9992" "pH" "Conductivity"
## [51] "Alkalinity" "TP" "SiO2" "NO3" "Na"
## [56] "K" "Mg" "Ca" "Cl" "SO4"
## [61] "Chla" "Secchi" "Maxdepth"
rownames(ponds)
## [1] "4" "7" "31" "34" "37" "42" "50" "53" "57" "58" "65" "69"
## [13] "73" "74" "76" "79" "82" "83" "85" "86" "98" "100" "101" "105"
## [25] "107" "108" "112" "113" "114" "120"
str(ponds)
## 'data.frame': 30 obs. of 63 variables:
## $ AC001A : num 0 0.36 0.9 0.17 0 ...
## $ AC013A : num 0.55 3.4 1.08 0.52 6.84 ...
## $ AC013E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AM011A : num 0.74 1.07 0.9 0.69 2.54 0 0 0.94 0.34 0.34 ...
## $ AM012A : num 0.92 8.05 5.39 0.35 2.34 0.73 0 0 0.34 0 ...
## $ AS001A : num 1.66 0.36 0 0 0.19 0.36 2.4 2.83 0 1.18 ...
## $ AU002A : num 4.6 0 0 0 0 ...
## $ AU003B : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CC001A : num 0 2.15 0 9.69 0 0 0.2 0 0 0 ...
## $ CC002A : num 0 3.4 0.18 7.96 0 0 0.2 2.26 0 0 ...
## $ CC9997 : num 0 0 0.18 15.23 0 ...
## $ CM004A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CO001A : num 0.18 3.94 0.72 0.52 2.15 0.91 2.4 0 0 4.06 ...
## $ CY002A : num 1.11 1.97 0 3.46 0.59 ...
## $ CY003A : num 0 3.04 0 2.77 0.19 0 1.2 0.94 0.17 1.35 ...
## $ CY009A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CY011A : num 0 0.72 0 9.17 0 0 0.2 0.75 0 0 ...
## $ FR001A : num 3.13 3.22 2.15 2.77 1.17 ...
## $ FR002A : num 5.53 0 7.54 0 0.98 18.4 0 0 0 0 ...
## $ FR002C : num 7 0.18 37.52 1.38 1.95 ...
## $ FR006A : num 53.22 1.07 18.13 3.11 8.4 ...
## $ FR006E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FR009B : num 0 2.68 0 0 33.59 ...
## $ FR018A : num 0 1.25 0.72 4.67 1.37 7.47 0.2 3.95 0.34 9.65 ...
## $ FR019A : num 0 0.18 1.44 0.52 0 0 6.01 1.88 1.21 0.51 ...
## $ GO013A : num 0 1.07 0.54 0 2.34 0 3.01 0.75 0 0.85 ...
## $ NA004A : num 0 0.36 0 0.17 0 0 1.4 0.38 0.52 0 ...
## $ NA007A : num 0 0.89 0 0.35 0.98 0.36 6.41 2.26 0.52 0.34 ...
## $ NA022A : num 0 0 0 0 0 0 0 1.7 0 0.17 ...
## $ NA042A : num 0 0 0 0.17 0 0 1 0.19 0.86 0.85 ...
## $ NA114A : num 4.97 0 1.44 0 0 0 3.61 0.38 8.62 4.57 ...
## $ NI009A : num 0.37 1.97 0 0.17 1.95 0 0.4 1.32 1.55 0.85 ...
## $ NI014A : num 0 0.72 1.98 0 1.56 0.18 0 0 0 0 ...
## $ NI015A : num 0.18 0.72 0 2.08 0.39 0 0.2 0.38 0.17 0 ...
## $ NI083A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NI196A : num 0 0 0 0 0.59 0 0.4 1.13 0.52 0 ...
## $ NI9969 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NI9971 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OP001A : num 0 0 0 0 0 0 0 0 7.76 0 ...
## $ ST001A : num 0.18 2.5 3.59 6.92 0.98 0 0 1.13 0 0 ...
## $ ST002A : num 0 3.22 0 0.52 0.39 0 0 2.45 0 0 ...
## $ ST010A : num 0.55 7.87 1.79 10.03 1.56 ...
## $ SU016A : num 0 0 0 0.35 0 0 0 0.19 0 0 ...
## $ SY002A : num 0 0 0 0 0 0 6.21 0 0 2.2 ...
## $ SY003A : num 3.13 3.94 0.54 0 0 0 1.4 1.7 0 1.86 ...
## $ SY003C : num 0.92 0 0.18 0 0 0 0 0 0 0 ...
## $ SY010A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ UN9992 : num 0 0 0 0 0 0 0 0 0 4.57 ...
## $ pH : num 8.01 8 8.16 7.83 7.91 ...
## $ Conductivity: num 2.68 2.76 2.7 2.65 2.57 ...
## $ Alkalinity : num 0.539 0.546 0.602 0.35 0.376 ...
## $ TP : num 1.89 2.22 2.8 2.66 2.04 ...
## $ SiO2 : num 1.27 1.61 1.63 1.57 1.51 ...
## $ NO3 : num -0.0333 0.1957 -0.0301 0.1511 0.7469 ...
## $ Na : num 2.59 3.09 2.79 3.09 2.64 ...
## $ K : num 1.9 2.39 2.38 2.16 1.71 ...
## $ Mg : num 2.66 2.78 2.59 2.48 2.21 ...
## $ Ca : num 3.55 3.73 3.65 3.45 3.54 ...
## $ Cl : num 2.79 3.22 2.8 3.05 2.74 ...
## $ SO4 : num 3.21 3.29 3.08 2.99 2.77 ...
## $ Chla : num 1.52 2.3 2.43 2 1.19 ...
## $ Secchi : num 0.837 0.775 1.342 0.548 1.483 ...
## $ Maxdepth : num 1.73 1.1 1.1 1 1.41 ...
The ponds
object you have just created contains observations on 30 ponds. Columns 1 through 48 of the ponds
object contain the diatom counts of surface samples taken from each of the 30 ponds. The remaining columns contain the mean water chemistry data for each pond taken over the period represented by the diatom surface samples.
For the sake of simplicity during the remainder of the practical, we should subset ponds
into a new object containing just the standardised water chemistry data. As its name suggests, the subset()
function can be used to return a subset of a data frame. The scale()
function ca be used to center (using the center
argument) or standardise (using the scale
argument) the columns of a data frame or matrix.
pondsenv <- subset(ponds, select = pH:Maxdepth)
pondsenv <- scale(pondsenv, scale = TRUE, center = FALSE)
Using subset()
we choose which columns to retain using the select
argument and by stating the names of the columns. Note that we have standardised the columns of the pondsenv
object (dividing the values in each column by the standard deviation of the values in that column) but we have not centered (i.e. subtracted the column mean from each value in that column) these data. We now proceed to perform a nearest neighbour clustering of the pondsenv
data,
env.ed <- dist(pondsenv)
pond.slink <- hclust(env.ed, method = "single")
pond.slink
##
## Call:
## hclust(d = env.ed, method = "single")
##
## Cluster method : single
## Distance : euclidean
## Number of objects: 30
plot(pond.slink, hang = -0.01, main = "Ponds Hydrochemistry Data",
sub = "Single Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.slink, k = 4, border = "red")
cor(coph.slink <- cophenetic(pond.slink), env.ed)
## [1] 0.5318913
In the first line, dist()
is used to create a Euclidean distance matrix and store it in env.ed
for use later in the practical when different clustering techniques are compared. hclust()
performs the actual cluster analysis and requires a dissimilarity matrix from dist()
and the clustering method to use to be given as arguments. The resulting pond.slink
object contains the results needed to produce the dendrogram of the cluster analysis, but in keeping with R’s philosophy, only a brief summary of the object is presented when the object’s name is entered at the prompt.
The dendrogram of the cluster analysis is produced using plot()
, with the hang
argument instructing the plot method to draw the labels of the nodes of the dendrogram at the baseline of the plot. rect.hclust()
is used to cut the dendrogram into k clusters (in this case 4) and then draw rectangles around the resulting k clusters on the dendrogram. Finally, the cophenetic correlation is calculated, using the cophenetic()
function to first generate the cophenetic distances between ponds from the dendrogram 1, and then the cor()
function to calculate the Pearson product moment correlation between the original Euclidean distances and the cophenetic distances. The cophenetic correlation is one measure of how well the dendrogram represents the original distance matrix and therefore, how well it represents the real distance between samples.
An alternative way of visualising how faithfully a dendrogram represents the original dissimilarity matrix is via a Shepard-like plot. In a Shepard-like plot, the original distances (or dissimilarities) are plotted on the x-axis with the corresponding cophenetic distances plotted on the y-axis. A one to one line is drawn through the points to aid interpretation; deviations from the one to one line indicate how faithfully the original distances are represented from the dendrogram. If the dendrogram were able to fully represent the original distances, all the points would fall on the one to one line.
plot(env.ed, coph.slink)
abline(a = 0, b = 1)
What does the Shepard-like plot indicate about the faithfulness of the nearest neighbour dendrogram in representing the original Euclidean distances?
Repeat the above commands to perform the remaining clustering methods on the pondsenv
data set. The commands needed are reproduced below. Try to come to a consensus as to how many clusters there are in the water chemistry data.
# complete link
pond.clink <- hclust(env.ed, method = "complete")
plot(pond.clink, hang = -0.01, main = "Ponds Hydrochemistry Data",
sub = "Complete Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.clink, k = 4, border = "red")
cor(coph.clink <- cophenetic(pond.clink), env.ed)
## [1] 0.6543283
plot(env.ed, coph.clink)
abline(a = 0, b = 1)
# average link
pond.avg <- hclust(env.ed, method = "average")
plot(pond.avg, hang = -0.01, main = "Ponds Hydrochemistry Data",
sub = "Average Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.avg, k = 4, border = "red")
cor(coph.avg <- cophenetic(pond.avg), env.ed)
## [1] 0.7093294
plot(env.ed, coph.avg)
abline(a = 0, b = 1)
# Ward's minimum variance clustering
pond.min <- hclust(env.ed, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(pond.min, hang = -0.01, main = "Ponds Hydrochemistry Data",
sub = "Minimum Variance", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.min, k = 4, border = "red")
cor(coph.min <- cophenetic(pond.min), env.ed)
## [1] 0.5672972
plot(env.ed, coph.min)
abline(a = 0, b = 1)
# weighted average link
require("cluster")
## Loading required package: cluster
pond.wavg <- agnes(env.ed, method = "weighted")
pond.wavg <- as.hclust(pond.wavg)
plot(pond.wavg, hang = -0.01, main = "Ponds Hydrochemistry Data",
sub = "Weighted Average Link", ylab = "Dissimilarity", xlab = "")
rect.hclust(pond.wavg, k = 4, border = "red")
cor(coph.wavg <- cophenetic(pond.wavg), env.ed)
## [1] 0.7481309
plot(env.ed, coph.wavg)
abline(a = 0, b = 1)
The final method applied, weighted group average clustering, is not implemented in tbase R. Instead, the agnes()
function from the cluster
package is used. The as.hclust()
function is used to convert the output from agnes()
into a hclust()
object.
All cluster methods are designed to find clusters in any given data-set and as such, it is important to produce a number of clusterings of a data-set using a range of cluster techniques to try to find some consensus or agreement between the results of the various techniques employed. There are a number of ways in which this can be done. One simple way is to calculate the cophenetic correlation between all available clusterings of a given data-set using the cor()
function.
cor(cbind(coph.slink, coph.clink, coph.avg, coph.wavg, coph.min, env.ed))
## coph.slink coph.clink coph.avg coph.wavg coph.min env.ed
## coph.slink 1.0000000 0.3545052 0.6246113 0.5416554 0.1905182 0.5318913
## coph.clink 0.3545052 1.0000000 0.7001381 0.8023645 0.4645943 0.6543283
## coph.avg 0.6246113 0.7001381 1.0000000 0.8110084 0.5285280 0.7093294
## coph.wavg 0.5416554 0.8023645 0.8110084 1.0000000 0.4231211 0.7481309
## coph.min 0.1905182 0.4645943 0.5285280 0.4231211 1.0000000 0.5672972
## env.ed 0.5318913 0.6543283 0.7093294 0.7481309 0.5672972 1.0000000
cbind()
was used to bind each of the vectors of distances into a data frame object, because cor()
expects to be passed a matrix like object as one of its arguments.
Are any of the clusterings closely correlated with each other and with the original Euclidean distances?
An alternative approach is to calculate the percentage of samples assigned to the same cluster and the Rand coefficient. The compareMatchedClasses()
function in package e1071
can be used to calculate these statistics. The function accepts two vectors of cluster memberships as arguments so can only be used directly to compare any two classifications. As such, we have written a simple function that allows us to run the function with a data frame as the main argument and print the results nicely in matrix format; compCluster()
.
# install.packages("e1071")
require(e1071)
## Loading required package: e1071
mem.slink <- cutree(pond.slink, k = 4)
mem.clink <- cutree(pond.clink, k = 4)
mem.avg <- cutree(pond.avg, k = 4)
mem.wavg <- cutree(pond.wavg, k = 4)
mem.min <- cutree(pond.min, k = 4)
members <- cbind(mem.slink, mem.clink, mem.avg, mem.wavg, mem.min)
compareMatchedClasses(members)
## $diag
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA 0.326087 0.5000000 0.4901961 0.3095238
## [2,] NA NA 0.5283019 0.5576923 0.4722222
## [3,] NA NA NA 0.9666667 0.4166667
## [4,] NA NA NA NA 0.3846154
## [5,] NA NA NA NA NA
##
## $kappa
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA 0.03778677 0.09866667 0.09734513 0.03791469
## [2,] NA NA 0.21411625 0.26445264 0.27848101
## [3,] NA NA NA 0.90936556 0.14864865
## [4,] NA NA NA NA 0.12359551
## [5,] NA NA NA NA NA
##
## $rand
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA 0.4347826 0.4977376 0.4839216 0.4367015
## [2,] NA NA 0.5776488 0.6010558 0.6111111
## [3,] NA NA NA 0.9402299 0.4746032
## [4,] NA NA NA NA 0.3846154
## [5,] NA NA NA NA NA
##
## $crand
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA -0.0127354 0.08084006 0.0707203 -0.02605023
## [2,] NA NA 0.11487677 0.1617804 0.17221740
## [3,] NA NA NA 0.8733483 0.10836604
## [4,] NA NA NA NA 0.02547242
## [5,] NA NA NA NA NA
The first five lines all utilise the cutree()
function to return the cluster membership of each sample when the dendrogram is cut into the specified k classes. In this case we have chosen to cut each dendrogram into k = 4 classes. In each case, cutree()
returns a vector of class member ships (1 - 4 in our case). Cohen’s \(\kappa\) is the percentage of data points (samples) in the same cluster adjusted for chance. The Rand index is also presented both unadjusted for and adjusted for chance.
Do any of the classifications show any consensus as the the resulting clusterings of the water chemistry data?
Agglomerative cluster analysis methods impose a hierarchy of between-sample distances; a hierarchy that is unlikely to exist in reality, but one which helps reduce the computational effort required to produce a clustering of the samples. Alternative solutions to the computational effort are available that do not impose a hierarchy on the data-set being clustered. The k-means clustering algorithm partitions a data-set into an a priori defined number of clusters by minimising the total error sum of squares, \(E_k^2\) (TESS), which is to say that the distances between cluster members and the cluster centroid are minimised. The algorithm is, to a certain degree, sensitive to initial positions of the centroids of the clusters, which can result in the algorithm converging to a local minimum. The inital starting positions for the centroids are specified by the user, so care must be taken to ensure that the algorithm doesn’t converge towards local minima.
Several solutions can be used to help avoid convergence towards local minima:
It is easy, using the kmeans()
function in R to implement solutions two or three from the list above, and for the purposes of this practical, you will provide the k-means algorithm with an initial configuration based on one of the hierachical cluster analyses performed earlier. We proceed by loading the MASS
package, and creating a temporary matrix containing the water chemistry data to assist in calculating the initial starting configuration, which is stored in initial
.
require(MASS) #returns TRUE if loaded or is available to load FALSE if not
## Loading required package: MASS
pondsenv.mat <- as.matrix(pondsenv)
# initial <- tapply(pondsenv.mat, list(rep(cutree(pond.avg, k = 4),
# ncol(pondsenv.mat)), col(pondsenv.mat)),
# mean)
# dimnames(initial) <- list(NULL, dimnames(pondsenv.mat)[[2]])
# initial
initial <- aggregate(pondsenv ~ cutree(pond.avg, k = 4), FUN = mean)[, -1]
initial
## pH Conductivity Alkalinity TP SiO2 NO3 Na
## 1 0.9906810 0.9881786 0.7738583 0.9732949 0.8987840 0.2763468 0.9882921
## 2 0.9686033 0.9498077 0.8444599 0.9718913 1.2580812 2.1319248 0.8700127
## 3 0.8783098 0.9307489 -0.5210372 0.9059333 1.0425081 1.5048369 1.0240844
## 4 1.0084576 0.9965745 0.6618439 0.9435906 0.5747634 -0.4682216 0.9526573
## K Mg Ca Cl SO4 Chla Secchi
## 1 0.9939912 0.9867959 0.9873855 0.9878328 0.9893787 0.9851775 0.8660565
## 2 0.8245920 0.8838306 0.9845333 0.8653552 0.8892995 0.7244555 1.2139745
## 3 0.9026149 0.9959402 0.8782816 1.0343345 0.9224448 0.9179478 0.6075038
## 4 1.0473468 0.9743230 1.0291234 0.9851449 1.0873895 0.4719017 1.6520901
## Maxdepth
## 1 0.8930688
## 2 0.6140988
## 3 0.9644373
## 4 2.3653447
The code producing initial
looks a little daunting at first, but is easier to understand if it is broken down into its constituent parts. The first argument to tapply()
is the temporary matrix of water chemistry data we just created (it needs to be a matrix, because col()
and ncol()
only work with a matrix object). The last argument is mean
, and tells R to calculate the mean value of some combination of the values in pondsenv.mat
.
That combination is defined by the second (and most complicated) argument used above, which basically creates a list containing as its first object the cluster membership vector (from the average dendrogram cut into 4 clusters) repeated 15 times (from the number of columns in pondsenv.mat
, returned by ncol()
). The second object in the list is a matrix of integers indicating their column number in pondsenv.mat
. What this list is used for is to produce a cross-classification table which defines the groups of values for which we want to calculate the mean water chemistry.
If you don’t understand this, don’t worry about it; it is sufficient to know that for your own data all you need to do is replace the first argument with your own matrix of data, give your own dendrogram, specify k, and substitute your matrix of data into the ncol()
and col()
functions.
The penultimate line of code above takes the column names from the original data frame and stores then as the column names for the initial
object.
Now that we have calculated the initial starting configuration, we can proceed with the k-means clustering:
ponds.km <- kmeans(pondsenv, centers = initial)
ponds.km
## K-means clustering with 4 clusters of sizes 10, 7, 6, 7
##
## Cluster means:
## pH Conductivity Alkalinity TP SiO2 NO3 Na
## 1 1.0147997 1.0174257 1.1395204 0.9909872 0.9547134 -0.092901819 1.0024274
## 2 1.0070371 1.0033378 1.0986669 1.1009471 1.1024407 0.902807066 0.9912853
## 3 0.9738466 0.9511577 0.5986458 0.8716487 0.9580308 1.708050113 0.9019637
## 4 0.9152708 0.9313164 -0.2788570 0.8834023 0.7132049 -0.009911867 0.9935459
## K Mg Ca Cl SO4 Chla Secchi
## 1 1.0415040 0.9974897 1.0099982 1.0017756 1.0059339 0.9614444 1.1044150
## 2 1.0289529 0.9791206 1.0101287 0.9802753 1.0180834 1.2774124 0.5221512
## 3 0.8158069 0.9388798 0.9745891 0.9090489 0.9306542 0.7267176 1.2101862
## 4 0.9527984 0.9769680 0.9168744 1.0034129 0.9393455 0.7441124 0.7620074
## Maxdepth
## 1 1.1229066
## 2 0.7948381
## 3 0.7679073
## 4 0.8813988
##
## Clustering vector:
## 4 7 31 34 37 42 50 53 57 58 65 69 73 74 76 79 82 83 85 86
## 1 2 1 2 3 1 3 4 3 4 2 1 4 4 2 1 1 4 4 1
## 98 100 101 105 107 108 112 113 114 120
## 2 2 4 3 2 3 1 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 7.632418 3.236025 6.381747 4.382761
## (between_SS / total_SS = 58.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The cluster
component of ponds.km
contains the cluster membership for each of the 30 ponds. The centers
component contains the water chemistry values for the cluster centers, whilst the withinss
and size
components contains the within cluster sum of squares (the sum of the distances of each cluster member to the cluster center, the error) and the number of objects within each of the clusters respectively.
As the k-means method does not utilise a hierarchy in the analysis it is not possible to view the results of a k-means clustering in dendrogram form. An alternative is to plot the data and associated clustering on a biplot of the original data. The biplot could be produced using principal co-ordinates analysis (PCoA), non-metric multi-dimensional scaling (NMDS) or, as we shall now show, principal components analysis (PCA). Enter the following code chunk to display the results of the k-means clustering (after identify()
you will need to click on the plotted points to identify them, end with a right click):
ponds.pca <- prcomp(pondsenv)
screeplot(ponds.pca)
ponds.pred <- predict(ponds.pca, newdata = pondsenv)
dimnames(ponds.km$centers)[[2]] <- dimnames(pondsenv)[[2]]
pond.kmcenters <- predict(ponds.pca, ponds.km$centers)
eqscplot(ponds.pred[, 1:2], type = "n", xlab = "First principal component",
ylab = "Second principal component", main = "K-means clustering")
text(ponds.pred[, 1:2], labels = ponds.km$cluster, cex = 0.9,
col = ponds.km$cluster)
points(pond.kmcenters[, 1:2], pch = 3, cex = 3)
identify(ponds.pred[, 1:2], cex = 0.5)
## integer(0)
The first line uses princomp()
to calculate a principal components analysis of the ponds hydrochemical data. The next line displays a scree plot of the eigenvalues (More on scree plots in John’s lecture on Indirect Ordination). The third line predicts where in the PCA space all the samples should lie, and the fourth line just adds the variable names from the pondsenv
object to the names for the cluster centers. The eqscplot()
function from the MASS package is used to plot the PCA results, but we suppress plotting of any of the data (type = "n"
; this just sets up the axes and plotting region for us to plot on to. The final three lines do the actual plotting. Firstly, we plot each pond on the biplot using as a symbol a number representing which cluster that sample is in, and to help differentiate the clusters we plot them in different colours (col = ponds.km$cluster
). The penultimate line add the cluster centers as large crosses. identify()
is used to identify the samples. Click the left mouse button on each of the samples in the plot to display their sample ID. Press the right mouse button to finish.
canine <- read.csv(url("https://bit.ly/canine-data"))
head(canine)
## Group MandibleB MandibleH MolarLen MolarB Molar13 Premolar14
## 1 Modern dog 9.7 21.0 19.4 7.7 32.0 36.5
## 2 Golden jackal 8.1 16.7 18.3 7.0 30.3 32.9
## 3 Chinese wolf 13.5 27.3 26.8 10.6 41.9 49.1
## 4 Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6
## 5 Cuon 10.7 23.5 21.4 8.5 28.8 37.6
## 6 Dingo 9.6 22.6 21.1 8.3 34.4 43.1
rownames(canine) <- canine[, 1]
canine <- canine[, -1]
canine
## MandibleB MandibleH MolarLen MolarB Molar13 Premolar14
## Modern dog 9.7 21.0 19.4 7.7 32.0 36.5
## Golden jackal 8.1 16.7 18.3 7.0 30.3 32.9
## Chinese wolf 13.5 27.3 26.8 10.6 41.9 49.1
## Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6
## Cuon 10.7 23.5 21.4 8.5 28.8 37.6
## Dingo 9.6 22.6 21.1 8.3 34.4 43.1
## Prehistoric dog 10.3 22.1 19.1 8.1 32.2 35.0
Vegetation plots, located at even distances along transects following the steep valley slopes of Vltava river valley, collected during 2001-2003. Each transect starts at the valley bottom and ends up at the upper part of the valley slope. Altogether 97 plots located in 27 transects were collected, each of the size 10x15 m. In each plot, all tree, shrub and herb species were recorded and their abundances were estimated using 9-degree ordinal Braun-Blanquette scale (these values were consequently transformed into percentage scale).
v_url <- "https://bit.ly/vltava-spp"
vltava <- read.delim (url(v_url), row.names = 1)
head(vltava)
The distances are stored in coph.slink
for later use. This illustrates a useful feature of the R language; namely, we do not need to create an object before using it with separate commands. Here, the coph.slink
object is both created and used as an argument to cor()
in a single line of code↩