Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that clusters similar data points into groups called clusters. The endpoint is a hierarchy of clusters and the objects within each cluster are similar to each other.
Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Hierarchical clustering dendrogram of the Iris dataset (https://en.wikipedia.org/wiki/Hierarchical_clustering)
We can perform hierarchical clustering on a data matrix in R using function “hclust”.
Usage
hclust(d, method = “complete”, members = NULL)
Data set from clustering
#We can genrate randome data: x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") summary(x)
## x y ## Min. :-0.62946 Min. :-0.82754 ## 1st Qu.: 0.04838 1st Qu.:-0.01679 ## Median : 0.57912 Median : 0.43242 ## Mean : 0.53057 Mean : 0.47489 ## 3rd Qu.: 0.98021 3rd Qu.: 1.01691 ## Max. : 1.73683 Max. : 1.93000
#We also can use famous <i>Iris</i> flower data set. summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##
Now use hclust function on the dataset, all four columns (Sepal length, Petal length, Sepal width, and Petal width.
data <- dist(iris[, 1:4]) hcluster <- hclust(data) hcluster
## ## Call: ## hclust(d = data) ## ## Cluster method : complete ## Distance : euclidean ## Number of objects: 150
Let’s plot our cluster dendrogram data
# Convert hclust into a dendrogram and plot hcd <- as.dendrogram(hcluster) # Define nodePar nodePar <- list(lab.cex = 0.6, pch = c(20, 19), cex = 0.7, col = c("green","yellow")) plot(hcd, xlab = "Height", nodePar = nodePar, main = "Cluster dendrogram", edgePar = list(col = c("red","blue"), lwd = 2:1), horiz = TRUE)

nodePar:
It is a list of plotting parameters to use for the nodes.