Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that clusters similar data points into groups called clusters. The endpoint is a hierarchy of clusters and the objects within each cluster are similar to each other.
Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Hierarchical clustering dendrogram of the Iris dataset (https://en.wikipedia.org/wiki/Hierarchical_clustering)
We can perform hierarchical clustering on a data matrix in R using function “hclust”.
Usage
hclust(d, method = “complete”, members = NULL)
Data set from clustering
#We can genrate randome data: x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") summary(x)
## x y ## Min. :-0.62946 Min. :-0.82754 ## 1st Qu.: 0.04838 1st Qu.:-0.01679 ## Median : 0.57912 Median : 0.43242 ## Mean : 0.53057 Mean : 0.47489 ## 3rd Qu.: 0.98021 3rd Qu.: 1.01691 ## Max. : 1.73683 Max. : 1.93000
#We also can use famous <i>Iris</i> flower data set. summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##
Now use hclust function on the dataset, all four columns (Sepal length, Petal length, Sepal width, and Petal width.
data <- dist(iris[, 1:4]) hcluster <- hclust(data) hcluster
## ## Call: ## hclust(d = data) ## ## Cluster method : complete ## Distance : euclidean ## Number of objects: 150
Let’s plot our cluster dendrogram data
# Convert hclust into a dendrogram and plot hcd <- as.dendrogram(hcluster) # Define nodePar nodePar <- list(lab.cex = 0.6, pch = c(20, 19), cex = 0.7, col = c("green","yellow")) plot(hcd, xlab = "Height", nodePar = nodePar, main = "Cluster dendrogram", edgePar = list(col = c("red","blue"), lwd = 2:1), horiz = TRUE)

nodePar:
It is a list of plotting parameters to use for the nodes.
I am Nilesh Kumar, a graduate student at the Department of Biology, UAB under the mentorship of Dr. Shahid Mukhtar. I joined UAB in Spring 2018 and working on Network Biology. My research interests are Network modeling, Mathematical modeling, Game theory, Artificial Intelligence and their application in Systems Biology.
I graduated with master’s degree “Master of Technology, Information Technology (Specialization in Bioinformatics)” in 2015 from Indian Institute of Information Technology Allahabad, India with GATE scholarship. My Master’s thesis was entitled “Mirtron Prediction through machine learning approach”. I worked as a research fellow at The International Centre for Genetic Engineering and Biotechnology, New Delhi for two years.