Hierarchical clustering using R

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that clusters similar data points into groups called clusters. The endpoint is a hierarchy of clusters and the objects within each cluster are similar to each other.

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Hierarchical clustering dendrogram of the Iris dataset

Hierarchical clustering dendrogram of the Iris dataset (https://en.wikipedia.org/wiki/Hierarchical_clustering)

We can perform hierarchical clustering on a data matrix in R using function “hclust”.

Usage

hclust(d, method = “complete”, members = NULL)

Data set from clustering

#We can genrate randome data:
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
summary(x)
##        x                  y           
##  Min.   :-0.62946   Min.   :-0.82754  
##  1st Qu.: 0.04838   1st Qu.:-0.01679  
##  Median : 0.57912   Median : 0.43242  
##  Mean   : 0.53057   Mean   : 0.47489  
##  3rd Qu.: 0.98021   3rd Qu.: 1.01691  
##  Max.   : 1.73683   Max.   : 1.93000
#We also can use famous <i>Iris</i> flower data set.
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Now use hclust function on the dataset, all four columns (Sepal length, Petal length, Sepal width, and Petal width.

data <- dist(iris[, 1:4])
hcluster <- hclust(data)
hcluster
## 
## Call:
## hclust(d = data)
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 150

Let’s plot our cluster dendrogram data

# Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(hcluster)
# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(20, 19),
                cex = 0.7, col = c("green","yellow"))
plot(hcd,  xlab = "Height", nodePar = nodePar, main = "Cluster dendrogram",
     edgePar = list(col = c("red","blue"), lwd = 2:1), horiz = TRUE)

nodePar:
It is a list of plotting parameters to use for the nodes.

I am Nilesh Kumar, a graduate student at the Department of Biology, UAB under the mentorship of Dr. Shahid Mukhtar. I joined UAB in Spring 2018 and working on Network Biology. My research interests are Network modeling, Mathematical modeling, Game theory, Artificial Intelligence and their application in Systems Biology.

I graduated with master’s degree “Master of Technology, Information Technology (Specialization in Bioinformatics)” in 2015 from Indian Institute of Information Technology Allahabad, India with GATE scholarship. My Master’s thesis was entitled “Mirtron Prediction through machine learning approach”. I worked as a research fellow at The International Centre for Genetic Engineering and Biotechnology, New Delhi for two years.