Principal component analysis (PCA) using R

0
511
Image by xresch from Pixabay

Principal component analysis (PCA) is a statistical analysis technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Moreover, it has a wide variety of application in machine learning, it can be used to find structure in features and some pre-processing of the machine learning model.
Overall PCA is an ideal candidate to visualize the data along with the reduction of a number of data dimensions.

Data preparation

#Only Four columns (Sepal.Length Sepal.Width Petal.Length Petal.Width)
data = iris[,c(1,2,3,4)]
class(data)
## [1] "data.frame"

1. Scale data

data.scaled = scale(data, center = TRUE, scale = TRUE)
head(data.scaled,5)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052

2. The correlation matrix

res.cor <- cor(data.scaled)
res.cor
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

3. The eigenvectors of the correlation matrix

res.eig <- eigen(res.cor)
res.eig
## eigen() decomposition
## $values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
## 
## $vectors
##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762  0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199 -0.6342727  0.5235971
plot(res.eig$values, col=c("red","orange","green","blue"),type="h",main="Eigen values")

As the first eigenvalue “2.91849782” is largest so it is our first principal component.

-Advertisement-

4. Let’s compute components by multiplying the transposed scaled matrix and transposed eigenvector matrix.

# Transpose eigeinvectors
eigenvectors.t <- t(res.eig$vectors)
# Transpose the adjusted data
data.scaled.t <- t(data.scaled)
# The new dataset
data.new <- eigenvectors.t %*% data.scaled.t
# Transpose new data ad rename columns
data.new <- t(data.new)
colnames(data.new) <- c("PC1", "PC2", "PC3", "PC4")
head(data.new)
##            PC1        PC2         PC3          PC4
## [1,] -2.257141 -0.4784238  0.12727962  0.024087508
## [2,] -2.074013  0.6718827  0.23382552  0.102662845
## [3,] -2.356335  0.3407664 -0.04405390  0.028282305
## [4,] -2.291707  0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825  0.006586116
barplot(data.new, col = c("red","orange","green","blue"))
plot(data.new, col = c("blue"), main="PC1 vs PC2")

PCA using prcomp function

pca <- prcomp(iris[, -5])
summary(pca)
## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000
biplot(pca, col = c("blue","red"),main = "PCA using prcomp")

https://www.rdocumentation.org/packages/stats/versions/3.5.3/topics/prcomp


Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.com 

Previous articleSpotify reaches 100 Million Paid Subscribers
Next articleCan Tech Help Fight Counterfeiting?
Nilesh Kumar
I am Nilesh Kumar, a graduate student at the Department of Biology, UAB under the mentorship of Dr. Shahid Mukhtar. I joined UAB in Spring 2018 and working on Network Biology. My research interests are Network modeling, Mathematical modeling, Game theory, Artificial Intelligence and their application in Systems Biology. I graduated with master’s degree “Master of Technology, Information Technology (Specialization in Bioinformatics)” in 2015 from Indian Institute of Information Technology Allahabad, India with GATE scholarship. My Master’s thesis was entitled “Mirtron Prediction through machine learning approach”. I worked as a research fellow at The International Centre for Genetic Engineering and Biotechnology, New Delhi for two years.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.