How to perform PCA trough singular value decomposition using R.

# What is singular value decomposition?

Singular value decomposition (SVD) is a factorization of a real or complex matrix which generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m x n matrix:

Where M is m x n, U is m x m, S is m x n, and V is n x n.

The diagonal entries si of S are know as the singular values of M. The number of non-zero singular values is equal to the rank of M. The columns of U and the columns of V are called the left-singular vectors and right singular vectors of M, respectively.

In R, you can perform SVD through `svd()` which by default uses compact SVD, a similar decomposition M = USV´ in which S is square diagonal of size r x r, where r ≤ min{m, n} is the rank of M, and has only the non-zero singular values. In this variant, V is an m x r semi-unitary matrix and V is an n x r semi-unitary matrix, such that UU = VV = I.

# How can we use SVD to perform principal component analysis?

Among other applications, SVD can be used to perform principal component analysis (PCA) since there is a close relationship between both procedures. Check out the post “Relationship between SVD and PCA. How to use SVD to perform PCA?” to see a more detailed explanation.
Let’s say you have a data matrix M of m x n size, where m is the number of samples (treatments, states, etc.) and n is the number of variables (response variable measures on each sample). Also let’s assume you have centered your data by subtracting the mean of each variable. Then you can calculate the covariance matrix C given by:

This symmetric matrix can be diagonalized as:

V is a matrix of eigenvectors (each column is an eigenvector) and L is a diagonal matrix with eigenvalues ʎi in the decreasing order on the diagonal. When performing PCA, the rest is just to obtain the projections of the original data through the matrix multiplication MV. The new points can be called principal components or PC scores. Besides, eigenvalues also tell us the variation covered by each principal component.

Now, if we perform SVD on M:

We can easily see that:

Meaning that right singular vectors V are principal directions (eigenvectors) and that singular values are related to the eigenvalues of covariance matrix via ʎi = si2/(n-1). Principal components are given by XV = US and loadings by columns of VS/(m-1)1/2.

Now, it’s time to see the above in action with some data and R code.

# Data

The data set I’ve used for this post comes from tissuesGeneExpression, which in turn is a subset of gene expression data from the following papers:

``````devtools::install_github("genomicsclass/tissuesGeneExpression")

library(tissuesGeneExpression)
data(tissuesGeneExpression)

data_gene <- t(e)
data_gene <- as.data.frame(data_gene)``````

I’ve used `t()` to transpose the data, so in rows tissue samples are marked while in columns expression for different genes:

``dim(data_gene)``
``## [1]   189 22215``

As you can see, there are 189 tissue samples where the expression levels of 22215 genes has been measured.

# PCA through svd()

## Center the data

To perform a PCA, first is necessary to center the data by subtracting the mean from each gene:

``````library(purrr)

# Calculate the mean of each gene
gene_means <- map_dbl(data_gene, mean)

# Subtract each gene mean on its corresponding row
data_gene_c <- map2(data_gene, gene_means, .f = function(x, mean) x - mean)
data_gene_c <- data.frame(data_gene_c)``````

## SVD with svd()

Now it’s easy to perform SVD:

``dg_svd <- svd(data_gene_c)``

`dg_svd` contains a list with a u matrix, a v matrix, and a numerical vector with singular values:

``````# U matrix
dg_u <- dg_svd\$u

# V matrix
dg_v <- dg_svd\$v

# Singular values
dg_d <- diag(dg_svd\$d)``````

## Scree plot

The we can calculate each eigenvalue and the variation covered by each principal component:

``````library(dplyr)

# Obtain each eigenvalue
ev_dg_svd <- dg_svd\$d^2 / (nrow(data_gene) - 1)

# Percentage of variation of each PC
pervar_dg_svd <- data.frame(ev_dg_svd) %>%
mutate(
per_var = ev_dg_svd * 100 / sum(ev_dg_svd),
pc = map_chr(1:nrow(dg_u), .f = function(x) paste0("PC", as.character(x))
)
)``````

And make a scree plot:

``````library(ggplot2)

# Scree plot with the first 15 principal components
ggplot(pervar_dg_svd[1:15,], aes(reorder(pc, -per_var), per_var)) +
geom_col() +
xlab("Principal component") +
ylab("Percentage of variation (%)") +
ggtitle("Scree plot - svd()") +
theme_classic()``````

Since there are 189 principal components, I’m just took the first 15.

## Score plot

First we calculate the projections of the original data:

``````# Scores
scores_dg_svd <- data.frame(dg_u %*% dg_d)

# Change the default names for PC names
colnames(scores_dg_svd) <- map_chr(
1:nrow(scores_dg_svd), .f = function(x) paste0("PC", as.character(x))
)``````

I’m adding a new column to the previous data, that indicates the name of the tissue which each sample corresponds:

``````scores_dg_svd <- scores_dg_svd %>%
mutate(Tissue = tissue) %>%
relocate(Tissue)``````

The character vector `tissue` is included in the data that I downloaded first:

``table(tissue)``
``````## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver
##          38          34          15          31          39          26
##    placenta
##           6``````

Finally we can easily make a score plot using `ggplot2`:

``````ggplot(scores_dg_svd, aes(PC1, PC2, color = Tissue)) +
geom_point(size = 2) +
xlab("PC1 (33%)") +
ylab("PC2 (14%)") +
ggtitle("PCA on gene expression data using svd()") +
theme_classic()``````

``````ls_dg_svd <- (dg_v %*% dg_d) / sqrt(nrow(data_gene) - 1)

dim(ls_dg_svd)``````
``## [1] 22215   189``

Each row on this matrix corresponds to a gene and each column to a principal component:

``ls_dg_svd[1:5, 1:5]``
``````##             [,1]        [,2]        [,3]        [,4]        [,5]
## [1,]  0.28109040 -0.51722727  0.06503399  0.44542686  0.01671733
## [2,] -0.12941227 -0.08617018  0.04809022 -0.08718245 -0.08208977
## [3,] -0.18520759  0.22871016  0.05254447  0.10137023  0.04573839
## [4,] -0.04402067 -0.07799667 -0.08576623  0.04628856  0.46190597
## [5,]  0.06064367  0.04732136 -0.04864953  0.08728058 -0.02975143``````

Unfortunately, I have not been able to found another source that supports the loadigns definition on the previous section. Usually, in papers eigenvectors are reported and discussed as loadings. So if you want to see or use this values you have to take the v matrix:

``dg_v[1:5, 1:5]``
``````##               [,1]         [,2]         [,3]         [,4]          [,5]
## [1,]  0.0046965883 -0.013275124  0.002087159  0.017092900  0.0006956308
## [2,] -0.0021622800 -0.002211639  0.001543377 -0.003345557 -0.0034158660
## [3,] -0.0030945339  0.005870061  0.001686328  0.003890002  0.0019032363
## [4,] -0.0007355177 -0.002001858 -0.002752526  0.001776287  0.0192205309
## [5,]  0.0010132625  0.001214547 -0.001561327  0.003349323 -0.0012379970``````

Remember this vectors are normalized, so their norm is equal to 1, and they are orthogonal, so their dot product will be 0.

# PCA trhough prcomp()

A more direct way to do the above, is through the function `prcomp()`, which uses SVD to perform PCA. Even you do not need to center your data since this function do it by default, among other options.

``dg_pca <- prcomp(data_gene)``

`dg_pca` is basically a list that contains the PC scores, loadings and singular values.

## Scree plot

To make a scree plot, with percentage of variation covered by each principal component, use the values on `sdev` whithin `dg_pca`:

``````# Eigenvalues
ev_dg_prcomp <- dg_pca\$sdev^2 / (nrow(data_gene) - 1)

# Percentage of variation for each PC
per_var_prcomp <- data.frame(ev_dg_prcomp) %>%
mutate(
per_var = ev_dg_prcomp * 100 / sum(ev_dg_prcomp),
pc = map_chr(
1:nrow(dg_pca\$x), .f = function(x) paste0("PC", as.character(x))
)
)

# Scree plot with the first 15 principal components
ggplot(per_var_prcomp[1:15,], aes(reorder(pc, -per_var), per_var)) +
geom_col() +
xlab("Principal component") +
ylab("Percentage of variation (%)") +
ggtitle("Scree plot - prcomp()") +
theme_classic()``````

If you compare this graph with that obtained using `svd()` you will find it is the same.

## Score plot

PC scores are in the object `x` within `dg_pca`:

``````# Scores
scores_dg_prcomp <- dg_pca\$x %>%
as.data.frame() %>%
mutate(Tissue = tissue) %>%
relocate(Tissue)

# Score plot
ggplot(scores_dg_prcomp, aes(PC1, PC2, color = Tissue)) +
geom_point(size = 2) +
xlab("PC1 (33%)") +
ylab("PC2 (14%)") +
ggtitle("PCA on expression data using prcomp()") +
theme_classic()``````

Again, exactly the same result as with `svd()`.

You can see the eigenvectors in the object `rotation` within `dg_pca`. To obtain the loadings you only need to multiply by the singular values:

``````ls_dg_prcomp <- dg_pca\$rotation %*% diag(dg_pca\$sdev)

dim(ls_dg_prcomp)``````
``## [1] 22215   189``

In the same way as `svd()`, on this matrix each row corresponds to a gene and each column to a principal component:

``ls_dg_prcomp[1:5, 1:5]``
``````##                  [,1]        [,2]        [,3]        [,4]        [,5]
## 1007_s_at  0.28109040 -0.51722727  0.06503399  0.44542686  0.01671733
## 1053_at   -0.12941227 -0.08617018  0.04809022 -0.08718245 -0.08208977
## 117_at    -0.18520759  0.22871016  0.05254447  0.10137023  0.04573839
## 121_at    -0.04402067 -0.07799667 -0.08576623  0.04628856  0.46190597
## 1255_g_at  0.06064367  0.04732136 -0.04864953  0.08728058 -0.02975143``````

You can use or see the eigenvectors, usually reported as loadings as I previously wrote, in the component `rotation`:

``dg_pca\$rotation[1:5, 1:5]``
``````##                     PC1          PC2          PC3          PC4           PC5
## 1007_s_at  0.0046965883 -0.013275124  0.002087159  0.017092900  0.0006956308
## 1053_at   -0.0021622800 -0.002211639  0.001543377 -0.003345557 -0.0034158660
## 117_at    -0.0030945339  0.005870061  0.001686328  0.003890002  0.0019032363
## 121_at    -0.0007355177 -0.002001858 -0.002752526  0.001776287  0.0192205309
## 1255_g_at  0.0010132625  0.001214547 -0.001561327  0.003349323 -0.0012379970``````

You can find all the code on this post on the next repository:

Principal-Component-Analysis-through-Singular-Value-Decomposition

Thank you so much for your time and visit this site.

Juan Pablo Carreón Hidalgo