How to perform PCA step by step using R and basic linear algebra functions and operations.

# What is PCA?

PCA is an exploratory data analysis based in dimensions reduction. The general idea is to reduce the dataset to have fewer dimensions and at the same time preserve as much information as possible.

PCA allows us to make visual representations in two dimensions and check for groups or differences in the data related to different states, treatments, etc. Besides, we can get some clue about which variables in the data are responsible for the visual differences.

It is important to highlight that PCA is not used exclusively for the above, and since is an exploratory analysis, the data similarities or differences must be considered in the context where the data come from.

# A simple case with data in two dimentions

Let’s start simple. I have the data with just two variables to show you some core concepts behind PCA. Then, we will be able to generalize to data with more dimensions.

## Data

I have simulated the next data:

``````set.seed(1)

# Variable 1
var_1 <- rnorm(50, 50, sd = 3)

# Variable 2
var_2 <- .5*var_1 + rnorm(50, sd = sqrt(3))

# Both variables in a data frame
data_set_1 <- data.frame(var_1, var_2)

``````##      var_1    var_2
## 1 48.12064 24.74986
## 2 50.55093 24.21540
## 3 47.49311 24.33739
## 4 54.78584 25.43681
## 5 50.98852 27.97633
## 6 47.53859 27.19945``````

A scatter plot can show the spread and the possible relation between both variables:

``````library(ggplot2)

# A scatter plot with the two simulated variables
ggplot(data_set_1, aes(x = var_1, y = var_2)) +
geom_point(color = "blue", size = 2) +
xlab("Variable 1") +
ylab("Variable 2") +
theme_classic()``````

## Center each variable

The first step in PCA is center both variables:

``````library(dplyr)

# Subtract the mean from each variable
data_set_1 <- data_set_1 %>%
mutate(varc_1 = var_1 - mean(var_1), varc_2 = var_2 - mean(var_2))

``````##      var_1    var_2     varc_1      varc_2
## 1 48.12064 24.74986 -2.1807063 -0.60402890
## 2 50.55093 24.21540  0.2495851 -1.13848362
## 3 47.49311 24.33739 -2.8082307 -1.01649408
## 4 54.78584 25.43681  4.4844976  0.08291914
## 5 50.98852 27.97633  0.6871785  2.62244372
## 6 47.53859 27.19945 -2.7627500  1.84556287``````

Note that the above do not modify the relative position between each point, so the data looks the same:

``````# Scatter plot for the centered data
ggplot(data_set_1, aes(x = varc_1, y = varc_2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
theme_classic()``````

## Calculate the covariance matrix

You can calculate the covariance matrix for a given set of variables just by performing a matrix multiplication on the transformed data as follows:

``````# Select just the centered variables
data_set_2 <- data_set_1 %>%
select(varc_1, varc_2) %>%
as.matrix()

# Calculate the covariance matrix
cov_m <- (t(data_set_2) %*% data_set_2) / (nrow(data_set_2) - 1)

cov_m``````
``````##          varc_1   varc_2
## varc_1 6.220943 2.946877
## varc_2 2.946877 4.207523``````

At this case, the diagonal contains the variances for each variable while values out of the diagonal are the covariances between those (see the figure below).

The same result with the `cov` function:

``````# Covariance matrix using cov()
cov(data_set_2)``````
``````##          varc_1   varc_2
## varc_1 6.220943 2.946877
## varc_2 2.946877 4.207523``````

Or the `crossprod` function:

``````# Covariance matrix using crossprod()
crossprod(data_set_2) / (nrow(data_set_2) - 1)``````
``````##          varc_1   varc_2
## varc_1 6.220943 2.946877
## varc_2 2.946877 4.207523``````

## Obtain the eigenvalues and eigenvectors from the covariance matrix

Principal components represent the directions of the data that explain the maximal amount of variance. They are “lines” that capture most information of the data. These directions can be obtained through calculate the eigenvalues and eigenvectors from the covariance matrix:

``````# Use eigen() to obtain eigenvectors and eigenvalues
cov_e <- eigen(cov_m)

# Eigenvectors
e_vec <- cov_e\$vectors

# Eigenvalues
e_val <- cov_e\$values``````

The span of each eigenvector can be considered that “line” that capture most variation:

``````# First eigenvector
ev_1 <- e_vec[,1]

# First eigenvector slope
ev1_m <- ev_1[2] / ev_1[1]

# Second eigenvector
ev_2 <- e_vec[,2]

# Second eigenvector slope
ev2_m <- ev_2[2] / ev_2[1]

# Scatter plot showing the span of both eigenvectors
ggplot(data.frame(data_set_2), aes(x = varc_1, y = varc_2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
geom_abline(slope = ev1_m, color = "blue", size = 0.7) +
geom_abline(slope = ev2_m, color = "red", size = 0.7) +
theme_classic()``````

As you can see, there is an eigenvector for each variable in the data set, at this case two. Note that eigenvectors are perpendicular:

``````# Multiply both eigenvectors
ev_1 %*% ev_2``````
``````##      [,1]
## [1,]    0``````

Regarding to the eigenvalues, their numeric values are equal to the sum of squares of the distances of each projected data point in the corresponding principal component. This sum of squares is maximized in the first principal component.

## Make a Scree Plot

Dividing each eigenvalue by n – 1 (n is the number of rows in the original data) will provide an estimation of the variance accounted by each principal component. The sum of all the variances (the total variance) can be used to calculate the percentage of variation and visualize them with a Scree Plot:

``````# Calculate the estimated variance for each eigenvalue
e_var <- e_val / (nrow(data_set_2) - 1)

# Data frame with variance percentages
var_per <- data.frame(
PC  = c("PC1", "PC2"),
PER = c(e_var) * 100 / sum(e_var) # Calculate the percentage
)

# Scree plot
ggplot(var_per, aes(x = PC, y = PER)) +
geom_col(width = 0.5, color = "black") +
xlab("Principal component") +
ylab("Percentage of variation (%)") +
theme_classic()``````

The eigenvectors retrieved by the function `eigen` are normalized. This means that their longitude matches 1:

``````# Norm of the first eigenvector
norm(as.matrix(ev_1), "F")``````
``## [1] 1``
``````# Norm of the second eigenvector
norm(as.matrix(ev_2), "F")``````
``## [1] 1``

The elements of each eigenvector are also called loadings and can be interpreted as the contribution of each variable in the data set to the corresponding principal component, or, more strictly, as the coefficients of the linear combination of the original variables from which the principal components are constructed.

You can make a table with these values and see the contributions of each variable to each principal component:

``````# Data frame with both eigenvectors
VAR   = c("var_1", "var_2"),
PC1 = ev_1, # First eigenvecor
PC2 = ev_2  # Second eigenvectors
)

``````##     VAR        PC1        PC2
## 1 var_1 -0.8134113  0.5816890
## 2 var_2 -0.5816890 -0.8134113``````

The above can be useful in data with many dimensions to establish which variables are causing the clusters or differences in the PCA plot.

## Represent the data in lower dimensions

If we change the basis of the original data to that indicated by the eigenvectors, basically we are rotating the data:

``````# Inverse of eigenvectors matrix
inv_evec <- solve(e_vec)

# Change the basis of the original data
data_set_3 <- data_set_2 %*% inv_evec

# Scatter showing the rotation
ggplot(data.frame(data_set_3), aes(X1, X2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
xlab("PC1 (78.8%)") +
ylab("PC2 (21.2%)") +
theme_classic()``````

You can compare both graphs to have an idea of how data has been rotated once we change the basis:

``````library(ggpubr)

# Scatter plot with the centered data
plot_data <- ggplot(data.frame(data_set_2), aes(x = varc_1, y = varc_2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
ylim(c(-8, 8.5)) +
ggtitle("Original Data") +
theme_classic()

# Scatter plot with the rotated data
plot_rotation <- ggplot(data.frame(data_set_3), aes(X1, X2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
xlab("PC1 (78.8%)") +
ylab("PC2 (21.2%)") +
ylim(c(-8, 8.5)) +
ggtitle("Change of Basis to Eigenvectors") +
theme_classic()

# Both graphs side by side
ggarrange(plot_data, plot_rotation)``````

Since PC1 explain most of the variance in the data, we can omit PC2 and represent each point in just one dimension, here as red points:

``````# Data points just from PC 1
data_pc1 <- data.frame(v1 = data_set_3[,1], v2 = rep(0, nrow(data_set_3)))

# Scatter plot showing the projected points from PC1 (red points)
ggplot(data.frame(data_set_3), aes(X1, X2)) +
geom_point(color = "blue", size = 2) +
geom_point(data = data_pc1, aes(v1, v2), color = "red", size = 2) +
geom_vline(xintercept = 0, size = .5) +
geom_hline(yintercept = 0, size = .5) +
xlab("PC1 (78.8%)") +
ylab("PC2 (21.2%)") +
ylim(c(-8, 8.5)) +
theme_classic()``````

The ideas above can be used in data with many variables to reduce dimensions and represent the data with 2D plots.

# Generalization to more dimensions

Now let’s apply the same procedure in a data set with more variables, specifically in data from wine samples where the follow responses have been measured:

The wine samples come from Argentina, Chile, Australia, and South Africa:

The first six rows of the data set looks like this:

``````# Variable names
var_names <- names(var_names)

# Wine labels
wine_label <- unname(unlist(wine_label))

# Wine data set
row.names = wine_label, col.names = var_names
)

``````##          Ethanol TotalAcid VolatileA MalicAcid   pH LacticAc ReSugar CitricAcid
## ARG-BNS1   13.62      3.54      0.29      0.89 3.71     0.78    1.46       0.31
## ARG-DDA1   14.06      3.74      0.59      0.24 3.73     1.25    2.42       0.18
## ARG-FFL1   13.74      3.27      0.47     -0.07 3.87     1.13    1.52       0.39
## ARG-FLM1   13.95      3.66      0.47      0.09 3.79     1.00    4.17       0.41
## ARG-ICR1   14.47      3.66      0.38      0.61 3.70     0.81    1.25       0.14
## ARG-SAL1   14.61      3.45      0.52      0.16 3.92     1.76    1.40       0.10
##             CO2 Density FolinC Glycerol Methanol TartaricA
## ARG-BNS1  85.61    0.99  60.92     9.72     0.16      1.74
## ARG-DDA1 175.20    1.00  70.64    10.05     0.20      1.58
## ARG-FFL1 513.74    0.99  63.59    10.92     0.18      1.24
## ARG-FLM1 379.40    1.00  73.30     9.69     0.23      2.26
## ARG-ICR1 154.88    0.99  71.69    10.81     0.20      1.22
## ARG-SAL1 156.30    0.99  71.79    10.19     0.19      0.90``````

Note that wine samples are marked as the row names of the data frame, and each sample has associated the values for each response (data columns). Put attention to this when calculating covariance matrices or using functions to perform PCA, do you want to reduce de dimensions respect to the rows or respect the columns? Here we are interested on reduce the number of dimensions respect to the responses and try to spot if there are some similarities or differences between wine samples.

## Center each variable and divide by the standard deviation

Subtract the mean for each variable (columns), and divide it for its standard deviation:

``````library(purrr)

# Means for each variable
var_means <- unlist(map(data_set_wine, mean))

# Standard deviation for each variable
var_sd <- unlist(map(data_set_wine, sd))

# Center each variable
data_set_wine_2 <- map2(
data_set_wine, var_means, .f = function(x, mean) x - mean
)

# Divide by the standard deviation of each variable
data_set_wine_2 <- map2(
data_set_wine_2, var_sd, .f = function(x, sd) x / sd
)

# Make a matrix from the previous list
data_set_wine_2 <- as.matrix(data.frame(data_set_wine_2))``````

Each row in the transformed data corresponds to the same wine sample in the the original data.

The first six rows if the transformed data look like this:

``head(data_set_wine_2)``
``````##          Ethanol   TotalAcid  VolatileA  MalicAcid         pH    LacticAc
## [1,] -0.63678490 -0.40870999 -0.7811641  2.2627949 0.18823560 -0.88226788
## [2,]  0.29224039  0.01096889  1.3231961 -0.5694123 0.40060358  0.38388752
## [3,] -0.38341492 -0.97527630  0.4814521 -1.9201573 1.88717937  0.06061381
## [4,]  0.05998306 -0.15690246  0.4814521 -1.2229986 1.03770749 -0.28959935
## [5,]  1.15792166 -0.15690246 -0.1498560  1.0427673 0.08205162 -0.80144937
## [6,]  1.45351897 -0.59756526  0.8321787 -0.9179917 2.41810183  1.75780072
##         ReSugar CitricAcid        CO2    Density     FolinC     Glycerol
## [1,] -0.5311188  1.1594920 -1.6049333 -0.7844233 -0.1371321 -0.525885112
## [2,]  0.1996454  0.1084153 -0.9482843  1.2458487  1.2376582 -0.160707984
## [3,] -0.4854461  1.8063082  1.5330409 -0.7844233  0.2405111  0.802031811
## [4,]  1.5317676  1.9680124  0.5483973  1.2458487  1.6138874 -0.559083800
## [5,] -0.6909735 -0.2149929 -1.0972195 -0.7844233  1.3861700  0.680306454
## [6,] -0.5767916 -0.5384011 -1.0868116 -0.7844233  1.4003137 -0.005784994
##        Methanol   TartaricA
## [1,] -1.1081037  0.43877622
## [2,] -0.0870238 -0.03865555
## [3,] -0.5975635 -1.05319837
## [4,]  0.6787860  1.99042974
## [5,] -0.0870238 -1.11287729
## [6,] -0.3422939 -2.06774119``````

Dividing by the standard deviations is a way the give to each variable the same importance despite its range, magnitude, and/or measurement scale. Other transformations are possible besides dividing by the standard deviation, which can be applied relying on your data. See the resources at the end of this post if you are interested.

## Calculate the covariance matrix

Multiply the data (as a matrix) by its transpose:

``````# Calculate the covariance matrix
cov_wine <- (t(data_set_wine_2) %*% data_set_wine_2) /
(nrow(data_set_wine_2) - 1)

cov_wine[1:5, 1:5]``````
``````##              Ethanol  TotalAcid  VolatileA   MalicAcid         pH
## Ethanol   1.00000000  0.3262321  0.2028382  0.04166778  0.1697347
## TotalAcid 0.32623209  1.0000000  0.4660575 -0.26067574 -0.3518303
## VolatileA 0.20283816  0.4660575  1.0000000 -0.74628880  0.3011611
## MalicAcid 0.04166778 -0.2606757 -0.7462888  1.00000000 -0.2929542
## pH        0.16973470 -0.3518303  0.3011611 -0.29295421  1.0000000``````

Here I just showing you the first five rows and columns of the covariance matrix. Remember, if you want to see all the matrix you can use the function `View` as `View(cov_wine)`.

Each value on `cov_wine` matrix has the same interpretation as in the example with two variables, diagonal values are the variances of each variable, and values out of the diagonal are covariances between variables. As you can see all variances are equal to 1. This is precisely the effect of centering, by subtracting the mean, and dividing by the standard deviation.

## Obtain the eigenvalues and eigenvectors from the covariance matrix

Use the function `eigen` to obtain the eigenvectors and their egeinvalues:

``````# eigen() to obtain eigenvalues and eigenvectors
eg_wine <- eigen(cov_wine)

# Eigenvalues
eg_vals <- eg_wine\$values

# Eigenvectors
eg_vecs <- eg_wine\$vectors``````

The number of eigenvectors and eigenvalues is the same as the number of variables in the original data set:

``````# Number of eigenvalues
length(eg_vals)``````
``## [1] 14``
``````# Number of eigenvectors
ncol(eg_vecs)``````
``## [1] 14``

## Make a Scree Plot

Calculate the percentage of variation for each component and make a scree plot:

``````# Calculate variances from each eigenvalue
eg_vars <- eg_vals / (nrow(data_set_wine_2) - 1)

# Data frame with variance percenatges
vars_perc <- data.frame(
PC  = unlist(map(1:14, function(x) paste0("PC", x))),
PER = round((eg_vars * 100) / sum(eg_vars), 4)
)

# Scree plot
ggplot(
vars_perc,
aes(x = reorder(PC, order(PER, decreasing = TRUE)), y = PER)
) +
geom_col(width = 0.5, color = "black") +
xlab("Principal component") +
ylab("Percentage of variation (%)") +
theme_classic()``````

## Represent the data in lower dimensions

Ideally, if PC1 and PC2 were to gather most of the variation, let’s say more than 90%, it would possible to make a good representation of the data in a 2D scatter plot. Since real data is mostly never ideal, at this case PC1, PC2, PC3, and PC4 account for 73% of the variation, so I’m going to try to spot clusters making two scatter plots, the first with PC1 and PC2, and the second with PC3 and PC4.

First, change the basis of the transformed data into that stated by the eigenvectors:

``````# Change the basis of the data
data_set_wine_eb <- data_set_wine_2 %*% solve(eg_vecs)

# Transfrom to a data frame
data_set_wine_eb <- data.frame(data_set_wine_eb)
colnames(data_set_wine_eb) <- vars_perc\$PC

# Add a column with the origin of each wine sample
data_set_wine_eb <- data_set_wine_eb %>%
mutate(
WineSample = unlist(map(wine_label, function(x) substr(x, 1, 3)))
) %>%
relocate(WineSample)

``````##   WineSample        PC1        PC2        PC3        PC4         PC5
## 1        ARG -0.8340408  0.8560257  1.2477283 -0.2946354 -2.64962606
## 2        ARG  0.9182495 -0.7520145  0.8955902 -0.3967229  0.06460487
## 3        ARG  2.0109822  0.2126599 -0.4739304 -1.3446676  0.96328060
## 4        ARG  2.8454813 -1.9323804  0.5129401 -0.2805325 -0.56882676
## 5        ARG -0.4027013  0.7424389  0.3544802 -1.3925516 -0.86093254
## 6        ARG  0.4921766  0.6976268  0.9133225 -1.3858705  0.58195738
##           PC6        PC7         PC8         PC9        PC10        PC11
## 1 -0.09972511  0.6839127 -0.64470453  1.30296999  0.08192591  0.90864847
## 2 -1.15201976 -0.6156587  0.09652472 -0.66508130  0.22293709 -0.44745146
## 3 -0.97714541  0.7454403  1.13873701 -0.36070456 -1.80323468  1.57277920
## 4  0.04685162  0.9391345  1.05427365  0.06617118 -0.72767986 -1.40159736
## 5 -0.43249796 -1.2152728  0.19062688  1.02200106  0.80693595 -0.03070849
## 6 -1.12762870 -0.7447312  2.12377275 -1.45181400  0.27437060  1.29332051
##          PC12       PC13       PC14
## 1  0.09448821  0.2045349  0.2889872
## 2  0.60237715 -1.1576675 -0.2344966
## 3 -0.74179131  0.9287338 -0.2968222
## 4  0.97819193 -0.1854973  0.5890199
## 5 -0.89310589 -0.3806547 -1.1129547
## 6 -1.49226046 -1.9552497 -1.2846205``````

Now, let’s make both scatter plots just taking the values from PC1, PC2, PC3, and PC4.

``````# Scatter plot for PC1 and PC2
pc12 <- ggplot(
data_set_wine_eb,
aes(PC1, PC2, color = WineSample, shape = WineSample)
) +
geom_point(size = 3) +
ggtitle("PC1 and PC2") +
xlab("PC1 (24.4%)") +
ylab("PC2 (21.3%)") +
theme_classic() +
theme(legend.position = "none")

# Scatter plot for PC3 and PC4
pc34 <- ggplot(
data_set_wine_eb,
aes(PC3, PC4, color = WineSample, shape = WineSample)
) +
geom_point(size = 3) +
scale_color_discrete(
name = "Wine origin",
labels = c("Argentina", "Australia", "Chile", "South Africa")
) +
scale_shape_discrete(
name = "Wine origin",
labels = c("Argentina", "Australia", "Chile", "South Africa")
) +
ggtitle("PC3 and PC4") +
xlab("PC3 (17.5%)") +
ylab("PC4 (10.0%)") +
theme_classic()

# Both graphs side by side
ggarrange(pc12, pc34, widths = c(1.5, 2))``````

Elements of each eigenvector represent the loading for each variable on the corresponding principal component:

``````# Data frame with loading scores

``````##                  PC1        PC2         PC3         PC4          PC5
## Ethanol   -0.2050720 -0.3452884  0.28833198 -0.33833697  0.001102612
## TotalAcid -0.1457210 -0.4545803 -0.06374909  0.42139270  0.041508880
## VolatileA  0.2959952 -0.4418538  0.03338975  0.08430388 -0.010891943
## MalicAcid -0.3401714  0.3173079  0.17907611 -0.04002812 -0.252870228
## pH         0.2413635 -0.1030772 -0.04459655 -0.66472873 -0.170592676
## LacticAc   0.3462627 -0.3745378 -0.16738658  0.11850141 -0.108643313
##                   PC6          PC7         PC8        PC9       PC10
## Ethanol   -0.02992588  0.106645513  0.40565761  0.2220668 -0.0736428
## TotalAcid -0.10721872  0.109080109 -0.15801516  0.1647662 -0.1264076
## VolatileA  0.12676353 -0.117134956 -0.01066998 -0.2323371  0.1969852
## MalicAcid  0.09495473  0.247290148 -0.35870967  0.1792700 -0.2720317
## pH         0.01717167 -0.191605072 -0.30040129  0.3352702  0.2882746
## LacticAc   0.05496474  0.008009813 -0.10169448  0.1377493 -0.2860564
##                  PC11        PC12        PC13        PC14
## Ethanol    0.52247393 -0.22486081  0.29025391  0.04195694
## TotalAcid -0.09209975 -0.25488777 -0.01537897 -0.65083548
## VolatileA  0.19114512 -0.33800897 -0.59864003  0.27743084
## MalicAcid -0.07338526 -0.53559762 -0.18854505  0.23066288
## pH        -0.21200578 -0.09998927 -0.02838824 -0.28223548
## LacticAc  -0.32247769 -0.11583440  0.49548494  0.45696212``````

Making a scatter plot with these value can help to see patterns of variation, and/or explain why a certain grouping is observed in scatter plots for principal components (previous figure):

``````# Scatter plot with loadings of PC1 and PC2
ld_pc12 <- ggplot(loads_wine, aes(PC1, PC2)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_text(aes(label = rownames(loads_wine)), hjust = -.2) +
xlim(c(-.7, .7)) +
ylim(c(-.7, .7)) +
xlab("PC1 (24.4%)") +
ylab("PC2 (21.3%)") +
theme_classic()

ld_pc34 <- ggplot(loads_wine, aes(PC3, PC4)) +
geom_point(color = "blue", size = 2) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_text(aes(label = rownames(loads_wine)), hjust = -.2) +
xlim(c(-.7, .7)) +
ylim(c(-.7, .7)) +
xlab("PC3 (17.5%)") +
ylab("PC4 (10.0%)") +
theme_classic()

# Both graphs side by side
ggarrange(ld_pc12, ld_pc34)  ``````

# Wonderful resources

• As a rookie, I’ve had to learn linear algebra to understand, and visualize, the core of PCA. The video series, on YouTube, by 3Blue1Brown are a terrific place to start with linear algebra:

Essence of linear algebra

• Intuitive explanations of PCA can be found in other YouTube channels:

Principal Component Analysis (PCA)

StatQuest: Principal Component Analysis (PCA), Step-by-Step

• If you want to go deeper, here there are two papers aimed to chemometrics and metabolomics:

Principal component analysis

Centering, scaling, and transformations: improving the biological information content of metabolomics data

# GitHub repository with all the code of this post

You can find all the code on this post in the next repository:

Principal Component Analysis from Scratch

Thanks a lot for visit this site 🤓