How to perform a principal component analysis on metabolomic data using R. This includes visual representations like a scree plot and a scatter plot for PC1 and PC2.

 

1 Problem

You have obtained the relative quantities for 43 metabolites in Arabidopsis thaliana under salt stress conditions at different times. Saline stress is a important factor limiting plant growth and you want to study how the level of primary metabolites changes under these conditions. This kind of studies are within the metabolomics, the study of the molecule products and intermediates of cell metabolism.

In the next figure there is a diagram of how you have followed the experiment:

At the end of the experiment you gather the data for all compounds, times and relative quantities, so the first ten rows of your data table may look like this:

MET TIME SAMPLE QT
aspartic acid 0.5 1 0.728
aspartic acid 0.5 2 0.839
aspartic acid 0.5 3 0.707
aspartic acid 1.0 1 0.746
aspartic acid 1.0 2 0.894
aspartic acid 1.0 3 0.686
aspartic acid 2.0 1 0.772
aspartic acid 2.0 2 0.762
aspartic acid 2.0 3 0.783
aspartic acid 4.0 1 1.160

The SAMPLE column refers to the replicates so you have three per compound and per each time. The metabolite names are the following:

aspartic acid lactate ribonate
asparagine phosphorate shikimate
serine<a0> n-butylamine citrate
glycine ethanolamine D-fructose
alanine glycerol glucose
glutamic acid succinate lysine
glutamine glycerate tyrosine
threonine fumarate gluconate
proline cadaverine galactonate
valine malate inositol
tryptophan 4-aminobutyrate uric acid
isoleucine cysteine sucrose
leucine a-ketoglutarate trehalose
phenylalanine aconitate
pyruvate putrescine

For this tutorial, and of course for the problem statement, I took the paper: Time-course metabolic profiling in Arabidopsis thaliana cell cultures after salt stress treatment. I simulated the data for metabolites quantity starting from means and standard deviations reported on this paper. To check out how I simulated them please see the R scripts “data_simulation.R” and “or_data_processing.R” in the Github repository for this tutorial (the link is at the end of this page).

NOTE: The methodology section of the paper mentions 47 compounds, but there are just 14 in table 1 in the discussion section and 29 in the table s2 in the supplementary material, that makes a total of 43 available compounds, which means and standard deviations are gathered in the file “complete_data.csv” (within the data folder in the Github repository). I also substituted the amino acids code (in table 1) for their respective names. Despite the missing compounds I was able to reproduce the analysis and obtained similar results to those in the paper.

Also, If you don’t know how it works or have some doubts about PCA, I recommend you the next video with an excellent explanation of the topic: StatQuest: Principal Component Analysis (PCA), Step-by-Step.

2 Solution

Let’s perform a Principal Component Analysis (PCA). This exploratory data analysis will help us to visualize patterns in the metabolites level through time under saline stress conditions.

First I must show you my folder organization:

  • In the analysis folder are placed all my R scripts, either for data simulation or data analysis and plots.
  • The data folder contains all the CSV ans TXT files products of my R scripts.
  • The graphs folder contains all the plots products of my R scripts.

These three folders are within a main folder with a representative name. I also recommend you to use RStudio and create a new project from the File toolbar. This makes things more easy.

2.1 PCA Analysis

I run the next R script to perform the PCA analysis. You can see it in the analysis folder as “pca_analysis.R”.

# PCA Analysis on Main Data ------------------------------------------------

# Packages
if (!"tidyverse" %in% .packages()) library(tidyverse)

# 1 Import Data -----------------------------------------------------------
main_data <- read_csv("data/main_data.csv")

# 2 Formatting Data for PCA ---------------------------------------------------

# 2.1 Obtain wide data 
main_data_wd <- main_data %>% 
  mutate(SAMPLE_TIME = paste0(SAMPLE, "-", TIME)) %>% 
  pivot_wider(names_from = SAMPLE_TIME, values_from = QT, -TIME:-SAMPLE)

# 2.2 Transposing just quantity values
qt_data <- t(main_data_wd[, -1])

# 3 PCA analysis ----------------------------------------------------------
qt_pca <- prcomp(qt_data, scale. = FALSE)

# 3.1 PCA summary
pca_summary <- summary(qt_pca)

# 3.2 PCA data
pca_data <- as_tibble(qt_pca$x) %>% 
  mutate(TIME = as_factor(substr(rownames(qt_pca$x), start = 3, stop =5))) %>% 
  relocate(TIME)

# 3.3 Loadings Data

# Compound names
compound_names <- filter(main_data, SAMPLE == 1, TIME == 0.5) %>% 
  select(MET) %>% 
  unlist()          

loadigns_data <- as_tibble(qt_pca$rotation) %>% 
  signif(3) %>% 
  mutate(MET = compound_names, INDEX = 1:length(MET)) %>% 
  relocate(MET, INDEX) 

# 3.4 Variation percentage per PC
var_pca <- qt_pca$sdev^2
per_var_pca <- round((var_pca / sum(var_pca))*100, 2)
per_var_pca <- tibble(
  PC = 1:length(per_var_pca),
  PER_VAR = per_var_pca
)

# 4 Save results ----------------------------------------------------------
capture.output(pca_summary, file = "data/pca_summary.txt")
write_csv(pca_data, "data/pca_data.csv")
write_csv(loadigns_data, "data/loadings_data.csv")
write_csv(per_var_pca, "data/per_var_pca.csv")

As you see I have divided the script into different sections. In the first section I checked if the packages I needed were already loaded, next I imported the data named as “main_data.csv” within the data folder.

# Packages
if (!"tidyverse" %in% .packages()) library(tidyverse)

# 1 Import Data -----------------------------------------------------------
main_data <- read_csv("data/main_data.csv")

Always double check if you are writing the right path for your file when you use functions like read_csv() from tidyverse or read.csv from utils.

In the second section of the script I reshaped the original data into a suitable form to use in the function in charge to perform the PCA analysis.

# 2 Formatting Data for PCA ---------------------------------------------------

# 2.1 Obtain wide data 
main_data_wd <- main_data %>% 
  mutate(SAMPLE_TIME = paste0(SAMPLE, "-", TIME)) %>% 
  pivot_wider(names_from = SAMPLE_TIME, values_from = QT, -TIME:-SAMPLE)

# 2.2 Transposing just quantity values
qt_data <- t(main_data_wd[, -1])

qt_data has the next structure, I’m only showing you the first 5 of the 24 rows and the first 7 of 43 columns:

1-0.5 0.73 1.10 0.78 0.91 0.96 0.43 0.47
2-0.5 0.84 0.98 0.79 0.91 0.96 0.48 0.46
3-0.5 0.71 1.17 0.82 0.91 0.94 0.43 0.43
1-1 0.75 1.00 0.83 0.86 0.83 0.56 0.63
2-1 0.89 0.96 0.86 0.85 0.81 0.55 0.80

Each row contains the quantities for a replicate (the first number in, e.g., “1-0.5”) at a specific time (the second number in, e.g., “1-0.5”), together, we can call the rows as samples. The PCA analysis will help to visualize the complete metabolite profiles into a 2 dimensions plot.

In the third section I perform the PCA analysis with the new data:

# 3 PCA analysis ----------------------------------------------------------
qt_pca <- prcomp(qt_data, scale. = FALSE)

# 3.1 PCA summary
pca_summary <- summary(qt_pca)

# 3.2 PCA data
pca_data <- as_tibble(qt_pca$x) %>% 
  mutate(TIME = as_factor(substr(rownames(qt_pca$x), start = 3, stop =5))) %>% 
  relocate(TIME)

# 3.3 Loadings Data

# Compound names
compound_names <- filter(main_data, SAMPLE == 1, TIME == 0.5) %>% 
  select(MET) %>% 
  unlist()          

loadigns_data <- as_tibble(qt_pca$rotation) %>% 
  signif(3) %>% 
  mutate(MET = compound_names, INDEX = 1:length(MET)) %>% 
  relocate(MET, INDEX) 

# 3.4 Variation percentage per PC
var_pca <- qt_pca$sdev^2
per_var_pca <- round((var_pca / sum(var_pca))*100, 2)
per_var_pca <- tibble(
  PC = 1:length(per_var_pca),
  PER_VAR = per_var_pca
)

Be sure that samples are always in the rows of your data when using prcomp. Remember each sample row could contain the data for different treatments, different phenotypes, etc. The argument scale. = FALSE within prcomp must be set as TRUE when the data in the rows is not in the same scale, for example if you have data with different measurement scales, be sure that scale. = TRUE. This is the reason why I set this argument as FALSE, because the data is in the same scale already.

pca_summary displays the next information:

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     5.3118 2.0027 0.94721 0.85199 0.54284 0.47582 0.41538
## Proportion of Variance 0.8068 0.1147 0.02565 0.02076 0.00843 0.00647 0.00493
## Cumulative Proportion  0.8068 0.9215 0.94712 0.96787 0.97630 0.98277 0.98771
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.35457 0.24837 0.23917 0.21549 0.19492 0.19206 0.13877
## Proportion of Variance 0.00359 0.00176 0.00164 0.00133 0.00109 0.00105 0.00055
## Cumulative Proportion  0.99130 0.99306 0.99470 0.99603 0.99711 0.99817 0.99872
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.10145 0.09904 0.08583 0.06810 0.06295 0.05556 0.05293
## Proportion of Variance 0.00029 0.00028 0.00021 0.00013 0.00011 0.00009 0.00008
## Cumulative Proportion  0.99901 0.99929 0.99951 0.99964 0.99975 0.99984 0.99992
##                           PC22    PC23      PC24
## Standard deviation     0.03973 0.03507 1.951e-16
## Proportion of Variance 0.00005 0.00004 0.000e+00
## Cumulative Proportion  0.99996 1.00000 1.000e+00

You can see that PC1 and PC2 explain the most variability between the data (check the row “Proportion of Variation”). Note that we also have one PC for each sample (8 times X 3 replicates = 24 samples). This information is also contained as percentages in the var_pca data frame. pca_data contains the PCA scores for each sample and loadings_data contains the loadings for each metabolite. In the fourth step I save these data in the data folder and I will use them to make nice plots.

# 4 Save results ----------------------------------------------------------
capture.output(pca_summary, file = "data/pca_summary.txt")
write_csv(pca_data, "data/pca_data.csv")
write_csv(loadigns_data, "data/loadings_data.csv")
write_csv(per_var_pca, "data/per_var_pca.csv")

2.2 PCA Plots

To make plots for PCA scores and metabolite loadings I use the next R script named as pca_plots.R in the analysis folder:

# Plots ----------------------------------------------------------------------

# Packages 
if (!"tidyverse" %in% .packages()) library(tidyverse)

# Set global theme
source("analysis/global_theme.R")

# 1 Import data --------------------------------------------------------------
per_var_pca <- read.csv("data/per_var_pca.csv")
pca_data <- read.csv("data/pca_data.csv")
loadings_data <- read.csv("data/loadings_data.csv")

# 2 Scree Plot ---------------------------------------------------------------
bar_pca <- per_var_pca[1:6,] %>% 
  ggplot(aes(x = as.factor(PC), y = PER_VAR)) +
  geom_col() + 
  xlab("PC") +
  ylab("% of total variance") 
  

# 3 Score plot PC1 and PC2 ----------------------------------------------------
pca_plot <- pca_data %>% 
  mutate(TIME = as_factor(TIME)) %>% 
  ggplot(aes(x = PC1, y = PC2, color = TIME)) +
  geom_point(size = 3) +
  xlab(paste0("PC1-", per_var_pca$PER_VAR[1], "%")) +
  ylab(paste0("PC2-", per_var_pca$PER_VAR[2], "%")) +
  scale_color_brewer(palette = "Dark2")

# 4 Loadings Plot -------------------------------------------------------------

# 4.1 Function to assign compound name if absolute value of loading is bigger
#     than set threshold
load_tr <- function(loadings, threshold, compound_name) {
  compound_name[!abs(loadings) > threshold] <- " "
  return(compound_name)
}

# 4.2 Line plot for PC1 loadings
load_line_pc1 <- loadings_data %>% 
  select(MET, INDEX, PC1) %>% 
  mutate(
    MET = load_tr(loading = PC1, threshold = 0.5, compound_name = MET)
  ) %>% 
  ggplot(aes(x = INDEX, y = PC1, label = MET)) +
  geom_line() +
  geom_text(fontface = "bold", position=position_nudge(), size = 2.5) +
  xlab("Compound Index") +
  ylab("Loadings")


# 4.3 Line plot for PC2 loadings
load_line_pc2 <- loadings_data %>% 
  select(MET, INDEX, PC2) %>% 
  mutate(
    MET = load_tr(loading = PC2, threshold = 0.19, compound_name = MET)
  ) %>% 
  ggplot(aes(x = INDEX, y = PC2, label = MET)) +
  geom_line() +
  geom_text(
    fontface = "bold", position = position_nudge(), angle = -40, size = 2.5
    ) +
  xlab("Compound Index") +
  ylab("Loadings")


# 5 Save plots ---------------------------------------------------------------
ggsave("graphs/scree_plot.jpeg", plot = bar_pca)
ggsave("graphs/pca_plot.jpeg", plot = pca_plot)
ggsave("graphs/load_line_pc1.jpeg", plot = load_line_pc1)
ggsave("graphs/load_line_pc2.jpeg", plot = load_line_pc2)

I also divided this script into different sections. In the first section additionally to check for necessary packages and import data I source a R script that set a global theme for all my plots (“global_theme.R” in the analysis folder). This is always useful if you’re working with ggplot2.

# Plot Global Theme -------------------------------------------------------

theme_set(
  theme_classic() +
    theme(
      axis.text.x = element_text(color = "black", size = 13),
      axis.text.y = element_text(color = "black", size = 13),
      axis.title = element_text(color = "black", size = 15)
    ) 
)

2.2.1 Scree Plot

Next, in the second section I made a scree plot, that is another way to say a bar chart where is column represents the percentage variation for each principal component. Taking into consideration the pca_summary I just made this plot for the first six PCs.

# 2 Scree Plot ---------------------------------------------------------------
bar_pca <- per_var_pca[1:6,] %>% 
  ggplot(aes(x = as.factor(PC), y = PER_VAR)) +
  geom_col() + 
  xlab("PC") +
  ylab("% of total variance") 

It’s clear that PC1 and PC2 explain the most of the variability.

bar_pca

Don’t forget, if the variability is more spread through the principal components, e.g., something like PC1: 50%, PC2: 25% and PC3%: 23%, a two dimensional score plot would not be an accurate representation of the data. At this cases it would be necessary make a two dimensional plot for each combination of PCs, for the example a plot of PC1 and PC2 and another one of PC1 and PC3 or PC2 and PC3. This way we coould recognize other behavior patterns.

2.2.2 Score Plot for PC1 and PC2

In the third section I made a scatter plot for PC1 and PC2 scores.

# 3 Score plot PC1 and PC2 ----------------------------------------------------
pca_plot <- pca_data %>% 
  mutate(TIME = as_factor(TIME)) %>% 
  ggplot(aes(x = PC1, y = PC2, color = TIME)) +
  geom_point(size = 3) +
  xlab(paste0("PC1-", per_var_pca$PER_VAR[1], "%")) +
  ylab(paste0("PC2-", per_var_pca$PER_VAR[2], "%")) +
  scale_color_brewer(palette = "Dark2")

This plot clearly shows differences between times of stress treatment.

pca_plot

The samples at 72 h are separated by the PC1 axis, while samples at 24 h are separated by the PC2 axis. Which metabolites contribute to these differences? A plot with the loadings can help us.

2.2.3 Loadings Line Plot

In the fourth section of the script I made line plots for metabolites loadings:

# 4 Loadings Plot -------------------------------------------------------------

# 4.1 Function to assign compound name if absolute value of loading is bigger
#     than set threshold
load_tr <- function(loadings, threshold, compound_name) {
  compound_name[!abs(loadings) > threshold] <- " "
  return(compound_name)
}

# 4.2 Line plot for PC1 loadings
load_line_pc1 <- loadings_data %>% 
  select(MET, INDEX, PC1) %>% 
  mutate(
    MET = load_tr(loading = PC1, threshold = 0.5, compound_name = MET)
  ) %>% 
  ggplot(aes(x = INDEX, y = PC1, label = MET)) +
  geom_line() +
  geom_text(fontface = "bold", position=position_nudge(), size = 2.5) +
  xlab("Compound Index") +
  ylab("Loadings")


# 4.3 Line plot for PC2 loadings
load_line_pc2 <- loadings_data %>% 
  select(MET, INDEX, PC2) %>% 
  mutate(
    MET = load_tr(loading = PC2, threshold = 0.19, compound_name = MET)
  ) %>% 
  ggplot(aes(x = INDEX, y = PC2, label = MET)) +
  geom_line() +
  geom_text(
    fontface = "bold", position = position_nudge(), angle = -40, size = 2.5
    ) +
  xlab("Compound Index") +
  ylab("Loadings")

Note that first I defined a function:

# 4.1 Function to assign compound name if absolute value of loading is bigger
#     than set threshold
load_tr <- function(loadings, threshold, compound_name) {
  compound_name[!abs(loadings) > threshold] <- " "
  return(compound_name)
}

This function replaces the metabolite name for a white space if its loading (absolute value) is not bigger than a set threshold, next geom_text() will only show us the metabolites with loadings bigger than the threshold. The threshold I set are similar to those in the paper and you must be cautious about what loadings can be considered as big or small. An alternative approach could be just take the loadings tables and order the absolute values, which can also allow us to see what metabolites have the biggest loadings.

The loadings for PC1:

load_line_pc1

This plot only shows two metabolites with the biggest loadings: lactate and sucrose.

The loadings for PC2:

load_line_pc2

This graph shows several metabolites with the bigger values than the threshold: tryptophan, phenylalanine, glycerol, inositol, lysine, uric acid, tyrosine and sucrose.

2.3 Metabolite Behavior Patterns

Now that we know which compounds contributed to the differences between times, we can focus our attention on these metabolites and make a plot to see their behavior pattern through the time of stress treatment.

I made the script to accomplish the above (“met_plots.R” in the analysis folder):

# Scatterplots  for metabolites with the biggest loadings ------------------

# Packages
library(tidyverse)

# Set global theme
source("analysis/global_theme.R")

# 1 Import Data -----------------------------------------------------------
main_data <- read.csv("data/main_data.csv")

# 2 Scatter plot for metabolites with the biggest loadings for PC1 --------
met_plot_pc1 <- main_data %>% 
  filter(MET == "lactate" | MET == "sucrose") %>% 
  group_by(MET, TIME) %>% 
  summarise(MEAN_QT = mean(QT)) %>% 
  ggplot(aes(x = TIME, y = MEAN_QT, color = MET)) +
  geom_line(size = 1) +
  geom_point() +
  xlab("Time (h)") +
  ylab("Relative quantity")


# 3 Scatter plot for metabolites with the biggest loadings for PC2 --------
met_plot_pc2 <- main_data %>% 
  filter(MET == "tryptophan" | MET == "phenylalanine" | MET == "glycerol" |
         MET == "tyrosine" | MET == "inositol" | MET == "lysine" |
         MET == "uric acid") %>% 
  group_by(MET, TIME) %>% 
  summarise(MEAN_QT = mean(QT)) %>% 
  ggplot(aes(x = TIME, y = MEAN_QT, color = MET)) +
  geom_line(size = 1) +
  geom_point() +
  xlab("Time (h)") +
  ylab("Relative quantity")

# 4 Save plots
ggsave(filename = "graphs/met_plot_pc1.jpg", plot = met_plot_pc1)
ggsave(filename = "graphs/met_plot_pc2.jpg", plot = met_plot_pc2)

In the second section of the script I just take the data from lactate and sucrose, those with the loadings bigger than the threshold (0.5) and obtained the average relative quantity per time:

# 2 Scatter plot for metabolites with the biggest loadings for PC1 --------
met_plot_pc1 <- main_data %>% 
  filter(MET == "lactate" | MET == "sucrose") %>% 
  group_by(MET, TIME) %>% 
  summarise(MEAN_QT = mean(QT)) %>% 
  ggplot(aes(x = TIME, y = MEAN_QT, color = MET)) +
  geom_line(size = 1) +
  geom_point() +
  xlab("Time (h)") +
  ylab("Relative quantity")
met_plot_pc1

It’s very clear that at 72 h the levels of this compounds increase a lot. This probably was the reason of the differences saw in the score plot for PC1 and PC2 (section 2.2.2 of this post).

For the loadings in PC2 I made the same taking the compounds with the bigger values than the threshold I set (0.19):

# 3 Scatter plot for metabolites with the biggest loadings for PC2 --------
met_plot_pc2 <- main_data %>% 
  filter(MET == "tryptophan" | MET == "phenylalanine" | MET == "glycerol" |
         MET == "tyrosine" | MET == "inositol" | MET == "lysine" |
         MET == "uric acid") %>% 
  group_by(MET, TIME) %>% 
  summarise(MEAN_QT = mean(QT)) %>% 
  ggplot(aes(x = TIME, y = MEAN_QT, color = MET)) +
  geom_line(size = 1) +
  geom_point() +
  xlab("Time (h)") +
  ylab("Relative quantity")
met_plot_pc2

On this graph I skipped the sucrose because we already saw its pattern. Despite that this plot is a bit less clear, we can still see that at 24 h, metabolites levels like glycerol and inositol increases while aromatic aminoacids levels decreases.

Batch-learning self-organizing mapping analysis (BL-SOM) was performed in conjunction with PCA and the authors concluded that “the methylation cycle for the supply of methyl groups, the phenylpropanoid pathway for lignin production, and glycinebetaine biosynthesis are synergetically induced as a short-term response against salt-stress treatment” while “the co-induction of glycolysis and sucrose metabolism as well as co-reduction of the methylation cycle as long-term responses to salt stress”.

In another post I’ll cover the BL-SOM topic with the same data.

That’s it! If you wish, you are free to clone the repository with the code and results of this R tutorial, including the experimental data simulation:

Principal Component Analysis to Many Responses.

Try to reproduce the analysis step by step, modify or improve the code. It’s all yours!
 

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0