A brief R tutorial about how to perform a data analysis to compare four treatments data. This is achieved by fitting a analysis of variance model, summarizing and making some nice graphs for the data.

 

1 Problem

You want to investigate the effect of light on plant growth. For this you designed an experiment that includes four different light sources. For each source you placed six plants (you replicate each treatment six times) in chambers with a specific light bulb. You kept the plants in their chambers for a fixed time and then you quantified the chlorophyll of each plant.

At the end your experimental data may look like this:

SL CLQ
SL1 21.0
SL1 27.7
SL1 19.7
SL1 23.7
SL1 28.6
SL1 21.7
SL2 34.3
SL2 33.8
SL2 34.8
SL2 36.1

I’m only showing you the first ten data points, but remember that in total we have 24 chlorophyll measurements.

The next step is to make a summary table with means and standard deviations, some plots and of course perform and inferential statistic analysis to establish significant differences between light source treatments.

2 Solution

First, as always, I must show you my file organization:

  • The analysis folder contains all my R scripts.
  • The data folder contains the experimental data and all CSV and TXT files products of my analysis scripts.
  • The graph folder contains all the plot results of my analysis scripts.

These three folders are contained in a main folder with a representative name.

2.1 Data Summary

For make a summary table with means and standard deviations I used the following R script:

# Data summary ------------------------------------------------------------

if(!"dplyr" %in% .packages()) library(dplyr)

# 1 Import data -----------------------------------------------------------
main_data <- read.csv("data/main_data.csv")

# 2 Data summary ---------------------------------------------------------
sum_data <- main_data %>% 
  group_by(SL) %>% 
  summarise(
    MEAN = mean(CLQ),
    SD   = sd(CLQ),
    CV   = SD * 100 / MEAN
  )

# 3 Save summary ---------------------------------------------------------
write.csv(sum_data, file = "data/summary_data.csv", row.names = FALSE)

I followed the usual steps that I used in all my analysis: import data > make analysis > export results. The line of code if(!"dplyr" %in% .packages()) library(dplyr) checks if the necessary packages are already loaded, you can omit the if statement while working on your own.

As can you see in the second step, it is very easy to make this kind of tables using the dplyr package:

With group_by() I specified to perform the next operations, using summarise(), grouping the data by light source treatment, so this way I obtained four means, four standard deviations and four coefficients of variance.

You can see the results in the data folder or by typing sum_data:

sum_data
SL MEAN SD CV
SL1 23.73 3.67 15.46
SL2 35.28 2.23 6.33
SL3 21.38 2.31 10.79
SL4 18.88 1.81 9.60

2.2 One Way Analysis of Variance Analysis

Now it is time to establish if there are significant differences between treatments. For this I performed an one-way analysis of variance.

I used the following R script:

# Analysis of Variance ----------------------------------------------------

# 1 Import data -----------------------------------------------------------
main_data <- read.csv("data/main_data.csv")

# 2 AOV analysis ----------------------------------------------------------

# 2.1 Fit AOV model
data_aov <- aov(CLQ ~ SL, data = main_data)

# 2.2 ANOVA table 
anova_table <- anova(data_aov)

# 3 Save results ----------------------------------------------------------
write.csv(
  anova_table, file = "data/ANOVA_table.csv", 
  row.names = FALSE, na = ""
  )

The code aov(CLQ ~ SL, data = main_data) fits an analysis of variance model using the experimental data and the relation pointed by CLQ ~ SL.

To make the classic ANOVA table you need the function anova(). I made this with the code anova_table <- anova(data_aov) and finally exported it using the function write.csv().

You can see the result typing or looking in the data folder.

anova_table
Df Sum Sq Mean Sq F value Pr(>F)
SL 3 946.3 315.4 46.6 0
Residuals 20 135.3 6.8

At this case the p-value (in the Pr(>F) column) is so small that is rounded to zero (strictly, there is no p-values equal to zero).
So, now that you’ve established that there are significant differences, which treatment means are different? The ANOVA table doesn’t give any information about this question.

2.3 Multiple Mean Comparisons (Tukey HSD test)

To establish differences between treatments I performed a Tukey HSD test.

I used the next R script:

# Multiple comparisons: Tukey --------------------------------------------

if(!"agricolae" %in% .packages()) library(agricolae)

# 1 Import data -----------------------------------------------------------
main_data <- read.csv(file = "data/main_data.csv")

# 2 HSD test --------------------------------------------------------------

# 2.1 Fit AOV model
data_aov <- aov(CLQ ~ SL, data = main_data)

# 2.2 HSD test on aov model
hsd_test <- HSD.test(data_aov, trt = "SL")

# 3 Save results

# 3.1 Groups from hsd_test
hsd_groups <- data.frame(
  SL = row.names(hsd_test$groups), 
  groups = hsd_test$groups[,2]
  )

# 3.2 Save as a data frame
write.csv(hsd_groups, file = "data/hsd_groups.csv", row.names = FALSE)

The package agricolae contains the useful function HSD.test(). You can see the result of this test by typing hsd_test:

hsd_test
## $statistics
##    MSerror Df     Mean       CV     MSD
##   6.763917 20 24.82083 10.47811 4.20273
## 
## $parameters
##    test name.t ntr StudentizedRange alpha
##   Tukey     SL   4         3.958293  0.05
## 
## $means
##          CLQ      std r  Min  Max    Q25   Q50    Q75
## SL1 23.73333 3.668606 6 19.7 28.6 21.175 22.70 26.700
## SL2 35.28333 2.233756 6 33.3 39.4 33.925 34.55 35.775
## SL3 21.38333 2.306874 6 17.4 23.6 20.800 21.50 23.175
## SL4 18.88333 1.812641 6 16.1 21.3 17.950 19.30 19.675
## 
## $comparison
## NULL
## 
## $groups
##          CLQ groups
## SL2 35.28333      a
## SL1 23.73333      b
## SL3 21.38333     bc
## SL4 18.88333      c
## 
## attr(,"class")
## [1] "group"

The test is performed using an alpha = 0.05 (you can change this within the function). There are other results that maybe could be useful, but at this case I just save the information related to the groups:

# 3 Save results

# 3.1 Groups from hsd_test
hsd_groups <- data.frame(
  SL = row.names(hsd_test$groups), 
  groups = hsd_test$groups[,2]
  )

# 3.2 Save as a data frame
write.csv(hsd_groups, file = "data/hsd_groups.csv", row.names = FALSE)

The hsd_table contains information about the groups treatments. In this column a different letter indicates a significant difference between treatment means at the specified alpha.

hsd_groups
SL groups
SL2 a
SL1 b
SL3 bc
SL4 c

Following alphabetical order the letter “a” indicates that SL2 mean is greater than SL1, SL3 and SL4 mean treatments. The interpretation of the remain letters can be made the same way. The combination “bc” in the treatment SL3 indicates that its mean cannot be considered different from SL1 and SL4 means. That’s weird! Remember that we are talking about statistical differences, so you can expect this kind of results in many of your experiments.

2.4 Plots

Finally, plots time. For this kind of experiments it is quite common to represent the magnitude of each treatment mean as a bar with its standard deviation as intervals. Anyway, I present other alternatives to represent the data that could be useful.

2.4.1 Bar Plot

To make the bar chart I used the next R script:

# Means bar plot   --------------------------------------------------------

if(!"ggplot2" %in% .packages()) library(ggplot2)

# Global theme
source("analysis/theme.R")

# 1 Import data ----------------------------------------------------------

# 1.1 Summary data
sum_data <- read.csv("data/summary_data.csv")

# 1.2 HSD groups
groups <- read.csv("data/hsd_groups.csv")

# 1.3 Join summary and groups data frames
sum_data <- merge(sum_data, groups, by = "SL")

# 2 Bar chart
bar_plot <- ggplot(sum_data, aes(x = SL, y = MEAN, label = groups)) +
  geom_col(width = 0.5, color = "black", fill = "royalblue1") +
  geom_errorbar(aes(ymin = MEAN - SD, ymax = MEAN + SD), width = 0.1) +
  geom_text(nudge_y = 5, size = 5) +
  xlab("Light source type") +
  ylab("Chlorophyll (mg/g)")

# 3 Save plot
ggsave(filename = "graphs/bar_plot.jpeg", bar_plot)

I just used the summary table made it in the previous section. I joined the groups information to the summary using the function merge(): sum_data <- merge(sum_data, groups, by = "SL").

You can notice that I used the ggplot2 package and their functions. geom_col() is in charge of representing the means as bars, while geom_errorbar() represents the standard deviations as intervals. geom_text() uses the groups information and places it above each bar.

The code source("analysis/theme.R") runs the next script, that sets a global theme used in all my graphs:

# Set theme for all graphs ------------------------------------------------

theme_set(
  theme_classic() +
    theme(axis.text.x = element_text(color = "black", size = 13),
          axis.text.y = element_text(color = "black", size = 13),
          axis.title = element_text(color = "black", size = 15))
)

The bar chart can be seen typing bar_plot and is saved in the graphs folder.

bar_plot

2.4.2 Scatter Plot

With a scatter graph we can represent each measurement as a point.

# Scatter plot  -------------------------------------------------------------

if (!"ggplot2" %in% .packages()) library(ggplot2)

# Global theme
source("analysis/theme.R")

# 1 Import data -------------------------------------------------------------

# 1.1 Experimental data
main_data <- read.csv("data/main_data.csv")

# 1.2 Summary data
sum_data <- read.csv("data/summary_data.csv")

# 1.3 HSD groups
groups <- read.csv("data/hsd_groups.csv")

# 1.4 Join groups and summary
groups <- merge(groups, sum_data, by = "SL")

# 2 Point plot  -------------------------------------------------------------
scatter_plot <- ggplot(main_data, aes(x = SL, y = CLQ)) +
  geom_point() +
  stat_summary(fun = mean, geom = "crossbar", width = 0.2, col = "blue") +
  geom_text(data = groups, 
            aes(x = SL, y = MEAN, label = groups), nudge_x = 0.2, 
            size = 5) + 
  xlab("Light source type") +
  ylab("Chlorophyll (mg/g)")


# 3 Save plot -------------------------------------------------------------
ggsave(filename = "graphs/scatter_plot.jpeg", scatter_plot)

I made this plot in a similar way than the bar chart. I used the ggplot2 functions geom_point() to represent each measurement as a point, stat_summary() to add a blue bar representing each treatment mean, and geom_text() to put the letter group a little to the right of each mean bar. The plot looks like this:

scatter_plot

2.4.3 Bar and Scatter Plot

Using ggplot2 you can even combine the both previous plots:

# Bar with points plot ----------------------------------------------------

if (!"ggplot2" %in% .packages()) library(ggplot2)

# Global theme
source("analysis/theme.R")

# 1 Import data -------------------------------------------------------------

# 1.1 Experimental data
main_data <- read.csv("data/main_data.csv")

# 1.2 Summary data
sum_data <- read.csv("data/summary_data.csv")

# 1.3 HSD groups
groups <- read.csv("data/hsd_groups.csv")

# 1.4 Join groups and summary
sum_data <- merge(groups, sum_data, by = "SL")


# 2 Bar point plot --------------------------------------------------------
bar_scatter_plot <- ggplot(sum_data, aes(x = SL, y = MEAN)) +
  geom_col(
    width = 0.5, color = "black", fill = "royalblue1"
    ) +
  geom_errorbar(
    aes(ymin = MEAN - SD, ymax = MEAN + SD),  width = 0.1
    ) +
  geom_text(
    aes(label = groups), nudge_y = 6.5, size = 5
    ) +
  geom_point(
    data = main_data, aes(x = SL, y = CLQ), color = "red", size = 1.5
    ) +
  xlab("Light source type") +
  ylab("Chlorophyll (mg/g)")

# 3 Save plot ---------------------------------------------------------------
ggsave(filename = "graphs/bar_scatter_plot.jpeg", bar_scatter_plot)

This code produces the next plot:

bar_scatter_plot

Wow! That’s it! If you wish, you are free to clone the repository with the code and results of this R tutorial, including the experimental data simulation:

More than Two Treatments Comparison

Try to reproduce the analysis step by step, modify or improve the code. It’s all yours!
 

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0