Some ideas related to the reproducibility of data analysis in the life science/chemistry laboratory, as well as a tutorial on how to approach reproducibility analysis with R and RStudio.
Data Analysis Reproducibility
This post contains several of my opinions on the topic, and most of the content is of a self-study and of reminder nature. These are things I wish I had known at the beginning of my master’s degree, and later in my PhD, and I have been learning and applying ever so imperfectly. Perhaps, with some luck, the information presented here will be useful to the reader in his or her daily work.
In my opinion, when we talk about the reproducibility of a scientific study, we think more about the reproducibility of methods or results, omitting, most of the time, the reproducibility of the data analysis. That is, documentation is rarely, if ever, provided with the software information and the step-by-step procedure for generating the tables, figures and statistical analysis in the study, and access to the raw data (the unprocessed experimental data) is practically non-existent. This represents a problem, also in my opinion, since scientific work in any area should rely on full transparency of the process.
To summarize, when talking about the reproducibility of a study, article, thesis, etc., three basic aspects should be covered: methods, results, and data analysis:
This can be supported using cloud services., such as Dropbox or similar, and GitHub, which together would have the following advantages:
- Facilitate replication of the study.
- To allow the revision, correction and improvement of the methods used in the study.
- Enable the application of the results if they are of great practical interest.
- Contribute to the generation of further research related to our topic of study.
- To facilitate the learning for current and future generations.
The above points could be expanded or reduced depending on the researcher’s context, which may involve ethical issues of access to data, private funding in research, accessibility to the Internet or to the necessary tools and, of course, the time one is willing to invest in learning or training in these tools.
A Basic Outline for Reproducibility of Data Analysis
For reproducibility of data analysis, it is certainly possible to use R and RStudio. If the analysis is complemented with the use of GitHub or a similar service, this enables full openness and transparency of our data analysis. Anyone around the world will have the possibility to download and execute the code we use for the generation of figures, tables and statistical or other related analysis. It is worth mentioning that the availability of our analysis should be evaluated considering the nature of the research or the policies of the publisher, university, etc., where we make our results public.
Below, as an example, I provide a small “template” for a reproducible data analysis using R and RStudio. This example can be modified, updated and, of course, improved, which will depend on the researcher’s needs. At the end of this post, I also mention a couple of resources that will help to expand the information and tools to ensure the reproducibility of our analyses.
A good organization of the working directory, where we will place our files, data, and code, is a good starting point to ensure reproducibility of the data analysis. I also believe that this aspect should be reviewed from time to time during the progress of the investigation and as the number of files in the directory increases.
In the main directory, files and other directories can be organized as follows:
The description of each folder:
- data. Here we place the raw experimental data and the relevant information about them, such as the description of treatments, responses, experimental units, etc.
- data_products. This is where you store anything related to the processing of the raw data that may be needed later. This includes ANOVA tables, multiple comparisons tests, data summaries, etc.
- figures. This folder contains the figures generated by their respective R scripts.
- R. The code needed to generate tables and figures.
- tables. The tables generated by their corresponding R scripts.
For this organization I considered what is generally published in a scientific paper (mainly the folders figures and tables) and can be modified/improved depending on the researcher’s needs. In addition to the folders, three files are included in the main directory:
- main_script.R. This script allows executing the scripts in the R folder and generating the TXT file with the information of the R session at the moment of performing the whole analysis. Optionally, it also allows deleting the content of figures, tables, and data_products, which makes it possible to perform the analysis from scratch.
- directory_example.Rproj. The name of this file can be changed with something representative to the analysis. It contains information related to the project, such as the encoding of the scripts, and double-clicking it allows you to open the directory directly using RStudio.
- Session_Info.txt. Information about our R session at the time of analysis, such as our operating system and the packages we used.
This directory is available for download on GitHub at the following link: directory_example. Once downloaded, you can explore the contents of each folder. The directory contains what is needed to perform an analysis of simulated data (data_main.csv), which includes a graph and a table with means and standard deviations. For didactic purposes, sometimes it is not easy to find real data for learning and/or teaching, I also include the code to perform the simulation of the data (data_simulation.R).
Create a project with RStudio from scratch
Once this is done, a new window will open. Double click on New Directory and then on New Project. Finally, select the folder where our new project will be and write a representative name:
Done, now we can create new folders and files in our new directory.
Connect RStudio with GitHub
The following steps are only necessary the first time we connect RStudio to GitHub:
- Create a free GitHub account.
- Installing Git on your computer.
- Tell Git our user information. For Windows users you can simply open a window with Windows PowerShell and type the following (for Mac users, you must type the same, but opening a terminal with Terminal.app):
git config --global user.name 'Your user name’ git config --global user.email 'email@example.com' git config --global –list
- Make sure to tell RStudio to use Git. Position the cursor on the Tools tab and click on Global Options:
- In this new window we position ourselves in Git/SVN. At the end of the Git executable box must be indicated git.exe. Otherwise, click on Browse… and select the file. It is also highly recommended to check the Enable version control interface checkbox:
- To connect RStudio with GitHub, we will need the usethis and gitcreds packages. Once installed, type the following in the R console, which will direct us to the GitHub page where we will be asked to log in with our account and password:
- Once this is done, we will be able to generate a token. To register it in RStudio we use the following code:
It will ask us to enter the token, just copy and paste it. Press Enter and that’s it, now RStudio is connected to your GitHub account.
Create a new directory and upload it to GitHub
Okay, now let’s upload our new project to GitHub through RStudio. When creating a new project with RStudio, before clicking on Create Project we must check the box Create a git repository:
Then we follow the next steps:
- Once we have created new subdirectories and files, in RStudio we click on the Git tab and then on Commit:
- A new window will open. First, we must check the Stage checkboxes in the folders and files we want to upload to GitHub:
- In the Commit message checkbox, we write something representative and click on Commit:
- A new window will open with information about the changes and additions in our first commit, close this small window and type:
Which will upload all the content of our first commit to GitHub:
- Step four will only be necessary for our first commit. Subsequently, each time we make changes that we consider important in our directory, we will have to repeat the process from step one to three and then just click on push:
Which will upload the entire contents of the corresponding commit to GitHub.
All the above will allow us to have a record in our project of the different changes, updates, corrections, etc., and will make the content available to anyone who decides to explore or download it.
Limitations in the Reproducibility of Data Analysis
Any data analysis we perform today may not work as well a couple of years from now. R and the packages that serve to extend its functionality are updated from time to time, which can directly introduce bugs in our code. In the best case, it will be enough to correct a couple of lines, but in the worst case, we may not be ready or have enough time to make more extensive changes.
A solution to this problem can be the renv package, which will help us to create reproducible projects by keeping a record of the packages and versions we use to perform the analysis. This will make us less dependent on future updates.
Another approach is Code Ocean, which will give us the possibility to create packages called “capsules” containing code, data, environment, and analysis results. Using this platform may require an investment of extra time to become familiar with its use if we are beginners, in addition to the fact that a basic account has several limitations in terms of storage and computation time. Personally, I think the R + RStudio + GitHub + renv combo can help us cover the reproducibility of our analysis at a basic level. More complex projects may require more complete implementations such as those offered by Code Ocean.
Some Resources for Further Study
The reproducibility of data analysis can be supported by other tools such as R Markdown, but trying to address everything related to the topic would require more of a book than a post. Fortunately, there are resources that can help us expand our knowledge as well as the tools at our disposal. Here are some of these resources:
- Principles, Statistical and Computational Tools for Reproducible Data Science. EdX course that can be taken free of charge.
- Happy Git and GitHub for the useR. A complete guide on how to use R and RStudio in combination with git and GitHub.
- Reproducible Research with R and RStudio (Third Edition). This book also discusses the use of R Markdown for reproducible analysis and reporting.
- Reproducible Research and Data Analysis. Recommended reading to learn more about the reproducibility of research and data analysis.
Well done! Thanks a lot for visit this site, I hope you find the content of this post useful. See you soon!
Juan Pablo Carreón Hidalgo 🤓
The text and code on this tutorial is under Creative Commons Attribution 4.0 International License.