Statistical inference concepts explained using R.

 

Perfection is always impossible; always it’s an approximation

1 Introduction

Formally, statistical inference can be defined as the process through which inferences about a population are made based on certain statistics calculated from a sample of data drawn from that population. What does this mean?

While you consult scientific literature, maybe you have read statements like “this results showed significant differences (P < 0.05)”. When someone says “there is a statistically significant difference”, does it really mean something?

I’m going to try to answer these and more questions with some help, of course, of R code.

2 Populations, Samples, Parameters and statistics

From Cambridge Dictionary an inference is a guess that you make or an opinion that you form based on the information that you have. Statistically, the objective of an inference is to draw conclusions about a population from a sample.

By population I mean the complete set of objects of your interest. This of course relies on your particular interest(s), the population can be the trees in a forest from a particular specie, mice, cell cultures, and even the elements from a continuous process like chip cookies in a cookies factory (yeah, I like chip cookies). Most of the time, you don’t have access to the entire population simply because is too big and it would be impractical analyze each element or, as I said, the population can be considered like a continuous.

So what’s next? Don’t worry, this is the place where the statistical inference can help. All what you need is a sample (a portion of the population). For this you must be sure that you sampling process ensures that each element in the population has the same chance to be selected. Once you have your sample you may be interested in some population property, so you proceed to measure this feature on each sample element. This property can be the weight, height, glucose levels, gene expression, etc.

Finally, you calculate a number that summarize the information of your sample, usually the mean, the variance, and the standard deviation. These numbers are referred as statistics and, with some assumptions, you can draw conclusions about this numbers in the population (parameters). See the next figure where I illustrate all process with cookies:

Note that when talk about properties, these can also include things like color, flavor, status (sick/ not sick), gender (male/female), etc. Depending on the measurement scale your inference assumptions and methods may be change. In this explanation I will working only with numbers in a ratio scale.

You can calculate the sample mean with the next equation, that is simply the sum (summation) of all sample elements divided by the sample size, n is the sample size, and i is the index of each element in your sample.

For the sample variance deviation you use the next equation:

This is the summation of the differences between each sample element and the sample mean. Each difference is squared, so negative and positive differences will not cancel each other. Then you divide the sum by the sample size minus one. If you are wondering why you divide by n - 1 I can briefly tell you that this way offers the best estimation of the population variance. You can also check out the next nice page: Why divide by N - 1 when calculating the variance of a sample?. To obtain the standard deviation you simply calculate the squared root of the variance, this way you have a dispersion sample measurement in the original measurement units.

Again, when you obtain these numbers the objective can be estimate them in the population, what we call parameters. The population mean:

And the population variance:

In both equations N means the population size.

3 Distributions

A probability distribution is a function (a table, a equation or a graph) that relates all the possible values of a populations with their occurrence probabilities (or frequencies).

To be clear let’s suppose that my cookie population has 1000 cookies, and I’m interested in the diameter size. The diameter frequencies distribution may look like this:

On this plot you can see that most of the values are around 5 cm.

Despite the fact that most of the time you won’t have this information, the concept of distribution is important is statistical inference. Why? Because we can made a theoretical distribution and proof if our statistics numbers belong to this distribution.
Also be aware that we can make a mean distribution. What’s that? Imagine that from my cookie population I take repeatedly samples of size 5:

# Means distribution for n = 5 --------------------------------------------

# Packages
library(tidyverse)

# Data
cookies_diameter <- read.csv("data/cookies_diameter.csv")

# Take 1000 samples of size 5
n_samples <- 1000

mean_cookies <- vector("numeric", length = n_samples)
for (i in 1:n_samples) {
  mean_cookies[i] <- mean(sample(cookies_diameter$D, size = 5))
}

mean_cookies <- tibble(mean_n5 = mean_cookies)

# Histogram
freq_hist_mean <- mean_cookies %>% 
  ggplot(aes(x = mean_n5)) +
  geom_histogram(color = "black", fill = "white", binwidth = 0.1) +
  xlab("Cookie mean (n = 5) diameter (cm)") +
  ylab("Count") +
  theme_classic() 

# Save histogram
ggsave(filename = "graphs/diameter_mean_cookies_hist.jpg", 
       plot = freq_hist_mean)
freq_hist_mean

Instead of diameter you see frequencies of diameter means from cookie samples of size 5, the same process can be repeated with other sample sizes. Remember this, it will be important in a later section (Central Limit Theorem).

3.1 The Normal Distribution

Since almost never we know the real distribution of the data (otherwise statistical inference wouldn’t be necessary), we need to resort to theoretical ones. There are a bunch of this kind of distributions, and without a doubt the normal distribution is the most important. It is defined by the following density equation:

This distribution is applied to continuous data (length, weight, temperature, etc.) and a plot of the previous equation looks like this (with a mean = 0 and sd = 1):

3.2 Student’s t Distribution

When you have small sample sizes and you don’t know the real deviation standard of the population (probably you won’t have this information), you can use the t distribution to made a good supposition. The shape of this distribution depends on the sample size, and for big sample sizes the t distribution tends to be normal:

The black line corresponds to a sample size of 2, the red line to a sample size of 4, the blue one to a size of 10, and the green one to a size of 50.

4 The Null Hypothesis

Now it is time to make some inferences, you have your sample and the statistics from it. The next step is to made a theoretical distribution and find out if our data belongs (or not) to this distribution. Imagine that you want to know if the population mean of the diameter cookies is greater to 3.5 cm. Just for illustrative purposes, let’s suppose that you have the population deviation standard (0.4) and you also know that the data distribution is normal.

First, you made a mean normal distribution with mean equal to 4.5 and standard deviation equal to 0.4:

# Cookies diameter null distribution -------------------------------------------------------

set.seed(5)

# Packages
library(ggplot2)

# Diameter data (mean = 4.5 and sd = 0.4)
diameter <- rnorm(1000, mean = 4.5, sd = 0.4)

# Means of samples of size 5
n_samples <- 1000

means_n5 <- vector("numeric", length = n_samples)
for (i in 1:n_samples) {
  means_n5[i] <- mean(sample(diameter, 5))
}

# Histogram
hist_null <- ggplot(data = data.frame(mean = means_n5), aes(mean)) +
  geom_histogram() +
  geom_histogram(color = "black", fill = "white", binwidth = 0.1) +
  xlab("Cookie mean (n = 5) diameter (cm)") +
  ylab("Count") +
  theme_classic() 

# Save histogram
ggsave(filename = "graphs/hist_null_cookies.jpg", plot = hist_null)

Now you want to know the proportion of means with a value bigger than the mean sample.

samp_n5 <- c(4.9, 4.7, 4.8, 5.3, 5.1)
mean_sample <- mean(samp_n5)

p_value <- mean(means_n5 >= mean_sample)
p_value
## [1] 0.005

In the null distribution the mean sample can be indicated with a vertical red line:

hist_null + geom_vline(xintercept = mean_sample, color = "red", size = 1)

In practice, you just need the t.test function to obtain the p-value:

ttest_sample <- t.test(samp_n5, mu = 4.5, alternative = "greater")
ttest_p_value <- ttest_sample$p.value
ttest_p_value
## [1] 0.006469959

4.1 Null Hypothesis for the difference between two means

Now imagine that you have performed an experiment with mice. You want to know if the consumption of a high-fat diet can increase the weight. The data look like this:

diet weight
1 18.64
1 27.54
1 16.98
1 22.28
1 28.85
1 19.59
1 20.11
1 19.46
1 20.86
1 22.55
1 26.91
1 18.79
2 14.68
2 18.37
2 14.71
2 18.44
2 16.61
2 10.26
2 19.96
2 17.96
2 22.60
2 22.77
2 24.87
2 21.83

Diet 1 corresponds to the high-fat diet and diet 2 is a regular diet.

To establish if there is a statistical difference, first we make a null distribution starting from the assumption that there is no difference between the two diet means.

I simulated data for the regular diet from a normal distribution, then I took two samples of size 12, and calculated the means and the difference between means. I repeated this process 10000 times with a for loop:

# Mice weight null distribution -------------------------------------------------------

set.seed(5)

# Packages
library(ggplot2)

#  Weight population for regular diet (N=250, mean = 20 and sd = 3)
weight_regular_diet <- rnorm(250, mean = 20, sd = 4)

# Differences for null hypothesis
n_differences <- 10000

diffs <- vector("numeric", length = n_differences)
for (i in 1:n_differences) {
  regular_diet <- sample(weight_regular_diet, 12)
  high_fat_diet <- sample(weight_regular_diet, 12)
  diffs[i] <- mean(high_fat_diet) - mean(regular_diet)
}

# Histogram
hist_null_mice <- ggplot(data = data.frame(diffs = diffs), aes(diffs)) +
  geom_histogram() +
  geom_histogram(color = "black", fill = "white", binwidth = 0.5) +
  xlab("Weight differences (g)") +
  ylab("Count") +
  theme_classic() 

# Save histogram
ggsave(filename = "graphs/hist_null_mice.jpg", plot = hist_null_mice)

You can see the frequencies distribution:

hist_null_mice
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I calculated the real difference between means, and put this value in the frequencies distribution as a vertical red line:

hf_data <- c(28.63, 27.54, 16.98, 22.28, 28.84, 19.59,
             20.11, 19.46, 20.86, 22.55, 26.91, 18.79) # high fat diet
r_data  <- c(14.68, 18.36, 14.71, 18.44, 16.61, 10.26, 
             19.96, 17.96, 22.60, 22.77, 24.87, 21.83) # regular diet
real_diff <- mean(hf_data) - mean(r_data)

hist_null_mice + geom_vline(xintercept = real_diff, color = "red", size = 1)

To calculate the p-value:

p_value_diets <- mean(diffs >= real_diff)
p_value_diets
## [1] 0.0052

So 0.5% of the differences in the null distribution are greater than real_diff.

If we performed a t test:

p_value_diets_t <- t.test(hf_data, r_data, alternative = "greater")$p.value
p_value_diets_t
## [1] 0.01200004

5 Confidence Intervals

Statistical significance doesn’t mean practical significance. With large enough sample sizes you can detect statistically significance even with small differences in your response variables. Is this relevant? Depends on your theory or discipline context. Maybe you are working with an expensive and active compound where small improvements on its production can be relevant, and, by contrast, if you are interested in mice weight changes of 1 mg maybe wouldn’t be important.

Better than just reporting p-values, you can consider include effect sizes, confidence intervals or percentage changes respect the control. If you are using t.test function, its results will show a confidence interval for the difference between means.

t_test_mice <- t.test(hf_data, r_data)
conf_int_95 <- t_test_mice$conf.int
conf_int_95
## [1] 0.5963027 7.6520306
## attr(,"conf.level")
## [1] 0.95

A 95% confidence interval is an interval with a 95% probability of falling on the parameter we are estimating (at this case the difference between means). You can change the percentage with the argument conf.level. For example if you want a 90% confidence interval:

conf_int_90 <- t.test(hf_data, r_data, conf.level = 0.9)$conf.int
conf_int_90
## [1] 1.203136 7.045197
## attr(,"conf.level")
## [1] 0.9

For a 95% confidence interval for the cookies diameter mean:

cookie_conf_int_95 <- t.test(samp_n5)$conf.int
cookie_conf_int_95
## [1] 4.660968 5.259032
## attr(,"conf.level")
## [1] 0.95

6 Central Limit Theorem

The central limit theorem states that besides the original data doesn’t follow a normal distribution, the average distribution from this data will tend to follow a normal distribution when the sample size is big enough.

First, let’s simulate data from a non-normal distribution:

# Data from non normal distribution ---------------------------------------

set.seed(5)

# Packages
library(ggplot2)

# Simulate data from a uniform distribution
non_normal_data <- data.frame(x = runif(10000, max = 100)) # uniform distribution

# Save data
write.csv(data, file = "data/data_non_normal.csv")

# Histogram
non_normal_hist <- ggplot(non_normal_data, aes(x)) +
  geom_histogram(color = "black", fill = "white") +
  ylab("Count") +
  theme_classic()

# Save histogram
ggsave(filename = "graphs/non_normal_hist.jpg", plot = non_normal_hist)

I made a frequencies histogram:

non_normal_hist

If we took 10000 samples of size 5, calculate their averages, and made a histogram:

set.seed(5)

sample_average_n5 <- vector("numeric", length = 10000)
for (i in 1:10000) {
  sample_average_n5[i] <- mean(sample(non_normal_data$x, size = 5))
}

av_samps_n5_hist <- ggplot(data = data.frame(x=sample_average_n5), aes(x)) +
  geom_histogram(color = "black", fill = "white") +
  ylab("Count") +
  xlab("Averages of size 5") +
  theme_classic()

av_samps_n5_hist

The same process can be repeated for a sample size of 10:

set.seed(5)

sample_average_n10 <- vector("numeric", length = 10000)
for (i in 1:10000) {
  sample_average_n10[i] <- mean(sample(non_normal_data$x, size = 10))
}

av_samps_n10_hist <- ggplot(data = data.frame(x=sample_average_n10), aes(x)) +
  geom_histogram(color = "black", fill = "white") +
  ylab("Count") +
  xlab("Averages of size 10") +
  theme_classic()

av_samps_n10_hist

And for a sample size of 30:

set.seed(5)

sample_average_n30 <- vector("numeric", length = 10000)
for (i in 1:10000) {
  sample_average_n30[i] <- mean(sample(non_normal_data$x, size = 30))
}

av_samps_n30_hist <- ggplot(data = data.frame(x=sample_average_n30), aes(x)) +
  geom_histogram(color = "black", fill = "white") +
  ylab("Count") +
  xlab("Averages of size 30") +
  theme_classic()

av_samps_n30_hist

Note that the means are almost the same for these distributions:

distribution_means <- data.frame(
  Distribution = c("Original data", "Samples of size 5", 
                   "Samples of size 10", "Samples of size 30"),
  Average = c(mean(non_normal_data$x), mean(sample_average_n5),
              mean(sample_average_n10), mean(sample_average_n30))
  )
distribution_means
##         Distribution  Average
## 1      Original data 50.18471
## 2  Samples of size 5 49.90794
## 3 Samples of size 10 49.94093
## 4 Samples of size 30 50.18709

And also note that the standard deviations decrease:

distribution_sds <- data.frame(
  Distribution = c("Original data", "Samples of size 5", 
                   "Samples of size 10", "Samples of size 30"),
  Standard_Deviation = c(sd(non_normal_data$x), sd(sample_average_n5),
                         sd(sample_average_n10), sd(sample_average_n30))
  )
distribution_sds
##         Distribution Standard_Deviation
## 1      Original data          28.917529
## 2  Samples of size 5          12.894585
## 3 Samples of size 10           9.069493
## 4 Samples of size 30           5.252030

This reduction is equal to divide the standard deviation from the original data by the square root of the sample size:

# Standard deviation for samples of size 5
sd(non_normal_data$x) / sqrt(5)       
## [1] 12.93231
# Standard deviation for samples of size 10
sd(non_normal_data$x) / sqrt(10)       
## [1] 9.144526
# Standard deviation for samples of size 30
sd(non_normal_data$x) / sqrt(30)       
## [1] 5.279594

This also works for the differences between two means calculated from data that don’t belong to a normal distribution.

The central limit theorem allows us the use of statistical methods that work for normal distributions in other kind of distributions.

7 Types of Error and Power

When your p-values are not small enough, it’s because you don’t have enough power.

7.1 Type I Error

This type of error is committed when you reject the null hypothesis and you should not, so you say that there is a real difference and really there isn’t. This error can be called a false positive and we set its occurrence probability when we establish an α value. If we establish an α = 0.05, 1 of 20 experiments will show significant differences despite the fact that there’s no a real difference.

7.2 Type II Error

Setting an α value too small, something like α = 0.00000001, and reduce the possibility of commit a false positive, could be a bad idea, because we must consider the probability of commit another kind of error: the type II error or false negative. A false negative occurs when we do not refuse the null hypothesis when we should (there is a real difference).

We denote this probability with the greek letter β.

7.3 Power Calculations

The power is the probability of make the right thing and say that there is a real difference when there is actually a real difference (true positive). This probability is calculated as 1 - β (one minus the probability of commit a false negative).

If we are talking about a difference between means, the power depends on the real difference, the standard deviation of the populations and the sample size. Generally the power increases when the sample size also increases, and the power decreases if you set α values too small.

You can use the pwr package to calculate the sample size. Let’s say that we want to detect differences in mice weight as big as 20 g, this as result of feeding the mice with a high-fat diet. From previous measurements we know that standard deviation for weight mice is equal to 6.3 and we have no reason to expect a different deviation for the mice fed with the high-fat diet. We also set a power = 0.99 and α = 0.05 (the sig.level argument):

# Sample size calculation -------------------------------------------------

# Packages
library(pwr)
## Warning: package 'pwr' was built under R version 4.2.1
# Effect size
difference <- 20   # difference between weight means (grams)
st_dv <- 6.3       # standard deviation for both treatments

effect_size <- difference / st_dv

# Sample size calculation
samp_size <- pwr.t.test(
  d = effect_size, sig.level = 0.05, 
  power = 0.99,  type = "two.sample", alternative = "greater"
  )
samp_size
## 
##      Two-sample t test power calculation 
## 
##               n = 4.072102
##               d = 3.174603
##       sig.level = 0.05
##           power = 0.99
##     alternative = greater
## 
## NOTE: n is number in *each* group

Always round up the n value in the results, so you need at least 4 mice to establish a significant difference of 20 g with power = 0.99 and α = 0.05. Note that you cannot do this calculation from scratch as you need some information, at this case the average weight for mice fed with a regular diet ans the standard deviation. Also take into account that if you are expecting different deviations, you need to calculate the pooled standard deviation and take this value to calculate the effect size.

That’s it! I really hope this material can help you. Most of the code, data and plots on this post are available in the GitHub repository: Statistical Inference with R.

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0