Some basic concepts about fractional experimental designs, as well as their generation and analysis in R.

All other things being equal, the simplest explanation is usually the most likely.

All code and data on this post are available on GitHub: Fractional Designs with R.

# What are fractional designs?

These are designs in which a part or fraction of the treatments of a full factorial design are appropriately chosen, with the objective of determining which of the factors are significant using fewer experimental runs. By a full factorial design, I mean a k-factor design with two levels each (2k).

Fractional designs offer a way to evaluate a large number of factors and detect those that actually have an effect on our response variable(s). For example, if we were to consider 10 factors with levels each we would have a total of 210 (1024) experimental runs, which seems quite overwhelming. If we took a 29-1 design we would reduce the number of runs by half (512 experimental runs). We can go further with a 210-5 design with only 32 runs, which would allow us to evaluate all 10 factors by focusing our attention on the main effects (single-factor effects) and discarding most of the higher-order interactions between factors, e.g., interactions between three, four, and five factors.

The main disadvantage with this approach is what we call alias effects, which are two or more effects with different names that basically estimate the same effect. For example, in the results of a fractional design it may be difficult to differentiate between the effect of factor A and the effect of the interaction of factors C, D and E (the CDE product). This problem can be reduced if we carefully choose the design fraction and do a proper analysis of the data.

The three principles, more or less empirical, that are considered when analyzing the results of this type of design are hierarchy, scarcity and inheritance.

## Principle of hierarchy

Low-order effects, mainly those of individual factors and interactions between two factors, are more likely to be important or active than higher-order effects. Effects of the same order are equally likely to be active.

## Principle of scarcity

Of the total number of effects that can be studied in an experiment, only a minor part, between 20 and 30%, is expected to be active.

## Principle of inheritance

For an interaction effect to be active, at least one of the main effects of its constituent factors must be significant.

## Resolution in a fractional design

• Design of resolution III. The main effects are allied with two-factor interactions.
• Design of resolution IV. Some double interactions are allied with each other.
• Design of resolution V. Main effects and double interactions are allied with three-factor or higher-order interactions.

## Fractional design notation

In fractional designs, the notation -1, +1 is used to code the levels of each factor. For example, if we are considering temperature at levels 30 and 60 °C, 30 °C would be coded -1, while 60 °C would be coded +1.

This notation is also common in 2k and response surface designs.

## Fraction generator

Effect whose contrast is used to generate the factorial fraction. This effect cannot be estimated. For example, if we want to estimate the effect of four factors A, B, C and D with a fractional design we can use the generator I = ABCD, which will allow us to define D = ABC, that is, we will be able to establish the different levels of D by multiplying the columns A, B and C in our design matrix. This will make it impossible to estimate the ABCD interaction, which, taking into account the three principles, is not so serious.

It is also important to note that, depending on our design, there may be more than one generator.

## Fractional designs with minimal aberration

Depending on the generators we choose, there may be more than one fractional design with the same resolution. For example, there are two 27-2 fractions with the same resolution (resolution IV), but only one of these will allow us to estimate two-factor interactions without them being allied with other double or higher-order interactions. Thus, designs with minimal aberration are those that will allow us to estimate the important effects in a “clean” way.

## Table of fractional designs

The following table is a good guide for selecting the fractional design that best suits your objectives. As you can see this table indicates the number of factors, the number of experimental runs, the resolution, as well as the generators that will allow us to obtain a design with minimum aberration.

As I will show later, the FrF2 package will save us a lot of work in selecting the best design according to our specifications.

# Generation of fractional designs in R

To generate a fractional design, we can use the FrF2 package, which also provides several tools to facilitate the analysis of results.

To generate a 27-3 design with resolution IV, we use the following code:

``````# Load FrF2 package
library(FrF2)

# Make the fractional design
frac_design_1 <- FrF2(nfactors = 7, resolution = 4, randomize = FALSE)

# Number of runs
nrow(frac_design_1)``````
``##  16``

Our design can be exported in CSV format to our work folder:

``write.csv(frac_design_1, "data/frac_design_1.csv", row.names = FALSE)``

# Analysis of results of a 25-1 design with resolution V

Now let’s look at an example of the analysis of the results of a fractional design. In this case, an experiment was conducted to determine the significant factors on the yield of a chemical process.

The factors taken into account were as follows:

``````yield_factors <- read.csv("data/yield_factors.csv")
colnames(yield_factors) <- c("Factor", "-1", "+1")
yield_factors``````
``````##                  Factor  -1  +1
## 1     Feed rate (L/min)  10  15
## 2          Catalyst (%)   1   2
## 3 Agitation speed (rpm) 100 120
## 4   Temperature (\xb0C) 140 180
## 5     Concentration (%)   3   6``````

The experiment results can also be found in the data folder:

``````yield_data <- read.csv("data/yield_data.csv")
``````##   FeedRate Catalyst AgitationSpeed Temperature Concentration Yield
## 1       -1       -1             -1          -1             1    56
## 2        1       -1             -1          -1            -1    53
## 3       -1        1             -1          -1            -1    63
## 4        1        1             -1          -1             1    65
## 5       -1       -1              1          -1            -1    53
## 6        1       -1              1          -1             1    55``````

## Fitting a linear model

The first step is to fit a linear model with the yield data considering main effects and two-factor interactions:

``yield_lm <- lm(Yield ~ (.)^2, data = yield_data)``

Note how I have simplified the formula, instead of specifying the whole model just use `Yield ~ (.)^2` to include the effects I mentioned previously. To query the values of each effect we can use the function `summary()`:

``summary(yield_lm)``
``````##
## Call:
## lm.default(formula = Yield ~ (.)^2, data = yield_data)
##
## Residuals:
## ALL 16 residuals are 0: no residual degrees of freedom!
##
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   6.525e+01        NaN     NaN      NaN
## FeedRate                     -1.000e+00        NaN     NaN      NaN
## Catalyst                      1.025e+01        NaN     NaN      NaN
## AgitationSpeed                1.544e-16        NaN     NaN      NaN
## Temperature                   6.125e+00        NaN     NaN      NaN
## Concentration                -3.125e+00        NaN     NaN      NaN
## FeedRate:Catalyst             7.500e-01        NaN     NaN      NaN
## FeedRate:AgitationSpeed       2.500e-01        NaN     NaN      NaN
## FeedRate:Temperature         -3.750e-01        NaN     NaN      NaN
## FeedRate:Concentration        6.250e-01        NaN     NaN      NaN
## Catalyst:AgitationSpeed       7.500e-01        NaN     NaN      NaN
## Catalyst:Temperature          5.375e+00        NaN     NaN      NaN
## Catalyst:Concentration        6.250e-01        NaN     NaN      NaN
## AgitationSpeed:Temperature    1.250e-01        NaN     NaN      NaN
## AgitationSpeed:Concentration  1.125e+00        NaN     NaN      NaN
## Temperature:Concentration    -4.750e+00        NaN     NaN      NaN
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN
## F-statistic:   NaN on 15 and 0 DF,  p-value: NA``````

Note that with 16 experimental runs it is only possible to calculate 16 effects, including the intercept. Since we do not have additional information it is not possible to calculate standard error, t-values and p-values.

## Alias effects

The `FrF2` package includes a function that will show us the alias structure of the effects, only if this structure involves aliases between main effects or two-factor interactions. Since the design in this example has resolution V, there is no alias structure for us to worry about, and the `aliases()` function will not show any results:

``aliases(yield_lm)``
``##  no aliasing in the model``

With the above we can directly establish which effects are the most relevant ones.

## Graphical analysis (Daniel Plot)

How can we establish which effects are significant? A first approach can be made with a Daniel plot, which can be made with the `DanielPlot()` function included in the `FrF2` package:

``DanielPlot(yield_lm)`` That was easy. This plot shows us in a simple way which main effects and two-factor interactions are relevant under the assumption that non-significant effects follow a normal distribution.

## Graphical analysis (Pareto chart)

The above graph can be complemented with a Pareto chart. For this we can use the package `ggplot2`, taking the model effects:

``````library(ggplot2)

yield_coeffs <- coefficients(yield_lm)[-1] # We discard the intercept

yield_effects <- data.frame(
Effect = names(yield_coeffs),
Value  = unname(yield_coeffs),
AbsoluteValue = abs(unname(yield_coeffs)),
Sign   = as.character(unname(sign(yield_coeffs)))
)

ggplot(yield_effects, aes(AbsoluteValue, reorder(Effect, -AbsoluteValue, abs))) +
geom_col(aes(fill = Sign)) +
xlab("Magnitude of effect") +
ylab("Effect") +
theme_minimal()`````` From this we can conclude that the most relevant effects were catalyst percentage, temperature and reagent concentration, as well as the double interactions between catalyst and temperature and between temperature and concentration.

## Subsequent analysis by ANOVA

Now that we know which factors have an important influence on yield, we can perform the usual analysis of variance considering only these:

``````yield_lm2 <- lm(Yield ~ Catalyst*Temperature*Concentration, yield_data)
anova(yield_lm2)``````
``````## Analysis of Variance Table
##
## Response: Yield
##                                    Df  Sum Sq Mean Sq  F value    Pr(>F)
## Catalyst                            1 1681.00 1681.00 213.4603 4.725e-07 ***
## Temperature                         1  600.25  600.25  76.2222 2.316e-05 ***
## Concentration                       1  156.25  156.25  19.8413  0.002127 **
## Catalyst:Temperature                1  462.25  462.25  58.6984 5.953e-05 ***
## Catalyst:Concentration              1    6.25    6.25   0.7937  0.398998
## Temperature:Concentration           1  361.00  361.00  45.8413  0.000142 ***
## Catalyst:Temperature:Concentration  1    1.00    1.00   0.1270  0.730795
## Residuals                           8   63.00    7.88
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1``````

# Analysis of results of a 28-4 design with resolution IV

Now let’s look at an example with a larger number of factors. In this case our goal is to design a paint with excellent brightness, so we consulted with paint experts and determined eight factors that could influence on brightness. Since a full experiment with eight factors would consist of 256 experimental runs, we decided to run a fraction, in particular a 28-4 design of resolution IV with only 16 runs, great!

The first step is to import the data into R:

``````paint_data <- read.csv("data/paint_data.csv")
``````##    A  B  C  D  E  F  G  H Brightness
## 1 -1 -1 -1 -1 -1 -1 -1 -1         53
## 2  1 -1 -1 -1  1  1  1 -1         60
## 3 -1  1 -1 -1  1  1 -1  1         68
## 4  1  1 -1 -1 -1 -1  1  1         78
## 5 -1 -1  1 -1  1 -1  1  1         48
## 6  1 -1  1 -1 -1  1 -1  1         67``````

## Fitting a linear model

Next, we fit a linear model considering main effects and two-factor interactions:

``paint_lm <- lm(Brightness ~ (.)^2, data = paint_data)``

## Alias effects

Unlike the previous example, in this design the `aliases()` function will show us the alias structure, where we can see that the interactions between two factors are allied to each other:

``aliases(paint_lm)``
``````##
##  A:B = C:E = D:F = G:H
##  A:C = B:E = D:G = F:H
##  A:D = B:F = C:G = E:H
##  A:E = B:C = D:H = F:G
##  A:F = B:D = C:H = E:G
##  A:G = B:H = C:D = E:F
##  A:H = B:G = C:F = D:E``````

If several factors turn out to be significant, it would be somewhat complicated to differentiate between the different interactions. But taking into account the principles of hierarchy, sparsity and inheritance, we have some confidence that only a few effects will be truly significant.

## Graphical analysis (Daniel Plot)

As in the previous example, we made a Daniel plot using the linear model on the brightness data:

``DanielPlot(paint_lm)`` Wow, only two effects among the eight we considered turned out to be significant.

## Graphical analysis (Pareto chart)

Let’s complement the above with a Pareto chart:

``````paint_coeffs <- coefficients(paint_lm)[-1] # We discard the intercept
paint_coeffs <- paint_coeffs[!is.na(paint_coeffs)] # Discard NA effects

paint_effects <- data.frame(
Effect = names(paint_coeffs),
Value  = unname(paint_coeffs),
AbsoluteValue = abs(unname(paint_coeffs)),
Sign   = as.character(unname(sign(paint_coeffs)))
)

ggplot(paint_effects, aes(AbsoluteValue, reorder(Effect, -AbsoluteValue, abs))) +
geom_col(aes(fill = Sign)) +
xlab("Magnitude of effect") +
ylab("Effect") +
theme_minimal()`````` Taking into account this graph, it may be important to consider the factor G, since its negative effect on brightness, although small compared to A and B, could give us some problems.

## Subsequent analysis by ANOVA

Now let’s perform an analysis of variance considering only factors A, B and G:

``````paint_lm2 <- lm(Brightness ~ A*B*G, data = paint_data)
anova(paint_lm2)``````
``````## Analysis of Variance Table
##
## Response: Brightness
##           Df  Sum Sq Mean Sq  F value    Pr(>F)
## A          1 1105.56 1105.56 102.2486 7.812e-06 ***
## B          1  637.56  637.56  58.9653 5.857e-05 ***
## G          1   52.56   52.56   4.8613   0.05855 .
## A:B        1    3.06    3.06   0.2832   0.60905
## A:G        1   27.56   27.56   2.5491   0.14902
## B:G        1    0.56    0.56   0.0520   0.82530
## A:B:G      1   14.06   14.06   1.3006   0.28710
## Residuals  8   86.50   10.81
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1``````

As we thought, considering the G-factor was not a bad idea. Although its effect was not significant at the 0.05 cutoff, in a practical way its influence could be undesirable.

## Cube plot

Among all our treatments (combinations of A, B and G) we can choose the one that resulted in the best brightness. This can be done graphically with a cube plot. The `cubePlot()` function is also included in the `FrF2` package:

``cubePlot(paint_lm2, eff1 = "A", eff2 = "B", eff3 = "G")`` This graph clearly shows that the combination A positive with B positive will give the best brightness. Regarding the G factor, it will be sufficient to set its value to the one that is most convenient for us in a practical way, since its influence seems to be minimal at positive levels of A and B.

# What’s next?

Depending on your objective, once we determine the significant factors on the response variable, the next step might be to move the experimental region to levels that enhance the response and ideally run a response surface design to find the optimal combination of factors.

It may also be the case that the results obtained in the fractional design are not sufficient to differentiate the main effects or the interactions of two factors. In this case, it would be necessary to run a complementary fraction that would allow us to find out.

I will explore both ideas in future posts, thank you very much for visiting this site!

If you wish to go deeper into the subject, please refer to chapter 6 of the book Statistics for Experimenters - Design, Innovation and Discovery (Second Edition). From chapter 6 I also took the two examples that we analyzed with R. Also check out the FrF2 package site if you want to go deeper into using the functions I used in this post: FrF2.

Juan Pablo Carreón Hidalgo 🤓  