Some basic concepts about fractional experimental designs, as well as their generation and analysis in R.
All other things being equal, the simplest explanation is usually the most likely.
All code and data on this post are available on GitHub: Fractional Designs with R.
What are fractional designs?
These are designs in which a part or fraction of the treatments of a full factorial design are appropriately chosen, with the objective of determining which of the factors are significant using fewer experimental runs. By a full factorial design, I mean a k-factor design with two levels each (2k).
Fractional designs offer a way to evaluate a large number of factors and detect those that actually have an effect on our response variable(s). For example, if we were to consider 10 factors with levels each we would have a total of 210 (1024) experimental runs, which seems quite overwhelming. If we took a 29-1 design we would reduce the number of runs by half (512 experimental runs). We can go further with a 210-5 design with only 32 runs, which would allow us to evaluate all 10 factors by focusing our attention on the main effects (single-factor effects) and discarding most of the higher-order interactions between factors, e.g., interactions between three, four, and five factors.
The main disadvantage with this approach is what we call alias effects, which are two or more effects with different names that basically estimate the same effect. For example, in the results of a fractional design it may be difficult to differentiate between the effect of factor A and the effect of the interaction of factors C, D and E (the CDE product). This problem can be reduced if we carefully choose the design fraction and do a proper analysis of the data.
The three principles, more or less empirical, that are considered when analyzing the results of this type of design are hierarchy, scarcity and inheritance.
Principle of hierarchy
Low-order effects, mainly those of individual factors and interactions between two factors, are more likely to be important or active than higher-order effects. Effects of the same order are equally likely to be active.
Principle of scarcity
Of the total number of effects that can be studied in an experiment, only a minor part, between 20 and 30%, is expected to be active.
Principle of inheritance
For an interaction effect to be active, at least one of the main effects of its constituent factors must be significant.
Resolution in a fractional design
- Design of resolution III. The main effects are allied with two-factor interactions.
- Design of resolution IV. Some double interactions are allied with each other.
- Design of resolution V. Main effects and double interactions are allied with three-factor or higher-order interactions.
Fractional design notation
In fractional designs, the notation -1, +1 is used to code the levels of each factor. For example, if we are considering temperature at levels 30 and 60 °C, 30 °C would be coded -1, while 60 °C would be coded +1.
This notation is also common in 2k and response surface designs.
Effect whose contrast is used to generate the factorial fraction. This effect cannot be estimated. For example, if we want to estimate the effect of four factors A, B, C and D with a fractional design we can use the generator I = ABCD, which will allow us to define D = ABC, that is, we will be able to establish the different levels of D by multiplying the columns A, B and C in our design matrix. This will make it impossible to estimate the ABCD interaction, which, taking into account the three principles, is not so serious.
It is also important to note that, depending on our design, there may be more than one generator.
Fractional designs with minimal aberration
Depending on the generators we choose, there may be more than one fractional design with the same resolution. For example, there are two 27-2 fractions with the same resolution (resolution IV), but only one of these will allow us to estimate two-factor interactions without them being allied with other double or higher-order interactions. Thus, designs with minimal aberration are those that will allow us to estimate the important effects in a “clean” way.
Table of fractional designs
The following table is a good guide for selecting the fractional design that best suits your objectives.
As you can see this table indicates the number of factors, the number of experimental runs, the resolution, as well as the generators that will allow us to obtain a design with minimum aberration.
As I will show later, the FrF2 package will save us a lot of work in selecting the best design according to our specifications.
Generation of fractional designs in R
To generate a fractional design, we can use the FrF2 package, which also provides several tools to facilitate the analysis of results.
To generate a 27-3 design with resolution IV, we use the following code:
# Load FrF2 package library(FrF2) # Make the fractional design frac_design_1 <- FrF2(nfactors = 7, resolution = 4, randomize = FALSE) # Number of runs nrow(frac_design_1)
##  16
Our design can be exported in CSV format to our work folder:
write.csv(frac_design_1, "data/frac_design_1.csv", row.names = FALSE)
Analysis of results of a 25-1 design with resolution V
Now let’s look at an example of the analysis of the results of a fractional design. In this case, an experiment was conducted to determine the significant factors on the yield of a chemical process.
The factors taken into account were as follows:
yield_factors <- read.csv("data/yield_factors.csv") colnames(yield_factors) <- c("Factor", "-1", "+1") yield_factors
## Factor -1 +1 ## 1 Feed rate (L/min) 10 15 ## 2 Catalyst (%) 1 2 ## 3 Agitation speed (rpm) 100 120 ## 4 Temperature (\xb0C) 140 180 ## 5 Concentration (%) 3 6
The experiment results can also be found in the data folder:
yield_data <- read.csv("data/yield_data.csv") head(yield_data)
## FeedRate Catalyst AgitationSpeed Temperature Concentration Yield ## 1 -1 -1 -1 -1 1 56 ## 2 1 -1 -1 -1 -1 53 ## 3 -1 1 -1 -1 -1 63 ## 4 1 1 -1 -1 1 65 ## 5 -1 -1 1 -1 -1 53 ## 6 1 -1 1 -1 1 55
Fitting a linear model
The first step is to fit a linear model with the yield data considering main effects and two-factor interactions:
yield_lm <- lm(Yield ~ (.)^2, data = yield_data)
Note how I have simplified the formula, instead of specifying the whole model just use
Yield ~ (.)^2 to include the effects I mentioned previously. To query the values of each effect we can use the function
## ## Call: ## lm.default(formula = Yield ~ (.)^2, data = yield_data) ## ## Residuals: ## ALL 16 residuals are 0: no residual degrees of freedom! ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.525e+01 NaN NaN NaN ## FeedRate -1.000e+00 NaN NaN NaN ## Catalyst 1.025e+01 NaN NaN NaN ## AgitationSpeed 1.544e-16 NaN NaN NaN ## Temperature 6.125e+00 NaN NaN NaN ## Concentration -3.125e+00 NaN NaN NaN ## FeedRate:Catalyst 7.500e-01 NaN NaN NaN ## FeedRate:AgitationSpeed 2.500e-01 NaN NaN NaN ## FeedRate:Temperature -3.750e-01 NaN NaN NaN ## FeedRate:Concentration 6.250e-01 NaN NaN NaN ## Catalyst:AgitationSpeed 7.500e-01 NaN NaN NaN ## Catalyst:Temperature 5.375e+00 NaN NaN NaN ## Catalyst:Concentration 6.250e-01 NaN NaN NaN ## AgitationSpeed:Temperature 1.250e-01 NaN NaN NaN ## AgitationSpeed:Concentration 1.125e+00 NaN NaN NaN ## Temperature:Concentration -4.750e+00 NaN NaN NaN ## ## Residual standard error: NaN on 0 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: NaN ## F-statistic: NaN on 15 and 0 DF, p-value: NA
Note that with 16 experimental runs it is only possible to calculate 16 effects, including the intercept. Since we do not have additional information it is not possible to calculate standard error, t-values and p-values.
FrF2 package includes a function that will show us the alias structure of the effects, only if this structure involves aliases between main effects or two-factor interactions. Since the design in this example has resolution V, there is no alias structure for us to worry about, and the
aliases() function will not show any results:
##  no aliasing in the model
With the above we can directly establish which effects are the most relevant ones.
Graphical analysis (Daniel Plot)
How can we establish which effects are significant? A first approach can be made with a Daniel plot, which can be made with the
DanielPlot() function included in the
That was easy. This plot shows us in a simple way which main effects and two-factor interactions are relevant under the assumption that non-significant effects follow a normal distribution.
Graphical analysis (Pareto chart)
The above graph can be complemented with a Pareto chart. For this we can use the package
ggplot2, taking the model effects:
library(ggplot2) yield_coeffs <- coefficients(yield_lm)[-1] # We discard the intercept yield_effects <- data.frame( Effect = names(yield_coeffs), Value = unname(yield_coeffs), AbsoluteValue = abs(unname(yield_coeffs)), Sign = as.character(unname(sign(yield_coeffs))) ) ggplot(yield_effects, aes(AbsoluteValue, reorder(Effect, -AbsoluteValue, abs))) + geom_col(aes(fill = Sign)) + xlab("Magnitude of effect") + ylab("Effect") + theme_minimal()
From this we can conclude that the most relevant effects were catalyst percentage, temperature and reagent concentration, as well as the double interactions between catalyst and temperature and between temperature and concentration.
Subsequent analysis by ANOVA
Now that we know which factors have an important influence on yield, we can perform the usual analysis of variance considering only these:
yield_lm2 <- lm(Yield ~ Catalyst*Temperature*Concentration, yield_data) anova(yield_lm2)
## Analysis of Variance Table ## ## Response: Yield ## Df Sum Sq Mean Sq F value Pr(>F) ## Catalyst 1 1681.00 1681.00 213.4603 4.725e-07 *** ## Temperature 1 600.25 600.25 76.2222 2.316e-05 *** ## Concentration 1 156.25 156.25 19.8413 0.002127 ** ## Catalyst:Temperature 1 462.25 462.25 58.6984 5.953e-05 *** ## Catalyst:Concentration 1 6.25 6.25 0.7937 0.398998 ## Temperature:Concentration 1 361.00 361.00 45.8413 0.000142 *** ## Catalyst:Temperature:Concentration 1 1.00 1.00 0.1270 0.730795 ## Residuals 8 63.00 7.88 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of results of a 28-4 design with resolution IV
Now let’s look at an example with a larger number of factors. In this case our goal is to design a paint with excellent brightness, so we consulted with paint experts and determined eight factors that could influence on brightness. Since a full experiment with eight factors would consist of 256 experimental runs, we decided to run a fraction, in particular a 28-4 design of resolution IV with only 16 runs, great!
The first step is to import the data into R:
paint_data <- read.csv("data/paint_data.csv") head(paint_data)
## A B C D E F G H Brightness ## 1 -1 -1 -1 -1 -1 -1 -1 -1 53 ## 2 1 -1 -1 -1 1 1 1 -1 60 ## 3 -1 1 -1 -1 1 1 -1 1 68 ## 4 1 1 -1 -1 -1 -1 1 1 78 ## 5 -1 -1 1 -1 1 -1 1 1 48 ## 6 1 -1 1 -1 -1 1 -1 1 67
Fitting a linear model
Next, we fit a linear model considering main effects and two-factor interactions:
paint_lm <- lm(Brightness ~ (.)^2, data = paint_data)
Unlike the previous example, in this design the
aliases() function will show us the alias structure, where we can see that the interactions between two factors are allied to each other:
## ## A:B = C:E = D:F = G:H ## A:C = B:E = D:G = F:H ## A:D = B:F = C:G = E:H ## A:E = B:C = D:H = F:G ## A:F = B:D = C:H = E:G ## A:G = B:H = C:D = E:F ## A:H = B:G = C:F = D:E
If several factors turn out to be significant, it would be somewhat complicated to differentiate between the different interactions. But taking into account the principles of hierarchy, sparsity and inheritance, we have some confidence that only a few effects will be truly significant.
Graphical analysis (Daniel Plot)
As in the previous example, we made a Daniel plot using the linear model on the brightness data:
Wow, only two effects among the eight we considered turned out to be significant.
Graphical analysis (Pareto chart)
Let’s complement the above with a Pareto chart:
paint_coeffs <- coefficients(paint_lm)[-1] # We discard the intercept paint_coeffs <- paint_coeffs[!is.na(paint_coeffs)] # Discard NA effects paint_effects <- data.frame( Effect = names(paint_coeffs), Value = unname(paint_coeffs), AbsoluteValue = abs(unname(paint_coeffs)), Sign = as.character(unname(sign(paint_coeffs))) ) ggplot(paint_effects, aes(AbsoluteValue, reorder(Effect, -AbsoluteValue, abs))) + geom_col(aes(fill = Sign)) + xlab("Magnitude of effect") + ylab("Effect") + theme_minimal()
Taking into account this graph, it may be important to consider the factor G, since its negative effect on brightness, although small compared to A and B, could give us some problems.
Subsequent analysis by ANOVA
Now let’s perform an analysis of variance considering only factors A, B and G:
paint_lm2 <- lm(Brightness ~ A*B*G, data = paint_data) anova(paint_lm2)
## Analysis of Variance Table ## ## Response: Brightness ## Df Sum Sq Mean Sq F value Pr(>F) ## A 1 1105.56 1105.56 102.2486 7.812e-06 *** ## B 1 637.56 637.56 58.9653 5.857e-05 *** ## G 1 52.56 52.56 4.8613 0.05855 . ## A:B 1 3.06 3.06 0.2832 0.60905 ## A:G 1 27.56 27.56 2.5491 0.14902 ## B:G 1 0.56 0.56 0.0520 0.82530 ## A:B:G 1 14.06 14.06 1.3006 0.28710 ## Residuals 8 86.50 10.81 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we thought, considering the G-factor was not a bad idea. Although its effect was not significant at the 0.05 cutoff, in a practical way its influence could be undesirable.
Among all our treatments (combinations of A, B and G) we can choose the one that resulted in the best brightness. This can be done graphically with a cube plot. The
cubePlot() function is also included in the
cubePlot(paint_lm2, eff1 = "A", eff2 = "B", eff3 = "G")
This graph clearly shows that the combination A positive with B positive will give the best brightness. Regarding the G factor, it will be sufficient to set its value to the one that is most convenient for us in a practical way, since its influence seems to be minimal at positive levels of A and B.
Depending on your objective, once we determine the significant factors on the response variable, the next step might be to move the experimental region to levels that enhance the response and ideally run a response surface design to find the optimal combination of factors.
It may also be the case that the results obtained in the fractional design are not sufficient to differentiate the main effects or the interactions of two factors. In this case, it would be necessary to run a complementary fraction that would allow us to find out.
I will explore both ideas in future posts, thank you very much for visiting this site!
If you wish to go deeper into the subject, please refer to chapter 6 of the book Statistics for Experimenters - Design, Innovation and Discovery (Second Edition). From chapter 6 I also took the two examples that we analyzed with R. Also check out the FrF2 package site if you want to go deeper into using the functions I used in this post: FrF2.
Juan Pablo Carreón Hidalgo 🤓
This work is licensed under a Creative Commons Attribution 4.0 International License.