# Experiments 3D – Case study: a 4-factor system using computer software

Let’s look at a good case study now, involving
four factors, and two outcome variables. We’re stepping up the complexity
here a little bit. This is a good question, from the
textbook by Box, Hunter and Hunter. It’s a case study where we we are using
solar panels, with a storage tank. The outcome values were from
a computer simulation. Now just a quick piece of
advice when using simulations. Running simulations is often really easy. But there’s a temptation to
really do this inefficiently. I often see people just playing around with
the software, trying out different values, until they get an answer they like. You shouldn’t treat simulations
any differently from real life. Always use a systematic method. In this case we’re going to use a set
of experiments as our systematic method. There are two key advantages
though to using simulations. You can run the simulations in parallel
at very little cost and secondly, you don’t have to randomize
the order of experiments. And the reason for that is quite simple. When you repeat the simulation,
you get the same answer, so the need for randomization
isn’t there anymore, which was, minimize the impact of disturbances. Be careful though: certain computer experiments,
when repeated, don’t give identical results. So then you should randomize. In fact, I always recommend you randomize. The cost of doing so if very minimal, and
it guards against all sorts of problems. More on that in the next module though. Let’s go back to the solar panel system. There are four factors. A: the total amount of insulation or sunlight
received; B: the capacity of the storage tank; C: the water flow rate through the absorber;
and D: the intermittency of the sunlight. You can read more about these types
of systems, by following this link. The two outcome variables were
“y_1” the collection efficiency, and “y_2” the energy delivery efficiency. You should be able to quickly tell
how many experiments will be done, if each factor is operated at
the low level and the high level. You should have: two to the
power of four (2^4) which is 16. So 16 experiments were run, and I’ve put the
results and the R code here on the screen. They’re available on the course website. Copy and paste that code and follow
along with me for the rest of the video. So here we define the four factors:
A, B, C and D, and I’ve manually typed in the two outcome variables, “y_1” and “y_2”. This is what you would do in practice,
but to make things a bit simpler, and to avoid typing errors, you
can also use the PID package in R. In a prior video I showed how
you can download and install that package, to extend R’s capability. That package includes the numeric
results for this case study. And you can get that dataset by typing
the following command: data(solar). So since we ran 16 experiments, we
are able to estimate 16 parameters: there are four main effects
(one for A, B, C and D). There are 6 two-factor interactions,
there are 4 three-factor interactions, and then the single four-factor interaction. That’s a total of 15 parameters, and it
adds to 16 if you count the intercept. The software can create all of this for you, very compactly with the “lm(…)”
command, as shown here. The reason why this A*B*C*D concept works is
because of the principle of model hierarchy. Let’s take a simple example: if you
wrote just A*B, then R will expand that to include factor A
and factor B in the model. After all, you can’t have the two factor
interaction A*B if you don’t also have factor A and factor B. Similarly,
when R encounters A*B*C, it ensures that the AB interaction
is present, as well as factor C. But, we’ve already mentioned that the
AB will be expanded into factors A and B. So it will ensure the BC interaction
is present, and in a similar line of thinking, the AC interaction will also be present. So now you can understand why when
we write A*B*C*D here in the lm(…) command, R will recursively expand
this into all the main effects, all the two factor interactions,
all the 3 factor interactions as well as the 4 factor interaction. It is as if we had written it
all out by hand as shown here. But obviously that is tedious, and
error-prone, so let R do the work for you. Now let’s build those two separate linear
models: for the collection efficiency, “y1”, and for the energy delivery efficiency, “y2”. If you use the summary(…) command, as we’ve done before,
it might be fairly difficult to quickly locate what the important
factors are that influence y_1. Rather let’s use the Pareto plot to show
us what the important parameters are. Here it is: the grey bars represent
the terms with a negative sign. And black bars represent the
terms with a positive sign. The most important terms are the
B, the A, the AB interaction, and factor C. The other terms have
a diminishing effect on the outcome. The collection efficiency will
decrease when factor B is increased. In other words, as the storage tank capacity
is increased, the collection efficiency drops. This is the most influential
variable in the system. Next is the A factor, the amount of insolation,
has a positive on the collection efficiency. Now try answering this question
here on the screen: pause the video, and think about the AB interaction. The correct answer is the one that
use a high level for factor A, and a low level for factor B. We can see this
in the equation, and from the Pareto plot. In this case, setting factor B to a
negative sign, helps boost our objective, but it also makes the two factor
interaction work in our favour. So A, B and AB interaction are the three
most influential terms in the model. But you also notice that factor D
has little impact on the outcome. That’s a useful result as it
indicates we are relatively insensitive to the variation in the solar intermittency. If we were to run more experiments
in the future, we might leave factor D out of consideration. Similarly, when trying to optimize the
process for collection efficiency, y1, we can be confident that solar
intermittency won’t play a major role; at least according to this simulation system. Now let’s take a look at our second outcome
variable, y_2, the energy delivery efficiency. If you rebuild the model and look at the
Pareto plot we see extremely strong effects from factor A, and the two
factor factor interaction of AB. The other factors, C and D, are small. What you also notice here, and
this is a very common result, is that many of the higher level
interactions, such as the 3- and 4-factor interaction are small, or zero. I would like to point out an important
issue at this moment using this example. Take a look at factor B, it is small and
based on what we’ve done you might be tempted to conclude that factor B is not important. That’s not entirely correct. We cannot exclude factor B from consideration,
because AB interaction is very important. Remember what an interaction was defined as. In this example the AB interaction means that
the effect of factor A is dependant on the level of factor B. Alternatively, the effect
of factor B is dependant on the factor A. So because the AB interaction is strong,
we cannot ignore factor B. The level that factor B is set at is also important. And so we cannot remove factor
B from the model either. So let’s end off today’s class with
this question for you to think about. Can you maximize both y_1
and y_2 simultaneously? What would be the best combination of
settings of the factors to get that maximum? This is a question that we will