Statistics Toolbox | ![]() ![]() |
Example: N-Way ANOVA with Large Data Set
In the previous example we used anova2
to study a small data set measuring car mileage. Now we study a larger set of car data with mileage and other information on 406 cars made between 1970 and 1982. First we load the data set and look at the variable names.
load carbig whos Name Size Bytes Class Acceleration 406x1 3248 double array Cylinders 406x1 3248 double array Displacement 406x1 3248 double array Horsepower 406x1 3248 double array MPG 406x1 3248 double array Model 406x36 29232 char array Model_Year 406x1 3248 double array Origin 406x7 5684 char array Weight 406x1 3248 double array cyl4 406x5 4060 char array org 406x7 5684 char array when 406x5 4060 char array
We will focus our attention on four variables. MPG
is the number of miles per gallon for each of 406 cars (though some have missing values coded as NaN
). The other three variables are factors: cyl4
(four-cylinder car or not), org
(car originated in Europe, Japan, or the USA), and when
(car was built early in the period, in the middle of the period, or late in the period).
First we fit the full model, requesting up to three-way interactions and Type 3 sums-of-squares.
varnames = {'Origin';'4Cyl';'MfgDate'}; anovan(MPG,{org cyl4 when},3,3,varnames) ans = 0.0000 NaN 0 0.7032 0.0001 0.2072 0.6990
Note that many terms are marked by a "#" symbol as not having full rank, and one of them has zero degrees of freedom and is missing a p-value. This can happen when there are missing factor combinations and the model has higher-order terms. In this case, the cross-tabulation below shows that there are no cars made in Europe during the early part of the period with other than four cylinders, as indicated by the 0
in table(2,1,1).
[table,factorvals] = crosstab(org,when,cyl4) table(:,:,1) = 82 75 25 0 4 3 3 3 4 table(:,:,2) = 12 22 38 23 26 17 12 25 32 factorvals = 'USA' 'Early' 'Other' 'Europe' 'Mid' 'Four' 'Japan' 'Late' []
Consequently it is impossible to estimate the three-way interaction effects, and including the three-way interaction term in the model makes the fit singular.
Using even the limited information available in the ANOVA table, we can see that the three-way interaction has a p-value of 0.699, so it is not significant. We decide to request only two-way interactions this time.
Now all terms are estimable. The p-values for interaction term 4 (Origin*4Cyl)
and interaction term 6 (4Cyl*MfgDate
) are much larger than a typical cutoff value of 0.05, indicating these terms are not significant. We could choose to omit these terms and pool their effects into the error term. The output termvec
variable returns a vector of codes, each of which is a bit pattern representing a term. We can omit terms from the model by deleting their entries from termvec
and running anovan
again, this time supplying the resulting vector as the model argument.
Now we have a more parsimonious model indicating that the mileage of these cars seems to be related to all three factors, and that the effect of the manufacturing date depends on where the car was made.
![]() | Example: N-Way ANOVA with Small Data Set | Multiple Linear Regression | ![]() |