Gottfried Helms
5'2016
In this paper I show the reason, why in R and in SPSS, using the defaults, the procedure ANOVA gives different results. When I first got in touch with this problem it was difficult for me to understand this from the software-descriptions - on one hand, they were much too concise and on the other hand I found some discussions about the SS-type problem with a multitude of little-understood topics involved; so I decided to try to understand the matters by reengineering the matrix-mathematics of the Anova-procedure myself. Here I provide a third extended version motivated by some questions in the discussion forum "stats.stackexchange.com".
For the presentation here I've (re-) analyzed the outputs of both software with some data and different models using my matrix-software MatMate, found the key for the basic and technical understanding, and think have now a nice and concise systematic scheme compatible with other concepts like the matrix-model and variance-decomposition of the linear model.
The methods available in MatMate allowed to reengineer the mathematical procedures and to find some reasonable conceptual background. The differences can nicely be seen in terms of triangular "loadings"-matrices and their column-rotations similarly as known from principal components analysis, where the analogon to cholesky-factorization of covariance-matrices occurs; and even the extension to the computation of the regression-coefficients B could be made from this basic tools.
1. Data
From the book "Multivariate Analysemethoden" of K. Backhaus [1990] I took the (metric) data example in the chapter on regression (pg. 6).
The setting is an analysis of the amount of sales of some item in various filials dependent on price, the practice of sending advisors to the shops and the investition of some marketing-propaganda. The items are Absatzmenge or Absatz for number of sold items per shop; Preis for price, Vertreter for the number of advisors' visits, VerkFoerd for the financial amount for some marketing propaganda.
To have the option to put the constant somewhere in the (hierarchical) Anova-model I've also added a variable const with the constant value "1" which requires then in SPSS to use the "Unianova"-procedure with "/intercept=exclude". In the MatMate-procedure this is happily no problem to configure arbitrarily and only in R it is not possible to move the const from the first place in the (typically hierarchical) model to some position in the list of items.
Of course, the data are all metric, but I think the situation is in principle the same when a model mixed with different scale-types or even only nominal-scaled factors and interactions is tested. Nominal-scaled factors can in the Anova-procedures be coded by dummy-variables which allow then again the metric procedures as used here, so I think the reduction for the simple case of metric items should still be meaningful for this small analysis of the basic structure of the procedure-implementations. Interaction-effects can easily be added: just compute the new interaction-item as product of the scores of the involved items; I've checked some examples based on this data using MatMate and SPSS.
The used data are as follows
Readable for MatMate
labels = {"const", "preis","VerkFoerd","Vertreter","Absatz"}
data = { _
{ 1.00, 12.50, 2000.00, 109.00, 2298.00}, _
{ 1.00, 10.00, 550.00, 107.00, 1814.00}, _
{ 1.00, 9.95, 1000.00, 99.00, 1647.00}, _
{ 1.00, 11.50, 800.00, 70.00, 1496.00}, _
{ 1.00, 12.00, 0.00, 81.00, 969.00}, _
{ 1.00, 10.00, 1500.00, 102.00, 1918.00}, _
{ 1.00, 8.00, 800.00, 110.00, 1810.00}, _
{ 1.00, 9.00, 1200.00, 92.00, 1896.00}, _
{ 1.00, 9.50, 1100.00, 87.00, 1715.00}, _
{ 1.00, 12.50, 1300.00, 79.00, 1699.00} }
Readable as CSV-data
; "const", "preis", "VerkFoerd", "Vertreter", "Absatz"
1.00, 12.50, 2000.00, 109.00, 2298.00
1.00, 10.00, 550.00, 107.00, 1814.00
1.00, 9.95, 1000.00, 99.00, 1647.00
1.00, 11.50, 800.00, 70.00, 1496.00
1.00, 12.00, 0.00, 81.00, 969.00
1.00, 10.00, 1500.00, 102.00, 1918.00
1.00, 8.00, 800.00, 110.00, 1810.00
1.00, 9.00, 1200.00, 92.00, 1896.00
1.00, 9.50, 1100.00, 87.00, 1715.00
1.00, 12.50, 1300.00, 79.00, 1699.00
Because we want to analyse the Sum-of-Squares we need not standardize the data, and because the data contain the constant item const we even don't need to recenter the items - we can just use them in their original values.
2. SSqr- and CoProduct-matrix
First we compute the matrix CoProd which is an analogon of the covariance-matrix between the items and contains, for instance, in the diagonal the Sum-of-squares of each item. It makes the following matrix-formulae easier that we have the dependent item at the end of the list/at the bottom of the CoProd-matrix. This is simply done by the MatMate-command "CoProd = data ' * data " and gives the following matrix:
CoProd |
const |
preis |
VerkFoerd |
Vertreter |
Absatz |
const |
10.000 |
104.950 |
10250.000 |
936.000 |
17262.000 |
preis |
104.950 |
1123.003 |
108550.000 |
9736.550 |
180338.650 |
VerkFoerd |
10250.000 |
108550.000 |
13172500.000 |
981650.000 |
19132900.000 |
Vertreter |
936.000 |
9736.550 |
981650.000 |
89370.000 |
1643436.000 |
Absatz |
17262.000 |
180338.650 |
19132900.000 |
1643436.000 |
30838452.000 |
(MatMate:)
CoProd = data' * data
In the diagonal occur the SSq ("Sum-of-Squares") for the items. In the off-diagonal entries are the sums of the crossproducts of the data. SPSS documents for Absatz (when dependend) the value for the Sum-of-Squares type SSType(1) "Gesamt : 30838452.000" which is exactly the number in the diagonal for Absatz in the above table. (Unfortunately, in the Unianova-procedure the Sums-of-Squares for the other items are not printed.)
3.1. Partial Sums-of-Squares for model with item-order: Const, Preis, VerkFoerd, Vertreter
Interestingly, it is possible to apply the mechanisms of the cholesky-decomposition and rotations as in PCA for finding the explaining partial sums-of-squares. If we do a cholesky decomposition for the CoProd-matrix then we find:
"Loadings" for (re-) ordered Anova; Model: Absatz <- (const) Preis, Verkfoerd, Vertreter
PL |
[const] |
[preis] |
[VerkFoerd] |
[Vertreter] |
[Absatz] |
const |
3.162 |
. |
. |
. |
. |
preis |
33.188 |
4.642 |
. |
. |
. |
VerkFoerd |
3241.335 |
210.288 |
1619.268 |
. |
. |
Vertreter |
295.989 |
-18.691 |
16.168 |
33.907 |
. |
Absatz |
5458.724 |
-177.932 |
911.997 |
284.368 |
310.685 |
(MatMate:)
PL = cholesky(CoProd)
In PCA these "loadings" were the coordinates in the generated euclidean (orthogonal) factor-space. The brackets in the column-headers shall indicate, that the columns are "partialled" up to that specific item - they are not names for the "coordinates"-axes!
After we have in table PL so-to-say "loadings", we need now Sums-of-Squares. Similarly as in PCA the partial covariances are simply the squared components loadings, we have now the "Partial Sums-of-Squares" as squares of the "loadings":
Partial Sums-of-Squares
PSSq |
[const] |
[preis] |
[VerkFoerd] |
[Vertreter] |
[Absatz] |
const |
10.000 |
. |
. |
. |
. |
preis |
1101.450 |
21.552 |
. |
. |
. |
VerkFoerd |
1050625. |
44221.094 |
2622028.906 |
. |
. |
Vertreter |
87609.600 |
349.339 |
261.406 |
1149.655 |
. |
Absatz |
29797664.400 |
31659.900 |
831737.936 |
80864.894 |
96524.870 |
Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III
(MatMate:)
PSSq = PL ^# 2 // compute squares elementwise
And in the row for the item Absatz we have already the Anova-SSq for the model with hierarchical list "absatz <- const preis VerkFoerd Vertreter" .
Software R:
To compare this result with that of the anova-procedure in the software R we find the yellow and blue marked entries in the row for the item Absatz; they represent the so-to-say "partially explained Sums-of-Squares". Note that in the R-procedure the entry for [const] is not displayed! The command was
RegModel.7 <- lm(absatzmenge~1+preis+verkfoerd+vertreter, data=Backhaus_Regression)
anova(RegModel.7)
getting
Response: absatzmenge |
|||||
tb |
Df |
SumSq |
MeanSq |
F_value |
Pr(>F) |
preis |
1 |
31660 |
31660 |
1.9680 |
0.210232 |
verkfoerd |
1 |
831738 |
831738 |
51.7010 |
0.000366 |
vertreter |
1 |
80865 |
80865 |
5.0266 |
0.066165 |
Residuals |
6 |
96525 |
16087 |
|
|
Unfortunately, the constant contribution was not displayed.
Software SPSS:
When using type SSType(1), the UniAnova-procedure gives us the same values:
Tests der Zwischensubjekteffekte |
|||||
Abhängige Variable: |
Absatz |
||||
Quelle |
Quadratsumme |
df |
Mittel der Quadrate |
F |
Sig. |
Korrigiertes Modell |
944262.73 |
3 |
314754.243 |
19.565 |
.002 |
Konstanter Term |
29797664.400 |
1 |
29797664.400 |
1852.227 |
.000 |
Preis |
31659.900 |
1 |
31659.900 |
1.968 |
.210 |
VerkFoerd |
831737.936 |
1 |
831737.936 |
51.701 |
.000 |
Vertreter |
80864.894 |
1 |
80864.894 |
5.027 |
.066 |
Fehler |
96524.870 |
6 |
16087.478 |
|
|
Gesamt |
30838452.000 |
10 |
|
|
|
Korrigierte |
1040787.600 |
9 |
|
|
|
UNIANOVA Absatz WITH Preis VerkFoerd Vertreter
/METHOD=SSTYPE(1) /INTERCEPT=INCLUDE
/CRITERIA=ALPHA(0.05) /DESIGN=Preis VerkFoerd Vertreter .
In my first encounter with the Anova-procedure it has been much irritating that SPSS with type SSType(3) gave differing output in the leading three items as documented here in orange - it seemed, as if the method of computation/ the concept for the sums-of-squares (selected by the SSType()) were different and not only the type of collections/presentation of them from a larger set of possible partial coefficients as it is actually the case; I'll show this below.
Tests der Zwischensubjekteffekte |
|||||
Abhängige Variable: |
Absatz |
||||
Quelle |
Quadratsumme vom Typ III |
df |
Mittel der Quadrate |
F |
Sig. |
Korrigiertes Modell |
944262.73 |
3 |
314754.243 |
19.565 |
.002 |
Konstanter Term |
26178.826 |
1 |
26178.826 |
1.627 |
.249 |
Preis |
10687.148 |
1 |
10687.148 |
.664 |
.446 |
VerkFoerd |
491123.992 |
1 |
491123.992 |
30.528 |
.001 |
Vertreter |
80864.894 |
1 |
80864.894 |
5.027 |
.066 |
Fehler |
96524.870 |
6 |
16087.478 |
|
|
Gesamt |
30838452.000 |
10 |
|
|
|
Korrigierte |
1040787.600 |
9 |
|
|
|
UNIANOVA Absatzmenge WITH Preis VerkFoerd
Vertreter
/METHOD=SSTYPE(3) /INTERCEPT=INCLUDE
/CRITERIA=ALPHA(0.05) /DESIGN=Preis VerkFoerd Vertreter .
The key is here, that one could understand this table as a collection of the relevant partial sums-of-squares of four different SS I-procedures. Consider four models, where always another item is at the end of the list:
"Unianova (...) /Design = const, Preis, VerkFoerd, Vertreter (...)",
"Unianova (...) /Design =
Vertreter, const, Preis, VerkFoerd
(...)",
"Unianova (...) /Design =
VerkFoerd, Vertreter, const, Preis
(...)"
"Unianova (...) /Design = Preis,
VerkFoerd, Vertreter, const
(...)"
and for each of this analyses using SStype(1), then the documented partial sums-of-squares in the above table occur always as the partial sum-of-squares of the last item in the list: the SStype(1) procedure uses a hierarchical model for the decomposition and the SStype(3) procedure documents the values for each item being the last one in the list.
Remark: this coefficients are also analoguous to the concept of "Usefulness" in regression, see the remark in "3.5 Overview" for a couple of references.
The entry Fehler with value 96524.870 is the "unexplained sum-of-squares" in the dependent item, which we also find -unsurprisingly- in the above table PSSq- in the column [Absatz] as residual of the regression-like sums-of-squares-decomposition.
In the following I document the differently re-ordered models to discover all partial sums-of-squares for the various models (which are collected in a single output of the SPSS-procedure for SStype(3)). The "loadings" matrices PL are rotated versions of each other only and serve so far only as source for the matrices PSSq of partial sums-of-squares. That latter ones contain simply the squares of the "loadings" and provide our interesting sets of coefficients. (So the "loadings"-matrices are greyed out in the following to focus more on the partial sums-of-squares tables).
3.2. Partial Sums-of-Squares for model with item-order: Vertreter, Const,Preis,VerkFoerd,
First we rotate the PL-matrix such that the previous variable VerkFoerd has an itemspecific "loading" on the 4'th axis. (In R this would mean to redefine the formula for the Anova-model). The "loadings" are in the grey area at the rhs, the sums-of-squares (which are simply the squares of the "loadings") are in the main table at the left side.
Partial Sums-of-Squares
Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III
|
"Loadings" for re-ordered Anova Model
(MatMate:)
|
3.3. Partial Sums-of-Squares for model with item-order: VerkFoerd, Vertreter, Const, Preis
Again we rotate the PL-matrix; now such that the previous variable Preis has an itemspecific "loading" on the 4'th axis.
Partial Sums-of-Squares
Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III
|
"Loadings" for re-ordered Anova Model
(MatMate:)
|
3.4. Partial Sums-of-Squares for model with item-order: Preis, VerkFoerd, Vertreter, Const
Partial Sums-of-Squares
Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III
|
"Loadings" for re-ordered Anova Model
(MatMate:)
|
3.5. Partial Sums-of-Squares -overview-
To have the set of coefficients of R and SPSS together we copy all rows of the PSSq which contain the Absatz- partial sum-of-squares; the models were defined with hierarchies according to the items-order and always Absatz as dependent.
Model 1 |
[const] |
[preis] |
[VerkFoerd] |
[Vertreter] |
[Absatz] |
Absatz (SPSS SS(1), R) |
29797664.400 |
31659.900 |
831737.936 |
80864.894 |
96524.870 |
|
|
|
|
|
|
Model 2 |
[Vertreter] |
[const] |
[preis] |
[VerkFoerd] |
[Absatz] |
Absatz (SPSS SS(1)) |
30221348.172 |
12580.307 |
16874.660 |
491123.992 |
96524.870 |
|
|
|
|
|
|
Model 3 |
[VerkFoerd] |
[Vertreter] |
[const] |
[preis] |
[Absatz] |
Absatz (SPSS SS(1)) |
27790310.299 |
2920182.204 |
20747.480 |
10687.148 |
96524.870 |
|
|
|
|
|
|
Model 4 |
[preis] |
[VerkFoerd] |
[Vertreter] |
[const] |
[Absatz] |
Absatz (SPSS SS(1)) |
28959889.834 |
1079973.646 |
675884.824 |
26178.826 |
96524.870 |
(Remark: for model2 to model4 I was unable to configure the R-command accordingly; the results were always as if the const was at the first place in the list, and was also not displayed. I've crosschecked the results with MatMate anyway finding that this was the only problem)
All models |
[const] |
[preis] |
[VerkFoerd] |
[Vertreter] |
[Absatz] |
SPSS SS(3) |
26178.826 |
10687.148 |
491123.992 |
80864.894 |
96524.870 |
(Remark: With option SStype(3) SPSS documents the list of coefficients as collection of the hierarchical SStype(1) results (marked with orange color))
The solution for R (model 1) is now the first row of coefficients without the [const], and that of SPSS SStype(1) the whole first row and SPSS SSType(3) the fourth column:
Sums of squares as documented by R(anova) and SPSS(unianova):
Model 1 |
R |
SPSS SS(1) |
SPSS SS(3) |
const |
? |
29797664.400 |
26178.826 |
preis |
31659.900 |
31659.900 |
10687.148 |
VerkFoerd |
831737.936 |
831737.936 |
491123.992 |
Vertreter |
80864.894 |
80864.894 |
80864.894 |
Residual |
96524.870 |
96524.870 |
96524.870 |
Conclusion: The "Anova"-procedure in R and SSType(1) in SPSS gives us the set of explained partial sum-of-squares (of the dependent item) organized in hierarchical order. That order implicitely defined by the textual order of items in the command for the procedure.
In SPSS the position of the constant in the hierarchy can be modified if there is one constant data-item included, called for instance "const", and in the unianova-command the implicite computation of the coefficient for the constant is deactivated by the option /origin=exclude . In R, the model seem to contain the constant always in the first position of the hierarchy and there seems to be no similar workaround.
The construct with the "loadings"-matrices PL has the interesting aspect, that the direction of the influence of some item on the dependent can be seen. So we have in the first model in 3.1. with the item Preis a partial "loading" with negative value, which shows a negative relation of Absatz with Preis when const is partialled out. The according partial sum-of-squares in PSSq however is of course positive and by that coefficient alone one would not see that information.
A furtherly interesting aspect might here be, that the SPSS SSType(3)-default gives us the set of coefficients which are also analoguous to the "Usefulnesses" known from the Regression-procedure (and also their F-value and p-value). The usefulness-coefficient in Regression seems to be rarely discussed - it is not even in the Wikipedia; I found it for instance mentioned in the 1999 book "Statistik für Sozialwissenschaftler" by J. Bortz, pg 442 ("Nützlichkeit") referring to an idea of R. B. Darlington (1968). It is also described in a more recent script by M. Persike, 2008, pg. 6.
4. Regression
Finally in this sequence we compute the Regression-coefficients B for the items using the inversion of the upper-left submatrix of PL. This gives us the columns in the metric of the predictors (see the 1.000-coordinates in their columns). Here the order of the items become irrelevant because each item gets an own axis attached in which the dependent item can be measured.
Note that the vectorspace has then non-orthogonal
axes |
|
To check this using SPSS we compare the entries in row "Absatz" with the coefficients given by SPSS. Remark: the "const" was given as variable to have additional options for the output . Of course, having a "constant" included, the option "/ORIGIN must then be applied. : |
||||||||||||||||||||||||||||||||||||||||||||||
(Matmate:) PLInv = inv(PL[1..4,1..4]) PLInv = insert(PLInv,{sqrt(N)}) B = PL * PLInv
|
|
Nicht standardisierte Koeffizienten
REGRESSION /ORIGIN /DEPENDENT Absatzmenge /METHOD=ENTER const Preis VerkFoerd Vertreter .
|
5. References
Backhaus Multivariate
Analysemethoden,
K. Backhaus, B. Erichson, W. Plinke, R.Weiber
Springer, Berlin; 1990, 6. Auflage
Bortz Statistik
für Sozialwissenschaftler
J. Bortz
Springer, Berlin; 1999, 5. Auflage
Darlington Multiple
Regression in psychological research and practice
Psychol. Bull. 69, 1968, pg. 161-182
(referred to by J. Bortz, pg 442)
Persike Forschungsstatistik
I
M. Persike
2008, Skript zur Vorlesung
http://methodenlehre.sowi.uni-mainz.de/download/Lehre/SS2009/StatistikII/VL_2009_05_12.pdf
Wikipedia Regression
Analysis
(multiple authors)
https://en.wikipedia.org/wiki/Regression_analysis
SPSS IBM Spss, V. 21 German
R The R-project
MatMate a
Matrix-calculator for statistical education and selfstudy
G. Helms, 1996 last update 2016
http://go.helms-net.de/sw/matmate/index.htm
SSE_Q&A related
Questions and discussions in the stat.stackexchange.com-network
http://stats.stackexchange.com/questions/13241/the-order-of-variables-in-anova-matters-doesnt-it
http://stats.stackexchange.com/questions/11209/the-effect-of-the-number-of-replicates-in-different-cells-on-the-results-of-anova
http://stats.stackexchange.com/questions/20452/how-to-interpret-type-i-type-ii-and-type-iii-anova-and-manova
and from there various more links:
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Type1-3.pdf
http://www.uni-kiel.de/psychologie/dwoll/r/ssTypes.php
http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf
http://r.789695.n4.nabble.com/Type-I-v-s-Type-III-Sum-Of-Squares-in-ANOVA-td1573657.html
https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
(c) Gottfried Helms, Univ. Kassel, 5'2016, Version 3.5