Some insight into the difference of Anova in R and SPSS default-procedure,
and between the "SSType(1)" and "SSType(3)" options
An example with purely metric data


Gottfried Helms
5'2016

In this paper I show the reason, why in R and in SPSS, using the defaults, the procedure ANOVA gives different results. When I first got in touch with this problem it was difficult for me to understand this from the software-descriptions - on one hand, they were much too concise and on the other hand I found some discussions about the SS-type problem with a multitude of little-understood topics involved; so I decided to try to understand the matters by reengineering the matrix-mathematics of the Anova-procedure myself. Here I provide a third extended version motivated by some questions in the discussion forum "stats.stackexchange.com".

For the presentation here I've (re-) analyzed the outputs of both software with some data and different models using my matrix-software MatMate, found the key for the basic and technical understanding, and think have now a nice and concise systematic scheme compatible with other concepts like the matrix-model and variance-decomposition of the linear model.

The methods available in MatMate allowed to reengineer the mathematical procedures and to find some reasonable conceptual background. The differences can nicely be seen in terms of triangular "loadings"-matrices and their column-rotations similarly as known from principal components analysis, where the analogon to cholesky-factorization of covariance-matrices occurs; and even the extension to the computation of the regression-coefficients B could be made from this basic tools.


1. Data

From the book "Multivariate Analysemethoden" of K. Backhaus [1990] I took the (metric) data example in the chapter on regression (pg. 6).

The setting is an analysis of the amount of sales of some item in various filials dependent on price, the practice of sending advisors to the shops and the investition of some marketing-propaganda. The items are Absatzmenge or Absatz for number of sold items per shop; Preis for price, Vertreter for the number of advisors' visits, VerkFoerd for the financial amount for some marketing propaganda.

To have the option to put the constant somewhere in the (hierarchical) Anova-model I've also added a variable const with the constant value "1" which requires then in SPSS to use the "Unianova"-procedure with "/intercept=exclude". In the MatMate-procedure this is happily no problem to configure arbitrarily and only in R it is not possible to move the const from the first place in the (typically hierarchical) model to some position in the list of items.

Of course, the data are all metric, but I think the situation is in principle the same when a model mixed with different scale-types or even only nominal-scaled factors and interactions is tested. Nominal-scaled factors can in the Anova-procedures be coded by dummy-variables which allow then again the metric procedures as used here, so I think the reduction for the simple case of metric items should still be meaningful for this small analysis of the basic structure of the procedure-implementations. Interaction-effects can easily be added: just compute the new interaction-item as product of the scores of the involved items; I've checked some examples based on this data using MatMate and SPSS.

The used data are as follows

Readable for MatMate

labels = {"const", "preis","VerkFoerd","Vertreter","Absatz"}

 data = {  _

     {    1.00,   12.50, 2000.00,  109.00, 2298.00}, _

     {    1.00,   10.00,  550.00,  107.00, 1814.00}, _

     {    1.00,    9.95, 1000.00,   99.00, 1647.00}, _

     {    1.00,   11.50,  800.00,   70.00, 1496.00}, _

     {    1.00,   12.00,    0.00,   81.00,  969.00}, _

     {    1.00,   10.00, 1500.00,  102.00, 1918.00}, _

     {    1.00,    8.00,  800.00,  110.00, 1810.00}, _

     {    1.00,    9.00, 1200.00,   92.00, 1896.00}, _

     {    1.00,    9.50, 1100.00,   87.00, 1715.00}, _

     {    1.00,   12.50, 1300.00,   79.00, 1699.00} }

 

Readable as CSV-data

;    "const",    "preis",  "VerkFoerd", "Vertreter",    "Absatz"

        1.00,       12.50,     2000.00,      109.00,     2298.00

        1.00,       10.00,      550.00,      107.00,     1814.00

        1.00,        9.95,     1000.00,       99.00,     1647.00

        1.00,       11.50,      800.00,       70.00,     1496.00

        1.00,       12.00,        0.00,       81.00,      969.00

        1.00,       10.00,     1500.00,      102.00,     1918.00

        1.00,        8.00,      800.00,      110.00,     1810.00

        1.00,        9.00,     1200.00,       92.00,     1896.00

        1.00,        9.50,     1100.00,       87.00,     1715.00

        1.00,       12.50,     1300.00,       79.00,     1699.00

 

Because we want to analyse the Sum-of-Squares we need not standardize the data, and because the data contain the constant item const we even don't need to recenter the items - we can just use them in their original values.

 


2. SSqr- and CoProduct-matrix

First we compute the matrix CoProd which is an analogon of the covariance-matrix between the items and contains, for instance, in the diagonal the Sum-of-squares of each item. It makes the following matrix-formulae easier that we have the dependent item at the end of the list/at the bottom of the CoProd-matrix. This is simply done by the MatMate-command "CoProd = data '  * data " and gives the following matrix:

CoProd

const

preis

VerkFoerd

Vertreter

Absatz

const

10.000

104.950

10250.000

936.000

17262.000

preis

104.950

1123.003

108550.000

9736.550

180338.650

VerkFoerd

10250.000

108550.000

13172500.000

981650.000

19132900.000

Vertreter

936.000

9736.550

981650.000

89370.000

1643436.000

Absatz

17262.000

180338.650

19132900.000

1643436.000

30838452.000

(MatMate:)
CoProd = data' * data 

 

In the diagonal occur the SSq ("Sum-of-Squares") for the items. In the off-diagonal entries are the sums of the crossproducts of the data. SPSS documents for Absatz (when dependend) the value for the Sum-of-Squares type SSType(1) "Gesamt : 30838452.000" which is exactly the number in the diagonal for Absatz in the above table. (Unfortunately, in the Unianova-procedure the Sums-of-Squares for the other items are not printed.)


3.1. Partial Sums-of-Squares  for model with item-order:  Const, Preis, VerkFoerd, Vertreter   

Interestingly, it is possible to apply the mechanisms of the cholesky-decomposition and rotations as in PCA for finding the explaining partial sums-of-squares. If we do a cholesky decomposition for the CoProd-matrix then we find:

"Loadings" for (re-) ordered Anova; Model:   Absatz <- (const) Preis, Verkfoerd, Vertreter

PL

[const]

[preis]

[VerkFoerd]

[Vertreter]

[Absatz]

const

3.162

.

.

.

.

preis

33.188

4.642

.

.

.

VerkFoerd

3241.335

210.288

1619.268

.

.

Vertreter

295.989

-18.691

16.168

33.907

.

Absatz

5458.724

-177.932

911.997

284.368

310.685

(MatMate:)
PL = cholesky(CoProd) 

 

In PCA these "loadings" were the coordinates in the generated euclidean (orthogonal) factor-space. The brackets in the column-headers shall indicate, that the columns are "partialled" up to that specific item - they are not names for the "coordinates"-axes!

After we have in table PL so-to-say "loadings", we need now Sums-of-Squares. Similarly as in PCA the partial covariances are simply the squared components loadings, we have now the "Partial Sums-of-Squares" as squares of the "loadings":

Partial Sums-of-Squares

PSSq

[const]

[preis]

[VerkFoerd]

[Vertreter]

[Absatz]

const

10.000

.

.

.

.

preis

1101.450

21.552

.

.

.

VerkFoerd

1050625.

44221.094

2622028.906

.

.

Vertreter

87609.600

349.339

261.406

1149.655

.

Absatz

29797664.400

31659.900

831737.936

80864.894

96524.870

Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III

(MatMate:)
PSSq = PL  ^# 2          // compute squares elementwise

 

And in the row for the item Absatz we have already the Anova-SSq for the model with hierarchical list "absatz <- const preis VerkFoerd Vertreter" .

Software R:

To compare this result with that of the anova-procedure in the software R we find the yellow and blue marked entries in the row for the item Absatz; they represent the so-to-say "partially explained Sums-of-Squares". Note that in the R-procedure the entry for [const] is not displayed! The command was

RegModel.7 <- lm(absatzmenge~1+preis+verkfoerd+vertreter, data=Backhaus_Regression)

anova(RegModel.7)

getting

Response: absatzmenge

tb

Df

SumSq

MeanSq

F_value

Pr(>F)

preis

1

31660

31660

1.9680

0.210232

verkfoerd

1

831738

831738

51.7010

0.000366

vertreter

1

80865

80865

5.0266

0.066165

Residuals

6

96525

16087

 

 

Unfortunately, the constant contribution was not displayed.

 

Software SPSS:

When using type SSType(1), the UniAnova-procedure gives us the same values:

Tests der Zwischensubjekteffekte

Abhängige Variable:

Absatz

Quelle

Quadratsumme
 vom Typ I

df

Mittel der      Quadrate

F

Sig.

Korrigiertes Modell

944262.73

3

314754.243

19.565

.002

Konstanter Term

29797664.400

1

29797664.400

1852.227

.000

Preis

31659.900

1

31659.900

1.968

.210

VerkFoerd

831737.936

1

831737.936

51.701

.000

Vertreter

80864.894

1

80864.894

5.027

.066

Fehler

96524.870

6

16087.478

 

 

Gesamt

30838452.000

10

 

 

 

Korrigierte
Gesamtvariation

1040787.600

9

 

 

 

UNIANOVA Absatz WITH Preis VerkFoerd Vertreter
  /METHOD=SSTYPE(1) /INTERCEPT=INCLUDE /CRITERIA=ALPHA(0.05) /DESIGN=Preis VerkFoerd Vertreter .

 

In my first encounter with the Anova-procedure it has been much irritating that SPSS with type SSType(3) gave differing output in the leading three items as documented here in orange - it seemed, as if the method of computation/ the concept for the sums-of-squares (selected by the SSType()) were different and not only the type of collections/presentation of them from a larger set of possible partial coefficients as it is actually the case; I'll show this below.

Tests der Zwischensubjekteffekte

Abhängige Variable:

Absatz

Quelle

Quadratsumme vom Typ III

df

Mittel der      Quadrate

F

Sig.

Korrigiertes Modell

944262.73

3

314754.243

19.565

.002

Konstanter Term

26178.826

1

26178.826

1.627

.249

Preis

10687.148

1

10687.148

.664

.446

VerkFoerd

491123.992

1

491123.992

30.528

.001

Vertreter

80864.894

1

80864.894

5.027

.066

Fehler

96524.870

6

16087.478

 

 

Gesamt

30838452.000

10

 

 

 

Korrigierte
Gesamtvariation

1040787.600

9

 

 

 

UNIANOVA Absatzmenge WITH Preis VerkFoerd Vertreter
  /METHOD=SSTYPE(3) /INTERCEPT=INCLUDE /CRITERIA=ALPHA(0.05) /DESIGN=Preis VerkFoerd Vertreter .

The key is here, that one could understand this table as a collection of the relevant partial sums-of-squares of four different SS I-procedures. Consider four models, where always another item is at the end of the list: 

"Unianova (...) /Design =  const, Preis, VerkFoerd, Vertreter  (...)",
"Unianova (...) /Design =  Vertreter, const, Preis, VerkFoerd  (...)",
"Unianova (...) /Design =  VerkFoerd, Vertreter, const, Preis  (...)" 
"Unianova (...) /Design =  Preis, VerkFoerd, Vertreter, const  (...)" 

and for each of this analyses using SStype(1), then the documented partial sums-of-squares in the above table occur always as the partial sum-of-squares of the last item in the list: the SStype(1) procedure uses a hierarchical model for the decomposition and the SStype(3) procedure documents the values for each item being the last one in the list.

Remark: this coefficients are also analoguous to the concept of "Usefulness" in regression, see the remark in "3.5 Overview" for a couple of references.

The entry Fehler with value 96524.870 is the "unexplained sum-of-squares" in the dependent item, which we also find -unsurprisingly- in the above table PSSq- in the column [Absatz] as residual of the regression-like sums-of-squares-decomposition.

In the following I document the differently re-ordered models to discover all partial sums-of-squares for the various models (which are collected in a single output of the SPSS-procedure for SStype(3)). The "loadings" matrices PL are rotated versions of each other only and serve so far only as source for the matrices PSSq of partial sums-of-squares. That latter ones contain simply the squares of the "loadings" and provide our interesting sets of coefficients. (So the "loadings"-matrices are greyed out in the following to focus more on the partial sums-of-squares tables).

 

3.2. Partial Sums-of-Squares  for model with item-order:  Vertreter, Const,Preis,VerkFoerd,     

First we rotate the PL-matrix such that the previous variable VerkFoerd has an itemspecific "loading" on the 4'th axis. (In R this would mean to redefine the formula for the Anova-model). The "loadings" are in the grey area at the rhs, the sums-of-squares (which are simply the squares of the "loadings") are in the main table at the left side.

Partial Sums-of-Squares

PSSq

[Vertreter]

[const]

[preis]

[VerkFoerd]

[Absatz]

const

9.803

0.197

.

.

.

preis

1060.763

44.964

17.275

.

.

VerkFoerd

10782552.562

4919.035

248743.030

2136285.373

.

Vertreter

89370.000

.

.

.

.

Absatz

30221348.172

12580.307

16874.660

491123.992

96524.870

Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III

 

"Loadings" for re-ordered Anova Model

PL

[Vertreter]

[const]

[preis]

[VerkFoerd]

[Absatz]

const

3.131

0.444

.

.

.

preis

32.569

6.706

4.156

.

.

VerkFoerd

3283.680

-70.136

498.741

1461.604

.

Vertreter

298.948

.

.

.

.

Absatz

5497.395

112.162

129.903

700.802

310.685

(MatMate:)
PL = rot(PL,"drei",4´1´2´3,1..4)

 

 

3.3. Partial Sums-of-Squares for model with item-order:  VerkFoerd, Vertreter, Const, Preis   

Again we rotate the PL-matrix; now such that the previous variable Preis has an itemspecific "loading" on the 4'th axis.

Partial Sums-of-Squares

PSSq

[VerkFoerd]

[Vertreter]

[const]

[preis]

[Absatz]

const

7.976

1.828

0.197

.

.

preis

894.523

167.315

45.691

15.474

.

VerkFoerd

13172500.000

.

.

.

.

Vertreter

73155.189

16214.811

.

.

.

Absatz

27790310.299

2920182.204

20747.480

10687.148

96524.870

Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III

 

"Loadings" for re-ordered Anova Model

PL

[VerkFoerd]

[Vertreter]

[const]

[preis]

[Absatz]

const

2.824

1.352

0.443

.

.

preis

29.909

12.935

6.760

3.934

.

VerkFoerd

3629.394

.

.

.

.

Vertreter

270.472

127.337

.

.

.

Absatz

5271.652

1708.854

144.040

-103.379

310.685

(MatMate:)
PL =
rot(PL,"drei",3´4´1´2,1..4)

 

 

3.4. Partial Sums-of-Squares for model with item-order:  Preis, VerkFoerd, Vertreter, Const   

Partial Sums-of-Squares

PSSq

[preis]

[VerkFoerd]

[Vertreter]

[const]

[Absatz]

const

9.808

0.004

0.138

0.050

.

preis

1123.003

.

.

.

.

VerkFoerd

10492498.904

2680001.096

.

.

.

Vertreter

84416.914

612.338

4340.748

.

.

Absatz

28959889.834

1079973.646

675884.824

26178.826

96524.870

Yellow and blue marked entry documented by R, blue marked entry documented by SPSS with SS III

 

"Loadings" for re-ordered Anova Model

PL

[preis]

[VerkFoerd]

[Vertreter]

[const]

[Absatz]

const

3.132

0.064

0.372

0.223

.

preis

33.511

.

.

.

.

VerkFoerd

3239.213

1637.071

.

.

.

Vertreter

290.546

24.745

65.884

.

.

Absatz

5381.439

1039.218

822.122

161.799

310.685

(MatMate:)
PL = rot(PL,"drei",2´3´4´1,1..4)

 

 

3.5. Partial Sums-of-Squares -overview-

To have the set of coefficients of R and SPSS together we copy all rows of the PSSq which contain the Absatz- partial sum-of-squares; the models were defined with hierarchies according to the items-order and always Absatz as dependent.

Model 1

[const]

[preis]

[VerkFoerd]

[Vertreter]

[Absatz]

Absatz  (SPSS SS(1), R)

29797664.400

31659.900

831737.936

80864.894

96524.870

 

 

 

 

 

 

 

Model 2

[Vertreter]

[const]

[preis]

[VerkFoerd]

[Absatz]

Absatz  (SPSS SS(1))

30221348.172

12580.307

16874.660

491123.992

96524.870

 

 

 

 

 

 

 

Model 3

[VerkFoerd]

[Vertreter]

[const]

[preis]

[Absatz]

Absatz  (SPSS SS(1))

27790310.299

2920182.204

20747.480

10687.148

96524.870

 

 

 

 

 

 

 

Model 4

[preis]

[VerkFoerd]

[Vertreter]

[const]

[Absatz]

Absatz  (SPSS SS(1))

28959889.834

1079973.646

675884.824

26178.826

96524.870

(Remark: for model2 to model4 I was unable to configure the R-command accordingly; the results were always as if the const was at the first place in the list, and was also not displayed. I've crosschecked the results with MatMate anyway finding that this was the only problem)

All models

[const]

[preis]

[VerkFoerd]

[Vertreter]

[Absatz]

SPSS SS(3)

26178.826

10687.148

491123.992

80864.894

96524.870

(Remark: With option SStype(3) SPSS documents the list of coefficients as collection of the hierarchical SStype(1) results (marked with orange color))

The solution for R (model 1) is now the first row of coefficients without the [const], and that of SPSS SStype(1) the whole first row and SPSS SSType(3) the fourth column:

Sums of squares as documented by R(anova) and SPSS(unianova):

Model 1

R

SPSS SS(1)

SPSS SS(3)

const

?

29797664.400

26178.826

preis

31659.900

31659.900

10687.148

VerkFoerd

831737.936

831737.936

491123.992

Vertreter

80864.894

80864.894

80864.894

Residual

96524.870

96524.870

96524.870

 

Conclusion: The "Anova"-procedure in R and SSType(1) in SPSS gives us the set of explained partial sum-of-squares (of the dependent item) organized in hierarchical order. That order implicitely defined by the textual order of items in the command for the procedure.

In SPSS the position of the constant in the hierarchy can be modified if there is one constant data-item included, called for instance "const", and in the unianova-command the implicite computation of the coefficient for the constant is deactivated by the option /origin=exclude . In R, the model seem to contain the constant always in the first position of the hierarchy and there seems to be no similar workaround.

The construct with the "loadings"-matrices PL has the interesting aspect, that the direction of the influence of some item on the dependent can be seen. So we have in the first model in 3.1. with the item Preis a partial "loading" with negative value, which shows a negative relation of Absatz with Preis when const is partialled out. The according partial sum-of-squares in PSSq however is of course positive and by that coefficient alone one would not see that information.

A furtherly interesting aspect might here be, that the SPSS SSType(3)-default gives us the set of coefficients which are also analoguous to the "Usefulnesses" known from the Regression-procedure (and also their F-value and p-value). The usefulness-coefficient in Regression seems to be rarely discussed - it is not even in the Wikipedia; I found it for instance mentioned in the 1999 book "Statistik für Sozialwissenschaftler" by J. Bortz, pg 442 ("Nützlichkeit") referring to an idea of R. B. Darlington (1968). It is also described in a more recent script by M. Persike, 2008, pg. 6.


4. Regression

Finally in this sequence we compute the Regression-coefficients B for the items using the inversion of the upper-left submatrix of PL. This gives us the columns in the metric of the predictors (see the 1.000-coordinates in their columns). Here the order of the items become irrelevant because each item gets an own axis attached in which the dependent item can be measured.

Note that the vectorspace has then non-orthogonal axes         
(the items which provide the metric are correlated)

 

To check this using SPSS we compare the entries in row "Absatz" with the coefficients given by SPSS. Remark: the "const" was given as variable to have additional options for the output . Of course, having a "constant" included, the option "/ORIGIN must then be applied. :

 

B

const

preis

VerkFoerd

Vertreter

Absatz

const

1.000

.

.

.

.

preis

.

1.000

.

.

.

VerkFoerd

.

.

1.000

.

.

Vertreter

.

.

.

1.000

.

Absatz

725.548

-26.281

0.479

8.387

982.471

(Matmate:)

PLInv = inv(PL[1..4,1..4])

PLInv = insert(PLInv,{sqrt(N)})

B = PL * PLInv

 

 

Nicht standardisierte Koeffizienten

 

Regressionskoeffizient B

const

725.548

Preis

-26.281

VerkFoerd

0.479

Vertreter

8.387

REGRESSION /ORIGIN  /DEPENDENT Absatzmenge

  /METHOD=ENTER const Preis VerkFoerd Vertreter .

 

 


5. References

Backhaus         Multivariate Analysemethoden,
K. Backhaus, B. Erichson, W. Plinke, R.Weiber
Springer, Berlin; 1990, 6. Auflage

Bortz              Statistik für Sozialwissenschaftler
J. Bortz
Springer, Berlin; 1999, 5. Auflage

Darlington       Multiple Regression in psychological research and practice
Psychol. Bull. 69, 1968, pg. 161-182
(referred to by J. Bortz, pg 442)


Persike            Forschungsstatistik I
M. Persike
2008, Skript zur Vorlesung
http://methodenlehre.sowi.uni-mainz.de/download/Lehre/SS2009/StatistikII/VL_2009_05_12.pdf

Wikipedia        Regression Analysis
(multiple authors)
https://en.wikipedia.org/wiki/Regression_analysis


SPSS              IBM Spss, V. 21 German

R                   The R-project

MatMate         a Matrix-calculator for statistical education and selfstudy
G. Helms, 1996 last update 2016
http://go.helms-net.de/sw/matmate/index.htm


SSE_Q&A        related Questions and discussions in the stat.stackexchange.com-network
http://stats.stackexchange.com/questions/13241/the-order-of-variables-in-anova-matters-doesnt-it
http://stats.stackexchange.com/questions/11209/the-effect-of-the-number-of-replicates-in-different-cells-on-the-results-of-anova
http://stats.stackexchange.com/questions/20452/how-to-interpret-type-i-type-ii-and-type-iii-anova-and-manova
      and from there various more links:
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Type1-3.pdf
http://www.uni-kiel.de/psychologie/dwoll/r/ssTypes.php
http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf
http://r.789695.n4.nabble.com/Type-I-v-s-Type-III-Sum-Of-Squares-in-ANOVA-td1573657.html
https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/

 

(c) Gottfried Helms, Univ. Kassel, 5'2016, Version 3.5