Stats-Pills: create correlated dataset

Q: How to create simulated data
(to be added to a given dataset),
which jointly satisfy a given correlation-matrix.

Abstract: a problem solved using the matrix-program MatMate (© with the author). The listing of the output is documented to make the process transparent.

Gottfried Helms, 16.5.2007

;===========================================

;MatMate-Listing vom:16.05.2007 14:09:28

;============================================

First: show the data

X is empirical data in 9 variables, n=100 observations, observations of a variable run along a row

X :

x1	0.417	-0.088	-1.065	0.574	2.137	0.093	...
x2	0.427	-1.157	-1.409	0.857	1.290	-0.510	...
x3	0.826	0.540	-0.886	0.982	1.596	0.633	...
x4	0.579	-0.228	-1.500	0.205	1.219	0.331	...
x5	0.808	0.151	-1.647	0.409	1.230	0.197	...
x6	0.432	0.381	-1.995	1.920	0.877	-0.547	...
x7	0.054	0.052	-1.010	0.646	1.656	0.408	...
x8	0.546	1.089	-1.335	1.599	1.694	0.154	...
x9	0.611	-0.269	-1.813	0.752	1.238	0.607	...

R is correlation-matrix 12x12. The top-left 9x9-matrix are the empirical correlations for X, the remaining entries the required correlations for simulated data x10..x12.

R :

1.00	0.77	0.76	0.73	0.75	0.59	0.86	0.83	0.70	0.80	0.66	0.67
0.77	1.00	0.57	0.87	0.68	0.72	0.81	0.71	0.67	0.87	0.83	0.82
0.76	0.57	1.00	0.54	0.81	0.63	0.82	0.86	0.85	0.77	0.57	0.56
0.73	0.87	0.54	1.00	0.77	0.66	0.83	0.71	0.72	0.77	0.76	0.86
0.75	0.68	0.81	0.77	1.00	0.64	0.88	0.81	0.88	0.87	0.77	0.75
0.59	0.72	0.63	0.66	0.64	1.00	0.58	0.86	0.74	0.62	0.79	0.73
0.86	0.81	0.82	0.83	0.88	0.58	1.00	0.82	0.83	0.91	0.70	0.82
0.83	0.71	0.86	0.71	0.81	0.86	0.82	1.00	0.82	0.77	0.77	0.74
0.70	0.67	0.85	0.72	0.88	0.74	0.83	0.82	1.00	0.78	0.74	0.80
0.80	0.87	0.77	0.77	0.87	0.62	0.91	0.77	0.78	1.00	0.84	0.77
0.66	0.83	0.57	0.76	0.77	0.79	0.70	0.77	0.74	0.84	1.00	0.86
0.67	0.82	0.56	0.86	0.75	0.73	0.82	0.74	0.80	0.77	0.86	1.00

Second: Computations.

Compute cholesky-loadingsmatrix from R.

[3] lad = cholesky(r)

LAD is the cholesky-loadingsmatrix 12x12 computed from R. The top-left 9x9-matrix are the loadings of empirical data on the orthogonal components (whose scores will have to be computed later), the remaining entries the required loadings for simulated data x10..x12.

lad :

x1	1.00	.	.	.	.	.	.	.	.	.	.	.
x2	0.77	0.63	.	.	.	.	.	.	.	.	.	.
x3	0.76	-0.03	0.65	.	.	.	.	.	.	.	.	.
x4	0.73	0.48	-0.00	0.49	.	.	.	.	.	.	.	.
x5	0.75	0.15	0.37	0.31	0.42	.	.	.	.	.	.	.
x6	0.59	0.42	0.30	0.05	0.00	0.62	.	.	.	.	.	.
x7	0.86	0.22	0.26	0.21	0.09	-0.18	0.25	.	.	.	.	.
x8	0.83	0.10	0.35	0.13	0.00	0.35	0.10	0.17	.	.	.	.
x9	0.70	0.20	0.50	0.23	0.15	0.12	0.08	-0.19	0.30	.	.	.
x10	0.80	0.40	0.25	-0.00	0.27	-0.16	0.06	0.13	0.04	0.10	.	.
x11	0.66	0.51	0.12	0.07	0.31	0.23	-0.03	0.22	0.23	-0.00	0.19	.
x12	0.67	0.48	0.09	0.29	0.11	0.14	0.27	-0.04	0.20	-0.09	0.26	0.07

Here the factor-/component-scores for the 9 orthogonal components, which define the 9 empirical datarows x1..x9, are computed by inversion of the top-left 9x9 loadingsmatrix.

[5] fac= inv(lad[1..9,1..9])*x

The first 6 columns in FAC of the factorscores for f1..f9:

FAC :

fac1	0.417	-0.088	-1.065	0.574	2.137	0.093	...
fac2	0.165	-1.721	-0.923	0.653	-0.575	-0.920	...
fac3	0.789	0.864	-0.155	0.867	-0.063	0.827	...
fac4	0.406	1.358	-0.577	-1.066	-0.122	1.445	...
fac5	0.115	-0.652	-1.121	-0.262	-0.546	-1.188	...
fac6	-0.226	1.342	-1.45	1.790	-0.193	-0.862	...
fac7	-2.745	1.184	0.473	1.390	-0.004	-0.110	...
fac8	1.109	1.558	1.330	0.753	0.495	0.017	...
fac9	0.818	-1.543	-0.325	-0.374	0.417	0.974	...

Now to be able to define 3 new components uncorrelated to the existing ones, we rotate the matrix of factor-scores to diagonal position.

[8] // get the transformation-matrix, which would rotate

[9] // fac into triangular/diagonal position

[10] t = gettrans(fac,"drei")

[11] // do the rotation actually; fac1 has then a diagonal in the top-left

[12] fac1 = fac*t

Since the 9 orthogonal components are uncorrelated, we get a diagonal matrix in the top-left 9x9-segment

FAC1:

fac1 1	10	0	0	0	0	0	0	0	0	...
fac1 2	0	10	0	0	0	0	0	0	0	...
fac1 3	0	0	10	0	0	0	0	0	0	...
fac1 4	0	0	0	10	0	0	0	0	0	...
fac1 5	0	0	0	0	10	0	0	0	0	...
fac1 6	0	0	0	0	0	10	0	0	0	...
fac1 7	0	0	0	0	0	0	10	0	0	...
fac1 8	0	0	0	0	0	0	0	10	0	...
fac1 9	0	0	0	0	0	0	0	0	10	...

To generate three more uncorrelated components means, to add three more rows and insert the value sqrt(n) = 10 in the appropriate diagonal position:

[16] fac1 = {fac1,null(3,n)} // first simply 3 null-rows

[17] fac1[10..12,10..12]=fac1[1..3,1..3] // insert diagonal-values(simply copy from top-left

Fac1 :

fac1 1	10	0	0	0	0	0	0	0	0	0	0	0	...
fac1 2	0	10	0	0	0	0	0	0	0	0	0	0	...
fac1 3	0	0	10	0	0	0	0	0	0	0	0	0	...
fac1 4	0	0	0	10	0	0	0	0	0	0	0	0	...
fac1 5	0	0	0	0	10	0	0	0	0	0	0	0	...
fac1 6	0	0	0	0	0	10	0	0	0	0	0	0	...
fac1 7	0	0	0	0	0	0	10	0	0	0	0	0	...
fac1 8	0	0	0	0	0	0	0	10	0	0	0	0	...
fac1 9	0	0	0	0	0	0	0	0	10	0	0	0	...
fac1 10	0	0	0	0	0	0	0	0	0	10	0	0	...
fac1 11	0	0	0	0	0	0	0	0	0	0	10	0	...
fac1 12	0	0	0	0	0	0	0	0	0	0	0	10	...

Since we have the rotation-matrix (T) saved, which performed the rotation of FAC into this diagonal-position, we use its transpose, to rerotate FAC1 , which now includes three new and uncorrelated components:

[24] // rotate fac+newdata back to original position (recreating fac)

[25] newfac = fac1 * t'

[26] // newfac 1..9 are identical to the original values

[27] // the newdata newfac10..12 are still uncorrelated to newfac1..9

NEWFAC :

newfac1	0.417	-0.088	-1.065	0.574	2.137	0.093	...
newfac2	0.165	-1.721	-0.923	0.653	-0.575	-0.920	...
newfac3	0.789	0.864	-0.155	0.867	-0.063	0.827	...
newfac4	0.406	1.358	-0.577	-1.066	-0.122	1.445	...
newfac5	0.115	-0.652	-1.121	-0.262	-0.546	-1.188	...
newfac6	-0.226	1.342	-1.450	1.790	-0.193	-0.862	...
newfac7	-2.745	1.184	0.473	1.390	-0.004	-0.110	...
newfac8	1.109	1.558	1.330	0.753	0.495	0.017	...
newfac9	0.818	-1.543	-0.325	-0.374	0.417	0.974	...
newfac10	0.141	-0.097	-0.041	-0.106	-0.010	0.050	...
newfac11	0.421	-0.022	0.408	0.084	0.118	0.087	...
newfac12	0.284	-0.309	0.093	-0.367	0.117	0.177	...

Note, that the data of newfac1..9 are identical to the original fac1..9.

Now we have 12 uncorrelated components and the 12x12 cholesky-loadingsmatrix, which defines the composition of variables according to their loadings on the components. We (re-)compute the dataset

[29] // create correlated data, based on the cholesky matrix and new factors

[30] y = lad * newfac

Y disp :

y1	0.417	-0.088	-1.065	0.574	2.137	0.093	...
y2	0.427	-1.157	-1.409	0.857	1.290	-0.510	...
y3	0.826	0.54	-0.886	0.982	1.596	0.633	...
y4	0.579	-0.228	-1.500	0.205	1.219	0.331	...
y5	0.808	0.151	-1.647	0.409	1.230	0.197	...
y6	0.432	0.381	-1.995	1.920	0.877	-0.547	...
y7	0.054	0.052	-1.01	0.646	1.656	0.408	...
y8	0.546	1.089	-1.335	1.599	1.694	0.154	...
y9	0.611	-0.269	-1.813	0.752	1.238	0.607	...
y10	0.703	-0.718	-1.134	0.738	1.443	-0.218	...
y11	1.067	-0.683	-1.646	1.123	1.117	-0.523	...
y12	0.023	-0.352	-1.531	0.971	1.141	0.064	...

Last step:

Check the correlations of the new dataset and the differences to the original correlations in R. The differences should all be zero

[32] chk = y *'/n - r

corr(Y) - R :

chkr1	0	0	0	0	0	0	...
chkr2	0	0	0	0	0	0	...
chkr3	0	0	0	0	0	0	...
chkr4	0	0	0	0	0	0	...
chkr5	0	0	0	0	0	0	...
chkr6	0	0	0	0	0	0	...
chkr7	0	0	0	0	0	0	...
chkr8	0	0	0	0	0	0	...
chkr9	0	0	0	0	0	0	...
chkr10	0	0	0	0	0	0	...
chkr11	0	0	0	0	0	0	...
chkr12	0	0	0	0	0	0	...

(MatMate (beta) can be obtained for academic educational institutions without charge,
see http://go.helms-net.de/sw/matmate)

Gottfried Helms, 16.5.2007