User:Michael Hardy/Matrix spectral decompositions in statistics

In statistics, there are a number of theoretical results that are usually presented at a fairly elementary level that cannot be proved at that level without resort to somewhat cumbersome arguments. However, they can be quickly and conveniently proved by using spectral decompositions of real symmetric matrices. That such a decomposition always exists is the content of the spectral theorem of linear algebra. Since the most elementary accounts of statistics do not presuppose any familiarity with linear algebra, the results are often stated without proof in elementary accounts.

Certain chi-square distributions

The chi-square distribution is the probability distribution of the sum of squares of several independent random variables each of which is normally distributed with expected value 0 and variance 1. Thus, suppose

Z_{1},\dots ,Z_{n}\,

are such independent normally distributed random variables with expected value 0 and variance 1. Then

Z_{1}^{2}+\cdots +Z_{n}^{2}\,

has a chi-square distribution with n degrees of freedom. A corollary is that if

X_{1},\dots ,X_{n}\,

are independent normally distributed random variables with expected value μ and variance σ², then

\left({\frac {X_{1}-\mu }{\sigma }}\right)^{2}+\cdots +\left({\frac {X_{n}-\mu }{\sigma }}\right)^{2}\qquad \qquad (1)

also has a chi-square distribution with n degrees of freedom. Now consider the "sample mean"

{\overline {X}}={\frac {X_{1}+\cdots +X_{n}}{n}}.

If one puts the sample mean in place of the "population mean" μ in (1) above, one gets

\left({\frac {X_{1}-{\overline {X}}}{\sigma }}\right)^{2}+\cdots +\left({\frac {X_{n}-{\overline {X}}}{\sigma }}\right)^{2}.\qquad \qquad (2)

One finds it asserted in many elementary texts^{[citation needed]} that the random variable (2) has a chi-square distribution with n − 1 degrees of freedom. Why that should be so may be something of a mystery when one considers that

The random variables

{\frac {X_{i}-{\overline {X}}}{\sigma }},\qquad i=1,\dots ,n,\qquad (3)

although normally distributed, cannot be independent (since their sum must be zero);

Those random variables do not have variance 1, but rather

\mathrm {var} \left({\frac {X_{i}-{\overline {X}}}{\sigma }}\right)={\frac {n-1}{n}}

(as will be explained below);

There are not n − 1 of them, but rather n of them.

The fact that the variance of (3) is (n − 1)/n can be seen be writing it as

{\frac {X_{i}-{\overline {X}}}{\sigma }}={\frac {1}{\sigma }}\left(\left(1-{\frac {1}{n}}\right)X_{i}-{\frac {{}\ \overbrace {X_{1}+\cdots +X_{n}} ^{{\text{Omit the }}i{\text{th term.}}}\ {}}{n}}\right),

and then using elementary properties of the variance.

To resolve the mystery, one begins by thinking about the operation of subtracting the sample mean from each observation:

{\begin{bmatrix}X_{1}\\\vdots \\X_{n}\end{bmatrix}}\mapsto {\begin{bmatrix}X_{1}-{\overline {X}}\\\vdots \\X_{n}-{\overline {X}}\end{bmatrix}}.

This is a linear transformation. In fact it is a projection, i.e. an idempotent linear transformation. To say that it is idempotent is to say that if one subtracts the mean of each of the scalar components of this vector, getting a new vector, and then applies the same operation to the new vector, what one gets is that same new vector. It projects the n-dimensional space onto the (n − 1)-dimensional subspace whose equation is x₁ + .... + x_n = 0.

The matrix can be seen to be symmetric by observing that the matrix is

P={\begin{bmatrix}1-{\frac {1}{n}}&-{\frac {1}{n}}&-{\frac {1}{n}}&\cdots &-{\frac {1}{n}}\\[8pt]-{\frac {1}{n}}&1-{\frac {1}{n}}&-{\frac {1}{n}}&\cdots &-{\frac {1}{n}}\\[8pt]-{\frac {1}{n}}&-{\frac {1}{n}}&1-{\frac {1}{n}}&\cdots &-{\frac {1}{n}}\\[8pt]\vdots &\vdots &\vdots &&\vdots \\[8pt]-{\frac {1}{n}}&-{\frac {1}{n}}&-{\frac {1}{n}}&\cdots &1-{\frac {1}{n}}\end{bmatrix}}

Alternatively, one can see that the matrix is symmetric by observing that the vector (1, 1, 1, ..., 1)^T that gets mapped to 0 is orthogonal to every vector in the image space x₁ + .... + x_n = 0; thus the mapping is an orthogonal projection. The matrices of orthogonal projections are precisely the symmetric idempotent matrices; hence this matrix is symmetric.

Therefore

P is an n × n orthogonal projection matrix of rank n − 1.

Now we apply the spectral theorem to conclude that there is an orthogonal matrix G that rotates the space so that

$P=G^{-1}{\begin{bmatrix}0\\&1\\&&1\\&&&1\\&&&&\ddots \\&&&&&1\end{bmatrix}}G=G^{-1}MG$

with 0 is every off-diagonal position.

Now let

X={\begin{bmatrix}X_{1}\\\vdots \\X_{n}\end{bmatrix}},

and

U={\begin{bmatrix}U_{1}\\\vdots \\U_{n}\end{bmatrix}}=GX=G{\begin{bmatrix}X_{1}\\\vdots \\X_{n}\end{bmatrix}}.

The probability distribution of X is a multivariate normal distribution with expected value

\mu {\begin{bmatrix}1\\\vdots \\1\end{bmatrix}}

and variance

\sigma ^{2}{\begin{bmatrix}1\\&1\\&&\ddots \\&&&1\end{bmatrix}}=\sigma ^{2}I.

Consequently the probability distribution of U = PX is multivariate normal with expected value

E(PX)=PE(X)=\mu P{\begin{bmatrix}1\\\vdots \\1\end{bmatrix}}={\begin{bmatrix}0\\\vdots \\0\end{bmatrix}}

and variance

\mathrm {var} (PX)=P(\mathrm {var} (X))P^{T}=P(\sigma ^{2}I)P^{T}=\sigma ^{2}P\,

(we have used the fact that P is symmetric and idempotent).

Confidence intervals based on Student's t-distribution

One such elementary result is as follows. Suppose

X_{1},\dots ,X_{n}\,

are the observations in a random sample from a normally distributed population with population mean μ and population standard deviation σ. It is desired to find a confidence interval for μ.

Let

{\overline {X}}_{n}={\frac {X_{1}+\cdots +X_{n}}{n}}

be the sample mean and let

S_{n}^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}_{n})^{2}

be the sample variance. It is often asserted in elementary accounts that the random variable

{{\overline {X}}_{n}-\mu  \over S_{n}/{\sqrt {n}}}

has a Student's t-distribution with n − 1 degrees of freedom. Consequently the interval whose endpoints are

{\overline {X}}_{n}\pm A{\frac {S_{n}}{\sqrt {n}}},

where A is a suitable percentage point of Student's t-distribution with n − 1 degrees of freedom, is a confidence interval for μ.

That is the practical result desired. But the proof using the spectral theorem is not given in accounts in which the reader is not assumed to be familiar with linear algebra at that level.

Student's distribution and the chi-square distribution

Student's t-distribution with (so called because its discoverer, William Sealy Gosset, wrote under the pseudonym "Student") with k degrees of freedom, can be characterized as the probability distribution of the random variable

{\frac {Z}{\sqrt {V/k\ }}}

where

Z has a normal distribution with expected value 0 and standard deviation 1;
V has a chi-square distribution with k degrees of freedom; and
Z and V are independent.

The chi-square distribution with k degrees of freedom is the distribution of the sum

Z_{1}^{2}+\cdots +Z_{k}^{2}\,

where Z₁, ..., Z_k are indepedent random variables, each normally distributed with expected value 0 and standard deviation 1.

The problem

Why should the random variable

{{\overline {X}}_{n}-\mu  \over S_{n}/{\sqrt {n}}}\qquad \qquad \qquad (1)

have the same distribution as

{\frac {Z}{\sqrt {V/k\ }}},

where k = n − 1?

We must overcome several apparent objections to the conclusion we hope to prove:

Although the numerator in (1) is normally distributed with expected value 0, it does not have standard deviation 1.

The random variable,

S_{n}^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}_{n}\right)^{2}

appearing in the numerator is the sum of square of random variables, each of which is normally distributed with expected value 0, but

- there are not n − 1 of them, but n; and
- they are not independent (notice in particular that

\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}_{n}\right)=0

regardless of the values of X₁, ..., X_n, and that clearly precludes independence); and

- the standard deviation of each of them is not 1. If one divides them each by σ, the standard deviation of the quotient is also not 1, but in fact less than 1. To see that, consider that the standard score

{\frac {X_{i}-\mu }{\sigma }}

has standard deviation 1, and substituting

\scriptstyle {\overline {X}}_{n}\,

for μ makes the standard deviation smaller.

It may be unclear what the numerator and denominator in (1) must be independent. After all, both are functions of the same list of n observations X₁, ..., X_n.

The very last of these objections may be answered without resorting to the spectral theorem. But all of them, including the last, can be answered by means of the spectral theorem. The solution will amount to rewriting the vector

(X_{1},\dots ,X_{n})\,

in a different coordinate system.

Spectral decompositions

The spectral theorem tells us that any real symmetric matrix can be diagonalized by an orthogonal matrix.

We will apply that to the n × n projection matrices P = P_n and Q = Q_n defined by saying that every entry in P is 1/n and Q = I − P, i.e. the n × n identity matrix minus P. Notice that

PX=P\left[{\begin{matrix}X_{1}\\\vdots \\X_{n}\end{matrix}}\right]=\left[{\begin{matrix}{\overline {X}}_{n}\\\vdots \\{\overline {X}}_{n}\end{matrix}}\right]{\text{ and }}QX=Q\left[{\begin{matrix}X_{1}\\\vdots \\X_{n}\end{matrix}}\right]=\left[{\begin{matrix}X_{1}-{\overline {X}}_{n}\\\vdots \\X_{n}-{\overline {X}}_{n}\end{matrix}}\right].

Also notice that P and Q are complementary orthogonal projection matrices, i.e.

P^{2}=P,\quad Q^{2}=Q,\quad PQ=QP=0,\quad P+Q=I.

For any vector X, the vector PX is the orthogonal projection of X onto the space spanned by the column vector J in which every entry is 1, and QX is the projection onto the (n − 1)-dimensional orthogonal complement of that space.

Let G be an orthogonal matrix such that

G\left[{\begin{matrix}1\\0\\\vdots \\0\end{matrix}}\right]=