8

In https://en.wikipedia.org/wiki/F-test#Regression_problems, an application of the F-statistic to comparing linear models is given:

Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the restricted model, and model 2 is the unrestricted one. That is, model 1 has p1 parameters, and model 2 has p2 parameters, where p1 < p2, and for any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of model 2. [...] If there are n data points to estimate parameters of both models from, then one can calculate the F statistic, given by

$$\frac{RSS_1-RSS_2}{p_2-p_1} / \frac{RSS_2}{n - p_2}$$

I understand why $RSS_2$ is chi-square distributed with degrees of freedom $n - p_2$: this is because in linear regression we project the data onto a $p_2$ dimensional space, leaving the residual error to reside in an $n-p_2$ dimensional space.

My question is why is $RSS_1 - RSS_2$ Chi-square distributed with $p_2-p_1$ degrees of freedom? It is true that $RSS_1$ and $RSS_2$ are chi-square distributed, but in general, I think the difference of two chi-squared distributions is not chi-square distributed. So this is a confusing claim to me.

I tried looking at the proof here: Proof that F-statistic follows F-distribution but it went above my head. I'm not looking for a completely rigorous proof, but I am looking for an explanation that I can grok that doesn't also wrongly (?) imply that the difference of any two chi-squared distributions is chi-squared, probably somehow using the fact that the models are nested.

aellab
  • 223

2 Answers2

3

Since, there is a sense of ambiguity as to how the 'difference' of residuals of the restricted and unrestricted models leads to the $F$-statistic, it is deemed to be apt enough to briefly sketch the development in a general setting which would provide a better insight.


Theorem $[\rm I]:$ If $\mathbf x\sim \mathsf{MVN}(\boldsymbol\mu,\mathbf V),$

$$\mathbf x^\mathsf T\mathbf A\mathbf x\sim{\chi^2}^\prime\left[r(\mathbf A),\frac12\boldsymbol\mu^\mathsf T\mathbf A\boldsymbol \mu\right]\iff \mathbf{AV}~\text{idempotent}.^1$$


Consider the imposition of the constraint $$\mathbf K^\mathsf T\mathbf b = \mathbf m\tag 1$$ on the model $\mathbf y = \mathbf{Xb} +\mathbf e, $ where $\bf b$ is vector of parameters of order $k,~\mathbf K^\mathsf T $ is a matrix of order $s\times k$ along with the condition $r\left(\mathbf K^\mathsf T\right) = s, ~\mathbf m$ is a constant vector.

What's the effect of $(1) $ on the model? How should the estimator and the associated sum of squares look?

In oder to find the estimator of $\mathbf b,~\left(\tilde{\mathbf b}\right) $ subject to the constraint, one can employ Lagrange multiplier $(2\boldsymbol\lambda) $ in the minimisation of $$\left(\mathbf y -\mathbf X\tilde{\mathbf b}\right)^\mathsf T\left(\mathbf y -\mathbf X\tilde{\mathbf b}\right)+ 2\boldsymbol\lambda^\mathsf T\left(\mathbf K^\mathsf T\tilde{\mathbf b }-\mathbf m\right)\tag 2$$ w.r.t. $\tilde{\mathbf b}, ~\boldsymbol\lambda.$ Solving the equations

\begin{align} \mathbf X^\mathsf T\mathbf X\tilde{\mathbf b} + \mathbf K\boldsymbol\lambda &= \mathbf X^\mathsf T\mathbf y\\ \mathbf K^\mathsf T\tilde{\mathbf b } &=\mathbf m, \end{align}

yield

$$\tilde{\mathbf b }=\hat{\mathbf b }-(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf K\left[\mathbf K^\mathsf T(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf K\right]^{-1}\left(\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m\right).\tag 3$$

What would be the residual sum of squares? Computing $\left(\mathbf y -\mathbf X\tilde{\mathbf b}\right)^\mathsf T\left(\mathbf y -\mathbf X\tilde{\mathbf b}\right)$ using the fact that $\mathbf X^\mathsf T(\mathbf y -\mathbf X\hat{\mathbf b})= 0 $ would lead to

\begin{align}&= \left[\mathbf y -\mathbf X\hat{\mathbf b}+ \mathbf X\left(\hat{\mathbf b}-\tilde{\mathbf b}\right)\right]^\mathsf T\left[\mathbf y -\mathbf X\hat{\mathbf b}+ \mathbf X\left(\hat{\mathbf b}-\tilde{\mathbf b}\right)\right]\\&= \left( \mathbf y -\mathbf X\hat{\mathbf b}\right) ^\mathsf T \left( \mathbf y -\mathbf X\hat{\mathbf b}\right)+ \left(\hat{\mathbf b}-\tilde{\mathbf b}\right)^\mathsf T\mathbf X^\mathsf T\mathbf X\left(\hat{\mathbf b}-\tilde{\mathbf b}\right)\\ &\stackrel{(3)}{=} \textrm{SSE} +\underbrace{(\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m)^\mathsf T\left[\mathbf K^\mathsf T(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf K\right]^{-1}(\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m)}_{:= Q};\tag 4 \end{align}

so

$$\textrm{residual(reduced)} = \textrm{residual(full)} + Q. \tag 5^2$$

What is the distribution of $ Q? $

Notice that $\hat{\mathbf b}~\sim\mathsf{MVN}\left(\mathbf b, (\mathbf X^\prime\mathbf X) ^{-1}\sigma^2\right)$ as $\mathbf y \sim \mathsf{MVN}(\mathbf X\mathbf b,\sigma^2\mathbf I).$ Therefore $$\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m\sim\mathsf{MVN}\left(\mathbf K^\mathsf T\mathbf b-\mathbf m,\mathbf K^\mathsf T (\mathbf X^\prime\mathbf X) ^{-1} \mathbf K \sigma^2\right).\tag 6$$ $Q$ can be seen to be a quadratic in $\mathbf K^\mathsf T\hat{\mathbf b}-\mathbf m, $ with $\left[\mathbf K^\mathsf T (\mathbf X^\prime\mathbf X) ^{-1} \mathbf K\right]^{-1}$ as the matrix of the quadratic. Now is the crucial part: apply Theorem $\rm[I] $ here (what are $\bf A, V$ here?) using $(6) $ to conclude that $$\frac Q{\sigma^2}\sim{\chi^2}^\prime\left[s,\frac{(\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m)^\mathsf T\left[\mathbf K^\mathsf T(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf K\right]^{-1}(\mathbf K^\mathsf T\hat{\mathbf b }-\mathbf m)}{2\sigma^2}\right].\tag 7$$ This is the desired result in a general framework. For constructing the required $F$-statistic, it has to be deduced that $Q$ and $\textrm{SSE}$ are independent.


Notes:

$[1] $ To prove this, compute the MGF of the quadratic form and use the property of the eigenvalues of an idempotent matrix.

$[2]$ It is tempting, perhaps, to interpret $Q,$ writing it in the form \begin{align}Q &= \mathbf y^\mathsf T\mathbf y -\text{SSE} -\left[ \mathbf y^\mathsf T\mathbf y - (\text{SSE}+ Q) \right]\\&=\text{SSR}- \left[ \mathbf y^\mathsf T\mathbf y - (\text{SSE}+ Q) \right]\\ &= \textrm{reduction(full)} - \left[ \mathbf y^\mathsf T\mathbf y - (\text{SSE}+ Q) \right],\tag{N1} \end{align} as the "reduction in sum of squares due to fitting of the reduced model" along the line of $(5);$ but this is not true, in general. In fact, the term in parenthesis in $\rm(N1) $ need not be even a sum of squares.


Reference:

Linear Models, S. R. Searle, John Wiley & Sons., $1971.$

User1865345
  • 8,202
2

Below we assume we have a model $y=Xb+e$ where the components of vector $e$ are distributed iid $N(0, \sigma^2)$ and that $H$ is the projection matrix onto the range of $X$. Use a subscript of $r$ to denote corresponding quantities of the restricted model and use $\hat{e} = (I-H)y$ to mean the estimated value of the unknown $e$.

We will make use of some facts about projections shown in the Appendix at the end.

Note that

$RSS = \hat{e}'\hat{e} = y'(I-H)y = e'(I-H)e$

where we have used the fact that

$(I-H)y = (I-H)(Xb+e) = (I-H)e$

Now applying that to both $RSS$ and $RSS_r$ we have that

$RSS_r - RSS = e'(I-H_r)e - e'(I-H)e = e'(H-H_r)e$

$H$ and $H_r$ are projections and their difference is a projection too because the space that $H_r$ projects onto is nested within the space associated with $H$ by our assumptions. Thus $RSS_r - RSS$ is of the form $e'Pe$ where $P = H-H_r$ is an orthogonal projection.

Now it is known that if the components of vector $x$ are iid $N(0, 1)$, which is the case for $e/\sigma$ by assumption, and $Q$ is any orthogonal projection that $x'Qx$ is chi-squared with degrees of freedom equal to the dimension of the space onto which $Q$ projects (and is also equal to the rank of $Q$ and also equal to $trace(Q)$). Thus $(RSS-RSS_r)/\sigma^2$ is chi-squared.

Appendix

An orthogonal projection matrix $Q$ is a matrix which satisfies $Q = Q' = Q^2$. That is it is symmetric and idempotent. This implies that $||Qx||^2 = x'Q'Qx = x'Qx$ where $||.||^2$ means squared length.

If $Q$ is an orthogonal projection then so is $I-Q$.

The range of a matrix is the set of values it maps to. An orthogonal projection is said to project onto its range.

If an orthogonal projection $Q$ projects onto the range of matrix $M$ then $QM=M$ and $(I-Q)M = 0$.

If $Q$ and $Q_0$ are orthogonal projections such that $Q_0$ projects onto a subspace of the set that $Q$ projects onto then $Q$ and $Q_0$ commute and $Q-Q_0$ is an orthogonal projection too. Also $rank(Q-Q_0) = rank(Q) - rank(Q_0)$.

  • When the errors are not iid to $N(0,\sigma^2)$, (e.g., the case where different points have different errors, which I think is more realistic), I speculate that the RSS should be naturally be extended by $\chi^2$? This case has closer link to the F-distribution, which comes from the ratio of two $\chi^2$ distributions. I'm still frustrated that only the contrived iid/homogeneous case is addressed in wikipedia. – luyuwuli Oct 22 '23 at 08:53