3

I was studying the properties of multiple linear regression, and stumbled across the F statistics often used for the tests. We are testing for the regression coefficients (let's say there are $p$ variables aside from the intercept term): $$H_0:\beta_1=\dots =\beta_p=0$$ versus its complement.

We all know that under null hypothesis, the $R^2/(1-R^2)$ assumes a (scaled) F distribution. But when I searched for the distribution under the alternative, there were few resources available. I found a mention that for $p=1$ case it is a scaled noncentral F variable, but the general case was absent.

Could anyone please give a few pointers how to prove a similar result for general $p$, or provide some sources? Thanks!

  • One reason to struggle to find general results should be that, for multiple hypothesis, there is no such thing as "the" alternative distribution anymore, as there are many ways in which the hypothesis that all coefficients are zero can be false. – Christoph Hanck Mar 14 '24 at 21:08
  • I added a formal answer to the +1 verbal argument by @ScottThompson – Christoph Hanck Mar 15 '24 at 09:41
  • You might also want to consider changing the title of your question - for what is stated in the title, links like https://stats.stackexchange.com/questions/342949/is-r2-adjusted-both-unbiased-and-consistent-under-the-alternative-in-simple/343113#343113 might be more suitable – Christoph Hanck Mar 15 '24 at 10:06

3 Answers3

2

The F statistic in question can be written as (a scaled) ratio of the sum of squares due to the regression (e.g. the sum of squared de-meaned predicted values) to the sum of squared residuals. Suppose that the errors are independent and identically distributed normal random variables. Suppose further that the errors have unit variance. (If they do not then some of the statements below are only true after further rescaling.) Then the numerator has a chi-square distribution with p degrees of freedom under the null, and a non-central chi-square distribution with p degrees of freedom if one or more of the $\beta$'s are not zero.

The math in both cases is the same: In both cases the sum of squares due to the regression can be shown to be a sum of squared values of p independent, normally distributed random variables with unit variances. The only difference between the two cases is that under the null these variables have mean zero, and under the alternative at least one of them does not have a zero mean. The latter implies that the numerator has a non-central chi-square distribution, and so the F statistic has a non-central F distribution with the same non-centrality parameter.

If the regressors are ortho-normal (i.e. uncorrelated and have unit variances) then the means of the p random variables in the numerator are the same as the true values of the $\beta$'s. In this case the non-centrality parameter is just the sum of the squared $\beta$ values. If the regressors are correlated or do not have unit variances then the math gets more complicated and the non-centrality parameter will depend on the correlations and scaling of the regressors in addition to the values of the $\beta$'s.

2

Consider the F-statistic (see Relationship between F (Fisher) and R^2 for a derivation of the equivalence to your expression in the specific case of testing zero slopes, to which we return below - the current general format will be more suitable to study the distribution of the F-statistic) for $q$ linear restrictions $R^\prime\beta=r$ as \begin{eqnarray} F=\frac{(R^\prime\hat{\beta}_{\text{ols}}-r)^\prime\left\{R^\prime(X^\prime X)^{-1}R\right\}^{-1}(R^\prime\hat{\beta}_{\text{ols}}-r)/q}{\hat{\sigma}^{2}} \end{eqnarray} If the null $H_{0}:R^\prime\beta=r$ were true, we can substitute this expression and obtain the null F-distribution \begin{eqnarray*} \frac{(\hat{\beta}_{\text{ols}}-\beta)^\prime R\left\{R^\prime(X^\prime X)^{-1}R\right\}^{-1}R^\prime(\hat{\beta}_{\text{ols}}-\beta)/q}{\hat{\sigma}^{2}}\sim F_{q,n-k}. \end{eqnarray*} Now, if the null is actually false, add and subtract $R^\prime\beta$ (twice) in the test statistic to write \begin{eqnarray} F=\frac{[R^\prime(\hat{\beta}_{\text{ols}}-\beta)+(R'\beta-r)]^\prime\left\{R^\prime(X^\prime X)^{-1}R\right\}^{-1}[R^\prime(\hat{\beta}_{\text{ols}}-\beta)+(R'\beta-r)]/q}{\hat{\sigma}^{2}} \end{eqnarray} (At this stage this is not really a test statistic anymore, as it depends on the unknown $\beta$, but a theoretical device to study the distribution under the alternative.)

Again, proceeding as in the link, write (with $B^{-1/2}$ a "matrix square root" of $B^{-1}=(R^\prime(X^\prime X)^{-1} R)^{-1}$, via, e.g., a Cholesky decomposition) the numerator of this expression as $n_A'n_A$, with \begin{eqnarray*} n_A:=\frac{B^{-1/2}}{\sigma}R^\prime(\hat{\beta}_{\text{ols}}-\beta)+\frac{B^{-1/2}}{\sigma}(R'\beta+r)\equiv n+\frac{B^{-1/2}}{\sigma}(R'\beta+r) \end{eqnarray*} Using that $n\sim N(0,I)$ (see the link once more) and defining the constant (we take regressors to be fixed here) $$ c:=\frac{B^{-1/2}}{\sigma}(R'\beta+r) $$ we have $$ n_A\sim N\left(c,I_{q}\right) $$ Hence, the scaled numerator of the F-statistic follows a non-central chi-square distribution with $q$ d.f. and noncentrality parameter \begin{eqnarray*} \lambda&=&c'c\\ &=&\left(\frac{B^{-1/2}}{\sigma}(R'\beta+r)\right)'\left(\frac{B^{-1/2}}{\sigma}(R'\beta+r)\right)\\ &=&(R'\beta+r)'\frac{B^{-1}}{\sigma^2}(R'\beta+r) \end{eqnarray*} Consequently, the F-statistic follows a noncentral F-distribution with numerator d.f. $q$ and noncentrality parameter $\lambda$ (and the standard $n-k$ denominator d.f.).

Let us consider your case in which we test for zero slope coefficients. We then have $q=p$ as we restrict $p$ slopes to be zero under the null. Calling these coefficients $\tilde\beta$, we have $\beta=(\beta_0,\tilde\beta')'$ such that $R'=(0\;I_p)$, $R'\beta=\tilde\beta$ and $r=0$. Also, with standard normal errors, we may simplify the experession a little with $\sigma^2=1$.

Let us next turn to $B=R^\prime(X^\prime X)^{-1} R$. Denote the regressors other than the constant by $\tilde X$. Note, with $i$ a vector of ones for the constant, $X=(i\; \tilde X)$ so that \begin{align*} (X^TX)^{-1}&=\left( \begin{array} {c,c} n & i'\tilde X \\ \tilde X'i&\tilde X'\tilde X\end{array} \right)^{-1}, \end{align*} so that $B=R^\prime(X^\prime X)^{-1} R$ picks the southeast block of $(X'X)^{-1}$. Using partitioned inverses, that is given by $$ B=[\tilde X'\tilde X-\tilde X'ii'\tilde X/n]^{-1}=[\tilde X'M_i\tilde X]^{-1}, $$ with $M_i=I-ii'/n$ the symmetric and idempotent residual maker matrix of a regression on a constant (aka demeaning). Putting things together, $$ \begin{eqnarray*} \lambda&=&\tilde\beta'B^{-1}\tilde\beta\\&=&\tilde\beta'[\tilde X'M_i\tilde X]\tilde\beta \end{eqnarray*} $$ As one would expect, the noncentrality parameter grows when the null is "more false" (larger $\tilde\beta$ further away from zero) and/or larger $n$ (implicitly in the sum-of-squares term in the middle). The mean of the regressors does not matter in view of $\lambda$ depending on $M_i\tilde X$.

Letting $\tilde c=M_i\tilde X\tilde\beta$ and using symmetry and idempotency of $M_i$, one may also write $$ \lambda=\tilde c'\tilde c, $$ suggesting that it is really the interplay of (demeaned) regressors and coefficients in the sense of their distance from the origin that matters for power.

E.g., a negative covariance of regressors should decrease power in view of negative entries in $\tilde X'M_i\tilde X$, but that effect could be offset by negative slope coefficients. So scope to play around with the code below!

Building on the Monte-Carlo from the link once more, consider:

library(lmtest)
n <- 20
reps <- 50000

sloperegs <- 3 # number of slope regressors, q or k-1 (minus the constant) in the above notation critical.value <- qf(p = .95, df1 = sloperegs, df2 = n-sloperegs-1)

for the null that none of the slope regrssors matter

betatilde <- rep(1, sloperegs) Fstat <- rep(NA,reps)

Xtilde <- matrix(rnorm(nsloperegs, mean = 1), ncol=sloperegs) # "fixed regressor paradigm" M.Xtilde <- sweep(Xtilde, 2, colMeans(Xtilde), FUN = "-") lambda <- t(betatilde)%%crossprod(M.Xtilde)%*%betatilde

for (i in 1:reps){ u <- rnorm(n) y <- Xtilde%*%betatilde + u

reg <- lm(y~Xtilde) Fstat[i] <- waldtest(reg, test="F")$F[2] }

hist(Fstat, breaks = 80, col="lightblue", freq = F, xlim=c(0, 90)) x <- seq(0, 90, by=.1) lines(x, df(x, df1 = sloperegs, df2 = n-sloperegs-1, ncp=lambda), lwd=2, col="purple") abline(v=critical.value, lwd=2, lty=2)

mean(Fstat>critical.value) # very close to 1, as test statistics almost all are bigger than c.v. (vertical line)

We observe an excellent match of the simulated and theoretical distribution:

enter image description here

2

To formalize the discussion, let me clarify notations and problem statement first.

Here we are studying the classical linear model $y = X\beta + \varepsilon$, where $y \in \mathbb{R}^n$ is the vector of response variables, $X \in \mathbb{R}^{n \times (p + 1)}$ is the design matrix, $\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top \in \mathbb{R}^{p + 1}$ is the vector of parameters (with $\beta_0$ intercept), and $\varepsilon \sim N(0, I_{(n)})$ (without loss of generality, here I assume $\sigma^2 = 1$ for convenience).

The problem is to find the distribution of the $F$-statistic \begin{align*} F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} \tag{1}\label{1} \end{align*} under the alternative hypothesis $H_a$ for the testing problem: \begin{align*} H_0: \beta_1 = \cdots = \beta_p = 0 \text{ v.s. } H_a: \text{not all of } \beta_1, \ldots, \beta_p \text{ are zero}. \end{align*}

If you understand how to derive the distribution of $F$ under $H_0$, the task of deriving its distribution under $H_a$ is not that different. So let's go over this process, with the help of some linear algebra. To this end, let me introduce two more standard notations: the hat matrix $H = X(X^\top X)^{-1}X$ (we assume $\operatorname{rank}(X) = p + 1$), and the vector of $n$ ones $e = (1, \ldots, 1)^\top$.

With the above setup, we can express $RSS$ and $TSS$ in $\eqref{1}$ in terms of quadratic forms as follows: \begin{align*} & TSS = y^\top(I - n^{-1}ee^\top)y, \tag{2.1}\label{2.1} \\ & RSS = y^\top(I - H)y. \tag{2.2}\label{2.2} \end{align*} Hence \begin{align*} TSS - RSS = y^\top(H - n^{-1}ee^\top)y. \tag{3}\label{3} \end{align*}

The distribution of $RSS$ is always $\chi_{n - p - 1}^2$, regardless the specification of $\beta$. This is because $(I - H)X = X - HX = 0$, whence $(I - H)y = (I - H)(X\beta + \varepsilon) = (I - H)\varepsilon$, which then implies that \begin{align*} RSS = \varepsilon^\top(I - H)\varepsilon \sim \chi_{n - p - 1}^2. \tag{4}\label{4} \end{align*} The last assertion is of course a consequence of that the matrix $I - H$ is idempotent with rank $n - p - 1$, and $\varepsilon \sim N(0, I_{(n)})$. For details, see the general result at the end of this answer.

Moving to the distribution $TSS - RSS$ in $\eqref{3}$, it is slightly more involved, due to that it normally does not reduce to a quadratic form of $\varepsilon$ only. However, under $H_0$, it does -- because $H_0$ entails $X\beta = \beta_0e$, whence \begin{align*} & (H - n^{-1}ee^\top)y \\ =& (H - n^{-1}ee^\top)(X\beta + \varepsilon) \\ =& \beta_0He - \beta_0n^{-1}ee^\top e + (H - n^{-1}ee^\top)\varepsilon \\ =& \beta_0e - \beta_0e + (H - n^{-1}ee^\top)\varepsilon \\ =& (H - n^{-1}ee^\top)\varepsilon.\tag{5}\label{5} \end{align*} It then follows by the idempotency of the matrix $H - n^{-1}ee^\top$ that \begin{align*} TSS - RSS = \varepsilon^\top(H - n^{-1}ee^\top)\varepsilon \sim \chi_p^2. \tag{6}\label{6} \end{align*} Combining $\eqref{4}$ and $\eqref{6}$ immediately gives $F \sim F_{p, n - p - 1}$ under $H_0$ (I will leave the check of independence between $RSS$ and $TSS - RSS$ to you, which is straightforward, thanks to the normality assumption, by verifying that $\operatorname{Cov}((I - H)\varepsilon, (H - n^{-1}ee^\top)\varepsilon) = 0$).

After reviewing the classical proof above, it is clear that the only place that the distribution of $F$ might be twisted from $H_0$ to $H_a$ is the algebra in $\eqref{5}$, which needs to be updated to
\begin{align*} (H - n^{-1}ee^\top)y := P(\mu + \varepsilon) \tag{7}\label{7} \end{align*} where $P = H - n^{-1}ee^\top, \mu = X\beta$. Since $\mu + \varepsilon \sim N(\mu, I_{(n)})$, it can be shown that (see a short proof at the end this answer) \begin{align*} y^\top(H - n^{-1}ee^\top)y = (\mu + \varepsilon)^\top P (\mu + \varepsilon) \sim \chi^2_{p; \mu^\top P\mu} = \chi^2_{p; \|(I - n^{-1}ee^\top)X\beta\|^2} \tag{8}\label{8} \end{align*} $\eqref{4}$ and $\eqref{8}$, together with the definition of noncentral F-distribution then imply that under $H_a$, \begin{align*} F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} \sim F_{p, n - p - 1; \|(I - n^{-1}ee^\top)X\beta\|^2}. \end{align*} That is, $F$ is distributed as noncentral $F$ distribution with degrees of freedom $p$, $n - p - 1$, and the noncentrality parameter $\|(I - n^{-1}ee^\top)X\beta\|^2 = \beta^\top X^\top(I - n^{-1}ee^\top)X\beta$.


Many conclusions in the above answer are based on the following classical result:

Suppose $\xi \sim N(\mu, I_{(n)})$, $A$ is an idempotent and symmetric matrix with $\operatorname{rank}(A) = r$, then the quadratic form $\xi^\top A \xi$ has a noncentral $\chi^2$ distribution with degrees of freedom $r$ and noncentrality parameter $\mu^\top A \mu$.

Here is a short proof based on the canonical form of $A$:

Since $A$ is symmetric, idempotent and $\operatorname{rank}(A) = r$, there exists an order $n$ orthogonal matrix $O$ such that $A = O^\top\operatorname{diag}(I_{(r)}, 0)O$. Denote $O\xi$ by $\eta = (\eta_1, \ldots, \eta_n)^\top$, then by the linear transformation property of MVN, we have $\eta \sim N(O\mu, I)$, it then follows by the definition of noncentral chi-squared distribution that \begin{align*} \xi^\top A \xi = \xi^\top O^\top \operatorname{diag}(I_{(r)}, 0) O\xi = \eta^\top \operatorname{diag}(I_{(r)}, 0) \eta = \eta_1^2 + \cdots + \eta_r^2 \sim \chi^2_{r; \mu^\top A \mu}, \end{align*} where we used $\eta_i \sim N(e_i^\top O\mu, 1), i = 1, \ldots, r$, whence (here $e_i$ is the $n$-long vector with $i$-th component $1$ and all remaining components $0$) \begin{align*} \sum_{i = 1}^r(e_i^\top O\mu)^2 = \sum_{i = 1}^r \mu^\top O^\top e_ie_i^\top O\mu = \mu^\top O^\top\operatorname{diag}(I_{(r)}, 0)O\mu = \mu^\top A \mu. \end{align*}

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73
  • +1 At first sight I suspected a discrepancy in our answers in that your ncp is $(I-ee'/n)X\beta$ whereas mine is (mixing your and my notation) $(I-ee'/n)\tilde X\tilde \beta$, i.e. mine involves the slope regressors and coefficients only. However, we of course have $(I-ee'/n)e=0$, so that the first column of $(I-ee'/n)X$ is zero, such that $(I-ee'/n)X\beta=(I-ee'/n)\tilde X\tilde \beta$. – Christoph Hanck Mar 15 '24 at 18:17
  • 1
    @ChristophHanck Yeah. In fact there is a quick geometrical way to determine NCP, which is just the squared norm of the orthogonal projection of $X\beta$ onto the space $\operatorname{col}(X)−\operatorname{col}(e)$ , which is of course $|(H−n^{-1}ee^\top)X\beta|^2$ (which equals to $|(I−n^{-1}ee^\top)X\beta|^2$). – Zhanxiong Mar 15 '24 at 18:29