3

When adding control variables to my regression, the F-statistic decreases. Furthermore, when I add an interaction term, the F-statistic is reduced further. How do I interpret these regression results?

ines
  • 31
  • 1
    Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Feb 02 '23 at 10:13
  • 2
    What do you mean by "the F-test decreases"? Is it an F test statistic that decreases? If so, where does this value come from (e.g., from an F test of the whole model against an intercept-only model, or from some other comparison between two models, or from somewhere else)? Are you running a straightforward linear model, or something more complicated (GLM, ...)? – Stephan Kolassa Feb 02 '23 at 10:22

1 Answers1

1

I assume you refer to the value of the test statistic decreasing.

From Relationship between F (Fisher) and R^2, we observe that, in the case of a linear regression with $p$ regressors and when testing that all $p-1$ slope coefficients (i.e., all coefficients except the one on the constant), we may write the F-statistic as $$ F_{short}=\frac{R^2}{1-R^2}\frac{n-p}{p-1} $$ When adding a $(p+1)$th regressor, we therefore obtain, with $\tilde R^2$ the R-squared of the regression with an extra regressor, $$ F_{long}=\frac{\tilde R^2}{1-\tilde R^2}\frac{n-p-1}{p} $$ Since we know that $\tilde R^2\geq R^2$ (R-squared does not decrease when adding regressors), we have $$ \frac{\tilde R^2}{1-\tilde R^2}\geq \frac{R^2}{1-R^2} $$ Denote the ratio of the R-squareds by $$ c:=\frac{\tilde R^2}{1-\tilde R^2}\Biggm/\frac{R^2}{1-R^2}\geq1 $$ We hence have $$\begin{eqnarray*} F_{long}\leq F_{short}&\Longleftrightarrow& \frac{\tilde R^2}{1-\tilde R^2}\frac{n-p-1}{p}\leq \frac{R^2}{1-R^2}\frac{n-p}{p-1}\\ &\Longleftrightarrow&c\leq \frac{n-p}{p-1}\frac{p}{n-p-1}\\ &\Longleftrightarrow&\left(1-\frac{1}{n-p}\right)\left(1-\frac{1}{p}\right)\leq \frac{1}{c} \end{eqnarray*} $$ When the additional regressor adds nothing to the explanatory power, so that $c=1$, the condition will therefore always be satisfied. When it does ($c>1$), "sufficiently much" needs to be subtracted in the brackets to give a product less than $1/c$.

We will subtract a lot when $1/(n-p)$ is large (first bracket), i.e. $n-p$ is small, i.e. there are many regressors relative to sample size. We will also subtract a lot when $p$ is small (second bracket).

Hence, the situation is likely to arise when you fit a model with almost as many regressors as you have data points or when you have few regressors, and, in either case, where the additional control variable(s) from your long regression have relatively little or no additional explanatory power.

Numerical illustration:

library(lmtest)
set.seed(1) # try set.seed(2) for an example where the long regression has a larger F-stat

n <- 10 sloperegs <- 5 # number of slope regressors, p-1 (minus the constant) in the above notation

y <- rnorm(n) X <- matrix(rnorm(n*sloperegs), ncol=sloperegs) short.reg <- lm(y~X)

X.additional <- rnorm(n) # another (irrelevant, as unrelated to y, regressor) long.reg <- lm(y~X+X.additional)

R2 <- summary(short.reg)$r.squared R2.tilde <- summary(long.reg)$r.squared

R2/(1-R2)*(n-sloperegs-1)/(sloperegs) # to check we indeed compute test statistics like written

> (Fstat.short <- waldtest(short.reg, test="F")$F[2]) [1] 3.10882

> (Fstat.long <- waldtest(long.reg, test="F")$F[2]) [1] 2.477202

> (1-1/(n-sloperegs-1))*(1-1/(sloperegs+1)) [1] 0.625

> 1/c [1] 0.7843575

[Technical remark: when we have as many regressors as observations, i.e., $n=p$, expressions such as $1/(n-p)$ will not be defined; also, $R^2=1$ in this case, so that the formula for the $F_{short}$ would also divide by zero, and the regression with $p+1$ would not have a unique solution anymore. Likewise, $\tilde R^2=1$ when $n=p+1\Rightarrow p=n-1$. The results therefore apply to cases without such "overfitting".

Conversely, we have $p\geq2$, as, otherwise, the F statistic for the short model does not test an exlusion restriction.

Hence, we require $2\leq p\leq n-2$.]

When adding $d>1$ regressors, the same argument shows that the condition becomes $$ \left(1-\frac{d}{n-p}\right)\left(1-\frac{d}{p}\right)\leq \frac{1}{c} $$ This clearly decreases both terms on the lhs more strongly than when $d=1$, but adding $d>1$ regressors at the same time also will tend to lead to larger $c$, as the longer regression has more degrees of freedom to fit the data.