Linear regression: p-value for linearHypothesis() in R

Question

Let's start with the Davis dataset from car package and fit a simple regression model

data(Davis)
mod.davis <- lm(weight ~ repwt, data = Davis)
summary(mod.davis)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)   5.3363     3.0369   1.757   0.0806 .

repwt         0.9278     0.0453  20.484   <2e-16 ***

Now I'd like to test \begin{align*} \beta_0 &= 5 \\ \beta_1 &= 1 \end{align*}

linearHypothesis(mod.davis, c("(Intercept) = 5", "repwt = 1"))

what I get is

  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    183 16549                                  
2    181 12828  2      3721 26.251 9.764e-11 ***

The low p-value is quite puzzling for me

confint(mod.davis)
                 2.5 %    97.5 %
(Intercept) -0.6560394 11.328560
repwt        0.8384665  1.017219

especially because of the fact, that for \begin{align*} \beta_0 &= 0 \\ \beta_1 &= 1 \end{align*}

linearHypothesis(mod.davis, c("(Intercept) = 0", "repwt = 1"))
Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    183 13074

2    181 12828  2    245.97 1.7353 0.1793

the data suggests that $H_0$ is true. How is that? $\beta_0 = 5.33$ is the mean of the sampling distribution, so I'd expect a p-value $< 0.05$ rather for testing for $\beta_0=0$.

Your intercept is the predicted outcome at 0, which for a body weight measurement seems nonsensical & will be quite the extrapolation. Have you considered centering your data? — PBulls, Dec 05 '23 at 20:15
A standard application where this issue comes up occurs when the explanatory variable is time. For instance, if you were to regress some time series against a numeric year C.E., the intercept would be the fitted value at the year zero. As far as your joint hypothesis goes, bear in mind the intercept estimate can be strongly correlated with the slope estimate. The only time this doesn't occur is with centered variables. — whuber, Dec 05 '23 at 20:26
@whuber I see, but what's the reason for the discrepancy here? I don't quite get it. — Adam Bogdański, Dec 05 '23 at 20:29
I don't see any discrepancy. Please plot your data and your hypothesis. This will help: with(Davis, plot(repwt, weight, xlim = c(0, 130), ylim = c(0, 5*130))); abline(c(1, 5)) How plausible is this hypothesis?? — whuber, Dec 05 '23 at 20:42
@whuber I believe that there should be abline(5,1) and then it doesn't look particularly wrong. I get the fact that the CI for the intercept is quite wide, but still it isn't clear to me why testing for "the most probable" value of the intercept yields to such a small p-value. — Adam Bogdański, Dec 05 '23 at 21:03
You are performing a joint F-test on both parameters, as already mentioned these will likely be highly correlated. Have a look at just the hypothesis test for $\beta_0=5$ (I'll tell you: P = 0.912). The test for $\beta_0=0$ under your model will return the same P-value as the one for having observed $\hat{\beta_0}=5.3363$ under the null. (I keep messing up the TeX in these comments...) — PBulls, Dec 05 '23 at 21:31
I tried some things:
linearHypothesis(mod.davis, "(Intercept) = 5") # p = 0.912 linearHypothesis(mod.davis, "repwt = 1") #p = 0.113 linearHypothesis(mod.davis, "repwt = 0.93") #p = 0.96

Considering the low SE for the parameter estimate of repwt, p = 0.113 for the first test makes sense. — Peter Flom, Dec 05 '23 at 22:51
Sorry--I did mix them up. But plotting still resolves the question: you can see how the plotted line exceeds all but two or three of the 200 points in the scatterplot, convincingly demonstrating it's incorrect. — whuber, Dec 05 '23 at 23:13

score 5 · Accepted Answer · answered Dec 06 '23 at 17:43

Let's look at the data and your hypothesis.

The first figure is a plot you should always make: a scatterplot of the data. On it I have drawn a 95% confidence band for the fit (in gray) and your hypothesis in red, the line with intercept 5 and slope 1:

The red line is not a good fit: almost all of the nearly 200 points in the plot lie below it. But does it really deserve a p-value as low as $10^{-10}$?

One way to get some intuition is to simulate these data. This isn't something you would routinely do, but doing it at least once is a good part of any statistical education and is a great way to check your analysis is correct.

Here, I sample from the estimated parameters (slope and intercept) using their estimated variance-covariance matrix. The next plot shows 50 such samples, each drawn as a gray line.

It shows, among other things, that the hypothesized line isn't close to any of these 50 lines: all of those simulated lines tend to go near the "point of averages" near $(66,66),$ whereas the hypothesized line is visibly higher there.

Perhaps a clearer way to see how unlikely the hypothesis is uses the more abstract method of plotting the slopes and intercepts as points. (This is the "parameter space" in which the possible values of (intercept, slope) lie.) This time I show 500 simulated values as faint points and their density as a colored contour plot.

The red point is the hypothesized line. Its extremely low p-value reflects two things:

The point is visibly far from any of the simulated data, especially in the northeast - southwest direction. That is, it might not be far from any of the intercepts (it's not) and it might not be far from any of the slopes (it's not), but it truly is a "geometric outlier" when examined in the full context. (Its Mahalanobis distance is large.)
The p-value is computed assuming these points follow a bivariate Normal density. Such densities decay extremely quickly away from their centers (their logarithms, by definition, decrease parabolically -- that is, their decrease accelerates with distance).

If you make a different distributional assumptions about the errors in the data, you will obtain different p-values: but clearly they will all be small, because this hypothesis is a poor description of the data.

Thank you! Brilliant and very insightful answer, as always. I guess I kinda fell for the trap that I always try to avoid: I trusted the OLS parameters far too much, beyond the resonable point. Thanks again! — Adam Bogdański, Dec 06 '23 at 19:25
@whuber This answer is great [+1] I wonder if you might show the code that you used to generate these plots ? I would be obliged! — Cooper Owls, Jan 10 '24 at 12:04
@Cooper The first two plots are routine. The last is a 2D density plot of the dataframe A created using the mvrnorm multivariate Normal RNG in the MASS library: library(car); library(MASS) mod.davis <- lm(weight ~ repwt, data = Davis); b <- coefficients(mod.davis); S <- vcov(mod.davis); A <- as.data.frame(mvrnorm(500, b, S)) — whuber, Jan 10 '24 at 15:43

Linear regression: p-value for linearHypothesis() in R

1 Answers1