Regression model with good p-values but bad error

Question

Suppose there is some data and all assumptions for regression are met. I fit 2 regression models now:

Case 1: Regression model where all coefficients have significant $p$ values but overall the model has high error.
Case 2: Regression model where all coefficients have non-significant $p$ values but overall the model has low error.

Is it possible that these models are still useful? If my goal is only prediction, Is Case 2's model still good? If my goal is only looking at the association of the predictor effect on the outcome, is Case 1's model still good?

Do both indicators (significant $p$ values AND low error) need to be met for a good model? Or is it dependent only on task? (ex: task of prediction vs task of association effect)?

score 4 · Answer 1 · answered Nov 02 '23 at 07:03

It's hard to tell whether the model is good based on the information provided.

There's no single answer. It, generally speaking, depends on the business case and what you want out of it.

Sometimes, a model can have a large error and still counted good enough. So yeah, that model can be useful. I find a model with non-significant p-values for its coefficients a bit cumbersome. I'd suggest replacing the features or removing them, but it doesn't necessarily render the model as non-useful.

score 3 · Answer 2 · answered Nov 02 '23 at 09:54

I agree with Alex that it's hard to tell with the info given.

Model 1 sounds like it has a lot of observations, that is usually how you get low p values with lots of error. But "high error" is dependent on the field and on the particular topic.

Model 2 seems likely to be useful, unless it is a case of overfitting.

If you post the output from the models, we may be able to give more insightful answers.

score 3 · Answer 3 · answered Nov 02 '23 at 13:30

Significant coefficients and a large error will happen if the variables have a very small but non-zero relationship to the outcome and you have a large sample size which will make even small effects statistically significant

Non-significant effects and small error will happen if the variables do have a strong relationship to the outcome, but are colinear or close to colinear, which will make the p-values non-significant.

Shawn Hemelstrand · Answer 4 · 2023-11-02T13:38:57.907

There are a bunch of factors that could go into this. Some cases where a $p$ value can be non-informative...

Poorly captured functions. For example, if we fit some nonlinear data with a linear regression and the predictors are deemed statistically significant, then the relationship captured means a lot less than it is presented as. This can sometimes be completely caused by error (the function is very noisy), or be completely independent of it (the regression simply draws a line through data in just the right way and erroneously speaks to the relationship).
Sample size and effect size. If you have a billion people measured on the relationship between two variables but the effect has an extremely poor magnitude, the relationship will be "statistically significant" but not "practically significant." Here the standard error should be very low, but the $p$ value really isn't that informative. The opposite can also happen where you have a low sample size and a statistically significant effect, but this is captured purely by chance. However, you can also have the case where the effect is actually present, but with a low sample size, the $p$ value is still uninformative. So the context will change how this plays out and error can weight all of those scenarios, but the standard error is usually negatively correlated with sample size.
Correlated errors. If your data has some correlated errors but you don't model them into the regression, then your coefficients are downward biased and are misleading. For the opposite reason, if you model in the random effects and the fixed effects become statistically non-significant, then you are capturing a more "true" model despite the $p$ values being high. Both of these scenarios are by nature caused by the noise in the model.
Unaccounted for attenuation bias. You could have a situation where the $p$ value is high for some scale made of many items used as a predictor. However, the actual relationship between the latent construct and outcome is quite strong and is only being distorted by noise from bad items, which inflates the error in the estimates. If the items were corrected and were accurately measuring the construct, then the high $p$ value could be simply attributed to attenuation bias. The opposite can also happen where you get high reliability items (low error in item reliability), but the predictor has no effect on the outcome because the relationship is poor.
The actual effect is by nature noisy. You may run one study where the function between the variables is properly accounted for, but because it is impossible to measure with certainty, it ends up with a high $p$ value. ROC curves were originally crafted in England during WWII because of signal-to-noise concerns with shooting down German planes. Properly calibrating the data to the function may take multiple studies, of which the error could be uninformative of the influence behind the relationship.

I will state in any case that $p$ values are usually terrible metrics for how good or bad a model is, along with it's regressors. There is a lot more information one should be concerned with, and $p$ values typically distract from the actual modeling process that people should be engaging in. Standard error can at least approximate the noise in our estimate, whereas a $p$ value can completely misinterpret the relationship between variables (and often is misinterpreted).

Below is a great example, where I have simulated the same model twice, but I have tweaked the standard error to be lower in the second model. Both are completely misfitting the data, but are "statistically significant."

#### Poorly Captured Function ####
set.seed(123)
par(mfrow=c(1,2))
x <- runif(100,0,8)
y <- cos(x) + rnorm(100,sd=.5)
fit <- lm(y ~ x)
summary(fit)
plot(x,y,main="Higher Error Model")
abline(fit,col="red")
set.seed(123)
x2 <- runif(100,0,8)
y2 <- cos(x2) + rnorm(100,sd=.1)
fit2 <- lm(y2 ~ x2)
summary(fit2)
plot(x2,y2,main="Lower Error Model")
abline(fit2,col="red")

Another simulated example which shows high sample size/low error with a nearly nonexistent effect compared to a low sample size/high error with a high magnitude effect that isn't getting captured well.

#### Sample Size and Error ####
par(mfrow=c(1,2))
x3 <- rnorm(1e5)
y3 <- (x3*.01) + rnorm(1e5)
plot(x3,y3,main="High Sample Size, Low Error")
abline(fit3,col="red")
fit3 <- lm(y3 ~ x3)
summary(fit3)
x4 <- rnorm(5)
y4 <- (x4*20) + rnorm(5,sd=20)
fit4 <- lm(y4 ~ x4)
summary(fit4)
plot(x4,y4,main="Low Sample Size, High Error")
abline(fit4,col="red")

Another classic example is the Simpson Paradox. Using the iris data in R, we can see two model fits, one with low error and a statistically significant effect, another with a high error and no statistically significant effect. Both models are still wrong, and the $p$ value doesn't say enough to tell the actual story.

#### Classic Simpson Paradox ####
fit.petal <- lm(Petal.Length ~ Petal.Width,iris)
fit.sepal <- lm(Sepal.Length ~ Sepal.Width,iris)
summary(fit.petal)
summary(fit.sepal)
par(mfrow=c(1,2))
plot(iris$Petal.Width,
     iris$Petal.Length,
     bg=iris$Species,
     pch=23,
     main="Petal Regression")
abline(fit.petal,
       col="red")
plot(iris$Sepal.Width,
     iris$Sepal.Length,
     bg=iris$Species,
     pch=23,
     main="Sepal Regression")
abline(fit.sepal,
       col="red")

score 3 · Answer 5 · answered Nov 02 '23 at 13:48

Case 1: Regression model where all coefficients have significant p-values but overall model has high error

This sounds like a situation where the sample size is large. The standard error calculation relates to the residual variance ("error") and to the sample size. If you wind up with a small p-value, that means a low standard error, so if the residual variance is large, the sample size must be so large that it overcomes this.

For example, the code below simulates a situation where the residual variance is so high that the $R^2$ and adjusted $R^2$ are both around $0.0035$. Nonetheless, the large sample size of $10000$ is able to detect that the slope parameter is different from zero, giving a tiny $p$-value on the order of $10^{-9}$.

set.seed(2023)
N <- 10000
x <- runif(N, 0, 1)
Ey <- 1 + 0.25*x
e <- rnorm(N, 0, 1)
y <- Ey + e
L <- lm(y ~ x)
summary(L)

Case 2: Regression model where all coefficients have non-significant p-values but overall model has low error

Two situations come to mind for this.

The sample size is so small that even a large detected effect cannot be declared significant. In the simulation below, the $R^2$ and adjusted $R^2$ are rather high, yet the slope parameter lacks significance at any standard $\alpha$-level (I get $p = 0.155$).

set.seed(2023)
N <- 4
x <- runif(N, 0, 1)
Ey <- 1 + 4*x
e <- rnorm(N, 0, 0.25)
y <- Ey + e
L <- lm(y ~ x)
summary(L)

There is multicollinearity that blows up the variance-inflation factors, killing your power to detect deviations. Think of this as the model being able to tell that the features are, jointly, important, yet not being able to attribute the importance to any individual feature, due to how related they are to each other. In the simulation below, the $R^2$ and adjusted $R^2$ are both quite high, yet the two features have a correlation exceeding $0.99$, leading to a variance-inflation factor exceeding $56$ for each coefficient.

library(MASS)
set.seed(2023)
N <- 100
X <- MASS::mvrnorm(N, c(0, 0), matrix(c(1, 0.99, 0.99, 1), 2, 2))
Ey <- X %*% rep(2, dim(X)[2])
e <- rnorm(N, 0, 2)
y <- Ey + e
L <- lm(y ~ X)
summary(L)

I suppose a final explanation could be that you are looking at the residuals without context. Sure, you might have your molecule measurement predictions making predictions with errors that are small numbers, but if those are small numbers of kilometers, well duh they're small. You're simply dealing with small numbers of kilometers that span a narrow range (low variance). That is why I look at the $R^2$ and adjusted $R^2$ values, which accounts for the overall variability to give some kind of context to the residuals (not without its flaws).

+1. The variance inflation factor is a point I hadn't considered but nonetheless is also important here. — Shawn Hemelstrand, Nov 02 '23 at 13:52

score 0 · Answer 6 · answered Nov 02 '23 at 13:11

0

In the case 2, I would say it depends if the model prediction is associated with low error on test set vs training set. There is also question of usefulness of individual predictors, which directs us to model selection algorithms.

answered Nov 02 '23 at 13:11

GAMer

133

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Community Nov 02 '23 at 13:15

Regression model with good p-values but bad error

6 Answers6

Linked