Prediction Interval when independent variable has variance

Question

Given an existing regression curve, how do I properly account for the known variance I have in some new value of X? If I had an observation $x_{new} = 700$ with a variance $\sigma_x^2 = 150$ then how should I incorporate that information into my final answer for $y_{new}$?

I can build a (very simple) regression model with two vectors. Using R notation:

x <- c(8, 10, 50, 200, 350, 500, 1000, 2000)
y <- c(0.012, 0.016, 0.078, 0.333, 0.583, 0.799, 1.643, 3.002)
simple.lm <- lm(y ~ x,data=data.frame(x,y))
summary(simple.lm)

which returns

Call:
lm(formula = y ~ x, data = data.frame(x, y))
Residuals:
     Min       1Q   Median       3Q      Max 
-0.05657 -0.02774 -0.01223  0.01591  0.09954
Coefficients:
             Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.835e-02  2.349e-02   1.207    0.273

x           1.515e-03  2.856e-05  53.056 3.01e-09 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05184 on 6 degrees of freedom
Multiple R-squared:  0.9979,    Adjusted R-squared:  0.9975 
F-statistic:  2815 on 1 and 6 DF,  p-value: 3.009e-09

I can use predict() to find any new value of Y given a value of X, and can find the prediction interval for that value of Y. Easy enough.

But that assumes I either don't know, or don't care to include, the variability attached to my observation of X. When I have $\sigma_x^2$, I would like to carry that forward in the reported variance of Y, from which I can calculate the prediction interval for Y. I had three thoughts, but not enough statistical background to know whether any (or none) are appropriate.

Based on this case and this case I could choose to explicitly calculate the variance for Y using the Delta Method, and scale the variance covariance matrix by the variance of my new observation $$ \widehat{V(\hat{\beta})} = V(\hat{\beta}) * \frac{1}{\sigma_x^2} $$ and then use the resulting answer as my $\sigma_y^2$.
That suggestion is included as one possible answer for this case, but buried in one of the comments was also a caution that suggested error variance for my new Y might simply be the sum of "variance from the regression" and "variance from the observation" $$ \sigma_y^2 = \sigma_{regression}^2 + \sigma_x^2 $$ and then for either of these two methods $$ \text{PI at 95%} = 1.96 * \sqrt{\sigma_y^2 + \sigma_{residuals}^2} $$
I could treat the system as an errors-in-variables situation and follow wikipedia suggestions. I'm not a fan of this one because I'm less worried about error in the regression curve given the series of known-X, and more focused on how the variance of my new X gets transformed by the regression into the appropriate prediction interval of the dependent Y.

EDIT: finally, I'd like to be able to extend this to an "arbitrary" formula. Recognizing that I might have to move from lm() to nls(), can one of the ideas above also be used when the equation is a quadratic ($y = a + bx + cx^2$) or exponential ($y = a + e^{-bx}$) ?

It sounds like you are changing your model in mid-stride. Your use of lm implies you wish to predict $y$ based on the observed value of $x,$ but then you turn around and ask how to predict $y$ based on the actual (unobserved) value of $x.$ If that's the case, then lm is the wrong tool: you need to adopt an errors-in-variables model from the very outset. nls won't do that. Or, is it possible that your regression is based on $x$ values that are known and somehow the new value is a random variable? That would need some more explanation to understand how that arises. — whuber, Jul 07 '23 at 18:14
@whuber - the slightly longer story is that I'm working with a calibration curve, where "known concentration" is the independent and "observed response" is the dependent variable. When building the curve, 'concentration' is taken as an exactly known value. Once the curve is established, it gets used to back-calculate concentration for new observations. There is a known variance in the response, and I'd like to include that in the back-calculated concentration.
I tried to simplify all that for the sake of the post. More correctly, my initial question would start from $y_{new} = 1.2$ — azabell, Jul 07 '23 at 19:13
This has been called "inverse regression." See https://stats.stackexchange.com/a/206682/919 for a brief account. — whuber, Jul 07 '23 at 19:40
@whuber - this works like a charm for a given $y_0$ but it doesn't appear to address the case which includes a known variance ($\sigma^2_{y_0}$) for that observation. Or am I misreading how $g^2$ is calculated and then used? — azabell, Jul 07 '23 at 20:44
If the data used for the regression have the same known variance, then that regression automatically accounts for it -- you don't have to include it explicitly in any of the formulas. — whuber, Jul 07 '23 at 21:38
The regression data and "the next sample" are all expected to have basically the same variance unless something goes wrong in data collection. So I like this idea of not having to add effort in my calculation! Sorry I can't give you a +1 for an answer since this is all chat in the commentary. But you're definitely helping me, thank you! — azabell, Jul 07 '23 at 21:45

score 0 · Accepted Answer · answered Jul 31 '23 at 21:01

Although the comments cover most of this ground, I want to update the question so it has an answer attached.

Consider three possible measures of variance in a new $x_0$ where the first variance value is treated as "no (known) variance," the second a low variance proposed in the question, the third a variance roughly equal to a 95% confidence interval of 700 +/- 140. The prediction interval for the first case is calculable with base R functionality

x_new <- 700
variance_x <- c(0,150,5000)
y_predictPI <- predict(simple.lm,
                       newdata=data.frame(x=x_new),
                       interval="prediction", level=0.95)

and all three can be done explicitly by converting the known variance on $x_0$ to the coefficient of variation before adding to the overall standard error

$$ \sigma_y^2 = \sigma_{regression}^2 + \sigma_{residuals}^2 \\ CV_x = \frac{\sqrt{\sigma_x^2}}{\bar{x}} \\ se_x = \sigma_{y(varx)} = \sigma_y + CV_x \\ y_{predict} = PI_{multiplier} * se_x $$

lvl <- 0.95
deg_freedom <- 6
pi_multiplier_knownDF <- qt((1 - lvl) / 2, deg_freedom, lower.tail = FALSE)
Xp <- matrix(c(1,x_new),ncol=2)
V <- vcov(simple.lm)
variance_regression <- diag(Xp %*% V %*% t(Xp))
variance_residuals <- c(crossprod(simple.lm$residuals)) / deg_freedom
cv_x <- sqrt(variance_x) / mean(x)
se_var <- sqrt(variance_regression + variance_residuals) + cv_x
y_PI_with_xvar <- pi_multiplier_knownDF * se_var

This is comparable to the calculations done in an errors-in-variables model, where the reliability ratio $\lambda$ is applied as a divisor to the slope.

Prediction Interval when independent variable has variance

1 Answers1

Linked