Given an existing regression curve, how do I properly account for the known variance I have in some new value of X? If I had an observation $x_{new} = 700$ with a variance $\sigma_x^2 = 150$ then how should I incorporate that information into my final answer for $y_{new}$?
I can build a (very simple) regression model with two vectors. Using R notation:
x <- c(8, 10, 50, 200, 350, 500, 1000, 2000)
y <- c(0.012, 0.016, 0.078, 0.333, 0.583, 0.799, 1.643, 3.002)
simple.lm <- lm(y ~ x,data=data.frame(x,y))
summary(simple.lm)
which returns
Call:
lm(formula = y ~ x, data = data.frame(x, y))
Residuals:
Min 1Q Median 3Q Max
-0.05657 -0.02774 -0.01223 0.01591 0.09954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.835e-02 2.349e-02 1.207 0.273
x 1.515e-03 2.856e-05 53.056 3.01e-09 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05184 on 6 degrees of freedom
Multiple R-squared: 0.9979, Adjusted R-squared: 0.9975
F-statistic: 2815 on 1 and 6 DF, p-value: 3.009e-09
I can use predict() to find any new value of Y given a value of X, and can find the prediction interval for that value of Y. Easy enough.
But that assumes I either don't know, or don't care to include, the variability attached to my observation of X. When I have $\sigma_x^2$, I would like to carry that forward in the reported variance of Y, from which I can calculate the prediction interval for Y. I had three thoughts, but not enough statistical background to know whether any (or none) are appropriate.
Based on this case and this case I could choose to explicitly calculate the variance for Y using the Delta Method, and scale the variance covariance matrix by the variance of my new observation $$ \widehat{V(\hat{\beta})} = V(\hat{\beta}) * \frac{1}{\sigma_x^2} $$ and then use the resulting answer as my $\sigma_y^2$.
That suggestion is included as one possible answer for this case, but buried in one of the comments was also a caution that suggested error variance for my new Y might simply be the sum of "variance from the regression" and "variance from the observation" $$ \sigma_y^2 = \sigma_{regression}^2 + \sigma_x^2 $$ and then for either of these two methods $$ \text{PI at 95%} = 1.96 * \sqrt{\sigma_y^2 + \sigma_{residuals}^2} $$
I could treat the system as an errors-in-variables situation and follow wikipedia suggestions. I'm not a fan of this one because I'm less worried about error in the regression curve given the series of known-X, and more focused on how the variance of my new X gets transformed by the regression into the appropriate prediction interval of the dependent Y.
EDIT: finally, I'd like to be able to extend this to an "arbitrary" formula. Recognizing that I might have to move from lm() to nls(), can one of the ideas above also be used when the equation is a quadratic ($y = a + bx + cx^2$) or exponential ($y = a + e^{-bx}$) ?
lmimplies you wish to predict $y$ based on the observed value of $x,$ but then you turn around and ask how to predict $y$ based on the actual (unobserved) value of $x.$ If that's the case, thenlmis the wrong tool: you need to adopt an errors-in-variables model from the very outset.nlswon't do that. Or, is it possible that your regression is based on $x$ values that are known and somehow the new value is a random variable? That would need some more explanation to understand how that arises. – whuber Jul 07 '23 at 18:14I tried to simplify all that for the sake of the post. More correctly, my initial question would start from $y_{new} = 1.2$
– azabell Jul 07 '23 at 19:13