2

I am working on a dataset for linear regression in R. After building the model (with the lm()function), I want to test my model on a new data point using the predict() function with a certain confidence interval. Is there a mechanism to verify whether predicting at this new data point is valid or not?

Nanda
  • 131
  • 2
    Welcome to Cross Validated! What would it mean for the new data point to be valid or invalid? – Dave Dec 05 '23 at 19:08
  • 1
    Confidence intervals don't answer your question: you want a prediction interval. The new data point is consistent with the original model when it lies within that prediction interval. See https://stats.stackexchange.com/questions/16493. – whuber Dec 05 '23 at 20:44
  • You must do relatively stupid things to invalidate predict.lm(). – Michael M Dec 05 '23 at 20:59

1 Answers1

5

I am working on a dataset for linear regression in R. After building the model (with the lm()function), I want to test my model on a new data point using the predict() function with a certain confidence interval. Is there a mechanism to verify whether predicting at this new data point is valid or not?

It is not clear what you mean by

using the predict() function with a certain confidence interval.

the predict function does not use a confidence interval. It uses the estimates from the fitted model to determine the value(s) of the response variable when the explanatory variable(s) take values provided (with the newdata parameter).

set.seed(15)
N <- 10

X1 <- rnorm(N,0,1) X2 <- rnorm(N,0,2) Y <- 10 + 2X1 + 3X2 + rnorm(N,0,1)

dt <- data.frame(Y, X1, X2)

m0 <- lm(Y ~ X1 + X2, data = dt)

summary(m0)

Here we fit the model and obtain these results:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.4936     0.2589  40.524 1.45e-09 ***
X1            2.0075     0.2792   7.191 0.000179 ***
X2            3.1711     0.1614  19.653 2.21e-07 ***

The equation for the fitted model is therefore:

10.4936 + 2.0075*X1 + 3.1711*X2

So if we were to set X1=2 and X2=3 we would have

10.4936 + 2.0075*2 + 3.1711*3

which equals 24.0219. Now if we want to

verify whether predicting at this new data point is valid or not?

We can do so like this:

predict(m0, newdata = data.frame(X1 = 2, X2 = 3))

which returns 24.02195 which matches the calculation we did manually, therefore verifying that the predict function worked correctly.


Edit: After clarification in the comments to this answer:

I meant to ask of performance measures to conclude if 'predict()' is doing the correct thing. Like how the linear regression function 'lm()' has performance measures to quantify its behavior. Does my question make sense?

Unfortunately I still don't understand what you are asking. You say you want:

performance measures to conclude if 'predict()' is doing the correct thing.

Well predict only has one job - to output a predicted value for the response or a confidence (or prediction) interval. I have demonstrated above that it is "doing the correct thing."

Then you ask about whether

'lm()' has performance measures to quantify its behavior

Unfortunately I don't know what you mean by this. If you are asking about the performance of lm itself, well most of the computation is done in compiled code written in C or Fortran for performance reasons. Further details are here:

Least Squares Regression Step-By-Step Linear Algebra Computation

If you are asking about the performance of predict or something else, please clarify your question.

Robert Long
  • 60,630
  • 1
    I don't get the feeling that the function doing the right calculation is in question. – Dave Dec 05 '23 at 20:19
  • 1
    @Dave I hear you. Hopefully my answer will prompt the OP to tell us what they mean by "test the validity" – Robert Long Dec 05 '23 at 20:23
  • predict.lm() has an option to return also interval estimates. – Michael M Dec 05 '23 at 20:58
  • @MichaelM I was being deliberately obtuse, hoping to elicit more details from the OP about what they are asking :) – Robert Long Dec 05 '23 at 21:58
  • 1
    I also haven't understood it :-) – Michael M Dec 05 '23 at 22:15
  • I meant to ask of performance measures to conclude if 'predict()' is doing the correct thing. Like how the linear regression function 'lm()' has performance measures to quantify its behavior. Does my question make sense? – Nanda Dec 06 '23 at 00:09
  • @Nanda Please clarify that as an edit to your original question. If you could expand on what you see as measures of performance in the lm function, that would be helpful. – Dave Dec 06 '23 at 11:33