2

As a simplified example, assume I have a model with two linear terms and an interaction between them:

y ~ b0 + b1.x1 + b2.x2 + b3.x1*x2

x1 and x2 have very different ranges and variances, so I scale and centre them before running the model. When generating predictions from the model for one term, this has the benefit of the means for other terms being 0. So, for example, if examining the effect of x1 in isolation I use:

y ~ b0 + (b1 * x1_new_data) + (b2 * 0) + (b3 * 0)

Where x1_new_data is a vector of 500 values between the minimum and maximum of the original x1, then scaled and centred using the same values as for x1.

My question concerns the case when predicting the interaction term x1*x2: should I also provide new data for x1 and x2 rather than cancelling them out with 0? I can think of four options here:

  1. y ~ b0 + (b1 * 0) + (b2 * 0) + (b3 * x1*x2_new_data)
  2. y ~ b0 + (b1 * x1_new_data) + (b2 * x2_new_data) + (b3)
  3. y ~ b0 + (b1 * x1_new_data) + (b2 * x2_new_data) + (b3 * x1*x2_new_data)
  4. y ~ b0 + (b1 * x1_new_data) + (b2 * x2_new_data) + (b3 * (x1_new_data * x2_new_data))

For option 1, I am only predicting the interaction effect when the constituent linear terms are held at their mean values, which doesn't seem logical. I think option 2 is probably junk. Option 3 is complicated by the generation of the new data as shown in @Dave's answer. Option 4 is probably the most straightforward, provided it is correct to include the linear terms as well as the interaction when generating predictions.

I'm prompted to ask this because I have a case where y has a positive linear relationship with both x1 and x2, but the x1*x2 interaction has a strong negative curve over some range of the values. This gives seemingly unrealistic predictions, so that y has a positive response to x1 + x2, both in isolation and together, but a negative response to x1*x2.

EcologyTom
  • 203
  • 1
  • 5

1 Answers1

1

You apply the same transformations to your out-of-sample data as you did to your in-sample data. This means that if you demean your variables, you subtract out the in-sample mean, not the out-of-sample mean. After all, you told the model to find the best regression parameter for an interaction between $x_1-\bar x_{1,in}$ and $x_2-\bar x_{2,in}$, not for an interaction between $x_1-\bar x_{1,out}$ and $x_2-\bar x_{2,out}$.

Ditto when it comes to dividing by the standard deviations.

This question is somewhat more advanced, but the spirit is the same.

Dave
  • 62,186
  • Thanks for this, I'm afraid my question was unclear. I know about the need to use the same scale/centre values for the out-of-sample data. I've edited my question, so hopefully it is clearer now? – EcologyTom Oct 24 '22 at 11:11
  • @EcologyTom You seem to be concerned about some kind of curvature, which is a different issue that warrants a separate question. – Dave Oct 24 '22 at 11:12
  • This is the question I want to ask?? Which of the four options in the question is the correct way to generate predictions for the interaction term, please? Not "How do I generate the new data?", which your answer addresses, but which sets of new data to include when predicting? – EcologyTom Oct 24 '22 at 11:18