1

This is something I have been thinking about for a while - suppose we fit a regression model and then plot the prediction interval for a given variable vs the response: https://strengejacke.github.io/ggeffects/articles/introduction_plotmethod.html

enter image description here

No matter if this is done for regression model, logistic regression model, survival regression model, etc. - the prediction intervals always seem to be made for the average individual in a subgroup.

As an example:

  • fit a regression model on the whole dataset (m1)
  • then, isolate all rows of data from your dataset where gender = male (i.e. male subgroup)
  • take the average value of all predictor variables for this dataset of males, except one of the predictors (p1)
  • then, for p1 - have your model (m1) predict the response value for different values of this predictor
  • at each one of these predicted values, calculate the prediction interval
  • finally, make a plot of the predicted response vs the different values of the chosen predictor (and overlay the prediction interval)

I am trying to understand why this is - why is the prediction interval based on the average individual in a given subgroup.

Informally, I always thought that this was for common sense reasons. A prediction interval for individual observation is not logical, this is like using the behavior of an individual to extrapolate the behavior of the population (ecological fallacy). Thus, why not take the average person in a given subgroup - and then calculate the prediction intervals for this average person. This way, you are no longer caught in the ecological fallacy and it also becomes easier to justify your results.

However, now I think there might be another reason for this. Perhaps prediction intervals made for the average person in a subgroup might be smaller in size (i.e. smaller is more desirable) compared to the prediction interval for any single individual in the subgroup/set of individuals in the subgroup?

In this link (https://online.stat.psu.edu/stat501/lesson/3/3.3), the formula of the prediction interval is given by:

$$\hat{y}_h \pm t_{(1-\alpha/2, n-2)} \times \sqrt{MSE \times \left(1+ \frac{1}{n} + \dfrac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}$$

Here, we can see that the prediction interval for $\bar{x}$ will be the smallest, as one term in the term $\dfrac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}$ will become 0 when $x_h = \bar{x}$ . And at the same time, when the absolute distance between $x_h$ and $\bar{x}$ becomes larger, the prediction interval also becomes larger. Thus, we can see that is desirable to create prediction intervals for the average observation compared to prediction intervals for individuals further away from the average observation.

Here is an R simulation I made to illustrate this idea:

n <- 100
mse <- 0.05
x_bar <- 0
sum_xi_xbar <- 50
t_alpha <- 1.96

x_h <- seq(from = x_bar - 10, to = x_bar + 10, by = 0.1)

sqrt_term <- sqrt(mse * (1 + 1/n + ((x_h - x_bar)^2)/sum_xi_xbar))

df <- data.frame(x_h = x_h, upper = sqrt_term, lower = -sqrt_term)

ggplot(df, aes(x = x_h)) + geom_point(aes(y = 0, color = "Predicted"), size = 1) + geom_line(aes(y = upper, color = "Upper")) + geom_line(aes(y = lower, color = "Lower")) + geom_line(aes(y = 0, color = "Predicted"), linetype = "dashed") +
geom_vline(xintercept = mean(df$x_h), linetype = "dotted") + labs(title = paste("Prediction Intervals for y = b0 + b1*x_h (where: n =", n, ", mse =", mse, ", x_bar =", x_bar, ", sum(xi-xbar) =", sum_xi_xbar, ")"), x = "x_h", y = "sqrt_term") + theme_minimal() + scale_color_manual(values = c("Upper" = "blue", "Lower" = "red", "Predicted" = "green")) +
theme(legend.title = element_blank())

enter image description here enter image description here

The simulation confirms the idea - the size of the prediction interval is smallest around the average, and the size of the prediction interval increases as you move away from the average.

I wonder - can this logic be extended to show that the average person in a subgroup will statistically have the smallest prediction interval compared to the prediction interval of any individual observation in the subgroup .... or compared to the average of some random group of individuals in that subgroup?

Using this link (Obtaining a formula for prediction limits in a linear model (i.e.: prediction intervals)), this is the (multivariate) prediction interval for an individual observation (where X* is a vector of covariate values for the observation being predicted) :

$$ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'}. $$

$$\mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)' = \begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix} \left( \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} \begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix} \right)^{-1} \begin{bmatrix} x_{1}^* \\ x_{2}^* \\ \vdots \\ x_{n}^* \end{bmatrix}$$

In the general case, can we mathematically prove that when $\begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix}$ is the average for a subgroup - then the size of the resulting prediction interval will be the smallest (compared to any single individual in that subgroup or collection of individuals in that subgroup)?

Note:

  • Actually, you can calculate prediction intervals for individuals, depending on what you mean by an "individual". Take a look at hierarchical forecasting, where "grouping" happens along one or multiple hierarchies, and you may be interested in PIs both on lower/disaggregate and higher/aggregate levels. As an example, you might be predicting demand for a specific SKU (for replenishment), but also in total demand for all SKUs (for workforce planning). What PI you want depends entirely on what you are predicting for. – Stephan Kolassa Jan 13 '24 at 19:55
  • Could you please explain what "average person in a subgroup" means? – whuber Jan 13 '24 at 20:49
  • e.g. suppose your model is y = b0 +b1x1 + b2x2 + gender*x3 .... if x3 = 'male' ... then the "average person in the male subgroup" is : (average x1 value of all males, average x2 value of all males) – Uk rain troll Jan 13 '24 at 20:51
  • I am puzzled, because this formulation doesn't seem to describe prediction intervals. Be aware that a prediction interval targets another random variable. In regression applications you might want an interval for one future value, or for the mean of $k$ future values, or the largest of $k$ future values, among other things. I haven't been able to determine what the target of your "prediction intervals" is intended to be. – whuber Jan 13 '24 at 20:57
  • so in this case, you would fit the model... then for the average male calculate the average value of x2 ... and for this, calculate the prediction intervals for different values of x1 vs the response – Uk rain troll Jan 13 '24 at 21:01
  • I'm sorry, but I can't make any sense of that. I looked at the page you reference and it doesn't even concern prediction intervals, which makes me wonder about the sense in which you use the term. However, you do quote correct formulas for OLS prediction intervals for a conditional response. It looks like you're asking whether $(x_h-\bar x)^2$ is minimized when $x_h=\bar x,$ but that interpretation would make the whole question trivial. – whuber Jan 13 '24 at 22:44

0 Answers0