This is something I have been thinking about for a while - suppose we fit a regression model and then plot the prediction interval for a given variable vs the response: https://strengejacke.github.io/ggeffects/articles/introduction_plotmethod.html
No matter if this is done for regression model, logistic regression model, survival regression model, etc. - the prediction intervals always seem to be made for the average individual in a subgroup.
As an example:
- fit a regression model on the whole dataset (m1)
- then, isolate all rows of data from your dataset where gender = male (i.e. male subgroup)
- take the average value of all predictor variables for this dataset of males, except one of the predictors (p1)
- then, for p1 - have your model (m1) predict the response value for different values of this predictor
- at each one of these predicted values, calculate the prediction interval
- finally, make a plot of the predicted response vs the different values of the chosen predictor (and overlay the prediction interval)
I am trying to understand why this is - why is the prediction interval based on the average individual in a given subgroup.
Informally, I always thought that this was for common sense reasons. A prediction interval for individual observation is not logical, this is like using the behavior of an individual to extrapolate the behavior of the population (ecological fallacy). Thus, why not take the average person in a given subgroup - and then calculate the prediction intervals for this average person. This way, you are no longer caught in the ecological fallacy and it also becomes easier to justify your results.
However, now I think there might be another reason for this. Perhaps prediction intervals made for the average person in a subgroup might be smaller in size (i.e. smaller is more desirable) compared to the prediction interval for any single individual in the subgroup/set of individuals in the subgroup?
In this link (https://online.stat.psu.edu/stat501/lesson/3/3.3), the formula of the prediction interval is given by:
$$\hat{y}_h \pm t_{(1-\alpha/2, n-2)} \times \sqrt{MSE \times \left(1+ \frac{1}{n} + \dfrac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}$$
Here, we can see that the prediction interval for $\bar{x}$ will be the smallest, as one term in the term $\dfrac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}$ will become 0 when $x_h = \bar{x}$ . And at the same time, when the absolute distance between $x_h$ and $\bar{x}$ becomes larger, the prediction interval also becomes larger. Thus, we can see that is desirable to create prediction intervals for the average observation compared to prediction intervals for individuals further away from the average observation.
Here is an R simulation I made to illustrate this idea:
n <- 100
mse <- 0.05
x_bar <- 0
sum_xi_xbar <- 50
t_alpha <- 1.96
x_h <- seq(from = x_bar - 10, to = x_bar + 10, by = 0.1)
sqrt_term <- sqrt(mse * (1 + 1/n + ((x_h - x_bar)^2)/sum_xi_xbar))
df <- data.frame(x_h = x_h, upper = sqrt_term, lower = -sqrt_term)
ggplot(df, aes(x = x_h)) +
geom_point(aes(y = 0, color = "Predicted"), size = 1) +
geom_line(aes(y = upper, color = "Upper")) +
geom_line(aes(y = lower, color = "Lower")) +
geom_line(aes(y = 0, color = "Predicted"), linetype = "dashed") +
geom_vline(xintercept = mean(df$x_h), linetype = "dotted") +
labs(title = paste("Prediction Intervals for y = b0 + b1*x_h (where: n =", n, ", mse =", mse, ", x_bar =", x_bar, ", sum(xi-xbar) =", sum_xi_xbar, ")"),
x = "x_h", y = "sqrt_term") +
theme_minimal() +
scale_color_manual(values = c("Upper" = "blue", "Lower" = "red", "Predicted" = "green")) +
theme(legend.title = element_blank())
The simulation confirms the idea - the size of the prediction interval is smallest around the average, and the size of the prediction interval increases as you move away from the average.
I wonder - can this logic be extended to show that the average person in a subgroup will statistically have the smallest prediction interval compared to the prediction interval of any individual observation in the subgroup .... or compared to the average of some random group of individuals in that subgroup?
Using this link (Obtaining a formula for prediction limits in a linear model (i.e.: prediction intervals)), this is the (multivariate) prediction interval for an individual observation (where X* is a vector of covariate values for the observation being predicted) :
$$ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'}. $$
$$\mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)' = \begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix} \left( \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} \begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix} \right)^{-1} \begin{bmatrix} x_{1}^* \\ x_{2}^* \\ \vdots \\ x_{n}^* \end{bmatrix}$$
In the general case, can we mathematically prove that when $\begin{bmatrix} x_{1}^* & x_{2}^* & \cdots & x_{n}^* \end{bmatrix}$ is the average for a subgroup - then the size of the resulting prediction interval will be the smallest (compared to any single individual in that subgroup or collection of individuals in that subgroup)?
Note:
- "The closer $X_h$ is to the average of the sample's predictor values, the narrower the interval" (https://online.stat.psu.edu/stat462/node/150/ - yet no proof is provided)


