4

Motivation

Everyone knows that fitting high variance models requires more data. A "yes" answer to the question would suggest that more data is also needed to evaluate these models.

Assumptions

  1. $\forall i \neq j, Y_i \perp Y_j$. This is just a usual independence assumption.
  2. $\forall i, j, Y_i \perp \hat{Y}_j$. This should come from the fact that outcomes (in our evaluation or "test" dataset) are independent of data used to fit a model which supplies predictions.

Feel free to assume a bit more if necessary, e.g., we're fitting a Gaussian linear model. Keep in mind that because this problem is about prediction, it's unreasonable to assume $\text{E}(Y_i) = \text{E}(Y_j)$ or $\text{E}(\hat{Y}_i) = \text{E}(\hat{Y}_j)$ for $i \neq j$.

Hypothesis

Variance in predictions contributes to variance in error, as error is a function of predictions.

Progress

Math

Is there a way to decompose, for example, $\text{Var}\Big( \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \Big)$, into a sum including $\text{Var}(\hat{Y}_i)$? So far I've broken down this down into a bunch of granular terms, and am trying to consolidate these into something clearer. For notation sake, let $\epsilon_i = Y_i - \hat{Y_i}$.

$$ \begin{equation} \text{Var}\Big( \frac{1}{n} \sum_{i=1}^{n} \epsilon_i^2 \Big) = \frac{1}{n^2} \Big( \sum_{i=1}^{n} \text{Var}(\epsilon_i^2) + 2\sum_{i < j} \text{Cov}(\epsilon_i^2, \epsilon_j^2) \Big). \end{equation} $$

Breaking down $\text{Var}(\epsilon_i^2)$:

$$ \begin{align} \text{Var}(\epsilon_i^2) &= \text{Var}(Y_i^2 - 2Y_i\hat{Y}_i + \hat{Y}_i^2) \\ &= \text{Var}(Y_i^2) + 4\text{Var}(Y_i\hat{Y}_i) + \text{Var}(\hat{Y}_i^2) - 2\text{Cov}(Y_i^2, Y_i\hat{Y}_i) - 2\text{Cov}(\hat{Y}_i^2, Y_i\hat{Y}_i) \\ &= \text{Var}(Y_i^2) + \text{E}(\hat{Y}_i)^2 \text{Var}(Y_i) + \text{Var}(\hat{Y}_i)\text{E}(Y_i)^2 + \text{Var}(\hat{Y}_i^2) \\ &\qquad - 2\text{E}(\hat{Y}_i)\text{Cov}(Y_i^2, Y_i) - 2\text{E}(Y_i)\text{Cov}(\hat{Y}_i^2, \hat{Y}_i). \end{align} $$

Note that I used this fact to break down $\text{Var}(Y_i\hat{Y}_i)$ and this fact to break down the two covariance terms.

Breaking down the covariances $\text{Cov}(\epsilon_i^2, \epsilon_j^2)$ where $i < j$, thanks to @gunes' answer (which I think used the above referenced fact about covariance):

$$ \begin{align} \text{Cov}(\epsilon_i^2, \epsilon_j^2) &= \text{Cov}(Y_i^2 - 2Y_i\hat{Y}_i + \hat{Y}_i^2, Y_j^2 - 2Y_j\hat{Y}_j + \hat{Y}_j^2) \\ &= \text{Cov}(Y_i^2, Y_j^2) - 2\text{Cov}(Y_i^2, Y_j\hat{Y}_j) + \text{Cov}(Y_i^2, \hat{Y}_j^2) \\ &\qquad - 2\text{Cov}(Y_i\hat{Y}_i, Y_j^2) + 4\text{Cov}(Y_i\hat{Y}_i, Y_j\hat{Y}_j) - 2\text{Cov}(Y_i\hat{Y}_i, \hat{Y}_j^2)\\ &\qquad + \text{Cov}(\hat{Y}_i^2, Y_j^2) - 2\text{Cov}(\hat{Y}_i^2, Y_j\hat{Y}_j) + \text{Cov}(\hat{Y}_i^2, \hat{Y}_j^2) \\ &= 0 -2(0) + 0 \\ &\qquad - 2(0) + 4\text{Cov}(Y_i\hat{Y}_i, Y_j\hat{Y}_j) - 2\text{Cov}(Y_i\hat{Y}_i, \hat{Y}_j^2) \\ &\qquad + 0 - 2\text{Cov}(\hat{Y}_i^2, Y_j\hat{Y}_j) + \text{Cov}(\hat{Y}_i^2, \hat{Y}_j^2) \\ &= 4\text{E}(Y_i)\text{E}(Y_j)\text{Cov}(\hat{Y}_i, \hat{Y}_j) - 2\text{E}(Y_i)\text{Cov}(\hat{Y}_i, \hat{Y}_j) \\ &\qquad - 2\text{E}(Y_j)\text{Cov}(\hat{Y}_i, \hat{Y}_j) + \text{Cov}(\hat{Y}_i^2, \hat{Y}_j^2) \\ &= \text{Cov}(\hat{Y}_i, \hat{Y}_j)\big( 4\text{E}(Y_i)\text{E}(Y_j) - 2(\text{E}(Y_i) + \text{E}(Y_j)) \big) + \text{Cov}(\hat{Y}_i^2, \hat{Y}_j^2). \end{align} $$

At this point, employing the assumption that we're fitting a Gaussian linear model may help.

Simulation - LASSO

See this Python notebook.

var_pred_vs_var_error

Problems with this simulation:

  1. I thought the variance in predictions decreases as the $l_1$ penalization coefficient (alpha in scikit-learn) increases. This simulation does not have this property.
  2. The plot is quite sensitive to dataset parameters. I can explore this more. But overall, the simulation doesn't provide consistent evidence for or against the hypothesis that higher variance in predictions causes higher variance in error.

Simulation - polynomial regression

Maybe there's a weird problem with the way I estimated the total variance in predictions. To avoid estimating it I decided to directly compute it, which is straightforward for unregularized linear regression.

Note that a problem with this simulation is that it currently isn't capable of fitting significantly variant models. Fitting a polynomial regression past degree=6 screws up compute_var_preds, as it needs to compute an inverse. Perhaps there are better numerical approximations to this quantity.

Speaking of compute_var_preds, below is the math for it (the $*$ subscript denotes test data):

$$ \begin{align} \text{Var}(\hat{\mathbf{y}}_*) &= \text{Var}(X_* \hat{\mathbf{\beta}}) \\ &= \text{Var}(X_* (X^T X)^{-1} X^T \mathbf{y}) \\ &= X_* (X^T X)^{-1} X^T \text{Var}(\mathbf{y}) X (X^T X)^{-1} X_*^T \\ &= X_* (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} X_*^T \\ &= \sigma^2 X_* (X^T X)^{-1} X_*^T. \end{align} $$

The total variance of predictions is defined as the sum of elements in $\text{Var}(\hat{\mathbf{y}}_*)$, which is equivalent to (only b/c it simplifies the code):

$$ \begin{align} \sum_{i=1}^{n_*} \text{Var}(\hat{\mathbf{y}}_*)_i &= \sum_{i=1}^{n_*} \sigma^2 \mathbf{x}_{*i}^T (X^T X)^{-1} \mathbf{x}_{*i} \\ &= \text{trace}( \sigma^2 X_* (X^T X)^{-1} X_*^T ). \end{align} $$

# input arguments to simulate simple linear regression
NUM_SAMPLES = 100
NUM_EXAMPLES = 500
TRAIN_FRAC = 0.5
ERROR_SD = 2
X_LB = 0; X_UB = 10
INTERCEPT = 0; SLOPE = 1
POLYNOMIAL_DEGREES = 1:6
# my compute_var_preds code can't handle degrees > 6
# b/c it directly computes an inverse
SEED = 42

fixed data and parameters

ERROR_VAR = ERROR_SD^2 x = runif(NUM_EXAMPLES, min=X_LB, max=X_UB) beta = c(INTERCEPT, SLOPE) y_true_mean = cbind(1, x) %*% beta

compute_var_preds = function(X_tr, X_te, error_var=ERROR_VAR) {

Returns sum of true variances of test set predictions using

an analytical formula

X_tr = as.matrix(X_tr) X_te = as.matrix(X_te) XtX_inv = solve(t(X_tr) %% X_tr) true_var_preds = ERROR_VARdiag((X_te %% XtX_inv %% t(X_te))) return(sum(true_var_preds)) }

loop takes 5-10 seconds

mses = matrix(rep(NA, NUM_SAMPLESlength(POLYNOMIAL_DEGREES)), nrow=NUM_SAMPLES, ncol=length(POLYNOMIAL_DEGREES)) var_preds = matrix(rep(NA, NUM_SAMPLESlength(POLYNOMIAL_DEGREES)), nrow=NUM_SAMPLES, ncol=length(POLYNOMIAL_DEGREES)) for (i in 1:NUM_SAMPLES) { j = 1 # indexes degree for (degree in POLYNOMIAL_DEGREES) { # x - polynomial transform, handle degree=0 if (degree == 0) { X_poly = data.frame(x=rep(1, length(x))) } else { X_poly = data.frame(poly(x, degree=degree, raw=TRUE)) } # y - randomly generate set.seed(SEEDij) error = rnorm(NUM_EXAMPLES, mean=0, sd=ERROR_SD) y = y_true_mean + error # split into train and test tr_inds = sample(NUM_EXAMPLES, floor(NUM_EXAMPLES*TRAIN_FRAC)) y_tr = y[tr_inds] y_te = y[-tr_inds] X_tr = X_poly[tr_inds,,drop=FALSE] # keep the df's row names X_te = X_poly[-tr_inds,,drop=FALSE] # fit, predict model = lm(y_tr ~ ., data=X_tr) # suppress warnings about predictions from rank-deficient fit. # should only occur for degree 0, but whatever... y_tr_pred = suppressWarnings(predict(model, X_tr, type="response")) y_te_pred = suppressWarnings(predict(model, X_te, type="response")) # store results var_preds[i,j] = compute_var_preds(X_tr, X_te) mses[i,j] = mean((y_te - y_te_pred)^2) j = j+1 } }

var_preds_mean = apply(var_preds, MARGIN=2, FUN=mean) # randomness in splits. aggregate w/ mean var_mses_estimates = apply(mses, MARGIN=2, FUN=var)

final plot of interest

poly_degree_col = 'red' plot(var_preds_mean, var_mses_estimates, xlab='true total Var(predictions)', ylab='estimated Var(test MSE)', pch=16, cex=1) text(var_preds_mean, var_mses_estimates, POLYNOMIAL_DEGREES, pos=2, col=poly_degree_col) legend('bottomleft', legend='polynomial degree', text.col=poly_degree_col)

var_pred_vs_var_mse

Note that y's range is around 0-10.

# misc plots
boxplot(mses,
        names=POLYNOMIAL_DEGREES,
        xlab='polynomial degree',
        ylab='test MSE')

visual check for degree of overfitting for the last degree

x_tr = x[tr_inds]

plot(x_tr, y_tr,

xlab='x (train)',

ylab='y (train)',

main=paste('polynomial degree:', degree))

x_tr_order = order(x_tr) # to plot preds as a line, need x to be sorted

lines(x_tr[x_tr_order],

y_tr_pred[x_tr_order])

degree_vs_mse

IMO, test MSE isn't different enough to draw any solid conclusions about the question. I'm just sharing my progress so far.

chicxulub
  • 1,420
  • If predictions and outcomes of different $i,j$ are assumed to be independent as well, this can be reduced to a simple distribution of variance over the summands. – gunes Mar 21 '22 at 20:21
  • Oh yeah sorry I'll add that assumption in. Basically, assume the model was fit on data that's independent of the evaluation data. Later today I'll add my progress on the calculation so that you can see where I'm stuck. I'd also like to run some simulations to get a quick and dirty answer – chicxulub Mar 21 '22 at 22:10
  • Perhaps I don't understand but are you claiming that $z_i=10x_i + \epsilon_i$ has more test error than $y_i=x_i + \epsilon_i$ – seanv507 Oct 21 '23 at 10:06

1 Answers1

1

For the cross term (which should be $i\neq j$, and multiplied by $2$)

Assume $i\neq j$: $$\begin{align}\operatorname{Cov}(\epsilon_i^2, \epsilon_j^2)&=\operatorname{Cov}(Y_i^2-2Y_i\hat Y_i+\hat Y_i^2, Y_j^2-2Y_j\hat Y_j+\hat Y_j^2)\\&=\operatorname{Cov}(-2Y_i\hat Y_i,-2Y_j\hat Y_j)+\operatorname{Cov}(-2Y_i\hat Y_i, \hat Y_j^2)\\&+ \operatorname{Cov}(\hat Y_i^2, -2Y_j\hat Y_j^2)+\operatorname{Cov}(\hat Y_i^2, \hat Y_j^2)\end{align}$$

First term: $$t_1=4\operatorname{Cov}(Y_i\hat Y_i, Y_j\hat Y_j)=\operatorname{E}[Y_i]\operatorname{E}[Y_j]\operatorname{Cov}(\hat Y_i, \hat Y_j)=\operatorname{E}[Y]^2\operatorname{Cov}(\hat Y_i, \hat Y_j)$$

Second term (and third term, as they are symmetric):

$$t_{2,3}=-2\operatorname{E}[Y]\operatorname{Cov}(\hat Y_i, \hat Y_j^2)$$

And, the fourth term is self-explanatory. I've assumed $Y_i$ come from the same distribution.

I wonder if something like this is what you're seeking for.

gunes
  • 57,205
  • I updated my question w/ your math. Two small corrections for $t_1$: (1) the $4$ coefficient was accidentally dropped (2) we can't assume $\text{E}(Y_i) = \text{E}(Y_j)$ in this prediction problem – chicxulub May 26 '22 at 03:08