38

I'm surprised this hasn't been asked before, but I cannot find the question on stats.stackexchange.

This is the formula to calculate the variance of a normally distributed sample:

$$\frac{\sum(X - \bar{X}) ^2}{n-1}$$

This is the formula to calculate the mean squared error of observations in a simple linear regression:

$$\frac{\sum(y_i - \hat{y}_i) ^2}{n-2}$$

What's the difference between these two formulas? The only difference I can see is that MSE uses $n-2$. So if that's the only difference, why not refer to them as both the variance, but with different degrees of freedom?

Alexis
  • 29,850
luciano
  • 14,269
  • What is it about the wikipedia page here that is not clear? – TrynnaDoStat Mar 05 '15 at 19:32
  • 5
    Variance is the average of squared deviation of the observations from the mean. The MSE in contrast is the average of squared deviations of the predictions from the true values. – random_guy Mar 05 '15 at 19:38
  • 3
    Both "variance" and "mean squared error" have multiple formulas and varying applications. To clarify your question, could you (a) describe what kind of data you are applying these concepts to and (b) give formulas for them? (It's likely that in so doing you will discover the answer to your question, too.) – whuber Mar 05 '15 at 19:41
  • 8
    There's a more general formula, which both are special cases of: $\frac{\sum_i(y_i-\hat{y}_i)^2}{n-p}$ where $p$ is the number of parameters estimated in obtaining $\hat{y}$ – Glen_b Mar 06 '15 at 03:05
  • @Glen_b could you please provide a reference for more information on this general formula? – trianta2 Nov 12 '19 at 04:08
  • Any decent reference that covers regression will have it. For example, John Fox's Applied Regression Analysis, 3rd Ed, ch6 p114-115. You could find a shelf full of suitable references at a university library... (Note that my $p$ is his $k+1$ because my $p$ includes the constant but his $k$ doesn't) – Glen_b Nov 12 '19 at 04:16
  • if MSE and variance are both based on the squared loss function, a more general question would be how can the risk measure of a loss function be derived from a given loss function? i.e. 'variance' of the Huber loss function, 'variance' of the absolute loss function, 'variance' of the Ledoit-Wolf covariance matrix estimator, etc. – develarist Nov 29 '19 at 15:49

2 Answers2

39

The mean squared error as you have written it for OLS is hiding something:

$$\frac{\sum_{i}^{n}(y_i - \hat{y}_i) ^2}{n-2} = \frac{\sum_{i}^{n}\left[y_i - \left(\hat{\beta}_{0} + \hat{\beta}_{x}x_{i}\right)\right] ^2}{n-2}$$

Notice that the numerator sums over a function of both $y$ and $x$, so you lose a degree of freedom for each variable (or for each estimated parameter explaining one variable as a function of the other if your prefer), hence $n-2$. In the formula for the sample variance, the numerator is a function of a single variable, so you lose just one degree of freedom in the denominator.

However, you are on track in noticing that these are conceptually similar quantities. The sample variance of $y$ measures the spread of the data around the sample mean of $y$ (in squared units), while the MSE measures the vertical spread of the data around the sample regression line (in squared vertical units).

Alexis
  • 29,850
  • @amoeba Hey! Thanks for the attention. Is there an official CV style guide that prompted this edit? If so I wanna learn of it. If not, well, Glen_b once rightly admonished me for being colonizing with my personal style preferences and edits to others Qs and As. What do you think? (And I ask this in a collegial tone: I think your edit does add something. Just wanna understand our editing values better.) – Alexis Mar 07 '15 at 15:10
  • 1
    I don't think there is any official CV style guide making this suggestion, but in LaTeX there are inline formulas (marked with one dollar sign) that are rendered directly in the block of text, and displayed formulas (marked with two dollar signs) that are rendered on a separate line. Displayed formulas use different layout. Your formula was originally on a separate line but marked with one dollar sign; I don't think this makes sense. However, you are right about personal preferences, so feel free to roll back with apologies. The reason I edited was that I was fixing a typo in the Q anyway. – amoeba Mar 07 '15 at 15:23
  • 1
    if there is no intercept term $\beta_0$ in the regression problem, then the degrees of freedom of MSE is equal to $n-1$ like in the variance formula instead of $n-2$ – develarist Nov 29 '19 at 15:46
  • What is the reason why we use MSE as an estimator for the variance, instead of the sample variance? Arn't both of them unbiased estimators of the variance? – woowz Aug 12 '22 at 21:51
  • @woowz The variance of what? The 'sample variance' is typically interpreted as being of a single variable, the MSE is a sample variance of $y$ about the regression line, so those are conceptually distinct. – Alexis Aug 14 '22 at 05:01
1

In the variance formula, the sample mean approximates the population mean. The sample mean is calculated for a given sample with $n$ data points. Knowing the sample mean leaves us with only $n-1$ independent data points as the $n$th data point is constrained by the sample mean, so ($n-1$) degrees of freedom (DOF) in the denominator in the variance formula.

To get the estimated value of y ($= \beta_{0} + \beta_{1}\times x$) in the MSE formula, we need to estimate both $\beta_{0}$ (i.e. the intercept) as well as $\beta_{1}$ (i.e. the slope) so we lose 2 DOF, and so that is the reason for ($n-2$) in the denominator in the MSE formula.

Alexis
  • 29,850