14

Updated question:

Why do we use RMSE: $$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\hat{y}_i -y_i\Big)^2}}$$

Why is it not MRSE: $$MRSE = \frac{1}{n}\sqrt{\Sigma_{i=1}^{n}{\Big(\hat{y}_i -y_i\Big)^2}}$$

I understand that other methods (e.g., MAE and MAPE) can be used as a metric for error. My question is specifically about why we use RMSE over MRSE.

Original:

Why is the equation for RMSE: $$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\hat{y}_i -y_i\Big)^2}}$$

Why is it not: $$RMSE = \frac{1}{n}\sqrt{\Sigma_{i=1}^{n}{\Big(\hat{y}_i -y_i\Big)^2}}$$

What is the reason for taking the square root of 1/n?

Circadian
  • 251
  • 9
    Because it is the Root of the Mean Squared Error (thus RMSE) and MSE is defined as the stuff under the square root. – Tylerr Jun 01 '22 at 20:51
  • 8
    You could do this. It would be a perfectly valid measure of residual size in any single instance. The problem is revealed when you consider what value to expect for its square. In the first case, you would be looking at the average squared residual. No matter what $n$ is, that expectation would be about the same. But in the second case you would be looking at $1/n$ times the average squared residual--and that gets really small as $n$ grows. Thus, it wouldn't be meaningful to compare the (modified) RMSEs of two datasets of different sizes. That wouldn't be terribly useful, would it? – whuber Jun 01 '22 at 21:52
  • @whuber, in the second example wouldn't multiplying the root by 1/n be considered taking the average of the root squared error? – Circadian Jun 01 '22 at 22:55
  • 1
    @whuber, I found an answer in your response to a similar question about the definition of standard deviation. For those interested: https://stats.stackexchange.com/questions/116342/why-is-the-standard-deviation-defined-as-sqrt-of-the-variance-and-not-as-the-sqr?noredirect=1&lq=1 – Circadian Jun 01 '22 at 23:51
  • 3
    The expression for $`` MRSE "$ seems off: the mean-root-square-error would be$$\text{MRSE} = {\frac{1}{n}} \sum_{\forall i}{\sqrt{{\left(\hat{y}i-y_i\right)}^{2}}} ,,$$with the thing being that the "_mean" involves adding the stuff up and dividing through by the count. This would be least absolute deviations (LAD). – Nat Jun 02 '22 at 07:05
  • $\sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\hat{y}i -y_i\Big)^2}}$ is an measurement of how spread out the sample is and a reasonable estimator of how spread out the population is. If you took a sample with four times as many observations from the same population, you would typically get approximately the same result. $\frac{1}{n}\sqrt{\Sigma{i=1}^{n}{\Big(\hat{y}_i -y_i\Big)^2}}$ multiplies the previous number by $\frac1{\sqrt n}$ and would tend to be smaller with a larger sample size, and is instead an estimator for the uncertainty in the sample mean. – Henry Jun 02 '22 at 22:20

4 Answers4

18

Interesting question. Let's break this down into: Why squared error, why mean squared error, and then why root mean squared error. I think that should answer your question.

Why squared error (SE)

Squared error happens to be a proper scoring rule, which is a really desirable property for your loss function to have (feel free to read up on proper scoring rules by searching this site). However, the squared error can grow simply by just adding more data. So if I have two data sets (maybe one from yesterday and one from today), and they are of different sizes, I could be fooled into thinking my model is doing poorly simply because I had more data today than yesterday. Which leads me to...

Why mean squared error (MSE)

Taking the mean of the squared eliminates this problem of different data sizes. By taking the average loss, we retain the nice properties of the proper scoring rule, but now can compare the loss of a model on different data sets of possibly different sizes. But the interpretation of MSE is kind of hard. If $y$ is measured in dollars, what is a dollar squared? Which leads me too...

Why root mean squared error (RMSE)

MSE has weird units, but if we took the square root of MSE the result would be on the scale of $y$. This makes interpretation a little easier.

In summation:

  • SE is a proper scoring rule. We like that
  • To prevent misleading inflation of the error due to sample sizes, we take the average of SE, or MSE
  • MSE is hard to interpret, so instead we take the square root of MSE to get RMSE and have the error units on the same scale as the outcome.
  • Demetri, I appreciate your thorough explanation. I understand that MSE is the mean of the squared residuals, and that taking the square root brings us back to a figure that is easier to interpret. When calculating MSE, the number of data samples is not squared. Wouldn't taking the square root of 1/n when calculating RMSE affect the mean? – Circadian Jun 01 '22 at 22:58
  • 3
    @Circadian no. Try it. Suppose the errors are [1, 2, 4]. The squared errors are [1, 4, 16]. The SSE is 21, The MSE is 21/3 = 7 and the RMSE is sqrt(7). Now for a trivial change let the errors be [1, 1, 2, 2, 4, 4]. I think you should agree that a good average error metric would be unchanged, and RMSE works: the SSE is 42, the MSE is 42/6=7, and the RMSE is sqrt(7). But with your sqrt(SSE)/n method, the first set would be sqrt(21)/3 = sqrt(7/3), while the second set would be sqrt(42)/6 = sqrt(7/6)... the metric gets smaller with more identically-distributed samples. – hobbs Jun 02 '22 at 14:32
  • @hobbs That seems like the answer to the question! – Joe Jun 02 '22 at 16:57
  • @hobbs Thanks for the straightforward example—it helped answer my question. My mistake was how I was calculating the mean, as Corvus et al point out. – Circadian Jun 03 '22 at 18:23
5

The goal is to have an unbiased estimator for the error your model makes on average. Let's call that $\bar \epsilon$. Now let's see what's the relationship of the two estimators you asked about with the $\bar \epsilon$:

$\hat y_{i} - y_{i} = \epsilon_{i}$

$\frac{1}{n}\sum_{i=1}^{n}(\hat y_{i} - y_{i})^2 = \frac{1}{n}\sum_{i=1}^{n}(\epsilon_{i})^2 = \bar \epsilon^2$

thus

$ \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_{i}-y_{i}\right)^{2}} \approx \bar \epsilon$

which is what we aimed for. Now let's see what the other estimator will give you:

$\frac{1}{n} \sqrt{\sum_{i=1}^{n}\left(\hat{y}_{i}-y_{i}\right)^{2}} = $

$\frac{1}{n} \sqrt{n \times \frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_{i}-y_{i}\right)^{2}} = $

$\frac{\sqrt{n}}{n} \times \bar \epsilon$

As you can see the second estimator has a bias of $\frac{\sqrt{n}}{n}$ in estimating the average error you aimed for. For example, if for a data generating process of $f(x) = 0$ you always predict 2 then you would want the estimator to give you $\bar \epsilon = 2$ which is given by the first estimator while the second estimator (assuming n = 10) will give you ($2 \times \frac{\sqrt 10}{10}$).

Amin Shn
  • 755
  • 1
    Re the goal: I don't think that's accurate, in part because the (usual) RMSE is never unbiased. – whuber Jun 02 '22 at 13:06
  • That's why I used the $\approx$ sign and not the equal sign. – Amin Shn Jun 02 '22 at 14:20
  • 1
    But that does not help, because your entire exposition is motivated by the claim that "the goal is to have an unbiased estimator." That's just not so. – whuber Jun 02 '22 at 14:20
  • 2
    And the first estimator is closer to that goal, that's my point, I gave an example to make it more sensible. – Amin Shn Jun 02 '22 at 14:21
  • 1
    And my point is that unbiasedness is not the goal of an RMSE. – whuber Jun 02 '22 at 14:39
  • @whuber I agree unbiasedness is not the goal of RMSE. However, it's worth noting that the only source of bias in RMSE is often just the root and not the $n-1$ term as the RMSE is typically (always in ML) computed out of sample (e.g. in cross validation). – Luca Citi Jun 02 '22 at 22:12
5

While Demetri's answer gives a very good derivation or RMSE, it doesn't really explain why not the other method you suggest. I think you can get a little more insight by observing that MRSE is not a valid name for your suggested measure. Look closely and the steps are

  1. Square the residuals
  2. Add them up
  3. Square root
  4. Divide by the number of samples

A "mean" needs to have the sum and the divide consecutive. So the MRSE would actually be:

$$ MRSE = \frac{1}{n} \sum \sqrt{(\hat{y}_i - y_i)^2} = \frac{1}{n}\sum |\hat{y}_i - y_i| = MAE$$

So, RMSE is the square-root of a mean - it is then just transformed (by square root) for convenience. The MAE is itself a mean. What you have created, isn't a mean - you are not adding things up and dividing by the number there are, you are adding things up, then square rooting, then dividing by the number there are. In fact the construct before the 1/n is a Euclidean distance - the total distance that the sample is from the predicted y-vector. As pointed out by Amin's answer, this error naturally grows as sqrt of the size of the y-vector, so by dividing by n your error will systematically get smaller the larger the sample.

Corvus
  • 5,345
0

I think both RMSE and MRSE could be potentially used for the purpose of creating a metric related to residuals per data point. The difference, and why the RMSE is commonly used and MRSE is not probably lies in interpretation of the terms and their related metrics. If we roll back RMSE to SE at every step of the way we get a term that is commonly used and interpretable. We square RMSE and get MSE (variance), we square root it and get SE. However, rolling back MRSE does not give as nice a terms. We multiply MRSE by sample size and get RSE, which is not commonly used to evaluate anything. This may be the reason why one is used and the other isn't.