9

Hastie et al. "The Elements of Statistical Learning" (2009) consider a data generating process $$ Y = f(X) + \varepsilon $$ with $\mathbb{E}(\varepsilon)=0$ and $\text{Var}(\varepsilon)=\sigma^2_{\varepsilon}$.

They present the following bias-variance decomposition of the expected squared forecast error at point $x_0$ (p. 223, formula 7.9): \begin{aligned} \text{Err}(x_0) &= \mathbb{E}\left( [ y - \hat f(x_0) ]^2 | X = x_0 \right) \\ &= \dots \\ &= \sigma^2_{\varepsilon} + \text{Bias}^2(\hat f(x_0)) + \text{Var}(\hat f(x_0)) \\ &= \text{Irreducible error} + \text{Bias}^2 + \text{Variance} .\\ \end{aligned} In my own work I do not specify $\hat f(\cdot)$ but take an arbitrary forecast $\hat y$ instead (if this is relevant).
Question: I am looking for a term for $$ \text{Bias}^2 + \text{Variance} $$ or, more precisely, $$ \text{Err}(x_0) - \text{Irreducible error}. $$

Richard Hardy
  • 67,272
  • 3
    What is the question here? – Michael R. Chernick Apr 12 '17 at 13:19
  • @MichaelChernick, I am looking for a name for the object defined at the end of the post. – Richard Hardy Apr 12 '17 at 14:22
  • How about 'explained error'? – sntx Apr 12 '17 at 14:59
  • 1
    @sntx, thanks for the idea. But it somehow does not sound right. Maybe modelling error (i.e. error due to model misspecification and imprecise estimation of the model), but then it does not make sense if there is no forecast-generating model (e.g. expert forecasts). – Richard Hardy Apr 12 '17 at 15:36
  • reducible error seems quite the logical choice. It's the only part of $Err(x_0)$ which you can hope to reduce, by choosing a model with the optimal bias-variance tradeoff, among the class of models you are considering. – DeltaIV Apr 27 '17 at 09:40
  • 1
    @DeltaIV, that is rather good. However, I think the term is charged; it seems as if the forecast is poor and we could do better. But suppose we did our best for the given data. So we happen to have chosen the correct model (no "model bias") but the sample is just too small to perfectly estimate the coefficients. The estimation variance ("model variance") is thus really irreducible for the given sample size -- while the term "reducible error" suggests this is not the case. Not that I am sure we can come up with a better term, I still would like to strive for that. – Richard Hardy Apr 27 '17 at 11:47
  • well, you described exactly the classic example of bias-variance tradeoff. The unbiased model, in your setting, needs not be the model which gives the smallest forecast error. A biased model, which however is less flexible (less variance), may outperform your unbiased model, when used in a "small data" (small wrt flexibility of the model) setting. Example: OLS (unbiased) vs ridge regression (biased). There are indeed "small data" cases where ridge regression has smaller forecast error than OLS. – DeltaIV Apr 27 '17 at 12:31
  • @DeltaIV, seems I did not get my point through. What I meant is, even for the theoretically optimal balance between bias and variance, the "reducible error" is not reducible beyond a certain quantity defined by the given sample size. No matter how good we are at balancing bias against variance, there is a quantity beyond which the "reducible error" cannot be reduced, and that quantity is not zero! That explains why the term "reducible error" is misleading. That error is "reducible" in the sense of what could potentially happen in infinite samples. But we will never have an infinite sample... – Richard Hardy Apr 27 '17 at 13:36
  • Yes, I got your point. My point is that the reducible error is always reducible to 0 if you set $\hat{f}(x)=f(x)=\mathcal{E}[Y|X=x]$, i.e., if your model is the regression function. You could of course argue that you never know exactly $f(x)$ and you need to estimate it. This is correct, but my point is that the "reducible error" is in theory reducible. The irreducible is not, because even if you use the best possible estimate (under squared loss) of $Y$ given $X$, the expectation of your squared error won't be 0 but $\sigma^2_{epsilon}$. Anyway, if you're not buying into reducible error.. – DeltaIV Apr 27 '17 at 16:12
  • ...then what about the term actually used by Hastie et al.? Sec. 2.9, page 37: "The second and third terms"(i.e., bias squared & variance) "are under our control, and make up the mean squared error of $\hat{f}_k(x_0)$ in estimating $f(x_0)$ [..]". I'm not a big fan because it's quite convoluted, but it's indeed the term they use. – DeltaIV Apr 27 '17 at 16:16
  • 1
    @DeltaIV, OK, I now got the intuition in which sense it is reducible. Still the term might be misleading if used without further explanation (just as you had to explain to me). Your latter suggestion is precise, which is really nice, but just as you said, it is quite convoluted. – Richard Hardy Apr 27 '17 at 17:45
  • @DeltaIV I think the mean squared error contains modelling error as well, see below, so I do not buy that particular "convolution." – Carl Feb 24 '18 at 19:25
  • @DeltaIV, I think I could accept an answer "reducible error". Ideally it would also include a summary of our discussion above in the comments to show what caveats apply. If someone comes up with a better answer eventually, I could choose that one instead, but for now "reducible error" is perhaps the best we have. – Richard Hardy Feb 25 '18 at 20:23
  • @RichardHardy gee, thanks, basically you said that by accepting my answer you'd be settling for less ;-) anyway, I like the idea of answering a question nearly a year after it was written! I'll do it. – DeltaIV Feb 25 '18 at 20:41
  • 1
    @DeltaIV, I did not intend to sound like that. This is nothing personal; my (hopefully convincing) arguments are above in the comments. But thanks for having the discussion with me, it helps. – Richard Hardy Feb 25 '18 at 21:11
  • @RichardHardy don't worry, I just have a weird sense of humor ;) I'm writing the answer now. Feel free to comment it and let me know if you think I left something out. – DeltaIV Feb 25 '18 at 23:31
  • There are logical problems with including reducibility in an axiom strudture for data types. See my latest comments on this below. (+1) on your question because I had to think it through, and despite the fact that I do not agree with you, I think you should admit that what I am saying has value, – Carl Feb 27 '18 at 00:52

2 Answers2

5

I propose reducible error. This is also the terminology adopted in paragraph 2.1.1 of Gareth, Witten, Hastie & Tibshirani, An Introduction to Statistical Learning, a book which is basically a simplification of ESL + some very cool R code laboratories (except for the fact that they use attach, but, hey, nobody's perfect). I'll list below the reasons the pros and cons of this terminology.


First of all, we must recall that we not only assume $\epsilon$ to have mean 0, but to also be independent of $X$ (see paragraph 2.6.1, formula 2.29 of ESL, 2nd edition, 12th printing). Then of course $\epsilon$ cannot be estimated from $X$, no matter which hypothesis class $\mathcal{H}$ (family of models) we choose, and how large a sample we use to learn our hypothesis (estimate our model). This explains why $\sigma^2_{\epsilon}$ is called irreducible error.

By analogy, it seems natural to define the remaining part of the error, $\text{Err}(x_0)-\sigma^2_{\epsilon}$, the reducible error. Now, this terminology may sound somewhat confusing: as a matter of fact, under the assumption we made for the data generating process, we can prove that

$$ f(x)=\mathbb{E}[Y\vert X=x]$$

Thus, the reducible error can be reduced to zero if and only if $\mathbb{E}[Y\vert X=x]\in \mathcal{H}$ (assuming of course we have a consistent estimator). If $\mathbb{E}[Y\vert X=x]\notin \mathcal{H}$, we cannot drive the reducible error to 0, even in the limit of an infinite sample size. However, it's still the only part of our error which can be reduced, if not eliminated, by changing the sample size, introducing regularization (shrinkage) in our estimator, etc. In other words, by choosing another $\hat{f}(x)$ in our family of models.

Basically, reducible is meant not in the sense of zeroable (yuck!), but in the sense of that part of the error which can be reduced, even if not necessarily made arbitrarily small. Also, note that in principle this error can be reduced to 0 by enlarging $\mathcal{H}$ until it includes $\mathbb{E}[Y\vert X=x]$. In contrast, $\sigma^2_{\epsilon}$ cannot be reduced, no matter how large $\mathcal{H}$ is, because $\epsilon\perp X$.

DeltaIV
  • 17,954
  • If noise is the irreducible error, it is not irreducible. You need to motivate this somehow, I cannot do that for myself. – Carl Feb 26 '18 at 01:55
  • In 2.1.1 the example is "assay of some drug in the blood." The first example I give below is exactly that. In that assay, the so-called irreducible error of measurement is nothing of the kind. It is composed of counting noise, which is usually reduced by counting 10000 or more events, pipetting error, which is almost exponentially distributed, and other technical errors. To further reduce these "irreducible" errors, I recommend using the median of three counting tubes for each time sample. The term irreducible is bad jargon, try again. – Carl Feb 26 '18 at 14:32
  • 2
    @Delta, thank you for the answer. A one liner "reducible error" might not have been very convincing, but given the context and the discussion it looks pretty good! – Richard Hardy Feb 26 '18 at 14:50
  • I do not think that the purpose of developing jargon is to confuse people. If you want to say error independent of $n$, versus error that is is function of $n$, say what you mean. – Carl Feb 26 '18 at 16:45
  • @DeltaV I believe that reducibility is a dubious assumption, see below. – Carl Feb 26 '18 at 19:38
  • (+1) anyway, for effort, even if I do not agree. – Carl Feb 26 '18 at 20:32
-1

In a system for which all of the physical occurrences have been properly modeled, the left over would be noise. However, there is generally more structure in the error of a model to data than just noise. For example, modelling bias and noise alone do not explain curvilinear residuals, i.e., unmodelled data structure. The totality of unexplained fraction is $1-R^2$, which can consist of misrepresentation of the physics as well as bias and noise of known structure. If by bias we mean only the error in estimating mean $y$, by "irreducible error" we mean noise, and by variance we mean the systemic physical error of the model, then the sum of bias (squared) and systemic physical error is not any special anything, it is merely the error that is not noise. The term (squared) misregistration might be used for this in a specific context, see below. If you want to say error independent of $n$, versus error that is a function of $n$, say that. IMHO, neither error is irreducible, so that the irreducibility property misleads to such an extent that it confuses more than it illuminates.

Why do i not like the term "reducibility"? It smacks of a self-referential tautology as in the Axiom of reducibility. I agree with Russell 1919 that "I do not see any reason to believe that the axiom of reducibility is logically necessary, which is what would be meant by saying that it is true in all possible worlds. The admission of this axiom into a system of logic is therefore a defect ... a dubious assumption."

Below is an example of structured residuals due to incomplete physical modelling. This represents residuals from ordinary least squares fitting of a scaled gamma distribution, i.e., a gamma variate (GV), to blood plasma samples of radioactivity of a renal glomerular filtered radiopharmaceutical [1]. Note that the more data that is discarded ($n=36$ for each time-sample), the better the model becomes so that reducibility deproves with more sample range.

enter image description here

It is notable, that as one drops the first sample at five minutes, the physics improves as it does sequentially as one continues to drop early samples out to 60 min. This shows that although the GV eventually forms a good model for plasma concentration of the drug, something else is going on during early times.

Indeed, if one convolves two gamma distributions, one for early time, circulatory delivery of the drug, and one for organ clearance, this type of error, physical modeling error, can be reduced to less than $1\%$ [2]. Next is an illustration of that convolution.

enter image description here

From that latter example, for a square root of counts versus time graph, the $y$-axis deviations are standardized deviations in sense of Poisson noise error. Such a graph is an image for which errors of fit are image misregistration from distortion or warping. In that context, and only that context, misregistration is bias plus modelling error, and total error is misregistration plus noise error.

Carl
  • 13,084
  • Indeed, this is what the above decomposition is about. But your answer would better serve as a comment as it does not address the actual question. Or does it? – Richard Hardy Feb 24 '18 at 06:03
  • Thanks, but the answer just got further away from the topic. I have a hard time finding any connection between the actual question (how do I call $\text{Bias}^2+\text{Variance}$) and all this... – Richard Hardy Feb 24 '18 at 18:56
  • Once again, you are answering a different question. A right answer to a wrong question is unfortunately a wrong answer (a note to self: coincidentally, I was explaining this to my undergraduate students yesterday). I am not asking how meaningful the expression is (it is meaningful for someone who has read the ESL textbook and/or worked in applied machine learning), I am asking for a proper term for it. The question is positive, not normative. And it is pretty simple and very concrete. – Richard Hardy Feb 24 '18 at 19:58
  • @RichardHardy Without the physics, the question was difficult for me to comprehend. Changed my answer, see misregistration above. – Carl Feb 24 '18 at 20:11
  • @RichardHardy For me, "concrete" means physically extant. Am I correct in this "If by bias we mean only the error in estimating mean y, by "irreducible error" we mean noise, and by variance we mean the systemic physical error of the model, then the sum of bias (squared) and systemic physical error is not any special anything, it is merely the error that is not noise. The term (squared) misregistration might be used for this in a specific context, see below."? – Carl Feb 25 '18 at 20:06
  • Hmm, not exactly. "Bias" is the difference between the functional form that generates the data and the one we use as a model. "Variance" is the difference between optimal and estimated parameters in the model we use. Hastie et al. probably have a better explanation. Shmueli "To Explain or to Predict" (2010) is another great source. But you are right that the sum of bias and variance is the error that is not noise, as should also be clear from the formula in my question. – Richard Hardy Feb 25 '18 at 20:13
  • I am now starting to think that "reducible error" is probably as good a name as it gets for that object (though I have pointed out the flaws of this name in the comments to the OP). It is mostly accurate if one keeps in mind that this error is reducible in principle but not necessary in reality (while irreducible error is irreducible both in theory and in reality). The caveat is that one needs to keep this in mind, and surely too many people wouldn't without being explicitly told about the caveat... – Richard Hardy Feb 25 '18 at 20:20
  • @RichardHardy It is not exact that noise is irreducible, noise reduction methods are quite real. Next, if bias is more variable than a constant, e.g., a linear bias, then it is perhaps more properly included in the systemic physical error of modelling. I would suggest that there is an advantage to thinking physically about these models. – Carl Feb 25 '18 at 22:33
  • Perhaps you are taking the notion of noise in physics, otherwise it would not be reducible. Think of any process plus a coin throw. The uncertainty around the outcome of the coin throw is irreducible, regardless of how well you can model the process. Even if you can forecast the process perfectly, the coin throw part will cause some irreducible error in the forecast. – Richard Hardy Feb 26 '18 at 06:23
  • @RichardHardy The radioactivity in the examples above is binned. The actual events are Dirac deltas, which if examined without binning are entirely noise, i.e., no information. Without smoothing, information tends to be rather coarse. I take you coin flips, and average them until I have optimally reduced noise. – Carl Feb 26 '18 at 13:19
  • 1
    You can do that for estimating the process, yes, and that is the reducible error part. But when you forecast a concrete event that includes the coin flip, there is no way you can reduce the error associated with mispredicting the outcome of the coin flip. This is what the irreducible error is about. Interesting: in a purely deterministic world there would be no irreducible errors by definition, so if your view of the world is completely deterministic, then I might understand what you mean. However, the world is stochastic in "The Elements of Statistical Learning" and in statistics in general. – Richard Hardy Feb 26 '18 at 14:52
  • Quanta are not deterministic. If your outcome is quantized into only two states, that outcome only really acquires meaning after binning. Without binning there is little or no information. I hope you do not trade stocks based on their moment to moment changes in value. Indeed there is an investment book named "Fooled by Randomness," worth reading that. – Carl Feb 26 '18 at 15:55
  • If you look at single flips of a coin as isolated events, it would not be meaningful to ask if the coin is biased. One has to do some type of noise reduction to answer that question. I do not see what you mean by irreducible error in that context. If, on the other hand, you are using a biased coin for prediction, then some of the prediction error is "irreducible." However, as a general categorization of the entirety of error to split off an "irreducible" portion of it is not really that meaningful, – Carl Jul 29 '22 at 00:46
  • con't... as often what is irreducible by statistical methods, may be reducible by other means. For example, patient motion in the thyroid study above, which causes a "wiggle" of low amplitude in the curve, can be reduced by putting the patient's head and neck it a tight fitting form. Alternatively, one can do image series motion correction, which puts the irreducible error squarely back into the statistically admissible frame, but this time as a reducible error. Thus, the concept of irreducibility can merely be an assumption and not a firm result. – Carl Jul 29 '22 at 00:57