6

Consider a data generating process $$Y=f(X)+\varepsilon$$ where $\varepsilon$ is independent of $x$ with $\mathbb E(\varepsilon)=0$ and $\text{Var}(\varepsilon)=\sigma^2_\varepsilon$. According to Hastie et al. "The Elements of Statistical Learning" (2nd edition, 2009) Section 7.3 p. 223, we can derive an expression for the expected prediction error of a regression fit $\hat g(X)$ at an input point $X=x_0$, using squared-error loss:

\begin{align} \text{Err}(x_0) &=\mathbb E[(Y-\hat g(x_0))^2|X=x_0]\\ &=(\mathbb E[\hat g(x_0)−f(x_0)])^2+\mathbb E[(\hat g(x_0)−\mathbb E[\hat g(x_0)])^2]+\sigma^2_\varepsilon\\ &=\text{Bias}^2\ \ \ \quad\quad\quad\quad\quad\;\;+\text{Variance } \quad\quad\quad\quad\quad\quad+ \text{ Irreducible Error} \end{align}

(where I use the notation $\text{Bias}^2$ instead of $\text{Bias}$ and $\hat g$ instead of $\hat f$).

Recently, I caught myself using the same term, bias, in two situations:

  1. Just as above, $\text{Bias}=\mathbb E[\hat g(x_0)−f(x_0)]$ for a specific fitted model $\require{enclose}\enclose{horizontalstrike}{\hat g}$ sample size $n$ and referring to a specific value $x_0$ of the regressors.
  2. Referring to the ability of a class of models $g$ to approximate $f$ across different values of $x_0$ given infinite data. In this sense, $g(x)=\beta_0+\beta_1 x+\beta_2 x^2$ is an "unbiased" class of models for $f(x)=0.5+2x$, because $0.5+2x$ is a special case of $\beta_0+\beta_1 x+\beta_2 x^2$, and given infinite data we can learn the values of $(\beta_0,\beta_1,\beta_2)$ to be $(0.5,2,0)$.

Is there a better term than bias to refer to the second case? Perhaps asymptotic bias?

Richard Hardy
  • 67,272

1 Answers1

2

It is true that the term "bias" tends to be used as a blanket term to indicate generally deviation from the true value. E.g. one finds expressions like "...then the estimator will be inconsistent and the estimates will be biased": the use of the word "bias" here does not refer to its formal sense and definition.

The specific case the OP describes is a case of "nested models": if we are lucky enough to specify a model that nests the true Data Generating Mechanism, then... it becomes a matter of whether the estimator we are using is consistent asymptotically even in the presence of redundant regressors (like the $x^2$ in the OP's example).

As regards terminology around "asymptotic bias", I guess the following two posts could be useful:

https://stats.stackexchange.com/a/245074/28746

https://stats.stackexchange.com/a/239919/28746

...responding to the OP's comment, as regards which, after all, could be a proper terminology for case 2. in his post, I would suggest "asymptotically equivalent", -- if by using $g(x)$ we will end up with the correct coefficient vector.

So "model $g(x)$ is asymptotically equivalent to $f(x)$ because as data goes to infinity $g(x)$ converges to $f(x)$."

  • Good point about the need for a consistent estimator. (I thought about it at some point but did not include in the post.) Given your two links, asymptotic bias seems to be taken. But just bias is "even more taken"... – Richard Hardy Jan 12 '24 at 20:51
  • Hmm. If $f(x)$ is nested in $g(x)$, the relationship is asymmetric. So if we use asymptotically equivalent, we would have both g(x) is asymptotically equivalent to $f(x)$ and f(x) is not* asymptotically equivalent to $f(x)$* at once. I would be more comfortable using the term [asymptotically] equivalent for a symmetric relationship than an asymmetric one. – Richard Hardy Jan 14 '24 at 12:27
  • @RichardHardy It's up to you, but I personally find this unjustified. The two functions have implicitly distinct/asymmetric roles, since the one represents the data generating mechanism, and the other a model of it. If you want to treat both as models of some third function/DGP, then whether they are asymptotically equivalent or not depends on what is this 3rd generating mechanism. So on my part I see no problem saying <>. – Alecos Papadopoulos Jan 14 '24 at 13:54
  • That is a helpful point, thank you. – Richard Hardy Jan 14 '24 at 13:58