10

Suppose

$$ \mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e} \, , \\ \mathbf{e} \sim \mathcal{N}(0,\mathbf{I}_P) \, . $$

We know that $\mathbf{\hat{b}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$ is the BLUE.

Is it also the UMVUE? I can only find a single source (page 6) that claims this, so I'm unsure.

In case $\mathbf{X}=\mathbf{1}$ it is true ($\mathbf{\hat{b}}$ becomes the sample mean).

But other, related results like Stein's example make me cautious.

And if it's true, then why isn't it more famous?

Patrick
  • 852
  • The second U in UMVUE stands, like the U in BLUE, for "unbiased". Stein's estimator gives up on unbiasedness through shrinkage in order to achieve an overall lower risk. – Christoph Hanck May 15 '18 at 11:26
  • Yup, I'm aware. I just brought it up as a related case. – Patrick May 15 '18 at 11:29
  • 2
    Another source: http://www.stat.wisc.edu/~doksum/STAT709/n709-36.pdf – amoeba May 15 '18 at 12:48
  • Thanks @amoeba . I saw that one, but it talks about $l^\tau b$ (what's $l$?) and presumes that it is estimable (?), and from the start of the proof the linearity seems imposed. Maybe I'm not understanding it well, but it does not seem to answer my question. – Patrick May 15 '18 at 12:59
  • 1
    You don't need to care about $l$, just look at the proof of (i) until the last line or so. They show that $\hat\beta$ is a function of complete sufficient statistic from which UMVUE follows by https://en.wikipedia.org/wiki/Lehmann%E2%80%93Scheff%C3%A9_theorem. Nothing is imposed here. Anyway, I just wanted to give this link. +1, good question. – amoeba May 15 '18 at 13:10
  • 1
    https://stats.stackexchange.com/questions/288674/are-there-unbiased-non-linear-estimators-with-lower-variance-than-the-ols-estim#comment551719_288674 – Cagdas Ozgenc May 15 '18 at 13:23
  • Thanks @CagdasOzgenc , that's brilliant! While I now understand better the reference of amoeba , I think I prefer the more direct proof of your reference [http://www.econ.ohio-state.edu/dejong/note5.pdf , page 17]. Please write an answer if you wish. And, then, can somebody tell me: Why is the BLUE result so much more famous than the UMVUE result? – Patrick May 15 '18 at 14:00
  • No problem. Stein's example doesn't apply here because we consider only unbiased estimators. Stein's example and other shrinking techniques introduce bias in exchange of lower variance. I think BLUE is more advertised because in basic regression setting they consider spherical errors not necessarily Gaussian errors (which is only one type of spherical distribution) or some other predefined distribution where a non-linear estimator will be better than BLUE. https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem. Basically if it is not Gaussian, but spherical OLS will be BLUE but not UMVUE. – Cagdas Ozgenc May 15 '18 at 15:08
  • Thanks @CagdasOzgenc , although I guess I was more thinking of the question "Why is imposing linearity favoured (in popularity) to assuming Gaussianity?" I'm guessing the reason is historical, but IMHO it's an omission by textbooks to not present the UMVUE result more equitably... – Patrick May 15 '18 at 15:23
  • 1
  • 1
    This is a great question and I am baffled that it is still not answered. I am in the same shoes. I found the bits and pieces that mention between the lines "if noise is Gaussian then it's also MVUE" but no more details. If anyone can give a concise answer between the relationship of OLS, BLUE and MVUE, that would be very helpful. – divB Oct 20 '20 at 06:44
  • 1
    BTW, the references to Lehman Scheffe theorem are helpful but a great answer would discuss why only the Gaussian case is MVUE. What is the MVUE when e is, say, Rayleigh? – divB Oct 20 '20 at 06:50

1 Answers1

4

Under the assumptions $$ \begin{align} &\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}, \;\mathbf X \;\text{full column rank},\\ &\mathbf e \mid \mathbf{X} \sim \mathop{\mathcal{N}}\left(\mathbf 0,\sigma^2\mathbf{I}\right),\;\sigma^2 \in \mathbb{R}_{>0} \end{align} $$ the OLS estimator $\hat{\mathbf{b}}=\left(\mathbf X^{\mathsf T}\mathbf X\right)^{-1}\mathbf X^{\mathsf T}\mathbf y$ is the UMVUE of $\mathbf b$.
This is clear from the facts that $\hat{\mathbf b}$ is unbiased and that $\mathop{\mathbb{V}}\left(\hat{\mathbf b}\right) = \sigma^2 \left(\mathbf X^{\mathsf T}\mathbf X\right)^{-1}$ is the inverse expected Fisher information of $\mathbf b$, i.e., $\hat{\mathbf b}$ attains the Cramér–Rao lower bound.
This result is in a sense more general than the Gauss–Markov theorem in that it's not restricted to linear estimators. On the other hand it's about linear regression with i.i.d. normal errors only.

Interestingly, under the more general assumptions $$ \begin{align} &\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}, \;\mathbf X \;\text{full column rank},\\ &\mathbf e = \left(e_1,\ldots, e_n\right)^\mathsf{T},\\ &e_1,\ldots, e_n \mid \mathbf X \overset{\text{(c.)i.i.d.}}{\sim} \left(0, \sigma^2\right),\;\sigma^2 \in \mathbb{R}_{>0}, \end{align} $$ the OLS estimator $\hat{\mathbf{b}}=\left(\mathbf X^{\mathsf T}\mathbf X\right)^{-1}\mathbf X^{\mathsf T}\mathbf y$ is also the UMVUE of $\mathbf b$ if $\hat{\mathbf{b}}$ is unbiased for all regression models that satisfy $$ \begin{align} &\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}, \;\mathbf X \;\text{full column rank},\\ &\mathop{\mathbb{E}}\left(\mathbf e\mid \mathbf{X}\right)=\mathbf 0,\\ &\mathbf e = \left(e_1,\ldots, e_n\right)^\mathsf{T},\\ &e_1,\ldots, e_n \mid \mathbf X \;\text{(conditionally) independent},\\ &\mathop{\mathbb{V}}\left(\mathbf e \mid \mathbf{X}\right)=\mathop{\mathrm{diag}}\left(\sigma^2_1,\ldots,\sigma^2_n\right),\; \sigma^2_i \in \mathbb{R}_{>0},\\ \end{align} $$ i.e., for all linear regression models with independent and homo- or heteroscedastic errors (in particular, not only for the data-generating class of linear regression models with independent and homoscedastic errors).


References

statmerkur
  • 5,950
  • Why was it not answered for so long? +1. – User1865345 Nov 20 '22 at 03:30
  • Do you have any more context about this result? It seems crazy that such a basic fact was not established until this year (the proof doesn't look crazy either). – John Madden Nov 20 '22 at 04:31
  • Yes. That was exactly my point @JohnMadden. But the outline looks neat and legit. So yeah. – User1865345 Nov 20 '22 at 07:16
  • @User1865345 Oh; from here it looked like you were asking why this question on SE took so long to be answered (which is also a reasonable question). – John Madden Nov 20 '22 at 14:15
  • @JohnMadden I think it was both. But we can be contended that at the end of the day, the question received a good answer. – User1865345 Nov 20 '22 at 14:17
  • 1
    @JohnMadden After reading parts of the 2nd reference in my (edited) answer, in particular section 5 "The Results in Hansen (2022)" and double-checking with the Hansen paper I would say that the second part of my answer is not really a generalization of the Gauss–Markov theorem due to the independence assumption and requirement of unbiasedness over a class of distributions that is bigger than the assumed (smallest) class containing the DGP. – statmerkur Nov 20 '22 at 17:14
  • @User1865345 my last comment might be of interest to you, too. – statmerkur Nov 20 '22 at 17:15
  • Very well. Thanks for pointing this out @statmerkur. – User1865345 Nov 21 '22 at 01:45