Does the average of a random sample minimizes MSE when you "know nothing about the distribution"?

Question

Consider any random variable $X$ and any random sample $(X_1,\dots, X_n)$ such that $X_i \sim X$.

As is well-known, $E(X)$ is the constant that minimizes the MSE of $X$, i.e., $E(X) = \arg\min_a E[(a-X)^2]$.

It seems that the minimizer does not change if we consider minimization over statistics from the random sample $m(X_1,\dots, X_n)$ instead, i.e., $\arg\min_{m(X_1,\dots, X_n)} E[(m(X_1,\dots, X_n)-X)^2]$ is the constant function $m(X_1,\dots, X_n) = E(X)$ for all $(X_1,\dots, X_n)$.

Informally, we don't really care about the random sample. If I am asking "what is the best function of random sample $(X_1,\dots, X_n)$ to predict $X$" (and "better prediction" is in terms of MSE), the answer remains: A constant function that's always equal to $E(X)$ that does not depend on the sample, but "implicitly assumes" you know $E(X)$.

What I am trying to do, if that's possible at all, is to formalize the idea/intuition that, if you did not know anything about $X$ --- and in particular, if you knew nothing about $E(X)$ --- then the best prediction would be $m(X_1,\dots, X_n) = \frac{1}{n} \sum_{i=1}^n X_i$.

My question is: Is there a standard statistical sense in which this is true?

I've tried to formalize this in a Bayesian sense, but I am struggling to

Model the idea that "you don't know anything about the distribution" (to the point where I doubt whether that's feasible at all), and
Make a general argument that's not dependent on a particular class of distribution (again, to the point where I doubt whether that's feasible at all).

Essentially, any assumption I make about the distributions of $X$ and $E(X) = \theta$ is tantamount to assuming "something is known" about the distribution $X$ and ends up creeping into the solution for the minimizer.

I am guessing I should use an uninformative prior for $\theta$ (https://www.statlect.com/fundamentals-of-statistics/uninformative-prior). Maybe I should look at the limit of a "flatter and flatter" and "wider and wider" discrete uniform for $\theta$. But I still don't know how to handle trying to make "as little assumption as possible" on the distribution of $X$ and argue that $m(X_1,\dots, X_n) = \frac{1}{n} \sum_{i=1}^n X_i$ is the best predictor (if that's doable at all).

Clarification following Dave's answer: I think I understand that the sample average $\bar{X}$ is the best MSE estimator of $E(X)$. I also understand that $E(X)$ is the best MSE predictor of $X$ itself. Part of what I am missing seems to be a connection between these two facts that would allow me to conclude that the $\bar{X}$ is the best MSE predictor of $X$. What I am after is a sense in which $\bar{X} = \arg\min_{m(X_1,\dots, X_n)} E[(m(X_1,\dots, X_n)-X)^2]$. I understand that, as is, the latter is wrong since the minimizer is a constant function always equals $E(X)$. But I am wondering whether it is true in some sense provided one does not have any information about the distribution of $X$ (and in particular about $E(X)$).

In order to succeed with this, you need to assume $X$ has a finite variance. The result is then immediate. It's often phrased in terms of operators on the $L^2$ space of finite-variance random variables: expectation is the projection onto the subspace of constant variables. See https://math.stackexchange.com/questions/1586810 for instance, or https://math.stackexchange.com/questions/2149093, https://math.stackexchange.com/questions/2442767, etc. — whuber, Dec 26 '22 at 15:36
Notice that $E[m(\ldots)-X)^2]=E[(m(\ldots)-\theta)^2]+\operatorname{Var}(X)$ demonstrates that the problem of finding the best MSE predictor of $X$ is equivalent to finding the best MSE estimator of $\theta.$ — whuber, Dec 26 '22 at 16:52

FZS · Answer 1 · 2022-12-27T02:23:35.833

Ok, with the help of Whuber's comments and Dave's answer, I think I now have an answer to part of my question.

Let $\textbf{X} = (X_1, \dots, X_n)$. As Whuber suggested in the comments:

$$\begin{align} E[(m(\textbf{X})-X)^2] & = E[(m(\textbf{X}) - E(X) + E(X)- X)^2]\\ & = E[(m(\textbf{X}) - E(X))^2 + 2*[m(\textbf{X}) - E(X)]*[E(X)- X] + (E(X)- X)^2]\\ & = E[(m(\textbf{X}) - E(X))^2] + \underbrace{2*E{[m(\textbf{X}) - E(X)]*E[E(X)- X]}}_{=0} + E[(E(X)- X)^2] \\ & = E[(m(\textbf{X}) - E(X))^2] + Var(X) \end{align}$$

So minimizing $E[(m(\textbf{X})-X)^2]$ is a matter of minimizing $E[(m(\textbf{X}) - E(X))^2]$.

Of course, the minimizer thereof remains $m(\textbf{X}) = E(X)$. That is, if you knew $E(X)$, you could still use that value as your best prediction and disregard any information from the random sample $\textbf{X}$.

But if you don't know $E(X)$, "a good way" to estimate it and limit $E[(m(\textbf{X}) - E(X))^2]$ is to use $m(\textbf{X}) = \bar{\textbf{X}}$. I still don't quite understand in what sense (or rather in which informational situations about the distribution of $X$) $m(\textbf{X}) = \bar{\textbf{X}}$ minimizes $E[(m(\textbf{X}) - E(X))^2]$, but I might be able to live with that.

It appears that, however good it is as a rule of thumb, $m(\textbf{X}) = \bar{\textbf{X}}$ does not always minimize $E[(m(\textbf{X}) - E(X))^2]$, not even within the class of estimators that are unbiased w.r.t. $E(X)$. See https://stats.stackexchange.com/questions/375101/question-about-mse-mean-square-error. — FZS, Feb 21 '23 at 15:38

score 0 · Answer 2 · answered Dec 26 '22 at 15:41

0

It’s just a calculus problem.

$$\bar X= \frac{1}{n}\sum_{i=1}^n X_i =\underset{\theta\in\mathbb R}{\arg\min}\left\{ \sum_{i=1}^n ( X_i-\theta )^2 \right\} $$

When you differentiate the sum of squares in the $\arg\min$, you find that the derivative is zero when $\theta=\bar X$ for any numbers $X_i$, and further calculus shows that to be the minimum.

answered Dec 26 '22 at 15:41

Dave

62,186

Thank you for your answer. I had a hard time trying to explain what my question was about. Sorry if I wasn't clear enough. I think I understand that the sample average $\bar{X}$ is the best MSE estimator of $E(X)$. I also understand that $E(X)$ is the best MSE predictor of $X$ itself. Part of what I am missing seems to be a connection between these two facts that would allow me to conclude that the sample $\bar{X}$ is the best MSE predictor of $X$. – FZS Dec 26 '22 at 16:11
What I am after is a sense in which $\bar{X} = \arg\min_{m(X_1,\dots, X_n)} E[(m(X_1,\dots, X_n)-X)^2]$ rather than a sense in which $\bar X =\underset{\theta\in\mathbb R}{\arg\min}\left{\sum_{i=1}^n(X_i-\theta)^2\right}$. As I tried to explain in the beginning of my question, I understand that $\bar{X} = \arg\min_{m(X_1,\dots, X_n)} E[(m(X_1,\dots, X_n)-X)^2]$ is wrong the minimizer is a constant function always equal to $E(X)$. But I am wondering whether it is true in some sense given that one does not have any information about the distribution of $X$ (and in particular about $E(X)$). – FZS Dec 26 '22 at 16:14

Does the average of a random sample minimizes MSE when you "know nothing about the distribution"?

2 Answers2