Are moments more robust than MLE?

Question

I am an MBA Student taking courses in Statistics.

We are learning about different ways to estimate the parameters (i.e. coefficients) of a Regression Model. Our professor indicated that there are two main ways of doing this:

(a) Ordinary Least Squares (OLS)

(b) Maximum Likelihood Estimation (MLE)

In OLS, basically, a "line of best fit" is fitted to the data and the regression parameters/coefficients are those from this "line of best fit". As we can see, OLS (i.e. the "line of best fit") requires very few statistical assumptions in the dataset - for example, OLS DOES NOT require the residuals to be Normally Distributed.

In MLE, we try to find out the "most likely" set of parameters. For example, suppose we have 100 giraffes - if we ASSUME that the true underlying distribution for the heights of giraffes has a normal distribution, then we want to find out the "most likely" (i.e. highest probability) values of the "mean" and the "variance" (i.e. "parameters") for this normal distribution that explain the heights of giraffes. In the case of regression, MLE tries to find out the "regression model parameters which likely have the highest probability" out of all regression models to explain this data.

Our prof told us that we can imagine a regression model as a normal distribution with mean $\beta_0 + \beta_1 x_1 + \beta_2 x_2+\cdots\beta_px_p$ and a variance equal to $\sigma^2$.

Our then prof told us about the Method of Moments. He explained to us the idea of the "first moment", "second moment" ... "$k$-th moment" ... and how these moments are related to the expected values. For example, the "first moment" is the "mean" and the "variance" is a function of the first moment and the second moment.

He then told us about the Generalized Method of Moments (GMM), and that the regression parameters can also be estimated using GMM. GMM requires you to solve a system of equations to get the parameter estimates and supposedly this system of equations is easier to solve than MLE, especially back in the day when computers were not as strong (I am not sure if this is true and why this is true).

So far, everything makes sense to me, but here is where I get mixed up:

Method of Moments requires you to set each "moment condition" equal to 0. I don't understand why this is necessary. In MLE, you need to take the derivatives and set them equal to zero, because the derivative corresponds to the "maximum point" on the (log) likelihood curve. But there appears to be no requirement for differentiation in the Method of Moments. Therefore, why do the moment conditions have to be set to 0? My guess is that this is just for "mathematical convenience" to make the system of equations solvable. (e.g. back in Linear Algebra class, we learned that a system of equations sometimes requires "conditions" to make it solvable, i.e. underdetermined system, over-determined system)
Our prof mentioned that the Generalized Method of Moments is closely related to OLS. He mentioned that Generalized Method of Moments is more "robust" than MLE, because Method of Moments does not require you to ASSUME a probability distribution (as is done in MLE). But you calculate moments by taking the "Expected Value" - and the "Expected Value" depends on a probability distribution! For example, the Expected Value of a Normal Distribution is $\sum x_i/n$ - but the expected value of an exponential distribution is $\sum_i n_i/x$. Therefore, why is the Method of Moments said NOT to depend on a probability distribution, and to be considered more robust than MLE - when it clearly does depend on a probability distribution?

This doesn’t address your main question, but it is worth pointing out that the OLS and MLE solutions coincide when the errors are $iid$ Gaussian (as is often assumed). — Dave, Sep 22 '22 at 03:07
@ Dave: I was actually going to write that! Our prof mentioned this in the class today! — stats_noob, Sep 22 '22 at 03:09
What might be more robust depends on what kind of robustness you're looking at and how you specifically define your measure of it. Typically, in relation to estimators robustness would be defined to refer to some behavior of the estimator under some set of conditions (e.g. its response to gross errors), rather than whether its definition started from some assumed distribution. Any MLE - or indeed pretty much any estimator at all - might be declared to be "robust" by the same criterion as your professor used simply by forgetting that's how we obtained it. . . . ctd — Glen_b, Sep 22 '22 at 08:03
ctd . . . Is the sample range robust as an estimator of scale? By the criterion "didn't assume a distribution" it appears that it must be, for all that it's fairly useless for that purpose for many distributions (and yet possibly quite useful for a few others). — Glen_b, Sep 22 '22 at 08:06
Thanks! Is there some "quantitative comparison" for the robustness of MLE vs GMM under these conditions? I would be interested in looking at some reference. Thanks! — stats_noob, Sep 22 '22 at 13:46
Regarding In MLE, we <...> want to find out the "most likely" (i.e. highest probability) values of the "mean" and the "variance" (i.e. "parameters"): not quite – see this. The MLE maximizes the probability of the data, not the parameters. — Richard Hardy, Sep 26 '22 at 17:52

Matthew Gunn · Answer 1 · 2022-10-19T22:27:18.657

Quick Q&A:

"Method of Moments requires you to set each "moment condition" equal to 0. I don't understand why this is necessary."

It isn't necessary. It's convenient to make math expressions look pretty. Setting it to a different number also doesn't do add anything. Imagine you have $\mathrm{E}[f(x)] = c$ where $c\neq 0$. Then you could define $g(x) = f(x) - c$ and you're back to $\mathrm{E}[g(x)] = 0$.

"Our prof mentioned that Generalized Method of Moments is closely related to OLS"

You can derive OLS as a special case of GMM.

Recall from linear algebra that two vectors are said to be orthogonal if their inner product is zero.
Recall that one way to derive OLS is to:
- Assume that in the population, the error term is orthogonal to the regressors. That is if you have $i = 1, \ldots, k$ regressors, assume $\operatorname{E}[x_i \epsilon] = 0$.
- Find regression coefficients $b_i$ such that in your sample, the regressors are orthogonal to residuals. Find $b_i$ such that $\frac{1}{n} \sum_{j=1}^n x_{i,j} e_j = 0$ where $e_j$ is the residual for observation $j$. This turns out to give you the normal equations which are solved with least squares.
In GMM language, your moment conditions are $\operatorname{E}[x_i \epsilon] = 0$ for $i=1,\ldots,k$ and then you find regression coefficients such that statement is true under the empirical measure defined by your sample.

Q "but the Expected Value of a Exponential Distribution is "summation n_i/x". Therefore, why is the "Method of Moments" said NOT to depend on a probability distribution, and to be considered more robust than MLE - when it clearly does depend on a probability distribution?"

To estimate something by maximum likelihood, you have to write down the likelihood function. To write the likelihood function, you need to know the probability distribution.

Also clearly in mathematics, you need to know the probability distribution of a random variable $X$ to calculate its expectation $\operatorname{E}[X] = \int_\Omega X dP$.

On the other hand. Let's say I have 5 independent draws of the random variable $X$ and I get 3, 1, 7, 2, and 12. Do I need to know the distribution of $X$ to calculate the sample mean? No! I just add up and divide to get:

$$ \frac{3 + 1 + 7 + 2 + 7}{5} = 4$$

So yeah, in some full philosophical sense that mean of 4 in this sample depends on the probability distribution of $X$, but in a narrower sense, all you have to do to get the sample average is add those 5 numbers together and divide by 5: it doesn't require the researcher to know the probability distribution at all!

Where this goes is that you can reasonably use OLS and rely on its large sample, asymptotic properties in various settings where you as the researcher have deep uncertainties as to what the probability distribution of the error term is. Is the error term normally distributed? T-distributed? log-normal? None of the above?

Another possibly pedantic note: MLE can be viewed as a special case of GMM in a variety of conditions

If you have a log-concave likelihood function and use the moment condition that the expectation of the score is 0, GMM gives you maximum likelihood estimation.

Maximizing the log likelihood function is equivalent to maximizing the likelihood function (since log is a monotonic transformation)
If the log likelihood function is concave, the maximum can be found by taking the gradient and setting it to 0.
Recall that the score is defined as the gradient of the log likelihood function.

At the true parameter values, the expectation of the score is zero. Then the GMM procedure would have you find parameter values such that in your data, the sample average of the score is zero. This is equivalent to maximizing the likelihood function (under certain regularity conditions).

Like a lot of areas of mathematics, a problem or procedure in one framework often has an interesting correspondence to a problem or procedure in another framework.

Some practical musings:

OLS is to statistics as is a hammer or screwdriver to construction: an immensely useful and WIDELY applicable! It is worth knowing OLS well. At a minimum, it's a great starting point for analysis. There are many, many extensions that in some sense build on OLS.
MLE is fantastic when you have reasonable beliefs about the likelihood function. Often times in the social science or business setting, that's a big ask....
GMM may get a bit academic or theoretical... it's cool to see how different estimation procedures can be derived as special cases of GMM, but how often is a typical practitioner going to derive their own GMM estimator for a problem?