I am an MBA Student taking courses in Statistics.
We are learning about different ways to estimate the parameters (i.e. coefficients) of a Regression Model. Our professor indicated that there are two main ways of doing this:
(a) Ordinary Least Squares (OLS)
(b) Maximum Likelihood Estimation (MLE)
In OLS, basically, a "line of best fit" is fitted to the data and the regression parameters/coefficients are those from this "line of best fit". As we can see, OLS (i.e. the "line of best fit") requires very few statistical assumptions in the dataset - for example, OLS DOES NOT require the residuals to be Normally Distributed.
In MLE, we try to find out the "most likely" set of parameters. For example, suppose we have 100 giraffes - if we ASSUME that the true underlying distribution for the heights of giraffes has a normal distribution, then we want to find out the "most likely" (i.e. highest probability) values of the "mean" and the "variance" (i.e. "parameters") for this normal distribution that explain the heights of giraffes. In the case of regression, MLE tries to find out the "regression model parameters which likely have the highest probability" out of all regression models to explain this data.
Our prof told us that we can imagine a regression model as a normal distribution with mean $\beta_0 + \beta_1 x_1 + \beta_2 x_2+\cdots\beta_px_p$ and a variance equal to $\sigma^2$.
Our then prof told us about the Method of Moments. He explained to us the idea of the "first moment", "second moment" ... "$k$-th moment" ... and how these moments are related to the expected values. For example, the "first moment" is the "mean" and the "variance" is a function of the first moment and the second moment.
He then told us about the Generalized Method of Moments (GMM), and that the regression parameters can also be estimated using GMM. GMM requires you to solve a system of equations to get the parameter estimates and supposedly this system of equations is easier to solve than MLE, especially back in the day when computers were not as strong (I am not sure if this is true and why this is true).
So far, everything makes sense to me, but here is where I get mixed up:
Method of Moments requires you to set each "moment condition" equal to 0. I don't understand why this is necessary. In MLE, you need to take the derivatives and set them equal to zero, because the derivative corresponds to the "maximum point" on the (log) likelihood curve. But there appears to be no requirement for differentiation in the Method of Moments. Therefore, why do the moment conditions have to be set to 0? My guess is that this is just for "mathematical convenience" to make the system of equations solvable. (e.g. back in Linear Algebra class, we learned that a system of equations sometimes requires "conditions" to make it solvable, i.e. underdetermined system, over-determined system)
Our prof mentioned that the Generalized Method of Moments is closely related to OLS. He mentioned that Generalized Method of Moments is more "robust" than MLE, because Method of Moments does not require you to ASSUME a probability distribution (as is done in MLE). But you calculate moments by taking the "Expected Value" - and the "Expected Value" depends on a probability distribution! For example, the Expected Value of a Normal Distribution is $\sum x_i/n$ - but the expected value of an exponential distribution is $\sum_i n_i/x$. Therefore, why is the Method of Moments said NOT to depend on a probability distribution, and to be considered more robust than MLE - when it clearly does depend on a probability distribution?