Why we don't use the least-square objective function for GLMs?

Question

I am familiar with OLS, in which we minimize the sum of squared residuals to obtain the estimator. Why do we not use the same approach for GLM, but instead use maximum likelihood estimation?

I don't understand the question. Are you asking "why we don't use the least-square objective function for GLMs", or "why we don't minimize a function to obtain an estimator in GLMs"? — Firebug, Jan 18 '23 at 16:53
@Firebug I hadn't even thought of that interpretation of the question, and I have edited my answer to address it. — Dave, Jan 18 '23 at 23:49
My question is "why we don't use the least-square objective function for GLMs" — Xtiaan, Jan 19 '23 at 09:48

Dave · Answer 1 · 2023-01-18T23:46:09.880

We sometimes do use least squares instead of maximum likelihood estimation. For instance, in a linear probability model, we assume a conditional binomial distribution, for which the loss function corresponding to maximum likelihood estimation is log loss.

$$ L(y,\hat y)=-\sum_{i=1}^N\left[ y_i\log\left(\hat y_i\right)+ \left( 1-y_i \right)\log\right( 1-\hat y_i \left) \right] $$

However, the common approach in a linear probability model is to minimize square loss, anyway.

There are legitimate criticisms of this, but they have not stopped people from doing their modeling this way, so it is not really true that we estimate GLMs with maximum likelihood instead of least squares.

When we do use maximum likelihood instead of least squares, however, we do so because maximum likelihood estimators have some excellent properties. They are consistent. They are efficient. If we know the likelihood, maximum likelihood estimation is tough to beat.

(MLE is tough to beat, not impossible to beat. For instance, James-Stein estimation beats maximum likelihood estimation, at least in some sense.)$^{\dagger}$

...and if you're wondering why we maximize something in GLMs when we minimize something in OLS, they're two sides of the same coin. The phrasing in "maximum likelihood estimation" discusses maximization of some likelihood function $L$, and the exact same parameter estimates would arise from minimizing $-L$. In fact, the square loss function in OLS is related to Gaussian maximum likelihood estimation in this same way: $\sum_{i=1}^N\left(y_i - \hat y_i\right)^2$ is the loss function that we would minimize, and $-\sum_{i=1}^N\left(y_i - \hat y_i\right)^2$ is the likelihood function that we would maximize, where either approach gives the same parameter estimates. Viewing a classical linear model as a GLM with a Gaussian likelihood and then estimating the coefficients via maximum likelihood estimation will result in the exact same coefficient estimates as the usual OLS.

$^{\dagger}$That sense is square loss. As has been pointed out in the comments, if an analyst values something other than square loss (whether it is unbiasedness or ease of explanation to a customer who is familiar with OLS but unfamiliar with and unlikely to trust James-Stein), OLS could be considered better.

James-Stein estimation is optimal regarding the mean squared error, but surely not optimal with regards to other, sometimes more important, measures — Firebug, Jan 18 '23 at 16:51
James-Stein proves that the usual sample mean is inadmissable for a multivariate mean with dimension higher than 2. But James-Stein in turn is inadmissable, and to my recollection the estimator that was first used to show that James-Stein is inadmissable is itself inadmissable. Maybe it's turtles all the way down. — Glen_b, Jan 19 '23 at 02:50

score 1 · Answer 2 · answered Jan 19 '23 at 10:25

Why do we not use the same approach for GLM, but instead use maximum likelihood estimation?

It is because by definition GLM generalizes the OLS approach and uses more different cost functions than only the least squares method (Least squares is still included as a special case).

More precisely GLM makes two generalisations and only one of them steers away from the use of least squares.

Generalisation of the functions that describe the mean. Instead of a linear function $E[Y|X] = X\beta$ we use a linear function inside another function $E[Y|X] = f(X\beta)$.

With this generalisation we can still use least squares. See for instance the fitting of an exponential function with Gaussian distributed noise (What is the objective function to optimize in glm with gaussian and poisson family?)
Generalisation of functions that describe the conditional distribution $y|\mu_Y = f(y;\mu_Y)$ where $f$ can be a one-parameter exponential dispersion family.

It is here where the discrepancy occurs. When $f$ would be a Gaussian distribution, then the maximum likelihood estimate is equivalent to the least squares estimate.

Possibly your question is not 'why is GLM different' (it is different by definition) but instead 'why do we use it'.

The reason why we use GLM with different distributions and associated maximum likelihood estimation, is because it improve the performance of the results (the error of the estimates will be smaller).

score 0 · Answer 3 · answered Jan 18 '23 at 17:39

I think you are confusing levels of explanation. maximum likelihood also minimises an objective function. in particular for gaussian errors and linear regression you get OLS. for binary responses you get log loss objective function.

Maximum likelihood identifies the objective function based on a particular model of your data, which means that you can draw further conclusions, eg getting confidence intervals on your parameter estimates.

Why we don't use the least-square objective function for GLMs?

3 Answers3