Why is linear regression taught differently from what I have learned?

Question

I took a machine learning course using the book "Learning from Data: A Short Course" by Hsuan-Tien Lin, Malik Magdon-Ismail, and Yaser Abu-Mostafa (LFD)

You are given a set of examples $\{x_n\}$, a set of labels $\{y_n\}$, the linear regression model is the function $\widehat{y}(x) = w^Tx$ where $w$ is a set of learned parameters obtained from minimizing the sum of the MSE error between $y_n$ and $\widehat y(x_n)$. The end (LFD, page 84).

However, when I open up other textbooks (virtually any ML/statistics textbook), they all say something like:

We assume the labels $y$ were generated via $y = w^T x + \epsilon$ where $\epsilon$ is some iid Gaussian error.

But why? This is very strange and unwieldy, no? In the way that is taught from LFD, there is no need to assume that the label are generated according to an additive iid Gaussian noise. This assumption can never be verified in practice.

Can someone make sense of what I am seeing? Is LFD taking on a new approach to teaching about linear regression?

After all, all I care about is drawing a linear relationship between the label and my data. I don't care if the label are generated in some specific way ($\approx0\%$ chance of being true in practice).

@Sycorax-OnStrike Why do we assume $\epsilon$ is additive? Why do we assume it is Gaussian? If I just give you two set of randomly generated integer numbers and ask you to find a line of best fit between them, that line takes the form $w^T x$. That line exists regardless of any $\epsilon$. That line always exists. There is no need to assume anything. — Curaçao Hajek, Jul 10 '23 at 23:17
@Dave Closed form expression through the design matrix $X$ would be fine. — Curaçao Hajek, Jul 10 '23 at 23:19
It's true that you can always calculate a minimizing $w$ (solution uniqueness issues aside). But this doesn't make it meaningful. To convince yourself that some assumptions on $\epsilon$ are necessary in order for this estimate to be meaningful, try generating some data from a linear regression model with Cauchy distributed error. (From a pedagogical standpoint, authors will tend to assume normality or omit this assumption depending on whether they intend to develop confidence intervals/p-values, which rely on it). — John Madden, Jul 10 '23 at 23:32
@JohnMadden Thanks. But there is still some problem: since linear regression can be seen as a special case of (say, deep, trillion parameters) neural networks, therefore how does the assumption on $\epsilon$ generalize from linear regression to neural networks? Will it still be Gaussian, additive? If not, doesn't mean that this assumption has some problems given that deep neural networks can do anything linear regression does, but without needing to assume even the existence of such $\epsilon$? — Curaçao Hajek, Jul 11 '23 at 00:03
the issue is not with the form of our parametric predictor (neural nets vs linear regression), but rather with the MSE loss function itself, which is sensitive to outliers. Are you already aware of the relationship between MSE and Gaussian MLE, see e.g. https://stats.stackexchange.com/questions/143705/maximum-likelihood-method-vs-least-squares-method ? This indicates that when we're doing MSE, we are acting as though we believe the errors are normally distributed. — John Madden, Jul 11 '23 at 00:26
@JohnMadden That assumes maximum likelihood estimation, which is a popular estimation approach, sure, but not the only way to estimate. — Dave, Jul 11 '23 at 00:33
@Dave oh, maybe I should have spelled it out: the "MLE" in my comment indeed refers to maximum likelihood estimation. — John Madden, Jul 11 '23 at 01:19
Unless you want to make an interpolating curve, you do need to assume the existence of some $\epsilon$; otherwise, how to explain a less-than-perfect fit? — jbowman, Jul 11 '23 at 01:31
I’m really not sure how the proposed duplicate answers this question, is there a particular section people have in mind? The proposed duplicate seems to address Gaussian errors vs errors with other distributions, while this seems to be about why we consider an error term at all (and why it is additive), which seems different. Any thoughts, @jbowman ? (A big +1 to your comment above, by the way. I was hoping that would form the beginning of an answer.) — Dave, Jul 11 '23 at 04:43
@Dave Exactly. Why do we need to consider an error term? What is the explicit form of the error term when doing imagenet classification using residual U-net? Is it possible to ever figure that out? If not, then isn't it more theoretically convenient to not bother with it at all? This is similar to: suppose I wish to predict weather, I know that a butterfly's wing flap could cause a tornado, but it is probably in my best interest to ignore dealing with that factor. — Curaçao Hajek, Jul 11 '23 at 04:51
I am hopeful that someone can point out what about the proposed duplicate addresses this. Right now, I see a related question that might be worth reading, but I do not see a duplicate. I will vote to reopen (I think my vote would be binding and automatically result in a reopening) and might post an answer if particular sections of the proposed duplicate cannot be cited as answering the question asked here (it might be that the proposed duplicate absolutely does it, though). — Dave, Jul 11 '23 at 04:59
Write down the Gaussian negative log-likelihood and the MSE objective falls out, giving an immediate and deep answer to the question. $$-\log \prod_i \exp( -( \hat y_i - y_i )^2 )=\sum_i (\hat y_i - y_i)^2 $$ — Sycorax, Jul 11 '23 at 17:51
@Dave This site is so typical: if the power users cannot answer a question they will either get very angry and downvote a post to death, or close a question, or just pretend that the question is already answered. — Curaçao Hajek, Jul 11 '23 at 23:20
I don’t find that to be the case on here at all. Every since I first registered, I have found a community of people eager to help. — Dave, Jul 11 '23 at 23:51
@CuraçaoHajek it's kind of strange to ignore half of the comments offering help and then complain that no one is helping, isn't it? — John Madden, Jul 12 '23 at 13:10

Why is linear regression taught differently from what I have learned?

0 Answers0