Square loss for "big data"

Question

Let’s set up a supervised learning problem with $p$ predictors and $n$ observations. The response variable is univariate. The problem can be regression or classification, though I think a classification problem introduces additional complexity if there are more than two categories.

Euclidean distance between two vectors, which is quite related to square loss where the two vectors are the $n$-dimensional prediction vector and the $n$-dimensional vector of observed responses, has problems when the dimension is high (whatever “high” means).

Why is Euclidean distance not a good metric in high dimensions?

Nonetheless, square loss is popular, even when there are hundreds of observations and therefore dimensions (which should be enough to trigger some of the bizarre behavior of the Euclidean norm).

Yes, if we use square loss for OLS regression under the Gauss-Markov conditions, we get the BLUE. Yes, we can do all sorts of inference on the parameters. However, I am thinking of a pure prediction problem, perhaps a complicated neural network. In that kind of prediction problem, the inference and interpretation are much less important than the predictive accuracy.

So why use square loss when there are many observations?

EDIT

I say that the observations $y$ and predictions $\hat y$ are vectors, even for a univariate response, because they must be for the following to make sense as linear transformations.

$$ \hat\beta_{ols}=(X^TX)^{-1}X^Ty\\ \hat y=X(X^TX)^{-1}X^Ty $$

The upper equation represents a linear transformation of a vector $y\in\mathbb R^n$ to $\hat\beta_{ols}\in\mathbb R^p$, and the bottom represents a linear transformation from $\mathbb R^n\rightarrow\mathbb R^n$.

If your response variable is univariate, then you have $n$ points in a single dimension - not in $n$ dimensions. Are you thinking of multivariate prediction problems? — Stephan Kolassa, Oct 20 '20 at 14:09
@StephanKolassa I'm considering $y\in\mathbb{R}^n$ and $\hat{y}\in\mathbb{R}^n$. Square loss is then $d_{L2}(y,\hat{y})$. I say this is reasonable, as we often write something like $\hat{\beta} = (X^TX)^{-1}X^Ty$, where we treat $y$ as being in $\mathbb{R}^n$. — Dave, Oct 20 '20 at 14:48
OK. What do you mean then by your second sentence, "The response variable is univariate."? So what you actually have is a multivariate problem, where you are training on $N$ samples, each of which is an $n$-vector, and evaluating on $K$ samples, each of which is again an $n$ vector? (And I'm not even looking at the predictors yet.) Can you give an example of where this would go into high dimensions? Are you predicting entire EEG time courses, for instance? — Stephan Kolassa, Oct 20 '20 at 14:55
@StephanKolassa I mean that the response variable is univariate, and there are $n$ observations of it. We then write $y = (y_1,\dots,y_n)\in\mathbb{R}^n$. — Dave, Oct 20 '20 at 15:01
Well, but then my initial comment applies, and you have only a single dimension. I think I'm confused. Can you perhaps give a concrete example of what you are trying to do? — Stephan Kolassa, Oct 20 '20 at 15:06
@StephanKolassa I observe ten million lions and ten million tigers, noting their top speeds. I regress speed on species using OLS. I say that $y\in\mathbb{R}^\text{20-million}$, as the parameter vector would be $\hat{\beta} = (X^TX)^{-1}X^Ty$. ($X$ is a column of 20-million $1$s and a column of 20-million species labels, say lions as $0$ and tigers as $1$.) — Dave, Oct 20 '20 at 15:51
OK. So you have a single dimension and a lot of observations in this single dimension. The Euclidean metric has no problems in such a setting that I am aware of. Can you explain what problems you see? — Stephan Kolassa, Oct 20 '20 at 15:58
$y\in \mathbb{R}^\text{20-million}$ is not a single dimension. — Dave, Oct 20 '20 at 16:01
You have 20 million observations in a single dimension, speed. A multivariate analysis would predict, e.g., a triple (speed, height, weight), and then you would measure 20 million individuals on these three dimensions. But three dimensions is still not "high-dimensional". Can you explain what problems you see? — Stephan Kolassa, Oct 20 '20 at 16:03
@StephanKolassa The $y$ vector in the $\hat\beta_{ols} = (X^TX)^{-1}X^Ty$ exists in a high-dimension space, even if we just have a single predictor (or no predictors at all in an intercept-only model). — Dave, Dec 16 '21 at 20:21
I still don't see it. $y$ is one-dimensional. How does it "exist in a high dimension space"? I am not talking about the predictors at all, only about $y$. You have many observations in a single dimension. Yes, the Euclidean distance has problems in high dimensional settings, but this is a "big data" setting in a single dimension. Can you explain what problem you see with having many observations in a single dimension? Note that four other people apparently were as confused as I was... — Stephan Kolassa, Dec 16 '21 at 20:28

score 0 · Answer 1 · answered Mar 29 '22 at 10:01

I wasn’t thinking like a statistician when I posted this two years ago.

Under an assumption of a Gaussian error term, minimizing square loss is equivalent to maximum likelihood estimation.

When we find the parameter estimates that minimize square loss, there is a sense in which those are the most likely parameter values, given the data.

I find this answer somewhat unsatisfying because I wanted to think purely in terms of doing predictions, but perhaps the relationship to maximum likelihood estimation and the tradition of minimizing square loss are the best I can do (and perhaps finding the “most likely” estimates really is a reasonable way to assess predictive accuracy).

score 0 · Answer 2 · answered Mar 29 '22 at 10:41

How much data you have is irrelevant to the choice of loss function. If you have more data, lucky you. Loss is a part of your statistical model, your assumptions about the data as it corresponds to the likelihood function. Loss is also about characteristics of the data, for example you may want it to be more robust against some kinds of outliers.

As about Euclidean distance, the problem relates to high dimensional data, i.e. many columns, not many rows (“big data”). Usually when minimizing loss you look at average loss across the samples, so it doesn’t really matter how many samples you have.

Why squared loss? It is one of the simplest ones, it is mathematically convenient (corresponds to Gaussian likelihood, has simple derivative), and for historical reasons. See Why is the squared difference so commonly used? and What makes mean square error so good? for more details.

Square loss for "big data"

2 Answers2

Linked