Square Loss for classification via regression

Question

I am working on a multiclass classification problem with $k$ classes by jointy performing $k$ linear regressions (I know this isn't the best way to tackle this kind of problem).

The result of this joint regression is a $N \times k$ matrix $\hat{Y}$, where each row $i$ contains $k$ floats representing the prediction for observation $i$ relative to each of the classes. I also have a $N \times k$ matrix for the validation/test set labels $Y$, however these matrices are one-hot such that each row contains one '1' in the column of the true label and 0's in all other columns.

I need to compute the square loss of my predictions following regression, but I am unsure how to compute the square loss between two vectors.

Is the following correct? $$ \text{Suppose each $y_i$ is a one-hot $k$-vector (e.g., [0,0,1,0,0,0] with $k=6$)} \\ \text{and each $\hat{y}_i$ is a $k$-vector of float values} \\ \text{Square loss} = \frac{1}{N}\sum_{i=1}^{N} \left( ||y_i - \hat{y}_i||_2 \right)^2 \\ = \frac{1}{N}\sum_{i=1}^{N} \left( \sqrt{(y_{i,1} - \hat{y}_{i,1})^2 + (y_{i,1} - \hat{y}_{i,1})^2 + ... + (y_{i,k} - \hat{y}_{i,k})^2} \right)^2 $$

...

As a separate but related question, I observe that all of the elements of my resulting $\hat{y}_i$ vectors are quite small, on the order of $10^{-4}$ to $10^{-1}$. Is this problematic? It seems that square loss will not be very meaningful when computing between vectors with one-hot 0/1 values and vectors with such small floats. However, when I assign each observation to the class with the largest element in $\hat{y}_i$ the 0-1 loss is relatively low (~86% accuracy) regardless of the fact that the magnitudes of the $\hat{y}_i$'s are small.

This seems so wrong... is there any reason why you are avoiding to use a squashing functions and a more appropriate cost function (i.e. cross entropy)? Overall your cost function seems to be correct, but it will very likely be less optimal than a suitable cost function. — Ramalho, Oct 29 '16 at 10:46
Mean squared error is appropriate to regression (line/curve fitting) where the goal is to minimize the mean squared error between the training set (points) and the fitted curve. Cross entropy cost is appropriate to classification where the goal is to minimize the number of mis-classified training samples by imposing an exponentially increasing error the closer an output comes to being "1" when it should be "0", and vice versa. — Ramalho, Oct 29 '16 at 10:46
@Ramalho As I noted in the question, I realize the points you're making, but I'm being asked to perform the regressions and square loss as described. — DrNesbit, Oct 29 '16 at 23:22

score 1 · Answer 1 · answered May 23 '23 at 13:53

First, I would be remiss not to mention that a linear probability model for a multi-class problem sounds like a poor approach, and I would encourage you to pursue appropriate methods instead of shoehorning this problem into a method that is wildly inappropriate.

However, there is no inherent issue with calculating the square loss between two matrices. Go element-wise. For example, with $2\times 2$ matrices:

$$ L\left( \begin{bmatrix} y_{1, 1} & y_{1, 2} \\ y_{2, 1} & y_{2, 2} \end{bmatrix} , \begin{bmatrix} \hat y_{1, 1} & \hat y_{1, 2} \\ \hat y_{2, 1} & \hat y_{2, 2} \end{bmatrix} \right) \\= \left(y_{1, 1} - \hat y_{1, 1}\right)^2 + \left(y_{1, 2} - \hat y_{1, 2}\right)^2 + \left(y_{2, 1} - \hat y_{2, 1}\right)^2 + \left(y_{2, 2} - \hat y_{2, 2}\right)^2 $$

In this case, you act as if these are vectors in $\mathbb R^4$ instead of $2\times 2$ matrices. In many regards, the space of $m\times n$ matrices is equivalent to $\mathbb R^{m\times n}$, so this is totally reasonable.

Slap on a square root, if you want. Divide by the number of matrix elements, if you want. Neither of those will change the optimum (except for the possibility of (hopefully slight) differences when you do the math on a computer).

As a separate but related question, I observe that all of the elements of my resulting $\hat{y}_i$ vectors are quite small, on the order of $10^{-4}$ to $10^{-1}$. Is this problematic? It seems that square loss will not be very meaningful when computing between vectors with one-hot 0/1 values and vectors with such small floats. However, when I assign each observation to the class with the largest element in $\hat{y}_i$ the 0-1 loss is relatively low (~86% accuracy) regardless of the fact that the magnitudes of the $\hat{y}_i$'s are small.

This really is a separate question that warrants its own post. However, a potential issus is that it sounds like you have imbalance in your data where one class has much more representation than the others, meaning that you can classify every observation as that majority category and wind up with a fairly high accuracy like the $86\%$ you achieve. This is often the first, but far from the only, issue with classification accuracy that leads people to realize that classification accuracy not the best measure for assessing classification models.

Square Loss for classification via regression

1 Answers1

Linked