I am working on a multiclass classification problem with $k$ classes by jointy performing $k$ linear regressions (I know this isn't the best way to tackle this kind of problem).
The result of this joint regression is a $N \times k$ matrix $\hat{Y}$, where each row $i$ contains $k$ floats representing the prediction for observation $i$ relative to each of the classes. I also have a $N \times k$ matrix for the validation/test set labels $Y$, however these matrices are one-hot such that each row contains one '1' in the column of the true label and 0's in all other columns.
I need to compute the square loss of my predictions following regression, but I am unsure how to compute the square loss between two vectors.
Is the following correct? $$ \text{Suppose each $y_i$ is a one-hot $k$-vector (e.g., [0,0,1,0,0,0] with $k=6$)} \\ \text{and each $\hat{y}_i$ is a $k$-vector of float values} \\ \text{Square loss} = \frac{1}{N}\sum_{i=1}^{N} \left( ||y_i - \hat{y}_i||_2 \right)^2 \\ = \frac{1}{N}\sum_{i=1}^{N} \left( \sqrt{(y_{i,1} - \hat{y}_{i,1})^2 + (y_{i,1} - \hat{y}_{i,1})^2 + ... + (y_{i,k} - \hat{y}_{i,k})^2} \right)^2 $$
...
As a separate but related question, I observe that all of the elements of my resulting $\hat{y}_i$ vectors are quite small, on the order of $10^{-4}$ to $10^{-1}$. Is this problematic? It seems that square loss will not be very meaningful when computing between vectors with one-hot 0/1 values and vectors with such small floats. However, when I assign each observation to the class with the largest element in $\hat{y}_i$ the 0-1 loss is relatively low (~86% accuracy) regardless of the fact that the magnitudes of the $\hat{y}_i$'s are small.