I have a classification problem with $K$ labels. I represent the correct label $y$ of an observation $x$ as a vector $y$ in $R^K$, with entries $y_{k'} = \delta_{kk'}$ if $x$ belongs to class $k$.
Given an observation $x$, I predict its label with a vector $f(x) \in R^K$, where the components $f_k(x)$ satisfy $f_k(x) \in (0,1)$ and $\sum_k f_k(x) = 1$. A larger value of some $f_k(x)$ means that $x$ is more likely to belong to class $k$.
We want to learn the best function $f$ by choosing an appropriate loss function $\ell$. I know that a common choice is the cross-entropy: $\ell(x,y) = -\sum_k y_k \log f_k(x)$. Is squared loss $\ell(x,y) = \frac{1}{2}||y - f(x)||_2^2$ ever used? If so, does it tend to produce classifiers with noticeably distinct performance profiles, compared to cross-entropy?
A comment on a related question warns against the use of squared loss.
