I would like to find a predictive density for target variable via multi-class classification. Suppose we are given a set of features $\mathbf X$ and continuous target $\mathbf y$. Replace each $y$ with a bin it belongs to and train a multi-class classification model.
Let's assume that our target distribution can be approximated by mixture of uniform distributions with disjoint support: $p(y) \approx \sum _k \phi _k \mathcal U(y|a_k, b_k)$, where $\mathcal U(y|a_k, b_k) = \mathbf 1_{[a_k, b_k)}(y) / (b_k - a_k)$. Then the goal is to predict $\boldsymbol \phi = (\phi _k)_{k=1..K}$ for each input vector $\mathbf x$. Let $Y|\mathbf w \sim \sum _k \phi_k \mathcal U(y|a_k, b_k)$, then
$$\mathcal L = \mathbb E [(Y - y)^2] = \mathbb E[Y^2] - 2y\mathbb E [Y] + y^2 = \sum _k w_k \frac{a_k^2 + a_kb_b +b_k^2}{3} - 2y\sum_kw_k\frac{a_k+b_k}{2} + y^2$$ Assuming $w_k = softmax(\mathbf z, k)$ with $\mathbf z = \mathbf W\mathbf x + \mathbf b$ we can calculate gradient w.r.t. our model parameters:
$$\frac{\partial \mathcal L}{\partial z_i} = \sum _k (\frac{a_k^2 + a_kb_b +b_k^2}{3} - 2y\frac{a_k+b_k}{2}) softmax (\mathbf z, k) (\delta _{ik} - softmax(\mathbf z, i))$$ and train our model using gradient descent.
Is it a valid approach?
I've implemented it in python, but have troubles with the convergence (probably due to some bug in code, but want to ensure that I'm on the right way).
Update: actually, my predictions are the same for each input vector and gradients for all samples are equal up to a constant. Why it is the case and how should I fix that?