0

I would like to find a predictive density for target variable via multi-class classification. Suppose we are given a set of features $\mathbf X$ and continuous target $\mathbf y$. Replace each $y$ with a bin it belongs to and train a multi-class classification model.

Let's assume that our target distribution can be approximated by mixture of uniform distributions with disjoint support: $p(y) \approx \sum _k \phi _k \mathcal U(y|a_k, b_k)$, where $\mathcal U(y|a_k, b_k) = \mathbf 1_{[a_k, b_k)}(y) / (b_k - a_k)$. Then the goal is to predict $\boldsymbol \phi = (\phi _k)_{k=1..K}$ for each input vector $\mathbf x$. Let $Y|\mathbf w \sim \sum _k \phi_k \mathcal U(y|a_k, b_k)$, then

$$\mathcal L = \mathbb E [(Y - y)^2] = \mathbb E[Y^2] - 2y\mathbb E [Y] + y^2 = \sum _k w_k \frac{a_k^2 + a_kb_b +b_k^2}{3} - 2y\sum_kw_k\frac{a_k+b_k}{2} + y^2$$ Assuming $w_k = softmax(\mathbf z, k)$ with $\mathbf z = \mathbf W\mathbf x + \mathbf b$ we can calculate gradient w.r.t. our model parameters:

$$\frac{\partial \mathcal L}{\partial z_i} = \sum _k (\frac{a_k^2 + a_kb_b +b_k^2}{3} - 2y\frac{a_k+b_k}{2}) softmax (\mathbf z, k) (\delta _{ik} - softmax(\mathbf z, i))$$ and train our model using gradient descent.

Is it a valid approach?

I've implemented it in python, but have troubles with the convergence (probably due to some bug in code, but want to ensure that I'm on the right way).

Update: actually, my predictions are the same for each input vector and gradients for all samples are equal up to a constant. Why it is the case and how should I fix that?

vladkkkkk
  • 691
  • 2
    Why perform the binning at all instead of modeling on a continuum? https://stats.stackexchange.com/a/68839/247274 – Dave Mar 23 '23 at 00:43
  • @Dave This is my workaround to get predictive distribution – vladkkkkk Mar 23 '23 at 01:41
  • So you combine a bunch of disjoint uniform distributions according to the predicted probabilities of the multiclass classifier to make some kind of conditional histogram (histogram of the target, conditioned on feature values)? This is an interesting idea, but why do you have to jump through all kinds of hoops instead of just using a standard multiclass classifier? // What happens when you apply your model to a $y$-value that is outside of the range of training $y$-values? – Dave Mar 23 '23 at 01:52
  • @Dave In multi-class classification using e.g. cross-entropy loss we treat all missclassification equal. The described loss function tries to punish predictions, which are far from actual values. – vladkkkkk Mar 23 '23 at 09:04
  • If true $y$ value for unseen $\mathbf x$ will lie outside of training $y$-values, it will be clipped. This approach is used in decision tree regression, for example. Although, without loss of generality, one can take e.g. truncated normal distribution for the edge bins. This way our model will be able to predict any value with probability diminishing towards infinity. – vladkkkkk Mar 23 '23 at 09:09

0 Answers0