3

I am handling a binary classification problem on an imbalanced dataset.

The goal is to create a system able to insert the returned score (probability to be in the positive class) in a bins between 1 and 10, where 1 means low probability to be in the positive class and 10 high prob.

The main problem is that I only have the training dataset, so I don't know any values in the test set. Moreover, the predictions will be done one by one, so I cannot analyze the whole test scores.

I tried many models, but in particular I use tree-based models (such as XG-Boost, RF). In these cases, considering also the imbalanced dataset, the output scores are in a very small range, much smaller than [0, 1]. The scores are necessary since I don't want to classify the instances directly into the 0-1 class, but I want to analyze the scores.

How should I build a method able to find the different thresholds in order to create the 10 classes?

Noah
  • 33,180
  • 3
  • 47
  • 105
A1010
  • 213
  • It sounds like your targets are ordinal. You can use an ordinal loss function. For example: https://en.wikipedia.org/wiki/Ordered_logit Although one must wonder why the targets have been pre-processed into this 1-10 scale instead of working with the class labels directly. – Sycorax May 24 '20 at 22:00
  • the target is always binary (0 or 1). I don't want to just put a label for each record, but I want to analyze the returned scores, in order to deeper understand the confidence of the prediction. I want to create the 10 classes in order to divide the output scores into these 10 classes to easily analyze – A1010 May 25 '20 at 06:40
  • In your question, you write "How should I build a method able to find the different thresholds in order to create the 10 classes?" But there's no single, correct answer to this question. One method would divide $[0,1]$ into ten equal-sized bins. Another would examine the model predictions and then choose bins based on the deciles of the predictions. Another might be uniform on the logarithmic scale. In other words, how will you know that you've chosen a good way to make bins? – Sycorax May 25 '20 at 15:15
  • You are right, there are many possible solutions. To be more precise, I want a method able to equally distribute the test observation into the 10 bins. A vanilla solution should obviously be to create the 10 bins like 0->[0, 0.1], 1->[0.1, 0.2] and so on. Anyway in this case, it is possible that all the test observations will be categorized into one single bin according to their score. So, I am looking for a method able to build the 10 bins in such a way that the test scores will fall into all the bins, not just one. – A1010 May 25 '20 at 15:24
  • It sounds like you can use the deciles of some test set as the bins. Another set of data drawn from the same distribution will be more-or-less uniformly allocated to the bins (it won't be exact because of sampling variance). – Sycorax May 25 '20 at 15:50
  • Yes, that's a good idea. The problem is that by this way I have to put aside a percentage of my training data and it could be a problem. I would try to train the model on the whole training set, and then predict the training set itself. Obviously it is not a good practice in general, due to overfitting. However, in my case I am not interested on who is predicted as 1 and who as 0, but just in the score distribution. What do you think about it? – A1010 May 25 '20 at 15:56
  • I think you'll have to put aside a percentage of your training data. – Sycorax May 25 '20 at 15:57
  • What about $(0,0.1)$, $(0.1,0.2)$, etc? – Dave May 29 '21 at 13:06
  • no, it is not possible, since the scores are in a range between 0 and 1, but in most of the case the range is much smaller (eg. 0 - 0.05) and so if I use some a-priori partition I will probably fall in a case where all the records are in the same folder. – A1010 Jun 01 '21 at 10:27

2 Answers2

1

First, the fact that your outputs are in only the very low end of $[0,1]$ strikes me as a feature, not a bug, of the outputs of a probability predictor in an imbalanced setting. It might typically be the case that a minority-class event is unlikely, even if the probability is higher than the prior probability (class ratio), as I discuss here. If that is the case, then, yes, the predicted probabilities should be at the low end of $[0,1]$.

You want to then scale $[0,1]$ to $[1,10]$. That sounds like a job for $y=9x+1$: first multiply the value in $[0,1]$ by $9$, and then add $1$. If you are determined to have only the integers, you could consider rounding (though I do not see any statistical advantage to doing so). Yes, this approach will have a lot of values down towards one, but think about what it means if you manage to get a seven or a ten!

Dave
  • 62,186
0

Well, if your models outputs probabilities for the positive class and your decision boundary is 0.5, i.e. p < 0.5 -> class 0 and p >= 0.5 -> class 1 then you can discretize the probability interval from $[0.5, 1.)$ into 10 bins and assign a class for each interval, e.g. $[0.5,0.55)$ => class 1, $[0.55,0.6)$ => class 2, ...

Now each class represents some confidence about the positive prediction.

Tinu
  • 828
  • Well, my decision boundary is not 0.5. Since I am working in an imbalanced scenario, the output probabilities for the positive class are usually much smaller than 0.5 (in particular for the tree-based classifiers), since they move around the conversion rate of the target column. Analyzing the problem from another point of view, it's as I have to set my threshold. – A1010 May 25 '20 at 13:11
  • 1
    This is not well conceptualized. Keep probabilities as probabilities. And use proper nomenclature. This is not a classification problem. This is a probability modeling problem. – Frank Harrell Jan 31 '22 at 12:59