1

I'm trying to train a text classification model that can predict label $A$ and $B$ accurately. However, 95% of the text examples in my dataset, which is representative of the kind of data I want my model to predict on, are not examples of $A$ nor $B$. I instead labeled them as $O$ for other.

I need more examples for my labeled dataset to actually train a good model, but to do so, I need to make my labeled dataset not as representative as the data it's meant to be used on. So, there's a catch-22 situation here. I can either label forever until I'm happy with the number of labels $A$ and $B$ that I have, which will feed my model $100$ examples of $O$ for every $5$ or $6$ examples of $A$ or $B$ and probably make the model favor assuming all text examples are labeled $O$ (although this may not be true as the proportion of populations for $A$ and $B$ vs. $O$ shouldn't change as I continue to label), or I give it a non-representative dataset that artificially has a greater proportion of occurrences in $A$ and $B$.

How do I navigate this? I don't know what imperfect solution is better.

  • 1
    If you feed representative data to your model, it should not favor anything, but converge to well-calibrated probabilistic predictions. This presupposes you are not using inappropriate error measures like accuracy. Instead, use proper scoring rules. Well-calibrated probabilistic predictions will simply tell you that the conditional probability of A or B is low, and that is exactly as it should be. More here: https://stats.meta.stackexchange.com/q/6349/1352 – Stephan Kolassa Nov 18 '22 at 16:57
  • It is not clear what the problem is. If you are unhappy about having to slug through a lot of “other” in order to have a reasonable number of instance of A and B, that is a separate issue from a trained model tending to predict “other” (which, as @StephanKolassa points out, is reasonable behavior). That is, there is no problem with showing the model many instances of “other” relative to instances of A or B, but if you have a lack of data, then there’s something happening. – Dave Nov 18 '22 at 18:04
  • indirectly you may be looking for how many samples do i need for a proper training. There may not be a perfect answer for this. All you can do is create training data and test data and see for yourself which combination makes sense and gives logically relevant output. For imbalanced learning problems, there are algorithms specially for that such as RUSBoost and other weighted classifiers. – prashanth Nov 19 '22 at 05:42

1 Answers1

0

This seems like a multi-class analogue of the problem addressed by King and Zeng (2001).

As is mentioned in the comments, class imbalance is much less of a problem than many believe. After all, if an event is rare, the predictive model should be skeptical of such an event occurring.

However, when it comes to collecting data, it can become a nightmare to have to slug through so many cases to get to a point where you have enough observations of the minority classes to do quality work with reasonable estimates. That’s where King and Zeng (2001) come in. Their paper addresses how to sample in situations where events are rare so that you do not waste your time on the majority cases; then you account for the artificial balancing.

King and Zeng (2001) address the binary case, but this philosophy makes a lot of sense and might lead somewhere useful.

(A crucial difference between King and Zeng (2001) and other artificial balancing like downsampling is that King and Zeng (2001) operate at the phase of data collection instead of data modeling. In that regard, they treat class imbalance as an issue for experimental design and how to be efficient in collecting data that might be time-consuming and/or expensive to acquire, rather than discarding data like downsampling does once imbalanced data have been collected.)

REFERENCE

King, Gary, and Langche Zeng. "Logistic regression in rare events data." Political analysis 9.2 (2001): 137-163.

Dave
  • 62,186