1

I'm working on a binary classification problem with a small dataset n < 2000 that predominantly uses text data. Let's say the model tends to misclassify observations where a certain categorical column value is 'A' or 'B' but performs well on 'C' or 'D'.

If I have the means to get help with manual data labelling, is it important that I specifically get more observations that have column value 'A' or 'B', or does that not matter/add bias to the system?

harry
  • 25
  • Is this a predictor or the outcome variable? – Stephan Kolassa Jan 20 '23 at 06:09
  • @StephanKolassa This is a predictor variable. Within that predictor variable column values ABCD are balanced and I believe the population it's sampling from is balanced too. I'm not sure what is best practice for additional data collection and if there's any strategy to employ there. – harry Jan 20 '23 at 06:50
  • Do you have lots of unlabeled data that's available but you could label only a small portion of it because labeling is expensive? – dipetkov Jan 21 '23 at 14:41
  • @dipetkov Yes, unfortunately limited on that end. – harry Jan 22 '23 at 21:49
  • You already got the suggestion to look into active learning (and a recommendation for a tool to do it, which is great). This thread might be relevant as well. – dipetkov Jan 22 '23 at 21:56

2 Answers2

2

You could also try active learning methods which will use your current classifier to automatically estimate which additional data would be most informative for it if labeled next. There are practical open-source implementations of this, eg. some methods for Python I helped develop.

You are currently using your own intuition to assume collecting more data from 'A' or 'B' will be best, but perhaps this is still sub-optimal. Active learning can help you figure this out

  • Active learning is what I had in mind (+1). However, doesn't it require a large pool of unlabeled examples as a starting point? The OP hasn't actually said they have that at hand. – dipetkov Jan 22 '23 at 17:07
  • Also, if you are a contributor to the cleanlab project (and it seems you are....) please make this clear in your answer. – dipetkov Jan 22 '23 at 17:08
1

Depending on what other predictors there are in your dataset, and how they co-occur with this particular predictor and interact in association with the outcome of interest, "focused data collection" may improve matters. Especially if your dataset is so unbalanced in terms of this predictor that some coefficients (or your model's analogue of coefficients) are imprecisely estimated (see Dikran's answer here). Unfortunately, you write that your predictor is balanced, so this increase in parameter precision will likely not make much of a difference, or at least it would probably be better to increase the precision of all coefficient estimates by "balanced" data collection.

Or it may not. For instance, assume that the target class is X with probability 0.6 if this predictor value is A or B and 0.9 if the predictor value is C or D, and that there are no other predictors involved. Then you will get the highest accuracy by always outputting a classification of X, completely regardless of the predictor... but the accuracy will only be 60% if the predictor is A or B, and 90% if it is C or D. And no amount of focused data collection will change that. (Related: Why is accuracy not the best measure for assessing classification models?)

If you do so much focused data collection that your training data does not reflect the ground population any more, you may even end up biasing your model. Whether that is good or bad for accuracy is not necessarily clear, because of the problems with accuracy as a KPI.

Bottom line: we can't tell. However, you could of course try to get a handle at this by simulation. Pretend that you don't have your full dataset, but a much smaller one, by balanced subsampling. Train a model and evaluate it. Then simulate this kind of focused data collection by adding back in data points where your predictor is A or B, retrain the model and check whether accuracy (or whatever better KPI you use) improves. If this helps, then it may also help in extending your original dataset.

Stephan Kolassa
  • 123,354
  • Thank you Stephan for taking the time to write this out. I think I was preoccupied with accuracy given the project's focus on it. I'll try to collect as close to the ground population and run some simulations as you recommended to see how the model performs. – harry Jan 20 '23 at 07:38