4

I am working on the classification of a dataset which contains ambiguous and noisy data - the result of which means I have class overlap in the feature space.

There seems to be a few papers on this topic:

I am still unable to find a more popular approach with hopefully a solution in python. My question is therefore how do people suggest handling overlapping classes for classification problems?

dendog
  • 212
  • 2
  • 8
  • By "overlapping classes" do you mean that one observation can belong to more than one class, or do you mean that the within-class distributions are genuinely overlapping, meaning that in a given position x data may occur from two or more classes with essentially nonzero probability? The latter is in fact the standard situation in classification and most methods can handle it. (Although the more overlap, the more difficult the classification problem is, regardless of the method used.) – Christian Hennig May 12 '21 at 09:01
  • in a given position x data may occur from two or more classes with essentially nonzero probability

    Correct! But just to mention I did link to two papers.

    – dendog May 12 '21 at 11:47
  • These are rather specialist papers. I don't see anything wrong with them, but I wouldn't agree with the idea that you have to do something special about it. Your favourite classification method should normally be able to handle such a situation. You can compare several ones with cross-validation as in any classification problem. There's really nothing special about overlapping classes. – Christian Hennig May 12 '21 at 15:23
  • Sorry but I would disagree, there is a whole area of research around confident learning, noisy labels, robust classifiers etc – dendog May 13 '21 at 08:41
  • 2
    I think the machine learning culture is different from the statistics culture in this respect. In statistics methods are governed by model assumptions, and these regularly allow classes to overlap. "Robustness" in statistics refers to violation of model assumptions, outliers etc., which is a different cup of tea. ML focuses more on classes that are strongly separated but of complex shapes - then overlap seems like more of an issue. Noisy labels is again another issue, as are classifiers that give out "ambiguity regions". – Christian Hennig May 13 '21 at 10:24
  • So fair enough if you want this stuff, nothing wrong with that. Your question looked to me as if you felt you'd need to do something about overlap and standard approaches wouldn't work, which in general isn't true. – Christian Hennig May 13 '21 at 10:25
  • 1
    If your classes are highly overlapping then just fitting a standard model will not work well. – dendog May 13 '21 at 10:28
  • This depends on what "working well" means. You will get a large misclassification probability, but this can essentially not be avoided by any method due to overlap... ultimately whether you can do better it depends on whether the overlap is compatible with the model assumptions or not. There's nothing in the definition of a "standard model" that makes it fail in case of large overlap. – Christian Hennig May 13 '21 at 10:35
  • You can try: https://github.com/cleanlab/cleanlab – hafiz031 Feb 22 '22 at 05:03
  • This question is basically asking "how to do classification" – Firebug Jan 22 '24 at 09:01

3 Answers3

0

If overlapping classes means that a single data instances are assigned multiple classes, you basically two options:

  • Make the problem a single-class classification by having a separate class for all class combinations in the training data (there might be too many of them, some of them might not make sense because you said the data is noisy)

  • Have an independent predictor for each of the classes and treat the problem more as assigning independent tags to each data instance.

If you want to use neural nets, in the latter case, it makes sense to have a shared representation and do the same architecture as if you did standard classification. However, instead of the softmax, you would use just a sigmoid non-linearity (for prediction between 0 and 1) and binary cross-entropy loss function. The target vector is in this case indication vector with ones for the active classes and zeroes elsewhere.

Jindřich
  • 3,379
0

The second paper you referred, describes the methods to handle overlapped instances.

Handling samples of overlapping regions is as important as identifying such regions. Xiong et al. [16] proposed that the overlapping regions can be handled with three different schemes: discarding, merging and separating.

Discarding: Ignores the overlapping region and learns on rest of the data that belongs to the non-overlapping region.

Merging: Considers the overlapping region as a new class and uses a 2-tier classification model.

Separating: The data from overlapping and non-overlapping regions are treated separately to build the learning models.

So as the author in the paper describes, you may ignore the overlapping region, or consider the overlapping region as a new class, or treat overlapped and non overlapped regions separately.

Reference : Handling Class Overlap and Imbalance to Detect Prompt Situations in Smart Homes by Barnan Das, Narayanan C. Krishnan, Diane J. Cook

DOT
  • 135
  • Can you expand on your answer to articulate how the content in the link answers the question? It's hard to tell what, specifically, you have in mind – Sycorax Sep 07 '21 at 15:33
0

Let's look at an example where there are regions where one classification is obvious but also a region where there is overlap and ambiguity.

X scatterplot

In this picture, the upper right and lower left are clearly dominated by red, while the upper left and lower right are clearly dominated by blue. Thus, if you have a point like $(2, 2)$, the prediction should be a high probability of the red category, while $(2, -2)$ should lead to a high probability of the blue category.

At a point like $(0, 0)$, the category to which the point belongs is not clear, and I would want my model to reflect this. Sure, it is desirable to get confident predictions, but the data in this case do not allow for such confidence. It really is the case that there is ambiguity, and to force a model to predict with confidence is to dismiss reality. Given how I generated this plot (code below), the probability that $(0,0)$ belongs to red is $1/2$, same as the probability that $(0,0)$ belongs to blue. If you force some other probability, you will be in a position to make mistakes.

I would say that, if you have overlapping classes like we do here, the way to proceed is to embrace the fact that there can be ambiguity. For instance, in this example, you really cannot accurately predict the category to which $(0,0)$ belongs.

library(MASS)
library(ggplot2)
set.seed(2023)
N <- 250
X0 <- MASS::mvrnorm(N, c(0, 0), matrix(c(
  1, 0.9, 
  0.9, 1
), 2, 2))
X1 <- MASS::mvrnorm(N, c(0, 0), matrix(c(
  1, -0.9, 
  -0.9, 1
), 2, 2))
d0 <- data.frame(
  x1 = X0[, 1],
  x2 = X0[, 2],
  y = "Category 0"
)
d1 <- data.frame(
  x1 = X1[, 1],
  x2 = X1[, 2],
  y = "Category 1"
)
d <- rbind(d0, d1)
ggplot(d, aes(x = x1, y = x2, col = y)) +
  geom_point() 
Dave
  • 62,186