1

A random forest classifier is reporting perfect classification accuracy when I pass it the data that it was trained on even though it has only 1 predictor that with overlapping values between classes.

Is this possible or am I making a error? If it is possible, how?

The distributions of the values in the two classes are shown below.

I know passing training data doesn't give meaningful results on how good the classifier is, and that using only 1 feature in a random forest is strange, but I am trying to assess whether the classifier is overfitting by sequentially adding more features to the classifier and looking at the accuracy on the training and the test set.

enter image description here

Drwhit
  • 81

1 Answers1

1

Oh yes, that can certainly happen.

One advantage of random forests is that they can model nonlinearities in your data. Thus, they can in principle classify perfectly even if your data are not linearly separable. It is enough if they are nonlinearly separable.

Here is an example in R:

set.seed(1)
xx <- runif(100,-1,1)
yy <- as.factor(xx^2>0.3)

plot(as.numeric(yy),xx,xlab="",xaxt="n",pch=19)
axis(1,1:2,levels(yy))

library(randomForest)
model <- randomForest(yy~xx)
model$confusion

      FALSE TRUE class.error
FALSE    59    0           0
TRUE      0   41           0

Note how the predictor cannot be linearly separated, but nonlinearly:

rf

Stephan Kolassa
  • 123,354
  • Stephan, that makes sense as long as the classes are truly separable. Whether linear or non-linear models are used, I still don't see how perfect accuracy is possible if the classes overlap. After a model is trained, shouldn't it predict the same output value every time a specific input value is provided? In the distributions that I shared above, both the "normal" and "preMCI" classes have several examples that have a value of 1.5. How can the random forest predict one sample to be "preMCI" when the input value is 1.5, and and different sample to be "normal" when it also has a value of 1.5? – Drwhit Feb 16 '18 at 00:28
  • If you really have only a single predictor and your RF classifies instances with identical values of the predictor into different classes (whether this classification is correct or not), then something fundamentally strange is going on. – Stephan Kolassa Feb 16 '18 at 20:37
  • Yeah, it doesn't make any sense to me. I think my code must have a bug somewhere, but I haven't been able to find one. – Drwhit Feb 18 '18 at 07:50