How to determine if the errors made by the classifiers are uncorrelated

Question

I am working on ensemble methods to improve the Area under the ROC curve in an experiment. In Ensemble Methods in Machine Learning ", Dietterich says " A necessary and suficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse". Later, he says that the classifiers are diverse if the errors are uncorrelated. In order to determine if the errors are uncorrelated, I use the following procedure:

For the test set , I generate vector with the same number of rows (data points) of the test set and put "1" if the classifier gave the correct classification and "0" otherwise. I called correct classification vector ( CC)
I calculate the linear correlation coefficient beetwen CCs of the classifiers

I found that the linear correlation coefficient varies from 0.32 up to 0.80. Does it mean that the errors are correlated and the ensemble will not give good results?

What can explain this correlation? The classifiers I used ( Adaboost, random forests, nearest neighbor, svm with radial basis) or it can be due to the dataset?

benbo · Answer 1 · 2015-08-26T14:54:45.133

If you simply look at the correlation coefficient e.g. for a matrix of "test points by trees" in a forest the way you described your vectors, then it won't tell you much about the errors, since the $1$ entries in your matrix may also correlate.

You could create vectors of the scores for the test points for each individual tree and then assess correlation. However, while this would tell you about how well the trees agree on the classification of your test points, it does not give you an indication of how they came to these scores. This is an important point to consider, since diversity also lies in the features and splits etc. that the individual trees use to come up with a score.

How to determine if the errors made by the classifiers are uncorrelated

1 Answers1