I am working on ensemble methods to improve the Area under the ROC curve in an experiment. In Ensemble Methods in Machine Learning ", Dietterich says " A necessary and suficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse". Later, he says that the classifiers are diverse if the errors are uncorrelated. In order to determine if the errors are uncorrelated, I use the following procedure:
- For the test set , I generate vector with the same number of rows (data points) of the test set and put "1" if the classifier gave the correct classification and "0" otherwise. I called correct classification vector ( CC)
- I calculate the linear correlation coefficient beetwen CCs of the classifiers
I found that the linear correlation coefficient varies from 0.32 up to 0.80. Does it mean that the errors are correlated and the ensemble will not give good results?
What can explain this correlation? The classifiers I used ( Adaboost, random forests, nearest neighbor, svm with radial basis) or it can be due to the dataset?