There are usually two methods to deal with imbalanced data while using the random forest model. One approach is cost-sensitive learning and the other is sampling. For extremely imbalanced data, random forest generally tends to be biased towards the majority class.
The cost-sensitive approach would be to assign different weights to different classes. So if the minority class is assigned a higher weight and thus higher misclassification cost, then that can help reduce its biasness towards the majority class. You can use the class weight parameter of random forest in scikit-learn to assign weights to each class.
Secondly, there are different methods of sampling such as oversampling the minority class or undersampling the majority class etc... Although simple sampling methods improve the overall model performance, its preferable to go for a more specialized sampling method such as SMOTE and others to get a better model.
Most of the machine learning models suffer from the imbalanced data problem although there are some reasons to believe that generative models generally tend to perform better in case of imbalanced datasets.
Specifically, you can take a look at lower half of page 17. Although I should mention that the general consensus is that discriminative models generally outperform generative models but given a normal sized imbalanced dataset and without using methods like SMOTE, one could go with a generative model but with the increase in the size of the dataset, a discriminative model will generally outperform a generative model at some point.
– Satwik Bhattamishra Apr 17 '18 at 16:50precision: 1.000 recall: 1.000 F: 0.500 Area under the curve (AUC): 1.000
precision: 0.615 recall: 1.000 F: 0.381 Area under the curve (AUC): 1.000
precision: 1.000 recall: 1.000 F: 0.500 Area under the curve (AUC): 1.000 4. USing SMOTE: precision: 0.941 recall: 1.000 F: 0.485 Area under the curve (AUC): 1.000 Did I do something terribly wrong? What does the result mean?
– MSilvy Apr 26 '18 at 22:33