Exceptionally high accuracy with Random Forest, is it possible?

Question

I need your help to find a flaw in my model, since it's accuracy (95%) is not realistic.

I'm working on a classification problem using Randomforest, with around 2500 positive case and 15000 negative ones, 75 independent variables. Here's the core of my code:

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 900, criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

I've optimized the hyperparameters through grid search and performed a k-fold cross validation, reporting 0.9444 as accuracies mean. Confusion matrix:

[[3390,   85],
 [ 101,  516]]

showing 97.6% accuracy.

Did I miss something?

NOTE: the database is composed by 2500 Italian mafia firms' financial reports, and 15000 lawful firms randomly sampled from the same regions as negative cases.

Thank you guys!

EDIT: I upload the metrics and cm. The model is actually performing well, but looking at the metrics and cm, it shows more realistic values regarding logloss and recall, so I assume it is fine.

The dataset is too much biased towards negative class. positive:negative is around 1:6. Try to run the model with equal amount of positive and negative examples and see how it works. — chmodsss, Nov 06 '18 at 12:19
Everyone here is talking about imbalanced data but based on the reported confusion matrix I don't think this is a severe problem in this scenario. The problem I see is that you are claiming that 95% accuracy is unrealistic. And the question is: Why? What are your assumptions to make such claim. In my opinion, RandomForest algorithm is able to achieve those or even better results if the data is informative enough. Maybe you should try and replicate your experiments with other algorithms (XGB, SVM, kNN, MLP...) and see if the results are consistent. Perhaps your data is better than you thought! — TitoOrt, Nov 06 '18 at 16:10
Thank you both for your answers. For yazhi: am I really suppose to balance it? Going theoretical (sorry for it), shouldn't I try to come as close as possible to a i.i.d. of the real-world distribution D? I don't exactly know D, but for sure the Italian business environment has at least a 100:1 ratio of legal firms every mafia one. For TitoOrt: I'd say it's unrealistic because otherwise I'd have found an almost perfect detection model for mafia firms, which seems quite ambitous. — Francesco Ambrosini, Nov 06 '18 at 17:00
I have to concur with @TitoOrt. The imbalance does not look terrible and it seems that the RandomForrest is performing well. Accuracy can be a bit misleading at times. For example, if the data was 100:1 you could easily get an accuracy of 99% simply classifying everything as non-mafia. So, I would also look at the sensitivity and specificity measures to understand the model's performance. Since you are using cross validation, the performance measures should be relatively stable on unseen data. That is why we perform CV after all, to understand the performance on unseen data. — Skiddles, Nov 06 '18 at 19:15
@FrancescoAmbrosini you do well by no trusting exceptionally good results :) However, I am not implying that your system is necessarily good though. Maybe somewhere when the data is being collected/generated and/or preprocessed you did something that makes the samples different between the two classes and this is what the algorithm is detecting. But it is hard to say. I'd say that the RF algorithm is doing a good job but if you feel the results are to good, take a good dip in the data, make plots and statistical tests to double check there are no issues like the ones I mentioned. — TitoOrt, Nov 07 '18 at 07:33
Skiddles, as you can see from my edit, sensitivty was not exceptional, thank you! TitoOrt, I'm afraid you actually nailed it. I was given the dataset from my professor, but it's data collection was probably biased towards a certain kind of companies :( — Francesco Ambrosini, Nov 07 '18 at 12:44

score 1 · Answer 1 · answered Nov 06 '18 at 12:43

In order to understand if the model is performing well, first would do the following:

Plot the distribution of the classes, to understand if sampling mechanisms are required.
In case the classes are not evenly distributed, would do stratified sampling during the test-train split.
After the prediction, would plot the confusion matrix that is supported by libraries such as matplotlib or seaborn
Based on the class distribution its also important to understand what sort of metrics are required, micro-averaging / weighted / macro-averaging of precision, recall and f1 score.

This should help you evaluate if you are model is truly learning the features or if one of the classes are imbalanced causing such a spike in the results.

score 0 · Answer 2 · answered Nov 06 '18 at 14:06

0

Random Forest are built by using decision trees, which are sensitive to the distribution of the classes. Other than stratification method, you can use oversampling, undersampling or use greater weights to the less frequent class to mitigate this effect. A detailed response you can study is in Cross Validated.

You might want to consider other metrics to measure your classification model other than accuracy, since your data are really unbalanced.

answered Nov 06 '18 at 14:06

JoPapou13

185
5

First of all, thanks for your answer. Actually, I could balance my dataset just by randomily sampling a smaller amount of negative cases. The point is, am I really suppose to balance it? Going theoretical (sorry for it), shouldn't I try to come as close as possible to a i.i.d. of the real-world distribution D? I don't exactly know D, but for sure the Italian business environment has at least a 100:1 ratio of legal firms every mafia one. – Francesco Ambrosini Nov 06 '18 at 16:58

score 0 · Accepted Answer · answered Nov 07 '18 at 10:06

I summarise below several ways that would help you train and validate your model with as less bias as possible:

Usually a good way to assess the classification performance is to compare with some very basic models. If your validation metrics are worse than (or close to) those, it is obvious that the model needs to improve. E.g. in your case you could compare with:
- random model (each observation is randomly classified to each class with probability 1/2)
- model that always predicts negative class
Another way to ensure that the high validation numbers you get aren't biased by the way training set and test set are separated, is to use cross-validation. In cross-validation, the data is split in training and test set multiple times though an iteration process and the end validation metrics are calculated as average over the iterations. Here is an example of how you can perform cross-validation in python using scikit-learn.
In addition to accuracy I would also try to calculate and compare other validation metrics in order to get a more complete picture about the model's performance (e.g. precision, recall or more concise ones as F-score). Accuracy is not a recommended metric when most of the observations belong to one class. You can read more about performance metrics here and here. Scikit-learn can calculate automatically some of them (see here), but you can calculate any using the confusion matrix.
SMOTE is a library popularly used with unbalanced datasets like yours - it applies resampling to create a new balanced dataset. You can read more here.

mannu · Answer 4 · 2018-11-12T05:16:02.047

Random Forest generally works well out of the box; In your case, it looks like the data is not balanced due to which is causing this false high accuracy. How to balance data? There are multiple techniques you can choose from but simplest ones are "Up-Sample" or "Under-Sample""

Sample Code:

from sklearn.utils import resample

minority_df = df[df.Col1 == 'value of Italian mafia firm']
majority_df = df[df.Col1 == 'value of lawful firm']

-- this will upsample your minority class to 15k, you can down-sample using your majority class but you already have less data, so I won't suggest that.

minority_df = resample(minority_df, replace=True, n_samples=15000, random_state=123)

-- Concat and create new balanced dataset.

df_balanced = pd.concat([majority_df, minority_df])

Use this new balanced Dataset for your model training, rest everything in your code looks standard. Let me know if I can help more. Cheers!

Exceptionally high accuracy with Random Forest, is it possible?

4 Answers4

Linked