Possible reason for high accuracy model

Question

I have developed a multi-class Random forest model and it’s working great (well, almost too good). I am getting very high accuracy, sometimes even 1. But I am kind of suspicious about this result. The main reason is that even if I train the model with only 1% data, and test with 99%, I still get ~1 accuracy. According to my understanding, this is very odd. That is why I am trying to figure out what’s going and what can be possible reasons behind such behavior.

My dataset has ~63k rows and ~80 columns. But I am using top 30 columns(after feature selection) to train the model. Also there are 13 different classes(labels).

confusion matrix:

[[ 119    0    0    0    0    0    0    0    0    0    0    0    0]
[   0   93    0    0    0    0    0    0    0    0    0    0    0]
[   0    0  158    0    0    0    0    0    0    0    0    0    0]
[   0    0    0  444    0    0    0    0    0    0    0    0    0]
[   0    0    0    0  301    0    0    0    0    0    0    0    0]
[   0    0    0    0    0 3425    0    0    0    0    0    0    0]
[   0    0    0    0    0    0 6702    0    0    0    0    0    0]
[   0    0    0    0    0    0    0  727    0    0    0    0    0]
[   0    0    0    0    0    0    0    0   96    0    0    0    0]
[   0    0    0    0    0    0    0    0    0  116    0    0    0]
[   0    0    0    0    0    0    0    0    0    0  119    1    0]
[   0    0    0    0    0    0    0    0    0    0    0   97    0]
[   0    0    0    0    0    0    0    0    0    0    0    0  260]]

My code:

df = pd.read_csv("merged_data_set.csv")
df = df[(df[['4', '5', '9']] > 0).all(1)]
df = df.reset_index(drop=True)
print(df.head)
df=df.dropna()
print(df.head)
X = df.drop(df.columns[len(df.columns)-1], 1)
Y = df[df.columns[len(df.columns)-1]]
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
from sklearn import tree
clf=RandomForestClassifier(n_estimators=500)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

data is not evenly distributed, the lowest is ~480 rows, and heights is 33k — Masudul Hasan, Jun 29 '20 at 21:57
That could contribute; what does a confusion matrix look like? could it be that your model is doing quite well by just sorting a couple dominant categories and basically ignoring all the others? — Bryan Krause, Jun 29 '20 at 22:00
Also, what are you using for feature selection? You could be overfitting a model if you are using your test data. — Bryan Krause, Jun 29 '20 at 22:00
I am using "SelectKBest" from scikit lean for feature selection — Masudul Hasan, Jun 29 '20 at 22:09
That doesn't really say anything, it depends on what your scoring/cost function is. And the other question? — Bryan Krause, Jun 29 '20 at 22:29
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn import preprocessing test = SelectKBest(score_func=chi2, k=30) fit = test.fit(X, Y) np.set_printoptions(precision=3) features = fit.transform(X)
this is my code for feature selection. I also added the confusion matrix in the post — Masudul Hasan, Jun 29 '20 at 23:48
Update your question with the feature selection code, don’t reply in the comments — astel, Jun 29 '20 at 23:53
But yes you’re probably leaking data when you do feature selection (among other places most likely). Try running the forest on all 80 variables and see what accuracy you get. — astel, Jun 29 '20 at 23:57
I updated the question with my code. The code is without feature selection and the result is same. — Masudul Hasan, Jun 30 '20 at 04:27

score 1 · Answer 1 · answered Jun 30 '20 at 04:36

It is possible to have some data leaking problems, i.e., there are some "cheating" variables in the features. For example, we are prediction if a user will buy some product, but "mistakenly" use number of customer call a person had as a feature.

A related question and answer can be found here. You may quickly fit a decision tree to see if there are very indicative features.

How can I quickly detect cheating variables in large data?

Possible reason for high accuracy model

1 Answers1