1

I have developed a multi-class Random forest model and it’s working great (well, almost too good). I am getting very high accuracy, sometimes even 1. But I am kind of suspicious about this result. The main reason is that even if I train the model with only 1% data, and test with 99%, I still get ~1 accuracy. According to my understanding, this is very odd. That is why I am trying to figure out what’s going and what can be possible reasons behind such behavior.

My dataset has ~63k rows and ~80 columns. But I am using top 30 columns(after feature selection) to train the model. Also there are 13 different classes(labels).

confusion matrix:

[[ 119    0    0    0    0    0    0    0    0    0    0    0    0]
[   0   93    0    0    0    0    0    0    0    0    0    0    0]
[   0    0  158    0    0    0    0    0    0    0    0    0    0]
[   0    0    0  444    0    0    0    0    0    0    0    0    0]
[   0    0    0    0  301    0    0    0    0    0    0    0    0]
[   0    0    0    0    0 3425    0    0    0    0    0    0    0]
[   0    0    0    0    0    0 6702    0    0    0    0    0    0]
[   0    0    0    0    0    0    0  727    0    0    0    0    0]
[   0    0    0    0    0    0    0    0   96    0    0    0    0]
[   0    0    0    0    0    0    0    0    0  116    0    0    0]
[   0    0    0    0    0    0    0    0    0    0  119    1    0]
[   0    0    0    0    0    0    0    0    0    0    0   97    0]
[   0    0    0    0    0    0    0    0    0    0    0    0  260]]

My code:

df = pd.read_csv("merged_data_set.csv")
df = df[(df[['4', '5', '9']] > 0).all(1)]
df = df.reset_index(drop=True)
print(df.head)
df=df.dropna()
print(df.head)

X = df.drop(df.columns[len(df.columns)-1], 1) Y = df[df.columns[len(df.columns)-1]] from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42) from sklearn import tree clf=RandomForestClassifier(n_estimators=500) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) print(metrics.confusion_matrix(y_test, y_pred))

  • How many rows belong to each of the 13 classes? – Bryan Krause Jun 29 '20 at 21:56
  • data is not evenly distributed, the lowest is ~480 rows, and heights is 33k – Masudul Hasan Jun 29 '20 at 21:57
  • That could contribute; what does a confusion matrix look like? could it be that your model is doing quite well by just sorting a couple dominant categories and basically ignoring all the others? – Bryan Krause Jun 29 '20 at 22:00
  • Also, what are you using for feature selection? You could be overfitting a model if you are using your test data. – Bryan Krause Jun 29 '20 at 22:00
  • I am using "SelectKBest" from scikit lean for feature selection – Masudul Hasan Jun 29 '20 at 22:09
  • That doesn't really say anything, it depends on what your scoring/cost function is. And the other question? – Bryan Krause Jun 29 '20 at 22:29
  • from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn import preprocessing test = SelectKBest(score_func=chi2, k=30) fit = test.fit(X, Y) np.set_printoptions(precision=3) features = fit.transform(X)

    this is my code for feature selection. I also added the confusion matrix in the post

    – Masudul Hasan Jun 29 '20 at 23:48
  • 2
    Update your question with the feature selection code, don’t reply in the comments – astel Jun 29 '20 at 23:53
  • 1
    But yes you’re probably leaking data when you do feature selection (among other places most likely). Try running the forest on all 80 variables and see what accuracy you get. – astel Jun 29 '20 at 23:57
  • I updated the question with my code. The code is without feature selection and the result is same. – Masudul Hasan Jun 30 '20 at 04:27

1 Answers1

1

It is possible to have some data leaking problems, i.e., there are some "cheating" variables in the features. For example, we are prediction if a user will buy some product, but "mistakenly" use number of customer call a person had as a feature.

A related question and answer can be found here. You may quickly fit a decision tree to see if there are very indicative features.

How can I quickly detect cheating variables in large data?

Haitao Du
  • 36,852
  • 25
  • 145
  • 242