I have developed a multi-class Random forest model and it’s working great (well, almost too good). I am getting very high accuracy, sometimes even 1. But I am kind of suspicious about this result. The main reason is that even if I train the model with only 1% data, and test with 99%, I still get ~1 accuracy. According to my understanding, this is very odd. That is why I am trying to figure out what’s going and what can be possible reasons behind such behavior.
My dataset has ~63k rows and ~80 columns. But I am using top 30 columns(after feature selection) to train the model. Also there are 13 different classes(labels).
confusion matrix:
[[ 119 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 93 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 158 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 444 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 301 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 3425 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 6702 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 727 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 96 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 116 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 119 1 0]
[ 0 0 0 0 0 0 0 0 0 0 0 97 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 260]]
My code:
df = pd.read_csv("merged_data_set.csv")
df = df[(df[['4', '5', '9']] > 0).all(1)]
df = df.reset_index(drop=True)
print(df.head)
df=df.dropna()
print(df.head)
X = df.drop(df.columns[len(df.columns)-1], 1)
Y = df[df.columns[len(df.columns)-1]]
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
from sklearn import tree
clf=RandomForestClassifier(n_estimators=500)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
this is my code for feature selection. I also added the confusion matrix in the post
– Masudul Hasan Jun 29 '20 at 23:48