I am trying to build a neural net to predict binary output [0,1]. I have a pretty small dataset 600 samples, 200 of them label 0, 400 of them label 1. I have 23 features, some of them are categorical( I encode them using pd.get_dummies).My target is a column "Status" with values: 0 for 'Terminated' and '1' for Active. I also scale my data with standards scaler. I have a feeling the way I built my net is wrong. I use dropout and BatchNormalization to prevent overfitting since my dataset is pretty small.
What can I change and improve in my classifier to make it more accurate?
def make_classifier(self,optimizer):
classifier = Sequential()
classifier.add(Dropout(0.2, input_shape=(24,))) # adding dropout layer to prevent overfiitng
classifier.add(Dense(50, kernel_initializer = "he_normal",kernel_regularizer=keras.regularizers.l2(0.01) , input_dim=24))# activation = "relu",
classifier.add(BatchNormalization())
classifier.add(LeakyReLU(alpha=0.2))
classifier.add(Dropout(0.2))
classifier.add(Dense(20, kernel_initializer = "he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
classifier.add(BatchNormalization())
classifier.add(LeakyReLU(alpha=0.2))#activation = "relu",
classifier.add(Dropout(0.15))
classifier.add(Dense(1, kernel_initializer = "he_normal", activation = "sigmoid",kernel_regularizer=keras.regularizers.l2(0.01)))
classifier.compile(optimizer= optimizer,loss = "binary_crossentropy",metrics = ["accuracy"])
return classifier
This is how I prep my data:
df['Status']=np.where(df['Status']=='Terminated' ,0, 1 )
df=df.dropna()
features = ['Region', 'Function']
df_final = pd.get_dummies(df,columns=features,drop_first=True)
X = df_final.drop(['Emp_Status'],axis=1).values
y = df_final[['Emp_Status']].values
sc = StandardScaler()
X = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1, stratify=y)
classifier = KerasClassifier(build_fn = self.make_classifier)
batch_max=len(X_train)
later on I use grid search :
params = {
'batch_size':[50,150,250,batch_max],#sample number
'epochs':[5,11,32,50,64,100], ## of transits of training data thought the algorithm
'optimizer':['adam','rmsprop','nadam']}
grid_search = GridSearchCV(estimator=classifier,
param_grid=params,
scoring="accuracy",
cv=2, error_score='raise')
grid_search = grid_search.fit(X_train,y_train)
best_param = grid_search.best_params_
best_accuracy = grid_search.best_score_
print('Grid best accuracy', best_accuracy)
Best accuracy currently is 85%
I have other models I use: random forest, MLPClassifier, XGBoost, CATBoost. I make sure they do not overfit and their accuracy is in 90-95% range. I am hoping to achieve similar accuracy with this one