0

I am trying to build a neural net to predict binary output [0,1]. I have a pretty small dataset 600 samples, 200 of them label 0, 400 of them label 1. I have 23 features, some of them are categorical( I encode them using pd.get_dummies).My target is a column "Status" with values: 0 for 'Terminated' and '1' for Active. I also scale my data with standards scaler. I have a feeling the way I built my net is wrong. I use dropout and BatchNormalization to prevent overfitting since my dataset is pretty small.

What can I change and improve in my classifier to make it more accurate?

def make_classifier(self,optimizer):
        classifier = Sequential()
    classifier.add(Dropout(0.2, input_shape=(24,))) # adding dropout layer to prevent overfiitng
    classifier.add(Dense(50, kernel_initializer = "he_normal",kernel_regularizer=keras.regularizers.l2(0.01) , input_dim=24))# activation = "relu",
    classifier.add(BatchNormalization())
    classifier.add(LeakyReLU(alpha=0.2))

    classifier.add(Dropout(0.2))
    classifier.add(Dense(20, kernel_initializer = "he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
    classifier.add(BatchNormalization())
    classifier.add(LeakyReLU(alpha=0.2))#activation = "relu",        

    classifier.add(Dropout(0.15))
    classifier.add(Dense(1, kernel_initializer = "he_normal", activation = "sigmoid",kernel_regularizer=keras.regularizers.l2(0.01)))
    classifier.compile(optimizer= optimizer,loss = "binary_crossentropy",metrics = ["accuracy"])
    return classifier

This is how I prep my data:

df['Status']=np.where(df['Status']=='Terminated' ,0,  1 )  
    df=df.dropna()
    features = ['Region', 'Function']
    df_final = pd.get_dummies(df,columns=features,drop_first=True)
X = df_final.drop(['Emp_Status'],axis=1).values
y = df_final[['Emp_Status']].values


sc = StandardScaler()
X = sc.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1, stratify=y)

classifier = KerasClassifier(build_fn = self.make_classifier)
batch_max=len(X_train)

later on I use grid search :

params = {
    'batch_size':[50,150,250,batch_max],#sample number
    'epochs':[5,11,32,50,64,100], ## of transits of training data thought the algorithm
    'optimizer':['adam','rmsprop','nadam']}
grid_search = GridSearchCV(estimator=classifier,
                   param_grid=params,
                   scoring="accuracy",
                   cv=2, error_score='raise')

grid_search = grid_search.fit(X_train,y_train)

best_param = grid_search.best_params_
best_accuracy = grid_search.best_score_
print('Grid best accuracy', best_accuracy)

Best accuracy currently is 85%

I have other models I use: random forest, MLPClassifier, XGBoost, CATBoost. I make sure they do not overfit and their accuracy is in 90-95% range. I am hoping to achieve similar accuracy with this one

0 Answers0