0

I am trying to find the f1 score, precision, recall of a highly imbalanced dataset. I would like to use k-fold cross validation approach. I followed the procedure:

  1. create arrays to store testing data and predictions.
  2. split the training and testing data
  3. do the over/under sampling in training data
  4. train the model and get the predictions
  5. append the test data and test result to test array [A] and predictions array [B]
  6. go back to (1) for another fold cross validation.
  7. calculate the f1-score by comparing [A] and [B]

This is my code:

import pandas as pd
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# pip install imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

creating a dataset

X, y = make_classification(n_samples = 1500, n_features = 5, n_redundant = 0, weights = [0.9]) df = pd.concat([pd.DataFrame(X), pd.DataFrame(y, columns = ['class'])], axis = 1)

df.head()

getting the input and output feature

input_features = df.drop(labels = 'class', axis = 1) target_feature = df['class']

specifying the number of folds

n_folds = 5

array to store the test data and predictions

test_data_array, predictions_array = [], []

model

clf = RandomForestClassifier() for fold in range(0, n_folds): # splitting the dataset X_train, X_test, y_train, y_test = train_test_split(input_features, target_feature, test_size = int(len(target_feature)/n_folds)) # print(X_test.shape) # print(Counter(y_train)) # do the oversampling / undersampling only in training data over_sample = RandomOverSampler(sampling_strategy = 'minority') X_train_sampled, y_train_sampled = over_sample.fit_resample(X = X_train, y = y_train) # print(Counter(y_train_sampled)) # training the model: clf.fit(X_train_sampled, y_train_sampled) # getting the predictions from the testing data predictions = clf.predict(X_test) # appending them into the array test_data_array = test_data_array + y_test.tolist() predictions_array = predictions_array + predictions.tolist()

calculating f1 score from the appended arrays

print(classification_report(y_true = test_data_array, y_pred = predictions_array))

I would like to get experts advices about this procedure. I tried inbuild cross-validation in sklearn but it can only show the average f1-scores of all the folds. Not sure which is the right approach, whether to append them and calculate f1 score as a whole or to calculate f1-scores of each fold and get the average?

Is this a right way to do this or Is there any other better way to achieve this?

JSVJ
  • 115

0 Answers0