In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our model on the undersampled data and check the model evaluation on the holdout (from the original data -- without undersampling).
Now the quesion is here. KFold-cross validation actually splits the undersampled train set into K segments and take one of the folds as test set (which is now undersampled test set). I believe for model evaluation we actually need to calculate the metric of interest on non-undersampled test set (right? or I am misunderstaning sth here?). If yes, is it possible to perfrom cross validation as follows?
- Split data into K segments.
- Take the first Segment as test set, and undersample the rest of Folds (for example K=1 as test and K=2,3,4,5 as train sets)
- Fit model on undersampled train data and calcualte the metric of interest on the test set.
- Consider the other Fold as test set (this time e.g., K=2) and the rest as train set (K=1,3,4,5). Undersample train set and proceed to step 3.
- Continue this procedure for the rest of the folds.
Is this a correct way of cross validation, when we undersample the data? If yes, is possible to do it with standard libraries?
Edited on 21 of Feb: Thanks to @Wes, I want to know if the following piece of code is the correct implementation of KFold cross validation on highly imbalanced dataset.
import numpy as np
from statistics import mean, stdev
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline
# initial parameters
RANDOM_STATE = 42
RATIO = 0.033
N_SAMPLES = 1000000
K_FOLD = 5
# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1,
n_features=10, n_redundant=2,
weights=[0.9999, 0.0001], n_informative=5,
flip_y=0.0, n_samples=N_SAMPLES,
random_state=RANDOM_STATE)
print('Number of samples in each class %s' % Counter(y))
rus = RandomUnderSampler(random_state=RANDOM_STATE, ratio = RATIO)
rfc = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=100)
pipeline = make_pipeline(rus, rfc)
auc_roc = []
kf = KFold(n_splits=K_FOLD)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
y_pred = pipeline.fit(X_train, y_train).predict_proba(X_test)[:,1]
auc_roc.append(roc_auc_score(y_test, y_pred))
print('ROC_AUC = {} +/- {}'.format(np.round(mean(auc_roc),4),
np.round(stdev(auc_roc),4)))
So that I have obtained the following result ROC_AUC = 0.9374 +/- 0.037.
cross_validateand passing your pipeline and data to that. Then you won't have to worry about manually splitting up the data on your own. Check this page out for more details. Also, if I have answered your question, please accept this answer. – Wes Feb 21 '19 at 17:38cross_validate?scores = cross_validate(pipeline, X, y, scoring=['roc_auc'], cv=K_FOLD).If yes, I did not obtain the same results for AUC_ROC score.
– Amin Kiany Feb 22 '19 at 15:03cross_validatefunction with the part of the code wherein the splitting is done manually, I reached different score for ROC_AUC. I expected similar result for ROC_AUC should be obtained. – Amin Kiany Feb 22 '19 at 20:22cv_results = cross_validate(pipeline, X, y, cv=kf, scoring='roc_auc')
– Wes Feb 23 '19 at 00:24print('ROC_AUC = {} +/- {}'.format(np.round(mean(cv_results['test_score']),4), np.round(stdev(cv_results['test_score']),4)))cv=kf(instead ofcv=K_Fold). That's why I got different score. I found the answer here: https://datascience.stackexchange.com/questions/14046/sklearn-cross-validation-cross-val-score-cv-parameter-question?rq=1 which explains that for binary classificationcross_validatedoesstratifiedKFold, which is different with normal KFold I used in the code. I rewrite the code withStratifiedKFoldand also withcross_validate. I got now the same results. :) I vote up your comment for usingcross_validatefunction. :) – Amin Kiany Feb 23 '19 at 09:06