Cross validation for highly imbalanced data with undersampling

Question

In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our model on the undersampled data and check the model evaluation on the holdout (from the original data -- without undersampling).

Now the quesion is here. KFold-cross validation actually splits the undersampled train set into K segments and take one of the folds as test set (which is now undersampled test set). I believe for model evaluation we actually need to calculate the metric of interest on non-undersampled test set (right? or I am misunderstaning sth here?). If yes, is it possible to perfrom cross validation as follows?

Split data into K segments.
Take the first Segment as test set, and undersample the rest of Folds (for example K=1 as test and K=2,3,4,5 as train sets)
Fit model on undersampled train data and calcualte the metric of interest on the test set.
Consider the other Fold as test set (this time e.g., K=2) and the rest as train set (K=1,3,4,5). Undersample train set and proceed to step 3.
Continue this procedure for the rest of the folds.

Is this a correct way of cross validation, when we undersample the data? If yes, is possible to do it with standard libraries?

Edited on 21 of Feb: Thanks to @Wes, I want to know if the following piece of code is the correct implementation of KFold cross validation on highly imbalanced dataset.

import numpy as np
from statistics import mean, stdev
from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline

# initial parameters
RANDOM_STATE = 42
RATIO = 0.033
N_SAMPLES = 1000000
K_FOLD = 5

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1, 
                           n_features=10, n_redundant=2,
                           weights=[0.9999, 0.0001], n_informative=5,
                           flip_y=0.0, n_samples=N_SAMPLES,
                           random_state=RANDOM_STATE)

print('Number of samples in each class %s' % Counter(y))

rus = RandomUnderSampler(random_state=RANDOM_STATE, ratio = RATIO)
rfc = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=100)
pipeline = make_pipeline(rus, rfc)

auc_roc = []
kf = KFold(n_splits=K_FOLD)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    y_pred = pipeline.fit(X_train, y_train).predict_proba(X_test)[:,1]

    auc_roc.append(roc_auc_score(y_test, y_pred))


print('ROC_AUC = {} +/- {}'.format(np.round(mean(auc_roc),4),
                                   np.round(stdev(auc_roc),4)))

So that I have obtained the following result ROC_AUC = 0.9374 +/- 0.037.

score 13 · Accepted Answer · answered Feb 11 '19 at 16:22

13

You should always do your evaluation of model performance on data that has not been over/undersampled. You can setup a pipeline with scikit-learn to perform your undersampling on the training set and then evaluate on the non-undersampled fold of data for each iteration as you described.

answered Feb 11 '19 at 16:22

Wes

682
4
13

Hi @Wes. Thank you for the reply. I thought about your answer and I came up with a piece of code. Please see the edit version of my question. Is that what you meant exactly? I want to become sure about your answer :) – Amin Kiany Feb 21 '19 at 10:23
1

I think that works, but you can also simplify it more by using some of the cross validation functions that sklearn provides like cross_validate and passing your pipeline and data to that. Then you won't have to worry about manually splitting up the data on your own. Check this page out for more details. Also, if I have answered your question, please accept this answer. – Wes Feb 21 '19 at 17:38
Thanks again for the comment. Is this the piece of the code you meant by cross_validate?
scores = cross_validate(pipeline, X, y, scoring=['roc_auc'], cv=K_FOLD).

If yes, I did not obtain the same results for AUC_ROC score.
– Amin Kiany Feb 22 '19 at 15:03
What do you mean by getting the same results? Same results as what? – Wes Feb 22 '19 at 19:21
I mean if I replace cross_validate function with the part of the code wherein the splitting is done manually, I reached different score for ROC_AUC. I expected similar result for ROC_AUC should be obtained. – Amin Kiany Feb 22 '19 at 20:22
I get the exact same results when I append this code to your script:
cv_results = cross_validate(pipeline, X, y, cv=kf, scoring='roc_auc')

print('ROC_AUC = {} +/- {}'.format(np.round(mean(cv_results['test_score']),4), np.round(stdev(cv_results['test_score']),4)))
– Wes Feb 23 '19 at 00:24
I got your point. you used cv=kf (instead of cv=K_Fold). That's why I got different score. I found the answer here: https://datascience.stackexchange.com/questions/14046/sklearn-cross-validation-cross-val-score-cv-parameter-question?rq=1 which explains that for binary classification cross_validate does stratifiedKFold, which is different with normal KFold I used in the code. I rewrite the code with StratifiedKFold and also with cross_validate. I got now the same results. :) I vote up your comment for using cross_validate function. :) – Amin Kiany Feb 23 '19 at 09:06

Cross validation for highly imbalanced data with undersampling

1 Answers1