Stratified balanced sampling from unbalanced data (Machine learning)

Question

I'm sorry about my carelessness. If you need the method to balanced-subsampling, access the below link. There are various answers.

Scikit-learn balanced subsampling

How can I do stratified balanced sampling from imbalanced data?

I need to solve classification problem for 40 classes. The data is collected from 13 sensors on real-time, which includes 13 columns(the number of sensors) by 368816 rows(simply, like a time period). I planned to put the data into Recurrent Neural Network.

So, I labeled it as 0 to 40 class. The data belongs to 0 class means normal state of process, and the others mean abnormal state and the place which make the problem.

The data consists of 13 columns by 368816 rows. Every row means each dataset. Each of 368816 dataset belongs to 0 to 40 class. But, it is imbalanced. The number of dataset belongs to 0 class is 103260, about 22% of whole dataset.

The numbers of data belongs to other classes, 1-40, are similar.

I want to make a balanced sample data from the imbalanced data. For example, if the smallest class have 7000 number of data, I want to sampling 7000*41(nb of class) data.

I tried to use StratifiedShuffleSplit method in scikit-learn package. The script is like below.

data=StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=99)
data.get_n_splits(x_data,dummy_y)         #dummy_y means one-hot encoded y
for train_index, test_index in data.split(x_data,dummy_y):
    x_train,x_test=x_data[train_index], x_data[test_index]
    y_train,y_test=dummy_y[train_index], dummy_y[test_index]
print("nb of train data:", len(y_train), "nb of test data:", len(y_test))

If my logic for sampling were correct, The sum of nb_train and nb_test should be smaller than 368816. Because I did balanced sampling from imbalanced data.

but the nb_train is 258171 and nb_test is 110645.

How can I do stratified balanced sampling from imbalanced data?

I tried the Stratified Train/Test-split in scikit-learn method. But, I failed. The script I used is below.

x_train,x_test,y_train,y_test=train_test_split(x_data,dummy_y,stratify=y,random_state=99,test_size=0.3)

I have 13 columns by 431116 rows data. Each row means one data set. It is 2D matrix data. Each data set is linked with each class label. — Hyunseung Kim, Sep 21 '17 at 09:56
What if you use [StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) with `n_splits=1`? Will it work? — E.Z, Sep 21 '17 at 12:39
I have tried to use the method that Vivek Kumar recommended but, it doesn't work in ver. 0.18. When I use the script " x_train,x_test,y_train,y_test=train_test_split(x_data,y_encoded,stratify=y,random_state=99,test_size=0.3)", It prints out some error message like "TypeError: object of type 'bool' has no len()". — Hyunseung Kim, Sep 22 '17 at 00:50

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

you need to do a StratifiedShuffleSplit as suggest in the comment , and you don't need to use cross validation for it.

as suggest in this answer

But if one class isn't much represented in the data set, which may be the case in your dataset since you plan to oversample the minority class, then stratified sampling may yield a different target class distribution in the train and test sets than what random sampling may yield.

he also give some differences between Stratified Cross Validation and stratified sampling

hope this will help

Stratified balanced sampling from unbalanced data (Machine learning)

1 Answers1