I'm sorry about my carelessness. If you need the method to balanced-subsampling, access the below link. There are various answers.
Scikit-learn balanced subsampling
How can I do stratified balanced sampling from imbalanced data?
I need to solve classification problem for 40 classes. The data is collected from 13 sensors on real-time, which includes 13 columns(the number of sensors) by 368816 rows(simply, like a time period). I planned to put the data into Recurrent Neural Network.
So, I labeled it as 0 to 40 class. The data belongs to 0 class means normal state of process, and the others mean abnormal state and the place which make the problem.
The data consists of 13 columns by 368816 rows. Every row means each dataset. Each of 368816 dataset belongs to 0 to 40 class. But, it is imbalanced. The number of dataset belongs to 0 class is 103260, about 22% of whole dataset.
The numbers of data belongs to other classes, 1-40, are similar.
I want to make a balanced sample data from the imbalanced data. For example, if the smallest class have 7000 number of data, I want to sampling 7000*41(nb of class) data.
I tried to use StratifiedShuffleSplit method in scikit-learn package. The script is like below.
data=StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=99)
data.get_n_splits(x_data,dummy_y) #dummy_y means one-hot encoded y
for train_index, test_index in data.split(x_data,dummy_y):
x_train,x_test=x_data[train_index], x_data[test_index]
y_train,y_test=dummy_y[train_index], dummy_y[test_index]
print("nb of train data:", len(y_train), "nb of test data:", len(y_test))
If my logic for sampling were correct, The sum of nb_train and nb_test should be smaller than 368816. Because I did balanced sampling from imbalanced data.
but the nb_train is 258171 and nb_test is 110645.
How can I do stratified balanced sampling from imbalanced data?
I tried the Stratified Train/Test-split in scikit-learn method. But, I failed. The script I used is below.
x_train,x_test,y_train,y_test=train_test_split(x_data,dummy_y,stratify=y,random_state=99,test_size=0.3)