I have an imbalanced binary classification problem I am trying to solve with the LogisticRegression algorithm in sklearn.
As the data is highly imbalanced I am looking at ways to treat the imbalance issue. After reviewing all the literature I see methods like undersampling, oversampling, and methods of those like - SMOTE, and ROSE.
For this post I am only interested in the below two methods:
- The F-beta score with a beta value >1
- The class_weights hyperparameter in sklearn.
I am having difficulty understanding the difference between the way f_beta and class weight work and the pros and cons of each implementation.
I understand both penalize missing prediction on the minority class but would greatly appreciate a detailed comparison. Specifically, i would like to know, does class_weight effect the sample itself and what effect do both have on generalization of the model? Also is there a rule of thumb for assigning a beta value based on the imbalance size?