0

I have an imbalanced binary classification problem I am trying to solve with the LogisticRegression algorithm in sklearn.

As the data is highly imbalanced I am looking at ways to treat the imbalance issue. After reviewing all the literature I see methods like undersampling, oversampling, and methods of those like - SMOTE, and ROSE.

For this post I am only interested in the below two methods:

  1. The F-beta score with a beta value >1
  2. The class_weights hyperparameter in sklearn.

I am having difficulty understanding the difference between the way f_beta and class weight work and the pros and cons of each implementation.

I understand both penalize missing prediction on the minority class but would greatly appreciate a detailed comparison. Specifically, i would like to know, does class_weight effect the sample itself and what effect do both have on generalization of the model? Also is there a rule of thumb for assigning a beta value based on the imbalance size?

  • 3
    Why do you think unbalanced is a problem? See https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression – kjetil b halvorsen Apr 22 '22 at 16:05
  • 2
    I'm with @kjetilbhalvorsen that class imbalance probably isn't the problem that you think it is. Frank Harrell, founder of the biostatistics department at Vanderbilt Medical School, has two good blog posts about how models like logistic regression can and should be evaluated by how well they predict probabilities (1) (2). With that in mind, balance is much less of an issue. – Dave Apr 22 '22 at 16:13
  • Thank you for these very helpful answers and I appreciate the articles. I would still like to know the difference between the two techniques in question for my general knowledge. – kdbaseball8 Apr 22 '22 at 16:47

0 Answers0