GLMNET: Weights and imbalanced data

Question

I have a multinomial regression problem using glmnet. The training data is imbalanced (1:5:10 roughly). I tried over and undersampling already.

Would providing weight to glmnet() do the thing, too? So, if I give a weight of 5 and 10, respectively, to the rarer events, I would not be forced to do under/oversampling?

Why do you want to over and under sample in the first place? Would the output of the model, i.e. the conditional probabilities, be sufficient to solve your problem? If so, there is generally no need to over/undersample. If you're building a decision rule based on the model, or want to put your model into production together with an automated decision procedure, there is some justification, though other methods are often available. — Matthew Drury, Jul 16 '18 at 18:10
Thank you for your comment. I want to build a model that should be used for decision making on unknown new data. I remarked that when using imbalanced data, the model tends to decide for the largest class more frequently. Do you know if and how I can use the weights in the glmnet() function? — Christian Hinze, Jul 22 '18 at 14:43
Multinomial regression does not choose classes, it returns probabilities. Are you saying that the chiose the most likely decidion rule is giving you problems? — Matthew Drury, Jul 22 '18 at 16:33
I'm with @MatthewDrury that the probability values should be the primary interest. In particular, so what if the model tends to classify as class $#3?$ Link 1 Link 2 Link 3 Link 4 — Dave, Apr 19 '22 at 17:32
As Dave mentions, if the classifier assigns everything to the majority class, it is because that is the optimal solution to the problem as posed, see https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance for a simple example. Class imbalance is most often a cost-sensitive learning problem in disguise. You need to work out the costs of different kinds of errors and factor those into the problem setting. For a probabilistic classifier, that can be done after training by altering the thresholds. — Dikran Marsupial, Jan 18 '23 at 00:34
Resampling the data is asymptotically equivalent, but it is a hassle and raises other problems, and the amount of resampling does not depend at all on the level of imbalance, just on the costs of the different types of errors. So unless you have a very good reason, just change the thresholds (and take what is written on blogs with more than a pinch of salt - some are generally reliable, some not so much). — Dikran Marsupial, Jan 18 '23 at 00:36

David · Answer 1 · 2023-01-17T22:42:53.180

9

Yes, you should provide weights. I assign weights $1 - \frac{\text{# of class members}}{\text{# of total members}}$. Glmnet rescales them to sum to the total number of class members anyway.

Here's an example using a binomial classification model, Y is the label vector. Can be applied to multinomial in the same fashion.

fraction_0 <- rep(1 - sum(Y == 0) / nrow(Y), sum(Y == 0))
fraction_1 <- rep(1 - sum(Y == 1) / nrow(Y), sum(Y == 1))
# assign that value to a "weights" vector
weights <- numeric(nrow(Y))
if (weighted == TRUE) {
  weights[Y == 0] <- fraction_0
  weights[Y == 1] <- fraction_1
} else {
  weights <- rep(1, nrow(Y))
}
create an initial model
lambda_model <- glmnet(as.matrix(X), as.factor(Y[, 1]), family="binomial",
                       weights=weights, nlambda=100)

edited Jan 17 '23 at 22:42

answered May 22 '19 at 04:35

David

342

1

Not sure why this is downvoted, this is a great answer – Daniel Freeman Jun 11 '20 at 15:24
3

By the way, if you have a vector of classes called outcome you can also use this one-liner:
1 / (table(outcome)[outcome] / length(outcome))
– Daniel Freeman Jun 11 '20 at 15:32

GLMNET: Weights and imbalanced data

1 Answers1

create an initial model