Is it right to build a logistic model for population with 2% of yes and 98% no population with 800k obs and 200 variables

Question

I have a dataset which has has some 800,000 observations data at member level with some 200 features and it has a response flag of 1/0. The proportion of response 1 flag is 2% of entire member population and rest is 0.

My question is: Is it appropriate to build logistic model with such lower population of 1? Do we need to consider any such proportion before building a logistic model?

There is some discussion of this topic here: http://stats.stackexchange.com/questions/66753/do-i-need-a-balanced-sample-50-yes-50-no-to-run-logistic-regression — Jeff, Jul 30 '15 at 13:42

score 3 · Answer 1 · answered Jul 30 '15 at 13:23

This is a typical rate for loan defaults. For instance, AAA corporate default rates are 0.1% in a year. You have sizeable data set. You don't have to use all 200 features. If your data is good and you have a reasonable model, then estimation can be done. Logistic models are often fit to this kind data.

On the surface I don't see an issue with the data set size and response rate. You may want to read a little about stratified sampling, just in case.

Thanks, I have same concern, as I replied to the above answer. — user83685, Jul 31 '15 at 07:58

score 3 · Answer 2 · answered Jul 30 '15 at 13:49

3

From my understanding of Logistic regression, you want to check that each category of yes/no's or 1/0's has a count >10*(p-1), where p is the number of covariates + 1 (for the intercept). If this holds true, you should be good.

answered Jul 30 '15 at 13:49

Jake

340

Thanks, my concern was this: If we have such a large population of 0's against 1, I think there will be at least certain set of users which might exhibit similar properties as the 1's, therefore the whole point of learning 1 vs 0 becomes useless, in turn the model will not be a good fit. – user83685 Jul 31 '15 at 07:58
That is a possibility.. Just looking at your numbers, if you have 800k obs and we go by the above formula, you need > 10*(201-1) or > 2000 obs for yes's and > 2000 obs for no's. 2% of 800k is 16k and obviously 98% of 800k will be greater. But some problems do arise and they are discussed here and through a few other questions on this website – Jake Jul 31 '15 at 13:31

SoakingHummer · Answer 3 · 2015-08-07T06:21:43.373

A classification model built on data of this type may not observe enough of the rare class to be able to distinguish the characteristics of the two classes. In my view, an SVM will work better in such situations.

In SVM a parameter called class.weights- a named vector of weights for the different classes, used for asymmetric class sizes might solve the problem you are facing.

Sample code:

library(e1071)
# weights: (example not particularly sensible)
i2 <- iris
levels(i2$Species)[3] <- "versicolor"
# Converting the dependent variable to binary(0-1) format
levels(i2$Species)[levels(i2$Species)=="versicolor"]<-1
levels(i2$Species)[levels(i2$Species)!=1]<-0
summary(i2$Species)   # Summary of dependent
weights <- 100 / table(i2$Species)   # Creating a named vector of weights
weights       # a named vector which contains weights for each class
model <- svm(Species ~ ., data = i2, class.weights = weights)

In your dataset there are close to 2% observations of class 1 and 98% of class 0, so you should be passing a named vector of weights with 98 for class 1 and 2 for class 0(assuming that you want to give equal importance to each class).

I have only 2 classes, 1 and 0, how do you assign proportion of weights? — user83685, Jul 31 '15 at 07:52
Hey @user83685, Technically, we need to create a named vector which contains weights for each class. Non- techically, in the above example the instances of one class is double than the other so the weights are assigned in the ratio of 2 for class 0(as it has 50 occurences) and 1 for class 1(which has 100 occurences), so that each class is given equal importance. It is a heuristic. — SoakingHummer, Aug 07 '15 at 06:11

Is it right to build a logistic model for population with 2% of yes and 98% no population with 800k obs and 200 variables

3 Answers3