How does upsampling rare events affect the interpretation of logistic regression?

Question

The answer to this question can be found here Does down-sampling change logistic regression coefficients?

I have a dataset of 50k positives and nearly 1M negatives. Instead of taking a random sample to build a logistic regression I take all 50k positives and randomly sample 50k negatives. How should I interpret the logistic regression coefficients of the upsampled version since that sample is not actually representative of my original population?

Here is a toy example where the sampling will generate two different regression coefficients

beta = rnorm(5)
X = matrix(rnorm(10000*5), nrow = 10000)
y = X%*%beta + rnorm(10000)
label = ifelse(y > 3, 1, 0)

X.pos = X[label == 1,]
X.neg = X[label == 0,][1:sum(label),]

summary(glm(c(label[label == 1], label[label == 0][1:sum(label)]) ~ .,
            data = as.data.frame(rbind(X.pos, X.neg)),
            family = binomial()))

Call:
glm(formula = c(label[label == 1], label[label == 0][1:sum(label)]) ~ 
    ., family = binomial(), data = as.data.frame(rbind(X.pos, 
    X.neg)))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-3.10121  -0.22644   0.00143   0.33291   3.01340  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.4787     0.2639 -13.184  < 2e-16 ***
V1            0.8904     0.1163   7.657 1.91e-14 ***
V2           -0.2319     0.1105  -2.100   0.0357 *  
V3           -2.9831     0.2064 -14.456  < 2e-16 ***
V4            1.3729     0.1333  10.299  < 2e-16 ***
V5           -1.6344     0.1435 -11.387  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1522.15  on 1097  degrees of freedom
Residual deviance:  544.67  on 1092  degrees of freedom
AIC: 556.67

Number of Fisher Scoring iterations: 7

summary(glm(label ~ .,
            data = as.data.frame(X),
            family = binomial()))

Call:
glm(formula = label ~ ., family = binomial(), data = as.data.frame(X))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5939  -0.1856  -0.0701  -0.0229   3.6145  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.60342    0.15260 -36.720  < 2e-16 ***
V1           0.67740    0.06001  11.287  < 2e-16 ***
V2          -0.27957    0.05780  -4.837 1.32e-06 ***
V3          -2.32995    0.08786 -26.520  < 2e-16 ***
V4           1.17757    0.06548  17.985  < 2e-16 ***
V5          -1.27162    0.06675 -19.052  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4254.0  on 9999  degrees of freedom
Residual deviance: 2187.1  on 9994  degrees of freedom
AIC: 2199.1

Number of Fisher Scoring iterations: 8

score 0 · Answer 1 · answered Mar 12 '15 at 10:07

0

Odds ratios or coefficients for the covariates are estimated correctly but intercept is affected by the sampling scheme. So probability estimates are not valid at population level.

http://support.sas.com/kb/22/601.html

answered Mar 12 '15 at 10:07

Analyst

2,655

1

It is confusing to see why you would do something that discards data and makes things more difficult by hurting the intercept. – Frank Harrell Mar 12 '15 at 10:49
1

@FrankHarrell perhaps JCWong thinks that it is frequency which is important and not the absolute numbers of observations with respect to covariates and such... – Analyst Mar 12 '15 at 12:30
4

To estimate model parameters you don't want to reduce the sample size. Logistic regression works fine with a highly imbalanced $Y$ distribution. – Frank Harrell Mar 12 '15 at 17:09
2

@FrankHarrell I'm sure you know about Gary King's work (among others). Care to elaborate on why that rare events isn't an issue here? Or do you mean to use one of those methods instead of resampling? – shadowtalker May 29 '15 at 21:51
1

This has been discussed elsewhere on the site. Gary King's excellent work is not inconsistent with what I wrote. – Frank Harrell May 29 '15 at 22:02
2

@FrankHarrell Would you mind pointing us to some posts where that discussion has happened? I skimmed through Gary King's work and would appreciate an alternative perspective on the issue. Thank you. – Clark Chong Jul 31 '15 at 20:24
1

I wouldn't find that useful unless you can point to a place in his excellent paper where he said you must do it that way. When I read the paper I didn't see that. – Frank Harrell Jul 31 '15 at 20:53

score 0 · Answer 2 · answered Jul 31 '15 at 20:38

This is called importance sampling.

Here's my wild guess based on intuition and gut feeling. Your positive was sampled at 20:1 rate to negative. So your logistic probability should be: $$y=\frac{e^{X\beta}}{20+e^{X\beta}}$$, where $X\beta$ is your logistic regression part.

This is in contrast to a standard binary logit $$y=\frac{e^{X\beta}}{1+e^{X\beta}}$$

How does upsampling rare events affect the interpretation of logistic regression?

2 Answers2