The answer to this question can be found here Does down-sampling change logistic regression coefficients?
I have a dataset of 50k positives and nearly 1M negatives. Instead of taking a random sample to build a logistic regression I take all 50k positives and randomly sample 50k negatives. How should I interpret the logistic regression coefficients of the upsampled version since that sample is not actually representative of my original population?
Here is a toy example where the sampling will generate two different regression coefficients
beta = rnorm(5)
X = matrix(rnorm(10000*5), nrow = 10000)
y = X%*%beta + rnorm(10000)
label = ifelse(y > 3, 1, 0)
X.pos = X[label == 1,]
X.neg = X[label == 0,][1:sum(label),]
summary(glm(c(label[label == 1], label[label == 0][1:sum(label)]) ~ .,
data = as.data.frame(rbind(X.pos, X.neg)),
family = binomial()))
Call:
glm(formula = c(label[label == 1], label[label == 0][1:sum(label)]) ~
., family = binomial(), data = as.data.frame(rbind(X.pos,
X.neg)))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.10121 -0.22644 0.00143 0.33291 3.01340
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4787 0.2639 -13.184 < 2e-16 ***
V1 0.8904 0.1163 7.657 1.91e-14 ***
V2 -0.2319 0.1105 -2.100 0.0357 *
V3 -2.9831 0.2064 -14.456 < 2e-16 ***
V4 1.3729 0.1333 10.299 < 2e-16 ***
V5 -1.6344 0.1435 -11.387 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1522.15 on 1097 degrees of freedom
Residual deviance: 544.67 on 1092 degrees of freedom
AIC: 556.67
Number of Fisher Scoring iterations: 7
summary(glm(label ~ .,
data = as.data.frame(X),
family = binomial()))
Call:
glm(formula = label ~ ., family = binomial(), data = as.data.frame(X))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5939 -0.1856 -0.0701 -0.0229 3.6145
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.60342 0.15260 -36.720 < 2e-16 ***
V1 0.67740 0.06001 11.287 < 2e-16 ***
V2 -0.27957 0.05780 -4.837 1.32e-06 ***
V3 -2.32995 0.08786 -26.520 < 2e-16 ***
V4 1.17757 0.06548 17.985 < 2e-16 ***
V5 -1.27162 0.06675 -19.052 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4254.0 on 9999 degrees of freedom
Residual deviance: 2187.1 on 9994 degrees of freedom
AIC: 2199.1
Number of Fisher Scoring iterations: 8