2

I've tried to build a predictive logistic regression model. However, there are only 500 observations with disease (+) and over 60,000 observations with disease (-).

Can I take a random sample (e.g., 10,000) of 60,000 observations with disease (-) to build the predictive model?

Sycorax
  • 90,934
  • 2
    The proposed method is called "under-sampling." Here are some related questions & answers on this topic https://stats.stackexchange.com/search?tab=votes&q=%22under-sampling%22%20answers%3a1 – Sycorax May 17 '22 at 14:54
  • 3
    Why do you want to do that? In general, this can be a bad idea. Imagine that there's no predictive information in any "predictors" you have. In that case, the best risk prediction for all members in the population (assuming your sample is representative) is 500/60500. If you subsampled, you'd get e.g. 500/10500. Of course, there are methods that can correct estimates for the sub-sampling that would account for this. – Björn May 17 '22 at 14:55
  • 2
    Never discard data that someone has already spent the money to collect. The except is when debugging your code you might take weighted samples, temporarily. – Frank Harrell May 17 '22 at 15:04
  • random sample is OK as long as it is representative of the population. So if you happen to get 'unlucky' with your 10,000 and not get any of the adverse observations in your sample, that obviously won't work. How did you come up with the 10,000 sampling number? – Ralph Winters May 17 '22 at 15:19
  • Closely-related in the context of logistic regression is the case-control design. https://stats.stackexchange.com/search?q=case-control+answers%3A1+logistic – Sycorax May 17 '22 at 16:56
  • @Björn possibly the sampling could require a lot of resources and the OP might rather take measurements with a smaller sample of the controls. Of course, when the data has already been gathered than reducing the sample makes little sense. It only costs a tiny little bit extra computation power. – Sextus Empiricus May 17 '22 at 21:42

1 Answers1

0

Can I take a random sample (e.g., 10,000) of 60,000 observations with disease (-) to build the predictive model?

Let's try it out. The code below simulates data and fits two times the data one time without and one time with the reduction in data.

We see that the result is nearly the same. The reason that the reduction does not matter so much is because the estimate of the control group is very accurate already.

In the code we have computed the logistic model in two different ways.

  • Using a glm model, which is the typical and straightforward approach
  • Estimating the distribution of the two groups. This is not the typical approach (it requires a normal distribution of the two groups, control and cases), but it is comparable. (you can see the black line and the thick dotted red line coincide)

So what happens when you reduce the data is that your estimate of the control group becomes less accurate. However, the limiting factor is the estimate of the group with the cases. The cases group is smaller and the estimates are less accurate which dominate the overal estimation. So reducing the control group has little influence.

Of course, when you have all the data already, then there is no reason to reduce the groups. But when you still have to gather the data then it can make sense to safe resources by not measuring the entire control group.

Note, you do need to keep in mind the frequency of cases in the population of interest and not use the frequency of cases in the sample with a relatively higher frequency of cases. In the code below this is done by correcting the odds from the fit with the sample according to the frequency of the population. This is done in the line odds2 = odds*(reduce*nx+ny)/(nx+ny).

example

layout(matrix(1:2,2))

generate two groups of data

set.seed(1) nx = 6*10^4 ny = 500 y = rnorm(ny,2,1) x = rnorm(nx,0,1) z = c(x,y) cat = c(rep(0,nx), rep(1,ny))

plot

plot(z,cat)

compute logistic model by approximating normal distributions

mu_x = mean(x) mu_y = mean(y) sig = ( (sum((y-mu_y)^2) + sum((x-mu_x)^2)) /(ny+nx-2))^0.5

zs = seq(-4,6,0.1) p_caty = (dnorm(zs,mu_y,sig)ny)/(dnorm(zs,mu_x,sig)nx+dnorm(zs,mu_y,sig)*ny) lines(zs,p_caty, lty = 1)

compute logistic model with glm

mod = glm(cat ~ z, family = binomial) lines(zs, predict(mod, newdata = list(z=zs), type = "response"), col = 2, lty = 2, lwd = 2)

title("estimate with all data 60000 vs 500 cases", cex.main = 1)

generate two groups of data

set.seed(1) nx = 610^4 reduce = 1/6 ny = 500 y = rnorm(ny,2,1) x = rnorm(nxreduce,0,1) z = c(x,y) cat = c(rep(0,nx*reduce), rep(1,ny))

plot

plot(z,cat)

compute logistic model by approximating normal distributions

mu_x = mean(x) mu_y = mean(y) sig = ( (sum((y-mu_y)^2) + sum((x-mu_x)^2)) /(ny+nx*reduce-2))^0.5

zs = seq(-4,6,0.1) p_caty = (dnorm(zs,mu_y,sig)ny)/(dnorm(zs,mu_x,sig)nx+dnorm(zs,mu_y,sig)*ny) lines(zs,p_caty, lty = 1)

compute logistic model with glm

mod = glm(cat ~ z, family = binomial) out = predict(mod, newdata = list(z=zs), type = "response") odds = out/(1-out) odds2 = odds*reduce out2 = odds2/(odds2+1) lines(zs, out2, col = 2, lty = 2, lwd = 2)

title("estimate with reduced data 10000 vs 500 cases", cex.main = 1)

  • Do these two models estimate the same intercepts? – Sycorax May 17 '22 at 17:17
  • @Sycorax The model with less datapoints is of course less accurate and will be different. But when we are talking about a difference of 60000 vs 500 then reducing it to 10000 vs 500 has little effect. – Sextus Empiricus May 17 '22 at 17:19
  • Obviously the OP should not downgrade data if he already has gathered the data. But, I imagine the situation where the OP has 60000 controls and 500 cases that he still has to measure/sample. In that case, sample all the cases and sample the controls as much as you can afford/need. – Sextus Empiricus May 17 '22 at 17:21
  • 1
    When the proportion of positives and negatives is fixed by the design, the intercept is likewise fixed. https://stats.stackexchange.com/questions/69561/case-control-study-and-logistic-regression/69564#69564 Appeals to sample size are misdirected. The intercept depends on the class ratio, and the ratio is fixed by the researcher in both of your demonstrations, instead of arising from SRS of the population. – Sycorax May 17 '22 at 17:55
  • 3
    The intercept has been screwed up royally. – Frank Harrell May 17 '22 at 19:01
  • @Sycorax I don't see how the intercept is fixed. The intercept depends on the means of the regressor variables in the two groups which can be variable. – Sextus Empiricus May 17 '22 at 19:34
  • @FrankHarrell how is the intercept screwed up? In both cases it is between 3 and 4. With the naked eye I can not see a clear difference. – Sextus Empiricus May 17 '22 at 19:35
  • 3
    If simple random sampling is not used it will be messed up. It depends on the mean of the predictors and on the Y=0,1 relative frequencies. – Frank Harrell May 17 '22 at 20:24
  • 1
    @FrankHarrell in the example I have corrected for this by scaling the odds with the same factor as used to reduce the sample size. The curves in the two images are nearly the same and there is no screwed up intercept. The estimation basically comes down to estimating the mean of the two groups and the variance. – Sextus Empiricus May 17 '22 at 21:01
  • The answer I linked to has a citation which explains in more detail. – Sycorax May 17 '22 at 21:05
  • @Sycorax the example that I have does not have a fixed intercept. If I repeat the simulation then I get a different intercept. I am not sure what I should look for in the cited book. – Sextus Empiricus May 17 '22 at 21:07
  • Re-read Frank Harrell's comments. – Sycorax May 17 '22 at 21:08
  • @Sycorax I don't get at all what the point is that your are trying to make. This is like people explaining me how poison and poisson should be pronounced by repeating it while not pointing out the culprit. – Sextus Empiricus May 17 '22 at 21:08
  • 1
    I think the comments highlight an important topic in case-control study designs. – Sycorax May 17 '22 at 21:42
  • 2
    I think the correction you did is approximately correct. The full correction needs an offset term on the logit scale. – Frank Harrell May 17 '22 at 21:43
  • 1
    @FrankHarrell the correction is equivalent to an offset on the logit scale because I make a multiplication with the odds which is the same as an addition to the log odds. – Sextus Empiricus May 17 '22 at 21:51
  • Correct me if I'm wrong but I think that putting an offset into the estimation process will result in changing the slopes a bit. – Frank Harrell May 18 '22 at 02:11
  • @FrankHarrell of course, for models with and models without offset, you get a different estimate of the slope when there is a correlation between the regressor and the offset term. But which is better? If there is an offset in the underlying model then one should include it also in the fit or otherwise we get a wrong estimate of the slope. – Sextus Empiricus May 18 '22 at 09:55
  • 1
    Yes the offset method would be more of a "gold standard" and will affect slopes even when not correlated with predictors (related to non-collapsibility of odds ratios). – Frank Harrell May 18 '22 at 10:59
  • Not run the code, but looks all right to me. It's equivalent to saying the estimate of the population intercept is the estimate of the intercept after down-sampling minus log(6), often called *prior correction. See https://stats.stackexchange.com/a/68726/17230. As in a case-control study, the estimates of population log odds ratios are the log odds ratios estimates after down-sampling. (Of course it may only be the relative odds that are of interest in any case.) – Scortchi - Reinstate Monica May 18 '22 at 12:40