Logistic regression with unbalanced sampling

Question

Lets say I have a dataframe that looks like this:

groups <- floor(runif(1000, min=1, max=5))
activity <- rep(c("A1", "A2", "A3", "A4"), times= 250)
endorsement <- floor(runif(1000, min=0, max=2))
value1 <- runif(1000, min=1, max=10)
area <- rep(c("A", "A", "A", "A", "B", "C", "C", "D", "D", "E"), times = 100)

df <- data.frame(groups, activity, endorsement, value1, area)

printed:

> head(df)
  groups activity endorsement   value1 area
1      1       A1           0 7.443375    A
2      1       A2           0 4.342376    A
3      1       A3           0 4.810690    A
4      4       A4           0 3.494974    A
5      3       A1           1 6.442354    B
6      1       A2           0 9.794138    C

I want to run a logistic regression (predicting endorsement from groups), but if you look at the area variable, A is very well represented, whereas B and E are not.

I'm not interested in the area variable itself, but the stats will be driven by areas that have high representation in the dataset, so I need to weight the data but I'm not sure the correct way to do it

This is the model I'd like to run:

library(lsmeans)
model <- glm(endorsement ~ factor(groups), data=df, family=binomial(logit))
anova(model, test = "Chisq")
lsmeans(model, pairwise ~ groups)

Without any adjustment, the "main effect" of groups and any pairwise differences will primarily be driven by any effects found in the most represented area (in the actual dataset area A has about 100x more subjects than any other area)

Whats the correct way to adjust for the unbalanced area representation? I thought about upsampling the minority groups (or even downsampling the majority group) but I feel like this would have adverse/artificial effects on the power of the test?

I'm not too familiar with MLM - how would including area as a random effects variable help the model account for differences in area representation? — Simon, Mar 27 '17 at 06:29
Well, simply put using area as a random effect you estimate a separate intercept for each area, but you can also assume a random 'slope' for one of the independent/predictor variables (such as 'groups'), which boils down to a random effect for the levels of groups across the areas. Another, somewhat simpler option than MLM would be to add an interaction term between area and group to the glm you proposed. — IWS, Mar 27 '17 at 09:07
One possibility, it seems to me, is to simply include area as an additive term in the model you specified. Then it will estimate a separate intercept for each area, and the lsmeans step will average the predictions for each area together, giving them equal weight. Thus the areas will be equally represented when it comes to summarizing via the group means and comparisons thereof. — Russ Lenth, Mar 27 '17 at 20:50

Logistic regression with unbalanced sampling

0 Answers0