4

I have 3 groups in my dependent variable. The sample sizes on the 3 are very different. I have about 10000 in the first group, 35000 in the second and 100,000 in the third.

My question is: Should we sample down group 2 and 3 before modeling? Paul Allison talks about this topic briefly in his book on logistic regression using SAS, where he says if you have two groups with different sample sizes, we could sample down one and finally adjust the intercept of the regression equation. I want to know how this will work in a multinomial situation. Any help is appreciated. If anyone can help me with what the equations would look like for a multinomial logit model, that would be great with intercepts being adjusted. I believe there are two equations. Wonder how the intercepts would get adjusted.

Thank you

buruzaemon
  • 123
  • 1
  • 1
  • 9
user16789
  • 796

1 Answers1

4

The only reason to worry about outcome proportions in a logit model is that the model has a property of perfect replication, if there is a constant in the utility specification. Imagine a binary logit model with only a constant, $$P_i = \Lambda(\alpha) = \dfrac{e^\alpha}{1+e^a}$$

The log-likelihood of this model is $$\ln(\mathcal{L}) = \sum_i y_i \ln(\Lambda(\alpha)) + (1-y_i)\ln(1-\Lambda(\alpha))$$

Taking the derivative of $\ln(\mathcal{L})$ with respect to $\alpha$ and setting it equal to 0 gives $$ 0 = \sum_i y_i(1-P_i) + (1-y_i)(-P_i) = \sum_i y_i - P_iy_i - P_i + P_iy_i$$ $$\sum_i y_i = \sum_i P$$ $$N_i = N \dfrac{e^\alpha}{1+e^a}$$ $$ {N_i\over N }= \dfrac{e^\alpha}{1+e^a} $$

What this expression says is that at the maximum likelihood point, the sample share ($N_i/N$) is exactly equal to the estimated probability. The same is true if the utility is $\alpha + X_i \beta$, and if we are looking at multinomial outcomes (rather than the binary I showed here).

The consequence of this property for researchers is that if the sample shares are unrepresentative of the population, then the estimates of the intercepts are biased. The other parameters are unbiased. But even if your sample is not representative (meaning you observe many more of outcome $i$ than you would expect), you can correct it post-estimation by adding $$\ln({\text{Population Share}\over \text{Sample Share}})$$

Or during estimation by using the WESML estimator by Manski and Lerman, assuming that you know the population share. If you don't know the population share, things get more interesting but are still not hopeless Manski and McFadden. In fact, recent work I am aware of (but which has not been published) has shown you may even be able to estimate parameters when your sample share is zero. But I digress...

Advice to you: if your shares are close to the population shares, don't worry about it. If they're not, just correct the coefficients.