3

was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.

However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.

This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.

A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.

Any suggestions on the best form of distribution for this new parameter, and how to model it in R?

Noah
  • 627

3 Answers3

4

Most software that supports Poisson regression will support an offset and the resulting estimates will become log(rate) or more acccurately in this case log(proportions) if the offset is constructed properly:

 # The R form for estimating proportions
 propfit <- glm( DV ~ IVs + offset(log(class_size), data=dat, family="poisson")

exp(coef(propfit) : should be the baseline proportion exp(Intercept) followed by the factors to multiply that baseline proportion estimate to arrive at the class-specific estimates.

DWin
  • 7,726
  • that is the solution I've seen elsewhere. For some reason, that regression doesn't seem to fit as well when I include the offset. In fact, the regression fails for many of the IVs. (I'm iterating over different IVs to find the best ones in a form of stepwise regression.) – Noah Apr 17 '13 at 21:36
  • One of the issues is how to use/interpret the predict.glm() funciton. Without an offset, I can just use predict.glm(fit, type="response") and get the predicted values. With an offset, how does the prediction differ? The help page for predict.glm() isn't clear on this. – Noah Apr 17 '13 at 21:37
  • It will be giving you the predicted proportions and you should check this by multiplying the N by the predicted values. – DWin Apr 17 '13 at 21:42
2

If the rate is a count in a given category divided by so many in total, you could do a binomial GLM; there you're in effect explicitly modelling a proportion...

glm( formula, family=binomial)

The default link is logit rather than log as for the Poisson, but for small proportions the difference is negligible.

If you're looking at many categories, there are a number of extensions that might be appropriate, such as multinomial logit choice models.

Also see http://en.wikipedia.org/wiki/Generalized_linear_model#Multinomial_regression

Glen_b
  • 282,281
  • A rate is never a count. – DWin Apr 17 '13 at 21:42
  • 1
    @DWin I have replaced the phrase "out of" by "divided by" in order to make the original intent clearer. A 'count divided by' something can be a rate. The original 'out of' was intended to convey the same meaning while emphasizing the connection between rates, and proportions-in-the-binomial (but was, I admit, unclear to the point of being potentially misleading). – Glen_b Apr 17 '13 at 23:09
  • A rate is a count divided by an exposure which is in turn a sum of time*counts; a proportion is a count divided by a (larger) count of individuals at risk. A logit is a logged( count divided by a "non-count"). – DWin Oct 20 '14 at 22:57
0

I don't get why you think you couldn't transform that DV to log-normality only because of the zeros. You can just add a constant to this variable and then log-tranform it. Usual numbers to do so are 1, 10 or the minimum value of your variable. It would therefore have the form of log(1+DV), log(10+DV), log(min(DV)+DV) or in a more general term, log(a+DV). In R you can use the log1p() function instead of log() to automatically apply a log(1+DV).

If it doesn't change the distribution of he residuals in your models, you can also try other transformations, such as DV^a (in which a represents any number, usually 2, 3 or 0.5) or even exp(DV).

Also, another interesting option is to consider using robust regression, which can handle bad behaved distribugions.

For additional help, bringing us a densityplot() image of you residuals would be great.

R.Astur
  • 1,137
  • 1
  • 9
  • 14