Regression for a Rate variable in R

Question

was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.

However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.

This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.

A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.

Any suggestions on the best form of distribution for this new parameter, and how to model it in R?

can you not still use poisson but with a log population offset term so you are modelling the rate? — user20650, Apr 16 '13 at 22:54
http://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 — kjetil b halvorsen, Apr 07 '17 at 14:10

score 4 · Accepted Answer · answered Apr 17 '13 at 02:13

4

Most software that supports Poisson regression will support an offset and the resulting estimates will become log(rate) or more acccurately in this case log(proportions) if the offset is constructed properly:

 # The R form for estimating proportions
 propfit <- glm( DV ~ IVs + offset(log(class_size), data=dat, family="poisson")

exp(coef(propfit) : should be the baseline proportion exp(Intercept) followed by the factors to multiply that baseline proportion estimate to arrive at the class-specific estimates.

answered Apr 17 '13 at 02:13

DWin

7,726

that is the solution I've seen elsewhere. For some reason, that regression doesn't seem to fit as well when I include the offset. In fact, the regression fails for many of the IVs. (I'm iterating over different IVs to find the best ones in a form of stepwise regression.) – Noah Apr 17 '13 at 21:36
One of the issues is how to use/interpret the predict.glm() funciton. Without an offset, I can just use predict.glm(fit, type="response") and get the predicted values. With an offset, how does the prediction differ? The help page for predict.glm() isn't clear on this. – Noah Apr 17 '13 at 21:37
It will be giving you the predicted proportions and you should check this by multiplying the N by the predicted values. – DWin Apr 17 '13 at 21:42

Glen_b · Answer 2 · 2013-04-17T23:08:39.593

2

If the rate is a count in a given category divided by so many in total, you could do a binomial GLM; there you're in effect explicitly modelling a proportion...

glm( formula, family=binomial)

The default link is logit rather than log as for the Poisson, but for small proportions the difference is negligible.

If you're looking at many categories, there are a number of extensions that might be appropriate, such as multinomial logit choice models.

Also see http://en.wikipedia.org/wiki/Generalized_linear_model#Multinomial_regression

edited Apr 17 '13 at 23:08

answered Apr 17 '13 at 07:13

Glen_b

282,281

A rate is never a count. – DWin Apr 17 '13 at 21:42
1

@DWin I have replaced the phrase "out of" by "divided by" in order to make the original intent clearer. A 'count divided by' something can be a rate. The original 'out of' was intended to convey the same meaning while emphasizing the connection between rates, and proportions-in-the-binomial (but was, I admit, unclear to the point of being potentially misleading). – Glen_b Apr 17 '13 at 23:09
A rate is a count divided by an exposure which is in turn a sum of time*counts; a proportion is a count divided by a (larger) count of individuals at risk. A logit is a logged( count divided by a "non-count"). – DWin Oct 20 '14 at 22:57

R.Astur · Answer 3 · 2013-04-16T23:34:36.443

I don't get why you think you couldn't transform that DV to log-normality only because of the zeros. You can just add a constant to this variable and then log-tranform it. Usual numbers to do so are 1, 10 or the minimum value of your variable. It would therefore have the form of log(1+DV), log(10+DV), log(min(DV)+DV) or in a more general term, log(a+DV). In R you can use the log1p() function instead of log() to automatically apply a log(1+DV).

If it doesn't change the distribution of he residuals in your models, you can also try other transformations, such as DV^a (in which a represents any number, usually 2, 3 or 0.5) or even exp(DV).

Also, another interesting option is to consider using robust regression, which can handle bad behaved distribugions.

For additional help, bringing us a densityplot() image of you residuals would be great.

http://www.imachordata.com/do-not-log-transform-count-data-bitches/ — Mox, Aug 17 '18 at 21:42

Regression for a Rate variable in R

3 Answers3

Linked