OLS regression with count data

Question

I have the following linear model in R:

model <- lm(response ~ v1*v2 + v3*v2 + v4, data=df)

v2 is number of hours spent sleeping

v4 is ordinal (7 pt likert) measuring subject rating of sleep quality

v1 and v3 are 3 level factors measuring time spent on different activities (0-10mins, 11-20, 20+)

response is a count variable ranging from 1-6. This measures number of correct items on a 6 item quiz

I'm wondering what criteria is used to decide whether poisson regression should be used instead. I've considered the following:

I've read that in a poisson model the mean and variance of the response should be equal, which is not true in this case (mean = 5, variance = 1.1, mode = 6).
The distribution is negatively skewed, which works in favour of using a poisson model. What types of transformations are possible if I wanted to use OLS?
The range of the variable is 1-6. I believe one reason to use a poisson is due to the bounding at 0, however, I dont have any 0 values and the majority of the values are in the upper range (6)
Does poisson regression require a larger sample size than OLS to gain sufficient power? My N is ~120
I've tried running ncvTest() to check for heteroscedasticity and the test results are in favour of using OLS (no assumption violation)

Many say that poisson regression should be used to count data no matter what, but OLS doesn't seem unreasonable given some of the points above. What should my primary considerations be and how should I weigh the points outlined above? Is there anything that could be used to argue against the use of a poisson model in this case (maybe sample size?)?

EDIT:

To address the duplicate post concern: I don't believe the other post is asking the same thing (or at least the answer provided there doesn't really help in this case):

The other post is dealing with extensive variables, but in this case we have intensive
Given intensive variables, the other post suggests a linear model is OK but doesn't explain why
The response variable in the other post is unbounded at the upper end (i.e. number of patents). In this case, the response measures number of correct items on an exam. Given there is a maximum value to that (i.e. the value cant be greater than the number of items on the exam), the response here is bounded at both ends, with no respondents touching the lower bound of 0

So my question here is really asking about how to correctly handle positive integer (discrete) response values that are bounded at both ends

Is the response really a count, or is it, for example, a rating? — The Laconic, Feb 24 '17 at 03:19
Note that with OLS, there is no guarantee whatsoever that the expected value (mean conditional on explanatory variables) will be non-negative. That's another point in favor of Poisson that you missed. And it's not that important that the (conditional) variance be equal to the (conditional) mean, as long as you use robust estimators of the standard errors. — The Laconic, Feb 24 '17 at 03:23
response is number of correct responses, which I think would be considered a count. I'm not using it as a predictive model so it doesn't really matter that negative (and non integer) values may be predicted. I just want to identify that a theoretical relationship holds between the variables — Simon, Feb 24 '17 at 03:30
ctd. ... 3. Meanwhile, ordinary regression will have (among other likely issues) the problem that it will predict counts outside the range of possible values; — Glen_b, Feb 24 '17 at 06:47
Could you tell us more about the practical context? What is the response count variable counting? Why the uper limit at 6? What is the predictor variables representing/measuring? We need that to be able to advance on the question! Please add new information as an edit to the post! — kjetil b halvorsen, Feb 25 '17 at 12:22
With this extra information, I would try a logistic regression. At least as a starting point. Maybe with overdispersion correction. — kjetil b halvorsen, Feb 25 '17 at 17:50

score 3 · Accepted Answer · answered Feb 24 '17 at 05:13

3

As I understand it, your empirical probability of a 0 count is 0, the mean is 5, and the theoretical probability of a count being greater than 6 is 0. A Poisson distribution can never have such properties.

While the ncvTest has not rejected the assumption of homoscedasticity, from your description the assumptions of OLS are also not met, as, your residuals are all going to be in the range of -1 to 5 (or -5 to 1), and this is not what a normal distribution looks like. Also, your data is discrete, so normal was a priori impossible anyway.

What to do? Some options:

Use OLS anyway, with either a log dependent variable or no transformation of the dependent variable. As long as your hypotheses are strongly confirmed or rejected, you may be OK. If you have line-ball results, it is more problematic.
Use a binary logit model comparing <=4 vs >=5, as then you at least have no distributional assumptions to worry about.
Try an ordered logit model. This is going to have more power than the binary logit, but its diagnostics need to be more carefully met as it is making stronger distributional assumptions.
Do all of the above, and, if the conclusions don't change, feel good.

answered Feb 24 '17 at 05:13

Tim

3,401

Just a few comments. OLS makes no distribution assumptions. Poisson also works in quasi likelihood way, like OLS it works if you specify the conditional function correctly, you simply need to compute robust errors. – Repmat Feb 24 '17 at 05:32
The conditional function of his model clearly isn't consistent with either Poisson or OLS, both of which permit predictions to be above 6, so robust standard errors aren't any help to him. And, as regards OLS, @Simon's question was about inference from OLS, not parameter estimation, and the inference does make distributional assumptions. – Tim Feb 24 '17 at 06:05
In that case non of the things mentioned in your answer would solve the problem. The all rely on that assumption. Also inference in OLS follows asymptotically under Gauss Markov, and so there is no need to invoke normality (which, as you say, is a completely inappropriate assumption) – Repmat Feb 24 '17 at 06:26
@Repmat. Gauss Markov assumes homoskedasticity. This is unlikely. I wouldn't be relying on asymptotics with obviously discrete data with n = 120. But, by all means, if you have a reference to back up your claim, please add it to the thread. – Tim Feb 24 '17 at 06:38
All of the methods you mention, except OLS (GM + normality), relies on assymptotics... There is no way to escape this. I feel we got off track somehow. I dont disagree with what you wrote, I only meant to refine it. – Repmat Feb 24 '17 at 07:27
1

Given the response is a ratio of counts (number of correct / number of available items), could logistic regression be used as suggested here: http://stats.stackexchange.com/questions/29038/regression-for-an-outcome-ratio-or-fraction-between-0-and-1 ? – Simon Feb 24 '17 at 19:58
That's a really nice idea. – Tim Feb 26 '17 at 21:24
@Repmat. Sorry. I have re-read your comments and can see I was being a bit sensitive. – Tim Feb 28 '17 at 02:08

OLS regression with count data

1 Answers1

Linked