4

I have the following linear model in R:

model <- lm(response ~ v1*v2 + v3*v2 + v4, data=df)

v2 is number of hours spent sleeping

v4 is ordinal (7 pt likert) measuring subject rating of sleep quality

v1 and v3 are 3 level factors measuring time spent on different activities (0-10mins, 11-20, 20+)

response is a count variable ranging from 1-6. This measures number of correct items on a 6 item quiz

I'm wondering what criteria is used to decide whether poisson regression should be used instead. I've considered the following:

  • I've read that in a poisson model the mean and variance of the response should be equal, which is not true in this case (mean = 5, variance = 1.1, mode = 6).

  • The distribution is negatively skewed, which works in favour of using a poisson model. What types of transformations are possible if I wanted to use OLS?

  • The range of the variable is 1-6. I believe one reason to use a poisson is due to the bounding at 0, however, I dont have any 0 values and the majority of the values are in the upper range (6)

  • Does poisson regression require a larger sample size than OLS to gain sufficient power? My N is ~120

  • I've tried running ncvTest() to check for heteroscedasticity and the test results are in favour of using OLS (no assumption violation)

Many say that poisson regression should be used to count data no matter what, but OLS doesn't seem unreasonable given some of the points above. What should my primary considerations be and how should I weigh the points outlined above? Is there anything that could be used to argue against the use of a poisson model in this case (maybe sample size?)?

EDIT:

To address the duplicate post concern: I don't believe the other post is asking the same thing (or at least the answer provided there doesn't really help in this case):

  1. The other post is dealing with extensive variables, but in this case we have intensive

  2. Given intensive variables, the other post suggests a linear model is OK but doesn't explain why

  3. The response variable in the other post is unbounded at the upper end (i.e. number of patents). In this case, the response measures number of correct items on an exam. Given there is a maximum value to that (i.e. the value cant be greater than the number of items on the exam), the response here is bounded at both ends, with no respondents touching the lower bound of 0

So my question here is really asking about how to correctly handle positive integer (discrete) response values that are bounded at both ends

Simon
  • 2,341
  • Is the response really a count, or is it, for example, a rating? – The Laconic Feb 24 '17 at 03:19
  • 1
    Note that with OLS, there is no guarantee whatsoever that the expected value (mean conditional on explanatory variables) will be non-negative. That's another point in favor of Poisson that you missed. And it's not that important that the (conditional) variance be equal to the (conditional) mean, as long as you use robust estimators of the standard errors. – The Laconic Feb 24 '17 at 03:23
  • response is number of correct responses, which I think would be considered a count. I'm not using it as a predictive model so it doesn't really matter that negative (and non integer) values may be predicted. I just want to identify that a theoretical relationship holds between the variables – Simon Feb 24 '17 at 03:30
  • 2
  • If every response is equally likely to be correct within a single count and they're independent, the number of correct responses would be Binomial, not Poisson. If the probability varies, it'd be some form of Poisson-binomial, not Poisson. 2. You say "distribution is negatively skewed, which works in favour of using a poisson model" --- two things wrong with that. a. A Poisson is right skew, not left skew. b. You're looking at the marginal distribution of the response but the model is for the conditional distribution (nonetheless we can almost certainly rule out the Poisson). ... ctd
  • – Glen_b Feb 24 '17 at 06:42
  • 1
    ctd. ... 3. Meanwhile, ordinary regression will have (among other likely issues) the problem that it will predict counts outside the range of possible values; – Glen_b Feb 24 '17 at 06:47
  • Could you tell us more about the practical context? What is the response count variable counting? Why the uper limit at 6? What is the predictor variables representing/measuring? We need that to be able to advance on the question! Please add new information as an edit to the post! – kjetil b halvorsen Feb 25 '17 at 12:22
  • Added some more info about the variables to the question – Simon Feb 25 '17 at 17:44
  • With this extra information, I would try a logistic regression. At least as a starting point. Maybe with overdispersion correction. – kjetil b halvorsen Feb 25 '17 at 17:50