What distribution should I use to model bounded count data (that also feels a bit like a proportion)?

Question

I am trying to determine the correct model for my data. I want to model the effect of two categorical independent variables (and their interaction) on my outcome variable. I am using SAS (proc genmod, probably).

The outcome variable can range from 0-6 and is the count of items in a sample of 6 with a particular feature. I do not think a Poisson model is appropriate because of the bounds of the count.

Consider, for example, an individual selecting a sample of 6 pieces of fruit from a basket of apples and oranges, with a goal to select a representative sample. I want to model the number of apples selected. I'm not sure if the goal is relevant for the model determination, but it seems like it should be relevant: there is a normative response of 2.4 apples (obviously, the sample is discrete, but this is what I'd expect on average), based on the proportion of apples in the basket (2/5).

What distribution should I use to model this outcome? I have ruled out normal (failed Shapiro-Wilk test for normality) and I think I should also rule out Poisson (because of the bound between 0 and 6).

score 0 · Answer 1 · answered Oct 03 '22 at 18:29

0

Sounds like a job for the binomial distribution, with 6 trials and a “success” probability that denotes the probability of selecting the feature of interest.

answered Oct 03 '22 at 18:29

fm361

103

utobi · Answer 2 · 2022-10-03T18:37:16.183

0

If your counts are bounded between 0 and 6 you my consider as a possible model the Binomial(6,$\theta$). This however, doesn’t help in case your ‘counts’ are not integers.

If you have many non integer values, an option could be to divide everything by 6, which if I got it correctly, is the maximum observable value. In this way you would get values between 0 and 1 so you could apply a beta model. Comment below if you need more details.

edited Oct 03 '22 at 18:37

answered Oct 03 '22 at 18:32

utobi

11,726

Thank you! They are all integer values. However, my understanding is that for the binomial model all trials are independent. I don't think this is the case, since the sample is selected with the goal of being representative. (If you pull two apples out of the basket, you may be less likely to pick another apple than if you first pulled two oranges.) I don't have information about the order in which the sample was selected, only the final sample (N apples out of 6). Could you help me understand whether the binomial distribution would still be appropriate given this clarifying information? – EmMa Oct 03 '22 at 18:45
what do you mean by representative sample? Could you explain exactly how the sampling is done? Are you sampling apples from a basket of 6 apples or what? – utobi Oct 03 '22 at 18:56
There are humans selecting 6 fruits from the basket to later test for quality. They want to select a "representative sample" (includes both apples and oranges, representing the contents of the basket) so that both fruits get tested for quality, but of course they are humans so they don't always do so. I am interested in modeling the factors that bias their sample selection. – EmMa Oct 03 '22 at 19:03
how large is the basket? Does it only have apples and oranges or something else? Why the selection is biased and biased towards what? – utobi Oct 03 '22 at 19:54
also, are you trying to estimate nr.apples/nr.oranges in the population? – utobi Oct 03 '22 at 19:56
Thank you! No population estimate: I am interested in the (bias in) the selection of 6. In this study, I only have two different items in the basket. – EmMa Oct 05 '22 at 14:11

What distribution should I use to model bounded count data (that also feels a bit like a proportion)?

2 Answers2