1

I'm creating a simulation model, in which some stochastic factors are included. On of my stochastic factors is the amount of containers arriving daily for a specific delivery location. A plot of this data is shown below.

Now when I fit this to distributions, the best fit seems to be normal or lognormal, but since this are continuous distributions, I need to round them if I want to fit this data to the really situation (since you can not have 0.74 container arriving), so my question is, is it allowed to fit this data to distributions and round them (I can not find anything on the internet about it)

If not, which most common discrete probability function will fit to this (since all the examples I find is about the number of succeses and I can not find anything about how to fit this kind of data to for example a binomial distribution). Or do you guys advice to create a empirical distribution based on this data?

Thanks in advance and sorry if this is a stupid question, but I'm new to probability distributions!

enter image description here

Aron T.
  • 113
  • 3
    re the title: ordinarily one fits a distribution model to the data. It's usually a bad idea to fit (i.e., modify) the data to make them conform to a distribution! As far as the rounding issue goes, that's a legitimate concern, but it's readily handled using interval techniques. See https://stats.stackexchange.com/questions/56015 for instance; and visit https://stats.stackexchange.com/questions/248476 for theoretical justification and more discussion. – whuber May 18 '22 at 13:48

1 Answers1

3

I recommend against using a continuous distribution to approximate a discrete distribution. Often it will work well enough, especially in data like yours where the counts are far from zero, but the further into the tails you get the less accurate it will be, so you have to be vigilant to make sure that you are not pushing the approximation too far. The appropriate discrete distributions are not any harder to use, so it's better just to use them.

The Poisson Distribution

For count data where the rate is constant, the distribution you want is the Poisson Distribution. You can think of the Poisson distribution (and in fact derive its mathematical form) as a limiting case of the binomial distribution with $N \rightarrow \infty$ and $p \rightarrow 0$, such that the expected number of successes $Np$ is constant. The Poisson distribution has support for counts $\ge 0$, which is important because for low rates you will sometimes see zero counts, but of course you will never see a negative number of counts. A normal distribution cannot be restricted to be positive, while a lognormal distribution cannot produce zero counts, so neither one of these is a good choice for low average event rates.

The Gamma-Poisson (aka Negative Binomial) Distribution

Because the Poisson distribution has only one parameter, its variance and mean are not independently selectable. In fact, if the parameter in the distribution is denoted by $\lambda$, then the variance and mean are both equal to $\lambda$. This may not be appropriate if your data shows evidence of having a variance that is larger than the mean, a feature called "overdispersion". In this case, you want the Gamma-Poisson Distribution. The Gamma-Poisson distribution is a mixture of Poisson distributions, where the parameter $\lambda$, instead of being constant like in the regular Poisson distribution, has a gamma distribution. This has the effect of smearing out the distribution a bit, so as to increase its variance. The width of the gamma distribution controls the amount of smearing, allowing you to choose the amount of overdispersion that's appropriate for your data.

As it happens, the Gamma-Poisson distribution is equivalent to the Negative Binomial Distribution, and that's the name you will find it under in most stats software packages. The parameterization for the negative binomial is a little confusing for this application, but you can sidestep that by using the mean and variance to choose the parameters.

The Conway-Maxwell-Poisson Distribution

The Conway-Maxwell-Poisson (CMP) Distribution generalizes the Poisson Distribution by adding an additional parameter that accounts for either overdispersion or underdispersion. Compared to the Gamma-Poisson, this distribution has the advantage of being able to handle underdispersed as well as overdispersed data (the Gamma-Poisson can only handle overdispersed). However, this flexibility comes at the cost of being a little more obscure and less well supported than the distributions above. The COMPoissonReg R package implements the CMP distribution, as well as CMP regressions using generalized linear models (GLMs)

Another way of handling underdispersion

Often underdispersion results from processes that have a component that isn't random. For example, you might have a customer whose order comes in reliably every week for the same amount of product, and added to this you might have irregular orders that display more variability. In this case, you can model the system as a constant distribution plus a Poisson distribution. For example, if you average 100 counts per week, but the variance is only 64, then you could model this as a flat 36 counts, plus a Poisson distribution with $\lambda = 64$.

As noted in the comments, however, the variance of a small sample is highly uncertain, so before introducing a correction for underdispersion, you should consider whether you have reason to expect this sort of behavior.

Nobody
  • 2,025
  • 11
  • 11
  • 2
    When an underlying theory indicates there is a specific family of continuous distributions governing the values and the observations really are binned, rounded, or otherwise reported in intervals, then arguably one should use that continuous family. Indeed, in such cases often the parameters of that family are the quantities of interest, so replacing it by some discrete family doesn't accomplish the intended aims of the analysis. Moreover, what would you do if the quantities in the question were negative? – whuber May 18 '22 at 15:06
  • @whuber What you say is true, but not applicable to this question. The question clearly states that this is count data, which is intrinsically positive and discrete. – Nobody May 18 '22 at 15:24
  • Thank you -- I missed that point in reading the question. It perfectly justifies your approach (+1). – whuber May 18 '22 at 15:39
  • @NoBody, in my data the variance is smaller than the mean, so negative bin, is not an option. Is there also another parameterization for the Binomial (with mean and variance). Or what discrete distribution should you advice me otherwise. – Aron T. May 18 '22 at 17:54
  • 2
    @AronT. Ah, you have data that is underdispersed, not overdispersed. I'll add a paragraph about that. – Nobody May 18 '22 at 17:55
  • 2
    You have a small dataset, which means randomness can explain some pretty large discrepancies. It doesn't look underdispersed to me. In a Poisson distribution the SD would be around $12$ and that's close to the SD one would estimate from the histogram. – whuber May 18 '22 at 18:10
  • 2
    I added some paragraphs on underdispersed models, but @whuber raises a good point. – Nobody May 18 '22 at 18:23
  • @Whuber & Nobody Thank you both. Everything is a lot more clear to me now. Is there some literature available about the constant rate with a part of the total rate to be random (poisson) as explained in the last paragraph. I'm curious how this phenomenon is called. – Aron T. May 19 '22 at 07:46
  • 1
    @AronT. I'm not sure if there is a name for this particular model, but it is an example of a "mixture model". – Nobody May 19 '22 at 12:54