The data:
For the purposes of this question/communication we can assume the data looks like rnbinom(1000,size=0.1,prob=0.01) in R, which generates a random sample of 1,000 observations from a negative binomial distribution (with size=0.1 and probability of success prob=0.01). This is the parametrization where the random variable represents the number of failures before size number of successes. The tail is long, and 1,000 observations is not a lot of data.
The problem: I have been given some data (integer on {1,2,....}) [see above] (1,500 data points) and asked to find "best fit" distribution and estimates of any parameters. I know nothing else about the data. I'm aware I this is not a very large sample for data with a long tail. More data is a possibility.
What I've done: I have considered using a likelihood ratio test by fitting two different distributions to the data, but I don't think this applies (as in, I cannot determine appropriate critical p-values) unless the two distributions are nested...
I then considered using a Kolmogorov-Smirnov test (adjusted for discrete data) but, in R anyway, it complained it could not compute a p-value for "data with ties".
What is the best way for me to go about testing/determining the fit of different distributions in this context? Here are some other things I have considered:
- Ask for (lots) more data. But will this help? Will I be able to use asymptotic results, for instance?
- Consider some bootstrap/re-sampling/monte-carlo scheme? If so, is there a standard reference I can/should read to learn how to do this correctly? Thanks