Testing a dataset against a distribution with parameters estimated from that dataset

Question

I am trying to figure out the best distribution to fit some data to, and I'm not sure if what I am doing is statistically correct. My data consists of 20 samples / year over 10 years. For each sample I have run a distribution fitting algorithm (using fitdistr() in R), to get the estimated parameters for each type of distribution. I am testing gamma, chi-squared, weibull and lognormal distributions.

My next step was to then run a Kolmogorov Smirnov test, using the sample data, and setting the parameters as estimated from that data. I was going to find which distribution was the overall 'best' (lowest average p-value for all 200 samples), and say that this was the distribution my data described. I have read that using the KS test in this way is incorrect and the resulting p-values will be unreliable.

I'm not sure if I can use the KS test in this way, or if I should do and maximum likelihood estimation.

In addition to @Ezekiel2517's points, note that you'd have to use a bootstrapped version of the Kolmogorov-Smirnov test if the parameters from each model are estimated from the data. — Scortchi - Reinstate Monica, Apr 10 '13 at 16:29
From what I gather, I would run the fitdist() on the sample, then run an AIC on the output of the fitdistr(). So I'm not sure why I'd need to bootstrap the KS test if I am no longer using it? — D'Arcy Mulder, Apr 10 '13 at 19:01
Sorry: I meant that it was an additional issue with your original idea, not an additional thing to do after calculating AICs. — Scortchi - Reinstate Monica, Apr 11 '13 at 08:27

score 3 · Accepted Answer · answered Apr 10 '13 at 16:22

3

Indeed, that is not a formal comparison. First of all, if you use fitdistr, then you are using a maximum likelihood estimation approach. See: http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/fitdistr.html.

The formal way to compare these models is to employ a model selection technique such as AIC, BIC, DIC or some other.

Finally, (you have probably consider this) there seems to be a time indexing of your observations which may be relevant to take into consideration.

answered Apr 10 '13 at 16:22

Ezekiel2517

46

In addition, this question seems to be a duplicate http://stats.stackexchange.com/q/45033/24160 – Ezekiel2517 Apr 10 '13 at 16:46
Thanks! I ran the AIC, and found values for every sample. I was hoping you could give me a bit of insight into what I'm doing with the AIC values next. As with the p-values, I was planning on averaging the AIC values for each year (for each distribution). Then I would choose the best distribution as that which had the lowest average AIC. I'm not familiar with this value, so I don't know if averaging like this is valid. – D'Arcy Mulder Apr 10 '13 at 18:18
1

@darcy.mulder AIC has one term proportional to the log-likelihood and a bias-correction term proportional to the number of parameters. So you can just sum the AICs for each model to get an overall AIC. (If you're using the second-order AICc, sum the log-likelihoods & recalculate the bias-correction term based on the total no. parameters & the overall sample size.) You might also want to see if you could fit some common parameter across all datasets. – Scortchi - Reinstate Monica Apr 11 '13 at 09:34

Testing a dataset against a distribution with parameters estimated from that dataset

1 Answers1