5

I have data with the following empirical distributionempirical distribution. What would be a good model for it?

Edit: here is a closer view:

enter image description here

static_rtti
  • 775
  • 2
  • 13
  • 25
  • 1
    There is some good general advice at http://stats.stackexchange.com/questions/10517/identify-probability-distributions. – whuber Aug 01 '11 at 20:44
  • @whuber: what I really wanted here was to get names of possible matches to read up on them and see if any of them make sense in my application. – static_rtti Aug 01 '11 at 21:53
  • 2
    Usually one does the opposite: the application suggests likely distributions. Not only that, it also determines how to check whether a distribution is a reasonable fit. For example, in some cases it's important to get a good fit to the right tail (the largest values), whereas in others it's important to get a fit that doesn't deviate too much at any percentage point. Thus, some amplification on your part concerning the application would be helpful. – whuber Aug 01 '11 at 21:57
  • Any idea what the curve would look like if continued in the negative direction? (It obviously looks exponential to the right.) – Daniel R Hicks Aug 02 '11 at 12:07

1 Answers1

2

You have strictly positive data that is clearly skewed, so you need a distribution that includes the possibility for skew. The Gamma distribution, which has density

$$ p(x) = x^{k-1} \frac{e^{-x/\theta}}{\theta^k \, \Gamma(k)}\text{ for } x \geq 0\text{ and }k, \theta > 0 $$

is probably the default choice in the situation like this. There are other choices for skewed data (e.g. Skew-normal, log-normal, skew-logistic, weibull) but the gamma is more commonly used, and is directly related to some of the other choices (skew-logistic, weibull).

Macro
  • 44,826
  • 4
    It's awfully hard to eyeball a distribution, but one can usually say what it's not. In this case, the right tail is so heavy that the Gamma shape parameter $k$ has to exceed 1, but the nonzero mode shows it has to be less than 1; ergo, no Gamma is going to be a reasonable approximation for most purposes. – whuber Aug 01 '11 at 20:45
  • 2
    The tail length is also a function of $\theta$. You could have $k$ just barely larger than 1, say $1/.75$, while having $\theta$ just barely larger than 0, say, $1/25$, would create the long tail and the non-zero mode. In fact, a histogram of x=rgamma(10000,1/.8,1/25) looks pretty close to the one seen above. – Macro Aug 01 '11 at 21:27
  • typo. look at the histogram of x=rgamma(10000,1/.75,1/25) – Macro Aug 01 '11 at 21:34
  • Not even close. The histogram in the OP decays roughly like $1/x$ (not exponentially). Its tail simply won't match any Gamma accurately. Histograms are notoriously bad tools for judging distributions. Constructing the q-q plot will show you what's wrong. – whuber Aug 08 '11 at 17:44
  • I didn't have the data set so I had nothing to compare to, which is why I used the histogram. – Macro Aug 08 '11 at 17:48
  • @whuber can you say a bit more about how the histogram above is obviously decaying like $1/x$? The height of the density corresponding to $x=100$ looks nowhere near twice as high as the $y$ corresponding to $x=200$. I could be making a rookie mistake here but I'd like to hear what you have to say – Macro Aug 08 '11 at 18:09
  • Here are the heights of the bars from the leftmost out to just beyond x=200 (accurate to about +-100): 90730 114284 99263 79374 70662 56121 51794 43202 40378 35631 33769 30704 28361 25176 22472 20610 18206 16330 14234 12943 11736 10233 9196 8349 7439 6486 5894 4984 4370 3989 3481 3205 2549 2232 1956 1618 1343 1300 1300 1067 1067 771 750 686. The bar width is approximately 4.75. There are many good ways to analyze this, so I'm interested in learning about your approach. – whuber Aug 08 '11 at 19:05