1

I have data whose distribution resembles an exponential distribution, but the data has a heavier tail than the exponential distribution.

I will be very glad for any recommendation of an alternative to the exponential distribution for the data.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
oercim
  • 689
  • 3
    A good basis for any such recommendation is a theory about the underlying process, because tail estimation can be highly uncertain without having a large amount of data. What additional information can you supply about your problem? – whuber Mar 16 '19 at 18:41
  • @whuber , data is about the time duration(in months) between car accidents day and reporting day of accidents. But I dont have so much data. It seems that, most of the accidents are generally being reported quickly, but some of the accidents are being reported lately. when, I fit the data with an exponantial distribution, I see that the tail seems so light compared with the data. – oercim Mar 16 '19 at 19:20
  • One might suppose there is a statute of limitations or insurance limit, most likely a whole number of years. This wouldn't be modeled well by any standard heavy-tailed distribution. You ought to closely examine what data you do have. – whuber Mar 16 '19 at 19:28

1 Answers1

1

Given your discussion with @whuber, I would suggest two approaches.

(1) A mixture model, perhaps a mixture of exponentials for simplicity. One can think of these as a heirarchical model, where observations can come from different sub-populations and each sub-population has its own distribution. From what you've described, this sounds like the situation you are looking at; most people who get in an accident report it almost immediately to an insurance provider. Those who wait any significant amount of time are likely following very different patterns.

(2) Kaplan-Meier curves. This is a non-parameteric approach that makes no assumptions about the baseline distributions. This is a very simplified approach that tells you what the data says, not what a constrained model of the data says.

Whether you want to use (1) or (2) depends on your use case; are you interested in knowing the proportion of subjects who don't follow the trend of reporting very quickly? Then (1) answers this question a little more directly (assuming a good fit). Are you just interested in seeing the overall distribution of time to reporting without a model's interpretation of the data? Then use (2).

Also, my advice is that even if you decide to use (1), you should still compare the overall fit with (2) as a form of model checking.

Cliff AB
  • 20,980
  • thanks a lot for the answer. My aim is making some simulations about the duration. I want to generate random data, for the different parameters of the underlying distribution. I guess, (1) seems more adequate for my aim. But,fitting such a distribution(s) may be hard. I will try for it. – oercim Mar 16 '19 at 19:36