12

I have a number of series that would typically be described as normal skewed or Gamma distributed. For example, in a group of customers I may have calculated their spend over a fixed length of time. I then create a histogram of the distribution of spend and find an extremely long tail for the small group of high-spenders.

Since I want to identify these high spenders, my question is: are there methods to empirically inspect a distribution of values and approximate the point at which the distribution becomes "long-tailed" to create a cut in the data? I am not looking at inspecting a histogram to find the long tail but for a consistent method to systematically cut the data.

mkt
  • 18,245
  • 11
  • 73
  • 172
Adam L
  • 121
  • 2
    You might appreciate this recent answer showing ways to identify modes in a distribution. Also of interest is this thread on 1D clustering. Searching our site on "clustering" is also likely to be fruitful. – whuber May 06 '13 at 20:40
  • What do you mean by "...the point at which the distribution becomes 'long-tailed'..."? Are you asking if there is a way to quantify the thickness of the tail of a distribution? If so, extreme value theory may be of help. – rbatt May 07 '13 at 00:58
  • whuber - judging by the first 5 min of browsing the links you provided, I believe you have pointed me in the right direction. Thanks a bunch! – Adam L May 07 '13 at 20:30
  • 1
    If you found a solution, maybe you can post an answer teling us what you finally did! – kjetil b halvorsen Jan 27 '19 at 11:57
  • What needs to be done to make sense of the data depends on the data itself. For example, uni-modal or bimodal data would be treated differently. – Carl Aug 16 '19 at 00:12

1 Answers1

1

This question can be restated as: where do I cut a continuum to split it into two meaningfully different sets? And the answer is of course that there is no non-arbitrary way to do this. This is essentially the question of whether discretising continuous variables is advisable, about which we have many relevant threads:

Why is it Bad to Discretize a Continuous Variable?

When should we discretize/bin continuous independent variables/features and when should not?

What is the benefit of breaking up a continuous predictor variable?

Why should binning be avoided at all costs?

What is the justification for unsupervised discretization of continuous variables?

In short: it's almost never advisable to do this as part of an analysis.

But sometimes, a decision needs to be taken that requires some discretising. An example relevant to your situation would be: which customers should I spend money on by offering incentives (such as discounts) to? You could cut the distribution at some arbitrary threshold, as is done in the wikipedia article for 'long tail', and offer incentives just to people above this:

https://en.wikipedia.org/wiki/Long_tail

But this approach is weak without some justification. The good news is that for a decision like this, there's often more information you can and should take into account. The costs and benefits of the incentives are most important here, and should be used in any choice of threshold.

mkt
  • 18,245
  • 11
  • 73
  • 172