0

One of the algorithms that we are working on is designed to report the time it takes for people to finish the onboarding process in an app.

We need an algorithm to eliminate the long tail of a distribution in a way in which it doesn't matter how long the tail is. Basically we want to identify the main body of the distribution on how much time it takes for users to finish the onboarding process.

We want to report "p% of users onboard within X time" where p should be determined by cutting the long tail of the distribution. The tail has variable length.

enter image description here

The characteristic curve could be a log-normal distribution if we consider data from the maximum value, or a power law distribution.

Considering this, how would you suggest finding the cut-off point of the long tail ? We are excluding Pareto principle, or other predetermined percentage (75%, 90%, etc).

In the uploaded photo, you can see the blue bar. Let's suppose that is the point that we are to find. The problem with the current algorithm is that when we select another period of analysis for the same client for the same report, that blue point shifts significantly to the right or left according to the length of tail.

One solution that we have is to determine the power law trendline of the raw data (pink line) and to calculate the tangent slope in each point of it and to set a threshold - the value where the slope is significantly flat - after which we can consider the start of the long tail.

Running numerous tests, the upper method seem to resolve the shifting problem. Would you consider this a valid approach? Do you see any potential problems with it?

  • 3
    Neither model you mention, power law or lognormal, implies that a cut-off makes sense at all. I don't find here a clear discussion of why you want or need this at all. – Nick Cox Jan 13 '20 at 09:47
  • @NickCox we are interested in analyzing the head of the distribution separately from the tail and we need to automate the process of identifying the data of the head. The further back we look in the data, the longer the tail (more outliers that take much longer to finish the process). The length of the tail should not impact how the head is selected. How would you recommend we address this? – Valentina Jan 14 '20 at 09:16
  • Sure, but if the head can't be distinguished systematically from the tail, the question can't be answered. I would look systematically at the survivor (survival, reverse or complementary distribution) function on double logarithmic scale. if you don't see systematic, identifiable breaks in that you're searching for something that doesn't exist. – Nick Cox Jan 14 '20 at 09:30
  • @NickCox, thank you! It is a useful start for research. – Valentina Jan 14 '20 at 10:30
  • I agree with @NickCox that this question has no real answer in a lot of cases. In the method you outline, you just push the problem back to figuring out what "significantly flat" means. – Peter Flom Jan 15 '20 at 13:21
  • 1
    @PeterFlom-ReinstateMonica, that's correct. To solve this, we need to decide what is the threshold of the slope where we can consider it "flat enough" – Valentina Jan 16 '20 at 08:29

1 Answers1

2

We want to report "p% of users onboard within X time" where p should be determined by cutting the long tail of the distribution. The tail has variable length.

[...]

Considering this, how would you suggest finding the cut-off point of the long tail ? We are excluding Pareto principle, or other predetermined percentage (75%, 90%, etc).

Approach a) You may report a fixed value of $p$ so that you can compare data over time. To me, a value of $90\%$ or $95\%$ seems adequate. Note that what, in this case, you are reporting a quantile of the time. More specifically, if you choose $p=90\%$, you will be reporting the $90\%$ percentile of the time.

Approach b) Alternatively, you may choose a "reasonable" time, like a "round" number, and find the percentage of individuals that are "on board" in a time as large as that.

Approach c) Nevertheless, if you really want to determine a cut-off in an "automatic" way, you may be interested in looking at kernel density estimation. A possible cut-off could be determined by the time $X$ such that the derivative of the probability density is lower than a small value for all times larger than $X$. However, you still have a problem in choosing that small value: will it be fixed or a fraction of the maximum density?

Examples of made up sentences:

Approach a) $95\%$ of users onboard within $9.74$ minutes.

Approach b) $94.7\%$ of users onboard within $10$ minutes.

Approach c) $94.8\%$ of users onboard within $9.84$ minutes.

In my opinion, the first two approaches will give you values that are easier to understand by the general public (and I prefer slightly the second approach).