One of the algorithms that we are working on is designed to report the time it takes for people to finish the onboarding process in an app.
We need an algorithm to eliminate the long tail of a distribution in a way in which it doesn't matter how long the tail is. Basically we want to identify the main body of the distribution on how much time it takes for users to finish the onboarding process.
We want to report "p% of users onboard within X time" where p should be determined by cutting the long tail of the distribution. The tail has variable length.
The characteristic curve could be a log-normal distribution if we consider data from the maximum value, or a power law distribution.
Considering this, how would you suggest finding the cut-off point of the long tail ? We are excluding Pareto principle, or other predetermined percentage (75%, 90%, etc).
In the uploaded photo, you can see the blue bar. Let's suppose that is the point that we are to find. The problem with the current algorithm is that when we select another period of analysis for the same client for the same report, that blue point shifts significantly to the right or left according to the length of tail.
One solution that we have is to determine the power law trendline of the raw data (pink line) and to calculate the tangent slope in each point of it and to set a threshold - the value where the slope is significantly flat - after which we can consider the start of the long tail.
Running numerous tests, the upper method seem to resolve the shifting problem. Would you consider this a valid approach? Do you see any potential problems with it?
