My samples follow heavy tail distributions. I use a process to detect and remove "extreme" samples that goes like this:
- Measure mean and standard deviation of samples.
- Remove samples higher than mean plus 4 standard deviations.
- Repeat from Step 1 - for a total of 3 times.
If there are no extreme samples then the above process removes none; if there are many then they skew mean/std.dev, however, the multiple iterations take care of that.
For my problem, the above algorithm removes 0-5% of samples and empirical tests show that it works quite well and seems quite robust.
However, is this process sound? Does it have a formal name to look it up?
Notice that, I can not just find the "optimal" number of standard deviations above which we reject samples, since this whole process needs to be automated for multiple datasets and run as part of a live system. The datasets are similar but not identical: sometimes they are small and there are no outliers (no luxury to lose samples); sometimes they are large and the number of extreme examples that need to be removed varies and can be quite high.
Data represent user actions (one variable per action type) and we have multiple sets of users we need to compare (different segmentations). However, many "users" are actually bots that typically (but not always) do the same actions multiple times (way more times than humans would do). It is fair to say that we have a mixture of two distributions that we can't really distinguish, human-users and bots. We care about differences of behaviours of human-users after we remove as many bots as possible. In the long term we will build a classifier for bots but for now we are after some quick solution.