Iterative process for removing extreme samples

Question

My samples follow heavy tail distributions. I use a process to detect and remove "extreme" samples that goes like this:

Measure mean and standard deviation of samples.
Remove samples higher than mean plus 4 standard deviations.
Repeat from Step 1 - for a total of 3 times.

If there are no extreme samples then the above process removes none; if there are many then they skew mean/std.dev, however, the multiple iterations take care of that.

For my problem, the above algorithm removes 0-5% of samples and empirical tests show that it works quite well and seems quite robust.

However, is this process sound? Does it have a formal name to look it up?

Notice that, I can not just find the "optimal" number of standard deviations above which we reject samples, since this whole process needs to be automated for multiple datasets and run as part of a live system. The datasets are similar but not identical: sometimes they are small and there are no outliers (no luxury to lose samples); sometimes they are large and the number of extreme examples that need to be removed varies and can be quite high.

Data represent user actions (one variable per action type) and we have multiple sets of users we need to compare (different segmentations). However, many "users" are actually bots that typically (but not always) do the same actions multiple times (way more times than humans would do). It is fair to say that we have a mixture of two distributions that we can't really distinguish, human-users and bots. We care about differences of behaviours of human-users after we remove as many bots as possible. In the long term we will build a classifier for bots but for now we are after some quick solution.

What is the purpose of this process? That is, why are your removing extreme data in the first place? — whuber, Jul 16 '14 at 15:02
@whuber Added an extra paragraph that gives some more insight to the problem. — iliasfl, Jul 17 '14 at 01:04

score 10 · Accepted Answer · answered Jul 25 '14 at 14:24

The problem with your approach is that you start assuming that your data is normally distributed, when you already know it is not. Some outlier detection techniques are similar (it is sound) but make the same assumption.

Rather than using the mean and the standard deviation, you may prefer the MAD (median absolute deviation) estimate, because is a more robust estimate of the deviation to use for thresholding your data. Another possibility is Tukey's outlier detection algorithm.

Still, most importantly, you point out that your data may be bimodal (users and bots). I am mostly familiar with the standard unimodal approach. For such case, you may need some technique like GMM or kernel density estimation. I googled a bit, and came across this paper which seems really interesting and fit your problem quite well.

Finally there are techniques for outlier detection, like SVMs which are distribution agnostic, and may work "out of the box", provided enough data.

score 0 · Answer 2 · answered Jul 30 '14 at 08:55

Just a heed of warning! be very careful of the way you implement the algorithm you have mentioned. Personally, I am a little skeptical of it.

The reason is the curse of dimensionality. In a high dimensional problem, since all points lie on the boundary, they all look like outliers.

As a simple counterexample to your algorithm, say you throw away 10 points in the first run (with the metric being 4 standard deviations (s.d.)). On the second run, your standard deviation will shrink because you have tossed out the extreme points. Now the 4.s.d. value for the second iteration will be smaller and hence you end up tossing more points. If the distribution looks something like a sinusoid decaying away from the origin, you might end up throwing away most of the points under the pretext of extremeness.

If you think my analysis is wrong (which it very well might be) I would appreciate your take on why I am thinking of this incorrectly

The curse of dimensionality is irrelevant, this is not a ML problem. The metrics I refer to are one-dimensional. The distribution looks heavy-tail, in reality it is bimodal and I want to remove one mode completely. The process I describe converges fast for my distributions after 3-4 iterations removing 0-5% of samples. Indeed we remove a lot of samples but this is desirable. In your example you miss the fact that in every iteration st. dev. will be smaller, however an increasing number of samples will fall within the 4 s.d. range. — iliasfl, Jul 31 '14 at 00:03

Iterative process for removing extreme samples

2 Answers2

Linked

Related