Rationale behind iterating standard deviation after removing outliers

Question

I'll preface this by saying that I haven't taken a stats class yet, so talk to me like a five year old. If it matters, the data I'm working with are execution times for a program and I'm trying to determine a meaningful average execution time.

I have a set of data. My professor says that to remove outliers, I should calculate the standard deviation and remove all values outside a range given by (mean - 3*std_dev):(mean + 3*std_dev), and then I should repeat that process on the new data (without those outliers) until no outliers are found, then use the new data set to determine an average execution time.

Why would I use this iterative approach rather than only applying the process once?

You shouldn't use this approach for estimating an average. The duplicate has answers explaining why not. They don't give a full explanation, so if you have further questions please feel free to edit this post to request specifics that haven't been addressed there. — whuber, Jul 08 '15 at 22:30
Sorry for not noticing a dup ... I barely know what I'm asking so recognizing a duplicate is hard; I'll take a peek at the other question. — Daniel B., Jul 08 '15 at 22:42
It's not a problem--this site has such a library of good questions and answers that it's hard even for experts to find duplicates sometimes: you have to know the right keywords (or have seen the question before). — whuber, Jul 08 '15 at 22:54

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

You can consider this approach because your data may contain outliers of different scale.

The first run will only the subset of those, while leaving the less outlying ones.

But applying it multiple times would leave the real inliers, removing lot's of small peaks, as well as the huge ones.

However, reasonability of using it, as well as number of iterations, depend of the data, of course, and it's not considered to be a good practice, as noted in the comments.

You may also consider the MAD (median absolute deviation), instead of mean+std, as supposed in the answer to a similar question.

edited Apr 13 '17 at 12:44

Community

1

answered Jul 08 '15 at 22:29

Nikita Pestrov

126

This reasoning is pretty sound but the initial advice is not: this is not a good way to estimate an average. The approach should be abandoned. Much better (and simpler) procedures exist, such as a Winsorized mean. – whuber Jul 08 '15 at 22:33
@whuber I just tried to explain why one (the Professor in this case) may suggest that approach, since that was the question, not trying to convince that one should. – Nikita Pestrov Jul 08 '15 at 22:35
It may simply be that this was the easiest approach to explain to someone with no background in statistics? – Daniel B. Jul 08 '15 at 22:42
Your answer, Nikita, strongly suggests the opposite of what you write in your comment. The initial line is a summary: it is what most people will take away from reading it. It states, clearly and with no reservation, that "You should consider this approach..." (my emphasis, of course). To most English speakers that means "This is the right thing to do," not "I can see why someone might think that way, but..." If you are not making such an emphatic recommendation then you should edit your answer to make that clearer. – whuber Jul 08 '15 at 22:57
@Daniel This is an approach a lot of people have thought of. It seems to be natural. But it (a) is overly complicated and (b) simply bad. It makes a sequence of mutually inconsistent assumptions, so it has no theoretical justification, and it's easy to think of circumstances where it gives heavily biased answers. For people with no background in stats, Winsorization is a good recommendation: it's simple and easy to understand. – whuber Jul 08 '15 at 23:00
Thanks. I'll ask the prof about this and take a look at winsorization. I don't know about overly complicated; it takes me about 10 lines in python. I don't have matlab where I'm working right now, but I don't think it'd be much different. – Daniel B. Jul 09 '15 at 13:44

Rationale behind iterating standard deviation after removing outliers

1 Answers1