Detecting which independent classification variables result in outliers

Question

I'd appreciate guidance on how to tackle this problem. I have several inputs which are classifications and a single output. Most of the time, each output has a small value. However, under some conditions, the output is large.

Consider the following data:

ind1    ind2    ind3    dep
black   SMALL   OLD     42
blue    LARGE   NEW     204
blue    LARGE   OLD     52
blue    LARGE   VOLD    34
blue    LARGE   VVOLD   32
blue    LARGE   MOLDY   57
blue    MEDIUM  NEW     247
green   XSMALL  NEW     217
green   XSMALL  VVOLD   27
green   SMALL   NOVEL   203
green   SMALL   VVOLD   25
green   SMALL   MOLDY   47
red     MEDIUM  OLD     44
red     MEDIUM  VVOLD   25
red     MEDIUM  MOLDY   47
red     XSMALL  OLD     41
red     XSMALL  MOLDY   48

This is an artificial data set. In reality there are many more conditions and 80% of them have "low" outputs.

Sorting the data by 'ind3' reveals that the bad outputs are when ind3==NEW or ind3==NOVEL. But I'd like a test that would tell me something like: "Any dep > 70 is an outlier. And, outliers are generated when ind3==NEW or ind==NOVEL". I want "70" to be determined from the data and not a parameter, for instance.

I'm not sure if I'm using the correct terminology, but by outlier, I don't mean "bad", I simply mean "outside of what is normal", which I realize is imprecise. The outliers would have more intense analysis.

Thank you for your time.

[UPDATE]: My original question was identical in form, but confusing. The category values for ind2 and ind3 were misleading. I converted them to sizes and ages in the hopes that they are more clear. Not important to the question, but the actual data is trying to detect runtime issues as a function of test case, computer host, and input parameter set.

Discussion of outliers is always imprecise because we can give different definitions for the same data depending on your intent. This is especially true for multivariate outliers. Recognizing that outliers are designated points that look extreme or unusual but are not necessarily erroneous is also the right approach. Nevertheless I have some questions about the data. I take it that ind3 is the student grade but what is ind2 and what do the symbols R, S, P, and Q mean. Also what does the numeric dependent variable stand for? Also the same questions for ind1 as for ind2. — Michael R. Chernick, Jan 09 '17 at 20:45
@MichaelChernick, I updated the example to get rid of the assumptions associated with "A"-"F". Basically, I have many runs. Some are slow. I want a way to detect ones that are outside of typical as well as what conditions cause the abnormal runtime. — Robert Lugg, Jan 10 '17 at 00:05
I found a similar question which was answered by @nrussell. I'd appreciate other ideas, but that one worked surprisingly well. — Robert Lugg, Jan 10 '17 at 21:36

score 2 · Answer 1 · answered Jan 21 '17 at 21:53

If 20% of your data have high output, I would be disinclined to call those points outliers. Rather, it seems more natural to say that you have data in two classes. Of course, the best situation would be if the high and low values had some meaning in terms of your problem and you could set the threshold based on domain knowledge. However, since you are saying I want "70" to be determined from the data and not a parameter, I assume that is not the case.

Let's look at the data. I have simply plotted the sorted dep values. When I look at this picture, I see two groups. The other post that you referred to uses the median and interquartile range to set a threshold for outliers. That is a great method under the assumption that you have a single distribution and you are looking for outliers. But I see a mixture of two populations and what you need to do is distinguish the two populations.

Thinking of the problem as distinguishing two populations that are mixed together, I drew the plot with two candidate thresholds. One is the one you mentioned, dep=70. The other is halfway between the lowest "high" point (dep=203) and the highest "low" point (dep=57). The midpoint is dep=130.

These present two very different ideas about the meaning of the separation. If there were simply two classes and you wished to separate them, dep=130 makes much more sense. The thought here is that you simply do not have information about what happens between dep=57 and dep=203 so you should "maximize the margin" to remain agnostic about this central region. But you suggested dep=70. That suggests to me that you do have some idea of what is "normal". What if there was a point that had dep=100? It would be closer to the low points than to the high ones. If you are thinking that it is somehow separated from the group of low points and so cannot be one of them, I think that you must have a domain-dependent idea of what the low points mean.

So it is one or the other. If you really have no interpretation of the values. The best point is probably halfway between the lowest high point and the highest low point, dep=130. But if the low points are "normal" and you want anything that is not normal, you must look to your domain knowledge and not the data to decide the boundary.

Detecting which independent classification variables result in outliers

1 Answers1