I'd appreciate guidance on how to tackle this problem. I have several inputs which are classifications and a single output. Most of the time, each output has a small value. However, under some conditions, the output is large.
Consider the following data:
ind1 ind2 ind3 dep
black SMALL OLD 42
blue LARGE NEW 204
blue LARGE OLD 52
blue LARGE VOLD 34
blue LARGE VVOLD 32
blue LARGE MOLDY 57
blue MEDIUM NEW 247
green XSMALL NEW 217
green XSMALL VVOLD 27
green SMALL NOVEL 203
green SMALL VVOLD 25
green SMALL MOLDY 47
red MEDIUM OLD 44
red MEDIUM VVOLD 25
red MEDIUM MOLDY 47
red XSMALL OLD 41
red XSMALL MOLDY 48
This is an artificial data set. In reality there are many more conditions and 80% of them have "low" outputs.
Sorting the data by 'ind3' reveals that the bad outputs are when ind3==NEW or ind3==NOVEL. But I'd like a test that would tell me something like: "Any dep > 70 is an outlier. And, outliers are generated when ind3==NEW or ind==NOVEL". I want "70" to be determined from the data and not a parameter, for instance.
I'm not sure if I'm using the correct terminology, but by outlier, I don't mean "bad", I simply mean "outside of what is normal", which I realize is imprecise. The outliers would have more intense analysis.
Thank you for your time.
[UPDATE]: My original question was identical in form, but confusing. The category values for ind2 and ind3 were misleading. I converted them to sizes and ages in the hopes that they are more clear. Not important to the question, but the actual data is trying to detect runtime issues as a function of test case, computer host, and input parameter set.
