I'm trying to remove what might be considered "unreasonable" data by evaluating the percent error in the mean and square root of the variance. Here's the setup:
Let's say I have three bids on a contract. The contractors' total bids are all relatively close. But the itemized breakdown of the bids can have extremely high variances in them.
For example:
# Total Bid Item 1 Item 2 Item 3 Item 4 Item 5
- --------- ------ ------ ------ ------ ------
1 827,558 1,026 27.7 800 1,000 1,998
2 667,118 950 25 80 3,000 23
3 720,909 1,100 25 25 1,100 22.4
--- --------- ------ ------ ------ ------ ------
err 9.03 5.97 4.91 117 54.1 136.78
The "err" is the percentage error between the mean and the square root of the variance of each group, calculated as:
((mean - var^(1/2)) / mean) * 100
This metric does a great job representing the problem that I think I need to address. For example, the % error of Items 1 and 2 show that the bidders bid pretty consistently. It also indicates that the item bids were more consistent than the overall bid totals (error 9.03%).
By contrast, Items 3 - 5 show a higher degree of inconsistency, ranging from 54% to over 136%.
Here's what I know about the data a priori:
The high bids of Item 3 and Item 5 are garbage. By that, I mean, there's no real way to have anticipated those bids. It's just the bidder playing games with how they itemize their bids (really high on one item, really low on another) to mitigate extra costs if they get awarded the contract. In both Items 3 and 5, the lower bids are far closer to the value of the work.
Item 4 has a more ambiguous distribution. It could be that the lower bids represent the value of the work more accurately (and likely, they do), but it may also be the value is higher here than it seems. I might be reluctant to throw out the high bid and maybe consider a weighted average as the real value of the work.
I should also point out, that I'm using this data with a neural network. Ideally, the model's prediction error would be 15% or less.
So, in order to treat this as conservatively as possible, keeping outliers that might reasonably contribute to the model while throwing out ones that are obviously useless, I've considered a couple of approaches:
Reject all bids for an item if the item's % error exceeds a set threshold.
Reject only the most variant bids when % error exceeds the threshold.
It seems to me the best approach might be #1, using a threshold that scales with the desired error of the model...
I had the thought of perhaps applying this technique after selecting outliers by quantile. That way, I know I'm removing "relative extremes" by variance (which may not be extremes after all) from a list of "absolute extremes" selected by quantile, if that makes sense. Seems like combining the techniques might mitigate the data quality issues that each introduces...
– Joel Graff Apr 03 '15 at 16:00Your second point deals with outliers I wasn't concerned about. I'm only going after these extremes because I know they demonstrably affect model performance. I just don't want ot be overzealous by removing them. Yes, I know that univariate techniques aren't appropriate for multivariate problems, but this is a somewhat special case...
Anyway, I'll do a little more research on the links you provided.
– Joel Graff Apr 03 '15 at 16:09To the point, though, I'm not using any of the variables which inform my model to select my outliers, here. I'm essentially selecting outliers based on the magnitude of the variance of the dependent variables alone. Knowing that the magnitude of the "useless" data is generally very large (100% or more), I would leave in anything less.
In one case, this appears to select maybe 2 or 3 points out of 8,400.
– Joel Graff Apr 03 '15 at 18:36