Recommendations for anomaly detection

Question

I have a binary classifier. The classifier is trained on both numeric and categorical variables. In a given month, I will have new data coming in, comparable to 5% of the observation count of the training sample. It takes a long time to observe the true binary outcome of these incoming observations, but I know all the right-hand-side variables or features of the observation right away, classic out-of-sample classification problem.

I make predictions on these data using the binary classifier. I would like a means of classifying how well these observations are represented in the training dataset, taking account of both categorical and numeric features. Can anyone recommend a methodology that would yield a score for how well represented an observation is in the development data?

For example, say $X_1$ and $X_2$ are features. In the development data, these variables' magnitudes are usually inversely correlated. If new observations come in with atypical association between the two, positively correlated, this atypical association would contribute to the "anomaly score".

score 2 · Answer 1 · answered Jan 28 '20 at 19:32

Restating your question as I understood it, you have a new subject n+1. Based on the new subject's covariates (predictor variables), you want to know whether there are a good number of subjects who are similar enough to the new subject within your training set of n subjects.

One approach to consider is using multivariate outlier detection methods. I'm partial to methods that use a Mahalanobis distance, which can handle continuous and categorical variables together - albeit with some issues that for your needs you can probably ignore. You can replace k level categorical variables with k-1 dummy variables in the covariate matrix. Before considering the new subject, it will be helpful to apply your detection method to your n training subjects alone. This will help you discover and think about special features of your training set, which may influence how you uniquely implement the method for your data. A final note, a subject being classified as an outlier doesn't guarantee the model will predict poorly for them. It just means your data is sparse in this area so the model's performance in this region has a lot of uncertainty.

There are many methods to consider, but these two are a good place to start.

(1) P. Filzmoser, R.G. Garrett, and C. Reimann. Multivariate outlier detection in exploration geochemistry. Computers & Geosciences, 31:579-587, 2005. -- implemented in the R package mvoutlier.

(2) Andrea Cerioli. Multivariate outlier detection with high-breakdown estimators. Journal of the American Statistical Association, 105(489):147-156, 2010. doi: 10.1198/jasa.2009.tm09147 -- implemented in the R package CerioliOutlierDetection.

Mahalanobis distance can be justified for anomaly detection because it's inversely related to negative log density, assuming a Gaussian distribution. But, the Gaussian assumption doesn't hold for mixed data, even if it's been dummy coded. Is there some other justification for Mahalanobis distance in this setting? — user20160, Jan 29 '20 at 00:44
Further investigation suggests autoencoders and isolation forests as competitors. Going to produce "outlierness" scores from the three methods and run a trial — JPErwin, Jan 29 '20 at 16:12
In settings where I'm not trying to precisely estimate a probability, instead just trying to flag things as close or not close, e.g. when doing matched randomization or looking for unusual subjects, I've found the Mahalanobis distance works fine for mixed data. I can break it, or at least make it struggle to perform, when generating certain theoretical mixed datasets. This is why I encouraged looking at the training set first to make sure the dataset is a good enough fit. But I've found the Mahalanobis distance often gets the job done fine for my mixed datasets. — Robert Alan Greevy Jr PhD, Jan 29 '20 at 17:18

Recommendations for anomaly detection

1 Answers1

Related