Summarizing the "anomalousness" of several columns

Question

Suppose I have a data set containing many rows and one column like this:

##      value 
## 1 4.315975         
## 2 4.416492         
## 3 4.305562         
## 4 4.339515         
## 5 4.313470        
## 6 4.412710

I can summarize how anomalous each row is by adding another column for standard score normalized values:

##      value normalized_value
## 1 4.315975       -0.6813408
## 2 4.416492        1.2954239
## 3 4.305562       -0.8861225
## 4 4.339515       -0.2184037
## 5 4.313470       -0.7306040
## 6 4.412710        1.2210472

I can then consider a row to be anomalous if the normalized_value of that row is less than, say, -3.0 or greater than 3.0.

Now suppose I have two or more columns of values and also their normalized values like this:

##    value_1 norm_value_1  value_2 norm_value_2
## 1 4.315975   -0.6813408 2.782256   -1.7706630
## 2 4.416492    1.2954239 2.900394   -0.1166862
## 3 4.305562   -0.8861225 2.886051   -0.3174936
## 4 4.339515   -0.2184037 2.940737    0.4481311
## 5 4.313470   -0.7306040 2.967396    0.8213673
## 6 4.412710    1.2210472 2.975537    0.9353443

Are there any standard techniques for expressing the 'anomalousness' of the two or more columns with a single number for each row?

We can't make a combined column by averaging the two normalized value columns. We're concerned about anomalies and a non-anomalous value in one column could cancel out an anomalous value in the other column. If any of the values are anomalous for a particular row, we want to be certain that this is preserved in the combined column. So we could set the values for the new combined value column for a given row to be the maximum of the absolute value of each of the original normalized value columns for that row.

abs_max_abs_func <- function(x, y) max(abs(x), abs(y))
combined_value <- sapply(norm_value_1, abs_max_abs_func, y = norm_value_2)

##    value_1 norm_value_1  value_2 norm_value_2 combined_value
## 1 4.315975   -0.6813408 2.782256   -1.7706630      1.7706630
## 2 4.416492    1.2954239 2.900394   -0.1166862      1.2954239
## 3 4.305562   -0.8861225 2.886051   -0.3174936      0.8861225
## 4 4.339515   -0.2184037 2.940737    0.4481311      0.4481311
## 5 4.313470   -0.7306040 2.967396    0.8213673      0.8213673
## 6 4.412710    1.2210472 2.975537    0.9353443      1.2210472

Note that another reasonable approach would be to set the value of the combined value column to be the value of that normalized value column that has the maximum absolute value (that is, we don't take the absolute value of those values).

max_abs_func <- function(x, y) if (abs(x) >= abs(y)) x else y
combined_value <- sapply(norm_value_1, max_abs_func, y = norm_value_2)

##    value_1 norm_value_1  value_2 norm_value_2 combined_value
## 1 4.315975   -0.6813408 2.782256   -1.7706630     -1.7706630
## 2 4.416492    1.2954239 2.900394   -0.1166862      1.2954239
## 3 4.305562   -0.8861225 2.886051   -0.3174936     -0.8861225
## 4 4.339515   -0.2184037 2.940737    0.4481311      0.4481311
## 5 4.313470   -0.7306040 2.967396    0.8213673      0.8213673
## 6 4.412710    1.2210472 2.975537    0.9353443      1.2210472

The problem with this approach (if, indeed, it is a problem) is that there are more discontinuities. We can jump from a large negative value to a large positive value or vice-versa. But we do lose less information doing it this way.

Another possible answer is to find the probability of each value. Since these are real values and are probably all unique, probability doesn't make sense. However we could bin the values and find the probability of a value being in the bin that it is in (by dividing the number of values in that bin by the number of all values). Then, for a row, multiply the probabilities together. This is assuming the columns are independent. But, even if they aren't all completely independent, this might give reasonable values.

Any suggestions? I'm sure someone must have tried to do this sort of thing.

Please edit your question to add details. The best approach depends on exactly what you want to do. — Kodiologist, Aug 25 '16 at 19:35

score 2 · Accepted Answer · answered Aug 25 '16 at 19:46

A general way to do this is to formulate a null model: a statistical model for what generates your points in the absence of anomalies. You can then estimate the parameters of your model and evaluate the joint density of each row; low densities are more anomalous. I favor this approach because it reveals the assumptions implied by your choice of anomalosity (anomalousness?).

For example, if you assume your columns $\{x_i\}_{i=1}^n$ and $\{y_i\}_{i=1}^n$ are independently normal, then you can

estimate the mean and variance of each column separately with e.g. $\hat \mu_x = \sum_i x_i/n$ and $\hat \sigma_x^2 = \sum_i (x_i - \hat \mu_x)^2/(n-1) $
normalize by those values to get $z_i = (\hat x_i - \mu_x) / \sqrt{\sigma^2_x}$ and $w_i = (\hat y_i - \mu_y) / \sqrt{\sigma^2_y}$

The negative log density will increase with $z_i^2 + w_i^2$, so points with the largest values are most anomalous.

Another option is to use an anomaly detection method with strong empirical performance -- FRaC by Noto et al comes to mind. This method uses a similar framework, but it focuses on the relationships among columns, modeling Pr(X|Y) and Pr(Y|X) instead of Pr(X,Y).

Summarizing the "anomalousness" of several columns

1 Answers1