Suppose I have a data set containing many rows and one column like this:
## value
## 1 4.315975
## 2 4.416492
## 3 4.305562
## 4 4.339515
## 5 4.313470
## 6 4.412710
I can summarize how anomalous each row is by adding another column for standard score normalized values:
## value normalized_value
## 1 4.315975 -0.6813408
## 2 4.416492 1.2954239
## 3 4.305562 -0.8861225
## 4 4.339515 -0.2184037
## 5 4.313470 -0.7306040
## 6 4.412710 1.2210472
I can then consider a row to be anomalous if the normalized_value of that row is less than, say, -3.0 or greater than 3.0.
Now suppose I have two or more columns of values and also their normalized values like this:
## value_1 norm_value_1 value_2 norm_value_2
## 1 4.315975 -0.6813408 2.782256 -1.7706630
## 2 4.416492 1.2954239 2.900394 -0.1166862
## 3 4.305562 -0.8861225 2.886051 -0.3174936
## 4 4.339515 -0.2184037 2.940737 0.4481311
## 5 4.313470 -0.7306040 2.967396 0.8213673
## 6 4.412710 1.2210472 2.975537 0.9353443
Are there any standard techniques for expressing the 'anomalousness' of the two or more columns with a single number for each row?
We can't make a combined column by averaging the two normalized value columns. We're concerned about anomalies and a non-anomalous value in one column could cancel out an anomalous value in the other column. If any of the values are anomalous for a particular row, we want to be certain that this is preserved in the combined column. So we could set the values for the new combined value column for a given row to be the maximum of the absolute value of each of the original normalized value columns for that row.
abs_max_abs_func <- function(x, y) max(abs(x), abs(y))
combined_value <- sapply(norm_value_1, abs_max_abs_func, y = norm_value_2)
## value_1 norm_value_1 value_2 norm_value_2 combined_value
## 1 4.315975 -0.6813408 2.782256 -1.7706630 1.7706630
## 2 4.416492 1.2954239 2.900394 -0.1166862 1.2954239
## 3 4.305562 -0.8861225 2.886051 -0.3174936 0.8861225
## 4 4.339515 -0.2184037 2.940737 0.4481311 0.4481311
## 5 4.313470 -0.7306040 2.967396 0.8213673 0.8213673
## 6 4.412710 1.2210472 2.975537 0.9353443 1.2210472
Note that another reasonable approach would be to set the value of the combined value column to be the value of that normalized value column that has the maximum absolute value (that is, we don't take the absolute value of those values).
max_abs_func <- function(x, y) if (abs(x) >= abs(y)) x else y
combined_value <- sapply(norm_value_1, max_abs_func, y = norm_value_2)
## value_1 norm_value_1 value_2 norm_value_2 combined_value
## 1 4.315975 -0.6813408 2.782256 -1.7706630 -1.7706630
## 2 4.416492 1.2954239 2.900394 -0.1166862 1.2954239
## 3 4.305562 -0.8861225 2.886051 -0.3174936 -0.8861225
## 4 4.339515 -0.2184037 2.940737 0.4481311 0.4481311
## 5 4.313470 -0.7306040 2.967396 0.8213673 0.8213673
## 6 4.412710 1.2210472 2.975537 0.9353443 1.2210472
The problem with this approach (if, indeed, it is a problem) is that there are more discontinuities. We can jump from a large negative value to a large positive value or vice-versa. But we do lose less information doing it this way.
Another possible answer is to find the probability of each value. Since these are real values and are probably all unique, probability doesn't make sense. However we could bin the values and find the probability of a value being in the bin that it is in (by dividing the number of values in that bin by the number of all values). Then, for a row, multiply the probabilities together. This is assuming the columns are independent. But, even if they aren't all completely independent, this might give reasonable values.
Any suggestions? I'm sure someone must have tried to do this sort of thing.