I am trying to do some anomaly detection between data sets using Python and sklearn (but other package suggestions definitely welcome!).
I have 10 sets of data, each set consists of data collected from torque value from a tire (so 10 tires in total). Each data set is pretty much just the tire_id, timestamp, and the sig_value (value from the signal, or the sensor). Sample data for one set of data look like this:
tire_id timestamp sig_value
tire_1 23:06.1 12.75
tire_1 23:07.5 0
tire_1 23:09.0 -10.5
Now I have 10 of them, and 2 of them behave strangely. I understand that this is an anomaly detection problem, but most of the articles I read online are detecting anomaly points within the same data set (aka if at some point the torque values are not normal for the same tire).
To detect which 2 tires are behaving abnormally, I tried using clustering method, basically k-means clustering (since it's unsupervised).
To prepare for the data to feed into the k-means clustering, for each data set (aka for each tire), I calculated:
- the top 3 sets of adjacent local maximum and local minimum with highest amplitude (difference)
- mean of torque value
- std of torque value
I also set the number of clustering to be only 2, so either cluster 1 or 2.
So my end result (after assigning cluster numbers) looks like following:
amplitude local maxima local minima sig_value_std \
tire_0 558.50 437.75 -120.75 77.538645
tire_0 532.75 433.75 -99.00 77.538645
tire_0 526.25 438.00 -88.25 77.538645
tire_1 552.50 -116.50 436.00 71.125912
tire_1 542.75 439.25 -103.50 71.125912
sig_value_average cluster
tire_0 12.816990 0
tire_0 12.816990 0
tire_0 12.816990 0
tire_1 11.588038 1
tire_1 11.588038 0
Now I have a question of what to do with this result... so each tire has 3 rows of data, as I've picked the top 3 pairs of local max/min with 3 largest amplitudes, and that means each row can be assigned to a cluster #, and sometimes they are assigned to different clusters for 1 tire even. Also the cluster size is normally larger than just 2.
My questions are:
- How to do anomaly detection about "set of data" not just individual data points?
- Is my approach reasonable/logical? If it is, how can I clean up my result to get what I want? And if not, what can I do to improve?
Any help/guidance/tip/pointer is greatly appreciated!
Note:
I did read into this article on this forum: Comparison of time series sets
But I don't think it's quite the same as what I'm asking.