0

I am trying to do some anomaly detection between data sets using Python and sklearn (but other package suggestions definitely welcome!).

I have 10 sets of data, each set consists of data collected from torque value from a tire (so 10 tires in total). Each data set is pretty much just the tire_id, timestamp, and the sig_value (value from the signal, or the sensor). Sample data for one set of data look like this:

tire_id        timestamp        sig_value
tire_1           23:06.1            12.75
tire_1           23:07.5                0
tire_1           23:09.0            -10.5

Now I have 10 of them, and 2 of them behave strangely. I understand that this is an anomaly detection problem, but most of the articles I read online are detecting anomaly points within the same data set (aka if at some point the torque values are not normal for the same tire).

To detect which 2 tires are behaving abnormally, I tried using clustering method, basically k-means clustering (since it's unsupervised).

To prepare for the data to feed into the k-means clustering, for each data set (aka for each tire), I calculated:

  1. the top 3 sets of adjacent local maximum and local minimum with highest amplitude (difference)
  2. mean of torque value
  3. std of torque value

I also set the number of clustering to be only 2, so either cluster 1 or 2.

So my end result (after assigning cluster numbers) looks like following:

        amplitude  local maxima  local minima  sig_value_std  \
tire_0     558.50        437.75       -120.75      77.538645   
tire_0     532.75        433.75        -99.00      77.538645   
tire_0     526.25        438.00        -88.25      77.538645   
tire_1     552.50       -116.50        436.00      71.125912   
tire_1     542.75        439.25       -103.50      71.125912   

        sig_value_average  cluster  
tire_0          12.816990        0  
tire_0          12.816990        0  
tire_0          12.816990        0  
tire_1          11.588038        1  
tire_1          11.588038        0 

Now I have a question of what to do with this result... so each tire has 3 rows of data, as I've picked the top 3 pairs of local max/min with 3 largest amplitudes, and that means each row can be assigned to a cluster #, and sometimes they are assigned to different clusters for 1 tire even. Also the cluster size is normally larger than just 2.

My questions are:

  1. How to do anomaly detection about "set of data" not just individual data points?
  2. Is my approach reasonable/logical? If it is, how can I clean up my result to get what I want? And if not, what can I do to improve?

Any help/guidance/tip/pointer is greatly appreciated!

Note:

I did read into this article on this forum: Comparison of time series sets

But I don't think it's quite the same as what I'm asking.

Roger V.
  • 3,903

1 Answers1

0

I think I would try to use dynamic time warping as the distance metric for whatever clustering you decide to use. K-means with k=2 with DTW as your distance metric should work well for this problem.

Dynamic time warping aims to learn an optimal match between sets of time series data by non-linearly stretching the time dimension of the data set and computes a similarity measure between data sets. https://en.wikipedia.org/wiki/Dynamic_time_warping