Change and anomaly detection

Question

I recently wrote a program that graphed data points so that a user could scroll through them and find "interesting" parts of the data.

Now I am looking at ways to make it even simpler by making a table of values deemed to be interesting. These are usually either the value switching to a new value, or spiking and returning to the old one. The main problem is that I have to take into account a lot of noise.

Since I'm already saving hours of data trawling, making it conservative (i.e. is trigger happy) is fine.

Algorithms and theories I'm looking at are k-nearest neighbour and Local Outlier factor.

I am looking at using both, and then collecting them by saying errors will be unique within +-100 samples, giving errors detected by both a higher rating... Or something along those lines.

What algorithms are there I should be looking at? And what would you recommend as the best algorithm?

Edit: More specific information:

The values are read in ~20ms intervals (but to say time is constant is ok), and are between +-10 (as single precision floats). The amount of noise is unknown, but is pretty consistent (i.e. same magnitude of noise throughout), and usually jumps between two values.

A typical set of data is 120000+ records long (so records are read 1 at a time, and the whole set, or even it's size, is not known), changes can occur anywhere in that set, and at any frequency, though the change will usually be >2 (though that fact is not reliable) and aren't usually within 1000 records of each other (again, not reliable). Typically the change or spike is <2 records long, and there are few gradual changes (though the possibility of them cannot be ruled out. Changes and spikes are always obvious to a human, and since this is to point a human in the right direction it only needs to be accurate to +- 100 records.

At the moment I am looking at something along the lines of: 1 - Get the following from 100 records the mean and standard deviation 2 - For each value test if it is within 2 standard deviations of the mean. 3 - if not then print the change and start from 1 again (ignoring 20 records)

That would be simple to implement, and would show gradual changes as being changes. I could add more complex checks to find the type of exception (spike or change) and the 20 doesn't need to be a constant

@SureshVenkat, sorry, How specific must I be when defining an outlier? The problem is I don't know exactly what I am looking for. To a human it is obvious that a value fluctuating at about 5 (+-1) changing to 7 and fluctuating at +-1 again is an important thing, but how to recognise this I don't know. There may be more noise, and there may be less. — , Sep 26 '11 at 18:47
well that's my point. such discussion is in the realm of modelling, and it's not a prior clear what TCS brings to the table at this stage in your investigation. By asking the folks at stats.SE, you might find statistical models that make sense. — Suresh Venkatasubramanian, Sep 26 '11 at 20:14
@Mat you might also want to try quant.SE since they probably deal with this sort of problem all the time and are familiar with algorithms. It might be possible to migrate your question if you desire (but check with a moderator like Suresh) — Artem Kaznatcheev, Sep 27 '11 at 05:03
I've flagged it as off topic... @SureshVenkat - can you please migrate it? — , Sep 27 '11 at 07:03
Finding a "best" algorithm, or even a good one, depends on knowing the nature of these "data points." Are they obtained at regular intervals or irregular ones? What is the statistical nature of the "noise" (its distribution, its serial correlation, etc)? Will the processes be similar to one another or could they differ radically? How precisely do you need to detect the onset or termination of the "interesting" values? Perhaps you could post a sample graph? — whuber, Sep 27 '11 at 15:06

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

you might want to look at Best algorithm for classifying time series motor data and review my answer and my comments on change point detection. Essentially if you are after a violation in the mean of the errors one uses Intervention Detection schemes documented by Ruey Tsay and others ( including myself ! ). If you are after variance change detection then this can arise in a number of ways e.g. changes in parameters over time , emerging dependence of the expected vaLue and the variability of the errors, actual points in time where the variance changes and even the need for an ARIMA model to provide structure for the squared errors (Garch).

Change and anomaly detection

1 Answers1

Linked