How to identify a section of data with different characteristics

Question

Consider this dataset:

Where the x-axis is some sort of "time" measure, and the y-axis is some sort of "value" measure.

E.g. "time taken for website to respond" vs. "time of day", hypothetically speaking, of course ;)

It is very clear to the human eye that there was a period where the behaviour of the value measure was qualitatively and quantitatively different.

Now, it happens that this data is Random generated Gaussian numbers, and that the middle section does indeed have a different mean and SD. And as a human I could eyeball the data, draw lines at 100 & 200, runs the stats on the distinct samples, and show that.

But how could I (indeed, CAN I!?) achieve that outcome without graphing it? Or how could I do it if the pattern was less clear? Say the SD on both sets was a lot bigger, so that the sets merged more and couldn't be eyeballed? Can this distinction be identified objectively?

What you are looking for is a change detection algorithm. There is a lot of research that has been carried out around this topic but in practice I've found almost all of these attempts to be very compute intensive and throw up a lot of false positives. The most reasonable expectation is if you try to develop the algorithm once you have actual data that you want to run it on. Here's an example: http://www-users.cs.umn.edu/~sboriah/PDFs/ChenCSBCK2013.pdf — Arun Jose, Oct 17 '16 at 08:12

score 4 · Accepted Answer · answered Oct 17 '16 at 15:36

The your problem is related to a research field in time series analysis for which there are number of sophisticated methods. But a more abstract problem formulation can use space or any other ordinal variable instead of time.

All the approaches depend on what kind of data you have - the more you know how your data is generated and the context of it the more suitable methods you can find.

Lets break it down in static and dynamic part.

First we are gonna generate some data (i'm using dplyr and ggplot2 packages - but you can handle data with other means):

mydata <- dplyr::bind_rows(
    data.frame(time = sample(1:100, size = datasize, replace = T),
               value = rnorm(datasize, mean = 100, sd = 10),
               group = 1),
    data.frame(time = sample(100:200, size = datasize, replace = T),
               value = rnorm(datasize, mean = 120, sd = 5),
               group = 2),
    data.frame(time = sample(200:300, size = datasize, replace = T),
               value = rnorm(datasize, mean = 100, sd = 8),
               group = 3))

ggplot(mydata, aes(x = time, y = value)) +
  geom_point() +
  ylim(c(0, 180))

Static approach (and unsupervised)

So you "see" clusters of points that are slightly shifted. That is actually independent of "time" - just consider time and value "static" random variables. What you can do is simple clustering. In order to also get some parameter estimates we can consider the data value as a mixture of Gaussians and use the EM Algorithm (package mixtools required):

mixn <- mixtools::normalmixEM(mydata$value, k = 3, maxit = 10^4)
mixn[c("mu", "sigma")]

$mu
[1] 100.8723 104.5275 120.0161

$sigma
[1] 9.5199389 0.9243747 5.1829086

The means are more or less captures accurately, but the sigmas not so much. Anyhow, your mixture would look like this:

You would then, for every point, compute its likelihood belonging to either distribution - and choose a class with the maximal likelihood.

Furthermore, we can compute more precise clusters using value and time variable togther - again from a static point of view.

mixmvn <- mixtools::mvnormalmixEM(as.matrix(mydata[c("time", "value")]), k = 3)

Although the result seems nice, you need to deal with covariance matrices for interpretation.

Dynamic approach (and supervised)

What you may looking for are Hidden Markov Models. In their basic version they model discrete states with discrete observations. They can be extended to continous observations. (Btw: continous state and constinous obeservations are related to the topic of Kalman-filters).

To estimate the parameters we can use package RHmm which atm can only be installed from achive RHmm archive by R CMD INSTALL R_Hmm_2.0.3.tar.gz.

# we need to order the values by time
orderedValues <- (mydata %>% dplyr::arrange(time))$value

hmm <- RHmm::HMMFit(obs = orderedValues, nStates = 3, dis = "NORMAL", list(iter = 1000, nInit = 100), asymptCov = F)

The results are very nice:

> hmm$HMM$distribution

Model:
3 states HMM with univariate gaussian distributions

Distribution parameters:
             mean      var
State 1 120.28424 24.27617
State 2  97.56794 72.53311
State 3 106.79554 31.86263
> sqrt(hmm$HMM$distribution$var) # sigma
[1] 4.927085 8.516637 5.644699

The means as well as sigmas are estimated more or less correctly. We can use predict for new samples in order to assign a state if we want.

All these methods has a distadvantage, however. They all need a fist estimate of number states.

If you are only interested in "change" in dynamic (and have a lot of data), a Hebb's rule may be your friend. Afaik Hebb's rule get accustomed to your data and you will see a spike if the sequential input changes "dramatically".

Any thoughts ?

score 1 · Answer 2 · answered Oct 17 '16 at 09:00

First you can verify that the process is heteroscedastic: https://en.wikipedia.org/wiki/Heteroscedasticity

Or that there is a moving average: https://en.wikipedia.org/wiki/Moving_average

If there is, then it is case depending how to proceed as Arun Jose has said. When designing algorithms you might want to look into window functions or kernels. Kernels are better and more rigor, but mathematically harder to understand, and thus to implement.

score -1 · Answer 3 · answered Oct 17 '16 at 08:44

-1

Yes, you can do it!

What you're looking for is machine learning, a set of test data and a good enough algorithm.

Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. source

If you have some programming expertise then I recommend you start here. These tutorials are very easy to follow along and you'll get a basic idea about ML itself.

answered Oct 17 '16 at 08:44

atefth

144
6

1

It is ML, but not the simple stuff and definitely not classification, but more of a regression type of a problem. Don't be fooled, ML is just a buzz word for applied statistics. You just optimize a set of estimated distributions and apply the "learned" with new data. – user3644640 Oct 17 '16 at 09:04
The downvoter wasn't even me, your answer is really bad. It is just that classification finds different sets. If you applied it to time series as (x,y)=(time,value) it would not make sense, because time series are long and ordered. Or did you mean some else estimators? ML is not some magic black box, it is simply optimizing distributions and choosing the most probable classification/value for new data point. All the ML methods are statistics applied to that specific domain. – user3644640 Oct 17 '16 at 09:57

How to identify a section of data with different characteristics

3 Answers3

Static approach (and unsupervised)

Dynamic approach (and supervised)