I have a sequence of events that arrive at random times, say, according to a Poisson process. Each event is annotated either 'good' or 'bad', and we can think of that annotation as coming from a Bernoulli($p$) process. I want to look for a change point in that probability $p$ (that an event is good).
In other words, here's my model for the underlying stochastic process. I expect that there are two probabilities $p_0,p_1$. Events arrive according to a Poisson process. Each event that arrives before time $t_0$ is annotated 'good' with probability $p_0$ and 'bad' with probability $1-p_0$ (independent of everything else); each event that arrives after time $t_1$ is annotated 'good' with probability $p_1$ and 'bad' with probability $1-p_1$. Thus, arrival times are Poisson and annotations are iid Bernoulli($p_0$) before time $t_0$ and iid Bernoulli($p_1$) afterwards. I get to observe the arrival time of each event and the annotation on it, but I don't know $t_0,p_0,p_1$. I'd like to infer the change-point $t_0$.
Slightly more generally, I'd like a hypothesis test that compares between two possible hypotheses:
There is no change. In other words, arrival times are Poisson and all event annotations are iid Bernoulli($p_0$) throughout.
There is a single change. Arrival times are Poisson and event annotations are iid Bernoulli($p_0$) before time $t_0$ and iid Bernoulli($p_1$) afterwards.
How would I go about this? Is there an existing statistical technique or test that would be appropriate for this?
Even better would be to find methods that are robust to departures from Poisson arrival times, or that don't assume anything about the arrival process for events.
Approaches I've considered:
I considered aggregating by some time period, say by month, and computing for each month the fraction of events that were 'good' in that month (the number of 'good' events in that month, divided by the total number of events in that month). This gives a time series (that fraction, as a function of time), and one could apply change-point detection to this time series. However, this seems like it throws away information. Also, it might be problematic: if in a month we have only one or two events, then the fraction might be driven to 0 or 1 simply due to the small number of observations, which seems likely to cause detection of spurious change-points. So, it would be nice to have something that takes into account the number of events in each month, or avoids the need for aggregation in the first place.
I considered trying all possible values of $t_0$, and using a statistical hypothesis test to check whether the fraction of 'good' events before $t_0$ seems to be different from the fraction after $t_0$, and keeping all candidates with a statistically significant difference. Due to the multiple corrections, it seems like I'd need to apply some correction, e.g., a Bonferroni correction. However, I worry this will be overly conservative (might fail to detect valid changes); I don't know if there's a smarter correction that can be applied. Also, I anticipate that this will find many spurious hits (values of $t_0$ slightly before or after the real one will also cause a statistically significant difference), though perhaps this could be handled through some kind of post-processing.
i don't really like the answers.
maybe it's overkill, but i think of this as a hidden Markov model with state A transitioning with probability $r$ (each step) to absorbing state B. the state A emits Bernoulli($p_0$) RVs, while the state B emits Bernoulli($p_1$) RVs. then you can assign "reasonable" priors and use Baum-Welch or whatever to estimate and find the changepoint.
– Timothy Teräväinen Mar 03 '16 at 22:36