2

I have a dataset $D$ of binary values, with length $|D|$. There is some unknown $d \in \left[1,2,\ldots,|D|\right]$ (usually, $d$ will be somewhere in the middle) such that, for any $i<d$, $D_i$ comes from a Bernoulli distribution with success probability $Q_1$, and for $i \geq d$, $D_i$ comes from a Bernoulli distribution with success probability $Q_2$. For the purposes of this question, assume $0 < Q_1 < 1/2 < Q_2 < 1$.

If I use a sliding window of size $k \ll |D|$ and take the sum within the window, I'll find that when the window is only on values with index $i < d$, the sum will hover around $kQ_1$, and when the window is only on values with index $i \geq d$, the sum will hover around $kQ_2$. As the window slides through indices around $i \approx d$, the sum will jump from $kQ_1$ to $kQ_2$, marking a change point. (I'm definitely open to other methods!)

I want to use an algorithm like step detection to detect $d$, but I don't want to appeal to heuristics. I want to find a way to get within some bound of $d$ with at least a certain probability. This problem is easier than the general change point detection problem because there is exactly one change point, and I know the value of the mean before and after $d$ (so I know the magnitude of the change). Is this a known problem? I'm thinking that we can bound the probability of being within some multiple of $k$ from $d$, and that the calculation would involve calculating the probability that you get a sum of $kQ_1$ when the binomial process has $p = Q_2$ and vice versa.

This problem is also similar to this question, though there they also don't seem to get any complexity results.

Germ
  • 217
  • (1) Why insist on using a sliding window method when it's possible some other method will be superior? (A standard approach is to evaluate a cumulative sum of log likelihood ratios, for instance.) (2) You seem not to distinguish between means and sums. – whuber Mar 02 '24 at 22:14
  • (1) I'm definitely open to other methods! The sliding window was just what I had in mind, since it seemed simple. (2) Thank you for the catch, I will update. – Germ Mar 02 '24 at 22:22
  • How large would $|D|$ typically be? – jbowman Mar 02 '24 at 22:26
  • $|D|$ would be very large. I'd like to know what happens in the large $|D|$ limit. – Germ Mar 03 '24 at 00:55

1 Answers1

3

Let us consider this problem from a Bayesian perspective. We start with a prior distribution on the changepoint, let us say, $d \sim U(\{1, 2, \dots, |D|\})$. Given $d$, the probability of seeing $x_1$ successes from the Bernoulli distribution for $i < d$ and $x_2$ successes from the Bernoulli distribution for $i \geq d$ is:

$$p(x_1,x_2 | Q_1, Q_2, d, |D|) = p(x_1|Q_1, d-1)p(x_2|Q_2,|D|-d+1)$$

where the component distributions on the right-hand side are Binomial distributions. This gives us the likelihood function; given our uniform prior distribution, the posterior distribution of $d$ will be proportional to the likelihood function.

An example in R follows. We set $Q_1 = 0.25$, $Q_2 = 0.75$, $|D| = 100$ and $d = 30$.

Q_1 <- 0.25
Q_2 <- 0.75

x <- c(rbinom(29, 1, Q_1), rbinom(71, 1, Q_2))

lf <- function(d) { if (d > 1) { x1 <- sum(x[1:(d-1)]) x2 <- sum(x) - x1 retval <- dbinom(x1, d-1, Q_1) * dbinom(x2, 101-d, Q_2) } else { retval <- dbinom(sum(x), 100, Q_2) } retval }

posterior <- rep(0,100) for (d in seq_along(posterior)) { posterior[d] <- lf(d) } posterior <- posterior / sum(posterior)

A plot of the posterior distribution, with vertical lines at the cumulative 10th, 50th, and 90th percentiles:

enter image description here

Note, however, that with this little data, this is a good result. If your probabilities are closer together and you don't have many observations, the randomness inherent in the Bernoulli distribution can result in posteriors that look like this (here $Q_1 = 0.35$ and $Q_2 = 0.65$):

enter image description here

The posterior mean is 17.5, and the true value is within the posterior 10%-90% range. Still, the posterior has a mode at $d=1$, thanks to the fact that the sum of the observations is 61, which is quite compatible with the data being drawn from a single Binomial$(100, 0.65)$ distribution.

Of course, we aren't constrained to use a uniform prior on $d$. We now model our belief that the changepoint is more likely to be in the middle of the sample than near the tails using a rescaled $Beta(2,2)$ distribution:

enter image description here

which, applied to the same data that generated the previous posterior, results in:

enter image description here

jbowman
  • 38,614
  • Thanks for this computational approach. For me, $Q_1$ and $Q_2$ are actually quite far apart. Also, I'm trying to see if I can reason about the change point and its uncertainty for any dataset $D$. I guess what I'm looking for is something equivalent to the tail bounds you see for estimating sums of Bernoulli variables. – Germ Mar 03 '24 at 01:03
  • As $|D| \to \infty$, the uncertainty about $d$ grows, so I'm not sure how useful that would be to you. – jbowman Mar 03 '24 at 03:22
  • 1
    Can you elaborate on how the uncertainty grows? If instead of using this Bayesian approach I consider a sliding window in which I calculate the sum within that window, it seems like I should be able to clearly tell by eye where the change point is. But the question is, what does the uncertainty look like? I was hoping to make this quantitative, and it seems like there should be some probability that I can identify the uncertainty within a fixed window. – Germ Mar 03 '24 at 13:53
  • Given $Q_1$, $Q_2$ and $|D|$, is there a way for me to bound the probability of my estimate being outside of some additive constant from the true value? I'm looking to do this analytically. – Germ Mar 07 '24 at 14:09