3

I have what is probably a very naive question, but I've been unable to find a suitable explanation elsewhere. I am trying to calculate the likelihood of two errors occurring at the same position if they are randomly distributed amongst all positions available. The problem is difficult to explain in context, so I will simplify the scenario. Say I have two positions that can have a number of identities across a number of events. So I might represent this with two vectors:

X<-(1,1,0,0,0,0,0,0,0,0,1)
Y<-(1,1,0,0,0,0,0,0,0,0,0)

My assumption is that the 1's are errors and the zeros are not. What I want to calculate is the probability of the errors coinciding at corresponding positions in X and Y if the errors are distributed at random, and then how unlikely it is that I observe a certain number of corresponding errors (at the same position in X and Y) in observed data? To do this I have calculated the frequency of an error at each position (so in the e.g. above, 3/11 in X and 2/11 in Y) and then calculated the expected chance of overlap by multiplying these together (3/11*2/11=6/121). I can then calculate the probability of seeing at least two overlaping 1's (from observed data) using the binomial cumulatively -> 1-pbinom(1,11,6/121) = which is the chance of two or more observed overlaps in 11 trials given and expectation of 6/121 per trials.

However, does this fully capture the probability of observing this particular scenario (two or more matching errors) if the errors are randomly distributed, or am I missing some information in the fact that the 0's also match up frequently too? In the scenario I am working with, having zeros match up is also good evidence that the positions of 0's and 1s are not random, but as they are often way more frequent than 1's I don't know if I should be considering these matches.

Sorry if this question is unclear, but any help would be much appreciated!

Peter Ellis
  • 17,650
Alan
  • 55

2 Answers2

1

It sounds like your actual situation is more complex than your example in that you have two sources of randomness:

  1. The timing of events
  2. Whether or not an event is an "error"

In this case, yes, there is certainly information in whether the non-errors line up and you have a very complex situation indeed. I would think that a simulation exercise is more likely to be useful than a theoretical one - you would need to simulate the timing of events, and whether they are errors, under the 'no relationship' scenario and see how much more lined up are the events in the observed data.

In your illustrative example however there isn't any information in the fact that the zeros are lined up, as this is a feature of the design and by implication the vectors are equally spaced (that is, we haven't got any information to suggest it is even possible for the zeros not to be lined up).

Peter Ellis
  • 17,650
  • Thanks for the reply Peter. In fact, my events have to occur at the same time at both positions (X and Y). If my question was slightly different, I wonder whether there would be a simple solution. Imagine that I am not talking about errors, but two different types of event that can happen - event A and event B. If there are two places that these events can happen, and they happen at the same time in each of these places (so vectors A and B are fixed), what is the likelihood of observing my particular pattern? e.g. – Alan Feb 14 '13 at 18:43
1

The chances are invariant under re-ordering the positions, so you may assume that $X = (1, \ldots, 1, 0, \ldots, 0)$ (having, say, $k$ ones) and that $Y$ has $l$ randomly placed ones. The question then asks for the distribution of those ones within the first $k$ places of $Y$. Well, letting there be $n$ places in total, there are $\binom{n}{l}$ combinations of $l$ ones and for each $i$, $0 \le i \le l$, there are $\binom{k}{i}$ equally probable ways to situate $i$ ones within the first $k$ places and $\binom{n-k}{l-i}$ equally probable ways to situate the $l-i$ remaining ones in the remaining $n-k$ places, whence the chance of $i$ coincident "errors" is

$$\binom{k}{i}\binom{n-k}{l-i} / \binom{n}{l},$$

$0 \le i \le \min(l,k).$

This differs from the solution that assumes the errors have binomial distributions; the difference is that this solution is conditioned on the values of $k$ and $l$, whereas in the binomial model $k$ and $l$ are themselves random.

whuber
  • 322,774
  • Thanks whuber, that makes a lot of sense. So in an imaginary scenario where k=4, l=4 and n=20, if I observe 3 coincident errors, to calculate the probability that this could have occurred by chance, I need to calculate the probability that I observe at least 3 coincident events and sum the above formula for i=3 and i=4? – Alan Feb 14 '13 at 21:04
  • Yes. The full distribution for $i=0,\ldots,4$ in this case is $\left{\frac{364}{969},\frac{448}{969},\frac{48}{323},\frac{64}{4845},\frac{1}{4845}\right}$, so your answer should be the sum of the last two, equal to $\frac{13}{969} \approx 0.0134$. As a rough check, each of the four errors in $Y$ had about a $4/20=1/5$ chance of coinciding with one in $X$ so the Binomial result is $\binom{4}{3}(1/5)^3(4/5) + \binom{4}{4}(1/5)^4 = 17/625 = 0.0272$. The correct answer is smaller because after the first coincidence the chance dropped to $3/19$, etc. – whuber Feb 14 '13 at 21:47