Co-occurance problem

Question

I have a series of discrete, purportedly random whole numbers like this:

v1 v2 v3 v4 v5 v6
42  23  10  07  01  35  
05  02  26  25  49  18  
35  18  43  29  26  28  
36  59  26  15  34  35

I want to identify if there are any values that co-occcur (in a row) more frequently than one would expect by chance. The co-occurance within a row is of interest regardless of which columns the #'s happen to be in.
Any ideas of a technique? Just keywords I could search for for more info would be highly appreciated.

The answers so far are greatly appreciated and interesting; let me clarify my interest. Pretend there are more lines than given below, and note the #'s 10 and 07, which seem to co-occur within a row a lot (it will not be this obvious, and I have thousands of rows). What I am interested in is identifying two or more numbers that appear in the same row together more frequently that one one expect at random. It should be noted that the numbers have a finite range (e.g. 1:60) and are supposed to be randomly drawn without replacement within each row.

v1 v2 v3 v4 v5 v6
42  23  10  07  01  35
10  02  26  25  07  18 
07  10  43  29  26  28  
36  59  07  15  34  10

score 1 · Accepted Answer · answered Nov 19 '12 at 03:59

If you suspect the numbers have been cooked, i.e. manually selected, one possibility is Benford's law used in forensic accounting, see: http://en.wikipedia.org/wiki/Benford%27s_law

Aside from that, it is probably easier just to test the whole row for randomness. see: http://en.wikipedia.org/wiki/Diehard_tests

If you have an overall problem, then you could zero in on its cause. Of course if you were trying to fool a random number test, you could easily do it by swapping the most frequently occurring number with instances of the least frequently occurring.

score 0 · Answer 2 · answered Nov 19 '12 at 03:37

I imagine some kind of simulation study is the way to go. Simulate data under a null hypothesis that the numbers really are random, count the number of repetitions in each row and get a distribution for those counts; and compare this to the observed distribution of actual counts of repetitions. You could use a Chi square test to compare the observed counts to those predicted under the null hypothesis.

For example, in the simple case where the numbers 1 to 100 are all equally likely to be in any position in your data, around 86% of rows should have no repeated numbers at all; about 14% have a number twice; etc. See simulation below with six columns of a million numbers each (input and output from R).

> numb <- matrix(sample(1:100,6*10^6, replace=TRUE), ncol=6)
> cooccur <- apply(numb, 1, function(x){max(table(x))})
> prop.table(table(cooccur))*100
cooccur
      1       2       3       4 
85.8445 13.9621  0.1921  0.0013

A more complicated simulation would be needed if the underlying randomness in your null hypothesis is not as simple as this (eg if, while all the cells are still independent, some values are more likely than others and hence obviously more likely to turn up twice in a row). In that case, you might need to find a way to resample from your observed data - effectively, some form of bootstrap. How best to do that would depend on a plausible model for how the data were generated under your null hypothesis of randomness.

Co-occurance problem

2 Answers2