2

Consider a two-way contingency table formed by variables $Y_{1}$ and $Y_{2}$ (say), each with three levels. How do I generate $N$ bootstrap samples from the given table ?

  • Could you give us (possibly made-up) example of your data so it is more clear what is the problem about in here? – Tim Sep 19 '17 at 15:29
  • It's basically a $3\times 3$ table. I want to generate this table using bootstrap resampling.from the given data. I need to check whether some inequalities involving the cell counts are satisfied for each sample. For that, I need to generate the table first. I know how to do bootstrapping in a $2\times 2$ table using uniform random numbers. Is it possible to extend this method to a $3\times 3$ table ? – user143487 Sep 19 '17 at 15:37

1 Answers1

5

There are at least three ways how this can be achieved.

(1) Contingency table stores counts of occurrences of the events. What you need to do is to de-aggregate the table into the individual observations and then apply the standard bootstrap.

For example, let's say that your data looks as below:

tab <- matrix(c(3, 1, 5, 0, 2, 8, 1, 6, 2), 3, byrow = FALSE,
              dimnames = list(LETTERS[1:3], LETTERS[1:3]))
##   A B C
## A 3 0 1
## B 1 2 6
## C 5 8 2

then you need to convert it to the "long" format:

dat1 <- as.data.frame(as.table(tab))
##   Var1 Var2 Freq
## 1    A    A    3
## 2    B    A    1
## 3    C    A    5
## 4    A    B    0
## 5    B    B    2
## 6    C    B    8
## 7    A    C    1
## 8    B    C    6
## 9    C    C    2

and repeat each row according to the Freq column (counts):

dat2 <- dat1[rep(1:nrow(dat1), dat1$Freq), 1:2]
##     Var1 Var2
## 1      A    A
## 1.1    A    A
## 1.2    A    A
## 2      B    A
## 3      C    A
## [ ... ]

Notice that this is just another way of storing the same data. You can even verify that this yields identical data by checking all(tab == table(dat2)).

Finally, apply the bootstrap that preserved the within-row dependence structure of the data, i.e. sample the rows with replacement.

boot_sample <- dat2[sample.int(nrow(dat2), nrow(dat2), replace = TRUE), ]

you need to repeat this sampling as many times as you need.

(2) Yet another implementation, more computationally efficient, but lest obvious at first sight, is to sample $n$ times the $(i,j)$ pairs with probabilities equal to $p(i,j) = c(i,j)/n$, where $n=\sum_{i,j} c(i,j)$ and $c(i,j)$ is the value of counts from the $i$-th row and $j$-th column. This doesn't need you to de-aggregate the table while yielding exactly the same solution.

(3) Even simpler solution is to take sample of size $n$ from multinomial distribution parametrized by $p(i,j)$ probabilities. In this case you directly sample the counts for the $(i,j)$ cells.

n <- sum(dat1$Freq)
p <- dat1$Freq/n
cbind(dat1[, 1:2], Freq = rmultinom(1, n, p))
##   Var1 Var2 Freq
## 1    A    A    4
## 2    B    A    0
## 3    C    A    2
## 4    A    B    0
## 5    B    B    2
## 6    C    B    9
## 7    A    C    1
## 8    B    C   10
## 9    C    C    0
Tim
  • 138,066