Consider a two-way contingency table formed by variables $Y_{1}$ and $Y_{2}$ (say), each with three levels. How do I generate $N$ bootstrap samples from the given table ?
-
Could you give us (possibly made-up) example of your data so it is more clear what is the problem about in here? – Tim Sep 19 '17 at 15:29
-
It's basically a $3\times 3$ table. I want to generate this table using bootstrap resampling.from the given data. I need to check whether some inequalities involving the cell counts are satisfied for each sample. For that, I need to generate the table first. I know how to do bootstrapping in a $2\times 2$ table using uniform random numbers. Is it possible to extend this method to a $3\times 3$ table ? – user143487 Sep 19 '17 at 15:37
1 Answers
There are at least three ways how this can be achieved.
(1) Contingency table stores counts of occurrences of the events. What you need to do is to de-aggregate the table into the individual observations and then apply the standard bootstrap.
For example, let's say that your data looks as below:
tab <- matrix(c(3, 1, 5, 0, 2, 8, 1, 6, 2), 3, byrow = FALSE,
dimnames = list(LETTERS[1:3], LETTERS[1:3]))
## A B C
## A 3 0 1
## B 1 2 6
## C 5 8 2
then you need to convert it to the "long" format:
dat1 <- as.data.frame(as.table(tab))
## Var1 Var2 Freq
## 1 A A 3
## 2 B A 1
## 3 C A 5
## 4 A B 0
## 5 B B 2
## 6 C B 8
## 7 A C 1
## 8 B C 6
## 9 C C 2
and repeat each row according to the Freq column (counts):
dat2 <- dat1[rep(1:nrow(dat1), dat1$Freq), 1:2]
## Var1 Var2
## 1 A A
## 1.1 A A
## 1.2 A A
## 2 B A
## 3 C A
## [ ... ]
Notice that this is just another way of storing the same data. You can even verify that this yields identical data by checking all(tab == table(dat2)).
Finally, apply the bootstrap that preserved the within-row dependence structure of the data, i.e. sample the rows with replacement.
boot_sample <- dat2[sample.int(nrow(dat2), nrow(dat2), replace = TRUE), ]
you need to repeat this sampling as many times as you need.
(2) Yet another implementation, more computationally efficient, but lest obvious at first sight, is to sample $n$ times the $(i,j)$ pairs with probabilities equal to $p(i,j) = c(i,j)/n$, where $n=\sum_{i,j} c(i,j)$ and $c(i,j)$ is the value of counts from the $i$-th row and $j$-th column. This doesn't need you to de-aggregate the table while yielding exactly the same solution.
(3) Even simpler solution is to take sample of size $n$ from multinomial distribution parametrized by $p(i,j)$ probabilities. In this case you directly sample the counts for the $(i,j)$ cells.
n <- sum(dat1$Freq)
p <- dat1$Freq/n
cbind(dat1[, 1:2], Freq = rmultinom(1, n, p))
## Var1 Var2 Freq
## 1 A A 4
## 2 B A 0
## 3 C A 2
## 4 A B 0
## 5 B B 2
## 6 C B 9
## 7 A C 1
## 8 B C 10
## 9 C C 0
- 138,066