1

I am trying to simulate "correlated categorical data". For instance, consider the following example: Suppose there are 10 players (p1, p2, ...p10) - each day, a random combination of these players meet together (1 = present, 0 = missing). These players then try to solve a puzzle (1 = successful, 0 = unsuccessful).

I would like this dataset to illustrate the following concepts:

  • I would like to show that when certain players are present together (e.g. p1 and p5, or p1,p6,p9), they tend to be more successful at solving the puzzle (i.e. the rows where they are there corresponds to the results column having a higher percentage)

  • I would like to show that when certain players are present together (e.g. p9 and p3), they tend to be less successful at solving the puzzle (i.e. the rows where they are there corresponds to the results column having a lower percentage)

  • I would like to show that certain players only tend to be successful when other players are there, and less successful when those players are absent (e.g. player 2 is very successful when player 3 is present, but player 2 is not very successful when player 3 is absent)

  • And finally, some players are always successful no matter who they play with

Using the R programming language, I tried to simulate some multivariate normal data - and then replace elements with a value of "0" if the elements was less than some number, and replace elements with a value of "1" if the elements were greater than some number:

library(mvtnorm)

n <- 11 A <- matrix(runif(n^2)2-1, ncol=n) s <- t(A) %% A

my_data = MASS::mvrnorm(100, mu = c(rnorm(11,10,1)), Sigma = s) my_data = data.frame(my_data)

colnames(my_data)[1] <- 'p1' colnames(my_data)[2] <- 'p2' colnames(my_data)[3] <- 'p3' colnames(my_data)[4] <- 'p4' colnames(my_data)[5] <- 'p5' colnames(my_data)[6] <- 'p6' colnames(my_data)[7] <- 'p7' colnames(my_data)[8] <- 'p8' colnames(my_data)[9] <- 'p9' colnames(my_data)[10] <- 'p10' colnames(my_data)[11] <- 'result'

my_data[my_data < 9] <- 0 my_data[my_data > 9] <- 1

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 result 1 1 1 1 0 1 1 1 0 0 0 0 2 0 1 1 1 1 0 1 1 1 1 1 3 1 1 1 0 1 0 1 1 1 1 1 4 1 1 1 0 1 1 1 1 1 1 1 5 0 1 1 1 1 0 1 1 0 0 0 6 1 0 1 0 1 1 1 0 1 1 1

But in the end, I not sure if I was able to successfully generate categorical data having any correlation pattern whatsoever.

Can someone please tell me if I have done this correctly - and if not, are there any standard methods to randomly simulate categorical correlated data?

Thanks!

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 2
    https://stats.stackexchange.com/questions/284996 shows how to do this with two variables in a way that generalizes to any finite number of binary variables. – whuber Jun 09 '22 at 15:02
  • @ Whuber: thank you! What do you think of the approach I have described? Does it make sense? Thank you so much! – stats_noob Jun 09 '22 at 15:59
  • 1
    Yes, that's an approach that often is used to model correlated binary variables. What it needs to be complete is to work out the relationship between the correlation matrix of the multinormal variable and the correlation matrix of the resulting multivariate binary variable, so that given a specification of the latter you can figure out the former. The relationship is relatively simple. – whuber Jun 09 '22 at 16:23

0 Answers0