Simulating logistic regression data for continuous by three-level categorical model

Question

I am attempting to simulate data for a logistic regression model testing a continuous interaction with a three-level categorical factor. However, I am encountering a little bit of difficulty. When I simulate the coefficients repeatedly, I do not see to return the same odds ratios as I input. I am attempting to create a dummy matrix of 1s and 0s, multiply that by the categorical factor's coefficients and the intercept. I then multiply beta1 as the coef for the continuous variable by the continuous data. For the categorical factors, given that a dummy coded interaction with a continuous variable represents the difference in slope for that level of the factor compared to the reference group's slope (represented as beta1), I multiply beta4 (the coef for the categorical interaction of level-2 of factor with continuous) by the 1's of level-2 of the factor and the continuous variable. See below:

interaction <- function(N,b0,b1,b2,b3,b4,b5){
  # coefficients input in odds ratios
x1 = rnorm(N)           # a continuous variable 
  xcatdum <- t(rmultinom(N, c(1,2,3), prob=c(.33,.33,.33))) # a three level categorical dummy matrix
  xcatdumvect <- apply(xcatdum, 1, function(x) which(x==1)) # create a vector of 1, 2, 3 for dummy variable in df
  xcatdumcoef <- cbind(1, xcatdum[,-1]) %*% c(b0, b2,b3) # matrix multplication of dummy coded coefficients
log odds
lb0 <- log(b0)
  lb1 <- log(b1)
  lb2 <- log(b2)
  lb3 <- log(b3)
  lb4 <- log(b4)
  lb5 <- log(b5)
coefficients for reference group and categorical main effects is xcatumcoef containing b0, b2, b3
lb1*x1 is slope for dummy reference group
lb4x1xcat[,2] is slope for second level
lb5x1xcat[,3] is slope for third level
z = xcatdumcoef + lb1x1 + lb4x1xcatdum[,2] + lb5x1*xcatdum[,3]
  pr = 1/(1+exp(-z))      # inverse logit 
  y = rbinom(N,1,pr)      # bernoulli response
#now add it to dataframe:
  df = data.frame(y=y,x1=x1,x2=as.factor(xcatdumvect))
return(df)
}

Should I be writing the simulated data for the interaction differently? For example, I wasn't sure if I should instead be writing something like this for the interaction between a level of the categorical factor and the continuous variable:

( lb4 * (lb1*x1*xcatdum[,2]) - (lb1*x1*xcatdum[,1]) )

I was thinking this may be how it would be written since it represents that the coefficient represents the slope of x1 for group 2 compared to group 1 (reference group).

Fully written, here is the alternative formulation I was thinking, denoting that beta1 is slope for reference group 1, beta0 is reference group mean, beta2 is group 2 mean, beta3 is group 3 mean, beta4 is difference in slope of group 2 from reference group, and beta5 is difference in slope of 3 from reference group:

  z = xcatdumcoef + lb1*x1*xcatdum[,1] + 
( lb4 * (lb1*x1*xcatdum[,2]) - (lb1*x1*xcatdum[,1]) ) + 
(lb5 *(lb1*x1*xcatdum[,3]) - (lb1*x1*xcatdum[,1]) )

It still appears to be generating coefficients that are off though...

Does anyone have any good ideas for how to write the simulation up? It was intuitive for me how to simulate a two-level factor's interaction with a continuous variable but expanded to more than two-levels began to feel tricky and I think I am doing something wrong.

I'm not sure I follow this. What is the point of this simulation? Is this for a power analysis? Are you exploring the properties of logistic regression? Something else? It may help you to read some of our existing threads on simulating data (eg, 1, 2, 3). — gung - Reinstate Monica, Jun 21 '22 at 18:24
Hi-- Yes, it's for power analysis. I've read up on a bunch of StackExchange posts and I've found posts on two-level categorical factors interactions with continuous IV, or a three-level categorical factor main effect, but not a three-level categorical factor interacting with continuous IV. I thought I could implement what's show in previous posts and extend to three-level factor interacting with continuous IV but having difficulty. Is there a part that's particularly unclear that I could clarify? — JElder, Jun 21 '22 at 18:30
The abbreviated description is that I want to simulate a three-level categorical factor interacting with a continuous IV. Each level of the factor interacting with the continuous IV is represented as the difference in slope relative to reference group's slope, so trying to figure out where my math or code is off there.
Is that somewhat clearer? — JElder, Jun 21 '22 at 18:33
Do you know what the distribution of the levels of the categorical variable will be? Eg, for an experiment, you would typically assign equal numbers to each level? Do you have a theory of what the three slopes should be? — gung - Reinstate Monica, Jun 21 '22 at 18:33
An interaction between a 3-level categorical variable & a continuous variable is just 3 lines. What are the three lines you believe are real & want to differentiate from the null? — gung - Reinstate Monica, Jun 21 '22 at 18:36
So, I would like to have sufficient power to differentiate the slope for level 2 AND for level 3 of the categorical factor from level 1 (reference group) of the categorical factor. I figured out how to test for power to differentiate the slope of one level from another. I'm wondering how to detect enough power to differentiate slope B AND slope C from slope A-- Does that make sense? Would I just test for power for slope B vs. slope A, and then simply double it? — JElder, Jun 21 '22 at 18:57
It may help to read my answer to Simulation of logistic regression power analysis - designed experiments. You need a statistical analysis plan. What is yours? A simple one would be to test the model as a whole; if significant, test the interaction; if significant, test 2 vs 1, & test 3 vs 1. If that's all you want to know, you would be done--eg, no multiple comparisons corrections would be needed. Do you want power for both being significant, either, something else? What are the 3 theorized slopes? Will the 3 groups have equal n's? — gung - Reinstate Monica, Jun 21 '22 at 19:04

Simulating logistic regression data for continuous by three-level categorical model

log odds

coefficients for reference group and categorical main effects is xcatumcoef containing b0, b2, b3

lb1*x1 is slope for dummy reference group

lb4x1xcat[,2] is slope for second level

lb5x1xcat[,3] is slope for third level

0 Answers0