I am attempting to simulate data for a logistic regression model testing a continuous interaction with a three-level categorical factor. However, I am encountering a little bit of difficulty. When I simulate the coefficients repeatedly, I do not see to return the same odds ratios as I input. I am attempting to create a dummy matrix of 1s and 0s, multiply that by the categorical factor's coefficients and the intercept. I then multiply beta1 as the coef for the continuous variable by the continuous data. For the categorical factors, given that a dummy coded interaction with a continuous variable represents the difference in slope for that level of the factor compared to the reference group's slope (represented as beta1), I multiply beta4 (the coef for the categorical interaction of level-2 of factor with continuous) by the 1's of level-2 of the factor and the continuous variable. See below:
interaction <- function(N,b0,b1,b2,b3,b4,b5){
# coefficients input in odds ratios
x1 = rnorm(N) # a continuous variable
xcatdum <- t(rmultinom(N, c(1,2,3), prob=c(.33,.33,.33))) # a three level categorical dummy matrix
xcatdumvect <- apply(xcatdum, 1, function(x) which(x==1)) # create a vector of 1, 2, 3 for dummy variable in df
xcatdumcoef <- cbind(1, xcatdum[,-1]) %*% c(b0, b2,b3) # matrix multplication of dummy coded coefficients
log odds
lb0 <- log(b0)
lb1 <- log(b1)
lb2 <- log(b2)
lb3 <- log(b3)
lb4 <- log(b4)
lb5 <- log(b5)
coefficients for reference group and categorical main effects is xcatumcoef containing b0, b2, b3
lb1*x1 is slope for dummy reference group
lb4x1xcat[,2] is slope for second level
lb5x1xcat[,3] is slope for third level
z = xcatdumcoef + lb1x1 + lb4x1xcatdum[,2] + lb5x1*xcatdum[,3]
pr = 1/(1+exp(-z)) # inverse logit
y = rbinom(N,1,pr) # bernoulli response
#now add it to dataframe:
df = data.frame(y=y,x1=x1,x2=as.factor(xcatdumvect))
return(df)
}
Should I be writing the simulated data for the interaction differently? For example, I wasn't sure if I should instead be writing something like this for the interaction between a level of the categorical factor and the continuous variable:
( lb4 * (lb1*x1*xcatdum[,2]) - (lb1*x1*xcatdum[,1]) )
I was thinking this may be how it would be written since it represents that the coefficient represents the slope of x1 for group 2 compared to group 1 (reference group).
Fully written, here is the alternative formulation I was thinking, denoting that beta1 is slope for reference group 1, beta0 is reference group mean, beta2 is group 2 mean, beta3 is group 3 mean, beta4 is difference in slope of group 2 from reference group, and beta5 is difference in slope of 3 from reference group:
z = xcatdumcoef + lb1*x1*xcatdum[,1] +
( lb4 * (lb1*x1*xcatdum[,2]) - (lb1*x1*xcatdum[,1]) ) +
(lb5 *(lb1*x1*xcatdum[,3]) - (lb1*x1*xcatdum[,1]) )
It still appears to be generating coefficients that are off though...
Does anyone have any good ideas for how to write the simulation up? It was intuitive for me how to simulate a two-level factor's interaction with a continuous variable but expanded to more than two-levels began to feel tricky and I think I am doing something wrong.
Is that somewhat clearer?
– JElder Jun 21 '22 at 18:33