I have a dataset where a predictor can have 2 states - if it's in state 1, then the value is always the same. If it's in state 2, then the value is continuous and changes. An example to this can be number of cigarettes smoked per day. If the person is a non-smoker, the value is always 0. If the person is a smoker, the value can be 0 or higher. In my model, I have 2 predictor variables - NoSmoke and Cig. Until recently, I thought this was the correct approach (based on another SO post that I can no longer find). However, I was recently told by a statistician that this approach is problematic - partly because the 2 predictors are independent, and partly because they're correlated.
When I run the model and plot the predictions, things seem to be working out. Is there an issue with this modeling approach? And if so - what's the correct way to analyze this, while preserving the continuous predictor values?
toy example with simulated data:
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
df1 <- data.frame(NoSmoke = "NoSmoke", Cig = rep(0, 100), Response = rnorm(100, 20, 5))
df2 <- data.frame(NoSmoke = "Smoke", Cig = rpois(100, 2)) %>%
mutate(Response = 5 + 2 * Cig + rnorm(100, 20, 5))
df <- rbind(df1, df2)
mod <- lm(Response ~ NoSmoke + Cig, data = df)
new <- data.frame(NoSmoke = c("NoSmoke", rep("Smoke", 7)), Cig = c(0, 0:6))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred <- preds$fit
new$Lower <- preds$lwr
new$Upper <- preds$upr
ggplot(df) +
geom_point(aes(x = Cig, y = Response, colour = NoSmoke), position = position_dodge(width = 0.2)) +
geom_point(data = new, aes(x = Cig, y = Pred, fill = NoSmoke), size = 2, shape = 21, colour = "black",
position = position_dodge(width = 0.2)) +
geom_errorbar(data = new, aes(x = Cig, ymin = Lower, ymax = Upper, colour = NoSmoke),
position = position_dodge(width = 0.2))
EDIT
I left this out of the question for simplicity, but following some of the answers/comments - my actual continuous variable is actually continuous, as opposed to the number of cigarettes in the example (apologies, I was trying for simplicity). In terms of correlation between the NoSmoke and the Cig variables - in my analysis, I standardize Cig, so that the standardized value for the Smoke=="Smoke" cases is standardized as normal, and the standardized value for the Smoke=="NoSmoke" is zero. This allows the model to be Response = Intercept + beta_cig*Cig for the smokers and Response = Intercept + beta_nosmoke for the non-smokers. The correlation between the NoSmoke and the CigS is low - ~0.4 in the example, and <0.1 in my real data.
Here's the toy example with the standardization:
df <- df %>%
group_by(NoSmoke) %>%
mutate(CigS = (Cig - mean(Cig)) / sd(Cig)) %>%
ungroup() %>%
mutate(CigS = ifelse(NoSmoke == "NoSmoke", 0, CigS))
NoSmoke = factor(NoSmoke, levels = c("Smoke", "NoSmoke")))
cor(as.numeric(df$NoSmoke), df$CigS)
mod <- lm(Response ~ NoSmoke + CigS, data = df)
new <- data.frame(NoSmoke = c("NoSmoke", rep("Smoke", 7)), Cig = c(0, 0:6)) %>%
left_join(unique(select(df, NoSmoke, Cig, CigS)))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred <- preds$fit
new$Lower <- preds$lwr
new$Upper <- preds$upr
ggplot(df) +
geom_point(aes(x = Cig, y = Response, colour = NoSmoke), position = position_dodge(width = 0.2)) +
geom_point(data = new, aes(x = Cig, y = Pred, fill = NoSmoke), size = 2, shape = 21, colour = "black",
position = position_dodge(width = 0.2)) +
geom_errorbar(data = new, aes(x = Cig, ymin = Lower, ymax = Upper, colour = NoSmoke),
position = position_dodge(width = 0.2))
NoSmokeindicator variable is opposite in direction to that in the post to which I linked, where the indicator was 0 when the continuous predictor was necessarily 0. This approach leads to a potential discontinuity atCigS=0between theNoSmokegroups. That's OK if it makes sense based on the subject matter. Interpretation might be easier, if allCigvalues are non-negative, to use them without centering/scaling. – EdM Feb 11 '23 at 21:06Cigvalues included the 0 values for theNoSmoke=="NoSmoke"group; is that what you intended? I don't think that standardization helps you here. – EdM Feb 11 '23 at 21:14NoSmoke. I'm surprised to hear the correlation between the two variables doesn't matter, wasn't expecting that. So are you suggesting testing the full model against a model that omits bothNoSmokeandCig, to get a P-value for the significance of the combined covariates? – user2602640 Feb 11 '23 at 21:31