My question is a follow-up from this post. I have a dataset where a predictor can have 2 states - if it's in state 1, then the value is always the same. If it's in state 2, then the value is continuous and changes. An example to this can be number of cigarettes smoked per day (my true predictor is actually continuous, while cigarettes are obviously not). If the person is a non-smoker, the value is always 0. If the person is a smoker, the value can be 0 or higher. In my model, I have 2 predictor variables - NoSmoke and Cig. The linear model in the linked post works just fine. However, I would like to use a spline, and it seems that I'm running into issues. The toy example below has 3 models - with poly(), spline() of the full dataset, and spline() of just the smokers. The poly() and the smokers-only spline() are fine, but the spline of the full dataset ends up rank-deficient (presumably due the non-continuity at zero). For my actual model, I would strongly prefer a spline, to reduce the tail-end artefacts that poly() produces. How do I deal with this?
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines)
set.seed(10)
df1 <- data.frame(NoSmoke = "NoSmoke", Cig = rep(0, 100), Response = rnorm(100, 20, 5))
df2 <- data.frame(NoSmoke = "Smoke", Cig = c(rpois(50, 2), rpois(50, 10))) %>%
mutate(Response = 5 - 2 * Cig + 0.1 * (Cig - 10)^2 + 0.1 * (Cig - 5)^3 + rnorm(100, 20, 10))
df <- rbind(df1, df2)
mod <- lm(Response ~ NoSmoke + poly(Cig, 3), data = df)
mod1 <- lm(Response ~ NoSmoke + ns(Cig, 3), data = df)
mod2 <- lm(Response ~ ns(Cig, 3), data = df %>% filter(NoSmoke == "Smoke"))
Predictions, showing the difference between spline and poly:
new <- data.frame(NoSmoke = "Smoke", Cig = c(0:max(df$Cig)))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred.poly <- preds$fit
new$Lower.poly <- preds$lwr
new$Upper.poly <- preds$upr
preds <- as.data.frame(predict(mod1, newdata = new, interval = "confidence"))
new$Pred.ns <- preds$fit
new$Lower.ns <- preds$lwr
new$Upper.ns <- preds$upr
preds <- as.data.frame(predict(mod2, newdata = new, interval = "confidence"))
new$Pred.ns1 <- preds$fit
new$Lower.ns1 <- preds$lwr
new$Upper.ns1 <- preds$upr
p1 <- ggplot(df[df$NoSmoke == "Smoke",]) +
geom_point(aes(x = Cig, y = Response)) +
geom_line(data = new, aes(x = Cig, y = Pred.poly)) +
geom_ribbon(data = new, aes(x = Cig, ymin = Lower.poly, ymax = Upper.poly), alpha = 0.5)
p2 <- ggplot(df[df$NoSmoke == "Smoke",]) +
geom_point(aes(x = Cig, y = Response)) +
geom_line(data = new, aes(x = Cig, y = Pred.ns)) +
geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns, ymax = Upper.ns), alpha = 0.5)
p3 <- ggplot(df[df$NoSmoke == "Smoke",]) +
geom_point(aes(x = Cig, y = Response)) +
geom_line(data = new, aes(x = Cig, y = Pred.ns1)) +
geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns1, ymax = Upper.ns1), alpha = 0.5)
ggpubr::ggarrange(p1, p2, p3)