0

My question is a follow-up from this post. I have a dataset where a predictor can have 2 states - if it's in state 1, then the value is always the same. If it's in state 2, then the value is continuous and changes. An example to this can be number of cigarettes smoked per day (my true predictor is actually continuous, while cigarettes are obviously not). If the person is a non-smoker, the value is always 0. If the person is a smoker, the value can be 0 or higher. In my model, I have 2 predictor variables - NoSmoke and Cig. The linear model in the linked post works just fine. However, I would like to use a spline, and it seems that I'm running into issues. The toy example below has 3 models - with poly(), spline() of the full dataset, and spline() of just the smokers. The poly() and the smokers-only spline() are fine, but the spline of the full dataset ends up rank-deficient (presumably due the non-continuity at zero). For my actual model, I would strongly prefer a spline, to reduce the tail-end artefacts that poly() produces. How do I deal with this?

library(plyr)
library(dplyr)
library(tidyr)

library(ggplot2) library(splines)

set.seed(10) df1 <- data.frame(NoSmoke = "NoSmoke", Cig = rep(0, 100), Response = rnorm(100, 20, 5)) df2 <- data.frame(NoSmoke = "Smoke", Cig = c(rpois(50, 2), rpois(50, 10))) %>% mutate(Response = 5 - 2 * Cig + 0.1 * (Cig - 10)^2 + 0.1 * (Cig - 5)^3 + rnorm(100, 20, 10)) df <- rbind(df1, df2)

mod <- lm(Response ~ NoSmoke + poly(Cig, 3), data = df) mod1 <- lm(Response ~ NoSmoke + ns(Cig, 3), data = df) mod2 <- lm(Response ~ ns(Cig, 3), data = df %>% filter(NoSmoke == "Smoke"))

Predictions, showing the difference between spline and poly:

new <- data.frame(NoSmoke = "Smoke", Cig = c(0:max(df$Cig)))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))

new$Pred.poly <- preds$fit new$Lower.poly <- preds$lwr new$Upper.poly <- preds$upr

preds <- as.data.frame(predict(mod1, newdata = new, interval = "confidence")) new$Pred.ns <- preds$fit new$Lower.ns <- preds$lwr new$Upper.ns <- preds$upr

preds <- as.data.frame(predict(mod2, newdata = new, interval = "confidence")) new$Pred.ns1 <- preds$fit new$Lower.ns1 <- preds$lwr new$Upper.ns1 <- preds$upr

p1 <- ggplot(df[df$NoSmoke == "Smoke",]) + geom_point(aes(x = Cig, y = Response)) + geom_line(data = new, aes(x = Cig, y = Pred.poly)) + geom_ribbon(data = new, aes(x = Cig, ymin = Lower.poly, ymax = Upper.poly), alpha = 0.5) p2 <- ggplot(df[df$NoSmoke == "Smoke",]) + geom_point(aes(x = Cig, y = Response)) + geom_line(data = new, aes(x = Cig, y = Pred.ns)) + geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns, ymax = Upper.ns), alpha = 0.5) p3 <- ggplot(df[df$NoSmoke == "Smoke",]) + geom_point(aes(x = Cig, y = Response)) + geom_line(data = new, aes(x = Cig, y = Pred.ns1)) + geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns1, ymax = Upper.ns1), alpha = 0.5) ggpubr::ggarrange(p1, p2, p3)

  • The problem likely lies in accepting the default knots for the spline. If so, a solution is to construct your own and plot it. – whuber Feb 13 '23 at 17:38
  • 1
    Huh, that seems to have done it. Any words on how to adjust inference for any knot placement exploration? Also, are you up for writing an answer? – user2602640 Feb 13 '23 at 18:09

0 Answers0