Rank deficiency in spline with a predictor that "becomes" categorical when larger than a cutoff

Question

My question is a follow-up from this post. I have a dataset where a predictor can have 2 states - if it's in state 1, then the value is always the same. If it's in state 2, then the value is continuous and changes. An example to this can be number of cigarettes smoked per day (my true predictor is actually continuous, while cigarettes are obviously not). If the person is a non-smoker, the value is always 0. If the person is a smoker, the value can be 0 or higher. In my model, I have 2 predictor variables - NoSmoke and Cig. The linear model in the linked post works just fine. However, I would like to use a spline, and it seems that I'm running into issues. The toy example below has 3 models - with poly(), spline() of the full dataset, and spline() of just the smokers. The poly() and the smokers-only spline() are fine, but the spline of the full dataset ends up rank-deficient (presumably due the non-continuity at zero). For my actual model, I would strongly prefer a spline, to reduce the tail-end artefacts that poly() produces. How do I deal with this?

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(splines)
set.seed(10)
df1 <- data.frame(NoSmoke = "NoSmoke", Cig = rep(0, 100), Response = rnorm(100, 20, 5))
df2 <- data.frame(NoSmoke = "Smoke", Cig = c(rpois(50, 2), rpois(50, 10))) %>%
        mutate(Response = 5 - 2 * Cig + 0.1 * (Cig - 10)^2 + 0.1 * (Cig - 5)^3 + rnorm(100, 20, 10))
df <- rbind(df1, df2)
mod <- lm(Response ~ NoSmoke + poly(Cig, 3), data = df)
mod1 <- lm(Response ~ NoSmoke +  ns(Cig, 3), data = df)
mod2 <- lm(Response ~ ns(Cig, 3), data = df %>% filter(NoSmoke == "Smoke"))

Predictions, showing the difference between spline and poly:

new <- data.frame(NoSmoke = "Smoke", Cig = c(0:max(df$Cig)))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred.poly <- preds$fit
new$Lower.poly <- preds$lwr
new$Upper.poly <- preds$upr
preds <- as.data.frame(predict(mod1, newdata = new, interval = "confidence"))
new$Pred.ns <- preds$fit
new$Lower.ns <- preds$lwr
new$Upper.ns <- preds$upr
preds <- as.data.frame(predict(mod2, newdata = new, interval = "confidence"))
new$Pred.ns1 <- preds$fit
new$Lower.ns1 <- preds$lwr
new$Upper.ns1 <- preds$upr
p1 <- ggplot(df[df$NoSmoke == "Smoke",]) +
 geom_point(aes(x = Cig, y = Response)) +
 geom_line(data = new, aes(x = Cig, y = Pred.poly)) +
 geom_ribbon(data = new, aes(x = Cig, ymin = Lower.poly, ymax = Upper.poly), alpha = 0.5) 
p2 <- ggplot(df[df$NoSmoke == "Smoke",]) +
    geom_point(aes(x = Cig, y = Response)) +
    geom_line(data = new, aes(x = Cig, y = Pred.ns)) +
    geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns, ymax = Upper.ns), alpha = 0.5) 
p3 <- ggplot(df[df$NoSmoke == "Smoke",]) +
    geom_point(aes(x = Cig, y = Response)) +
    geom_line(data = new, aes(x = Cig, y = Pred.ns1)) +
    geom_ribbon(data = new, aes(x = Cig, ymin = Lower.ns1, ymax = Upper.ns1), alpha = 0.5) 
ggpubr::ggarrange(p1, p2, p3)

The problem likely lies in accepting the default knots for the spline. If so, a solution is to construct your own and plot it. — whuber, Feb 13 '23 at 17:38
Huh, that seems to have done it. Any words on how to adjust inference for any knot placement exploration? Also, are you up for writing an answer? — user2602640, Feb 13 '23 at 18:09

Rank deficiency in spline with a predictor that "becomes" categorical when larger than a cutoff

0 Answers0