3

I am using the titanic_train data set in R to build a logistic regression model.

library(titanic)
library(splines)
library(broom)
library(dplyr)

#First we process the titanic_train dataset to make it

a little more logistic regression friendly

titanic <- titanic_train %>% select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>% mutate(Survived = factor(Survived), Pclass = factor(Pclass), Sex = factor(Sex))

I'll skip over the guts of the project and just say this: for the Age predictor I used a natural cubic spline via the ns() function and ended up with this model:

model_02 <- glm(Survived ~ SibSp + ns(Age, df = 3) + Pclass + Parch + Fare,
             data = titanic,
             family = binomial)
model_02 <- step(model_02, trace = FALSE)
summary(model_02)

When I run the summary, I look at the coefficients:

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       3.20072    0.55223    5.80  6.8e-09 ***
SibSp            -0.37579    0.11649   -3.23   0.0013 ** 
ns(Age, df = 3)1 -1.29087    0.47188   -2.74   0.0062 ** 
ns(Age, df = 3)2 -6.53137    1.08099   -6.04  1.5e-09 ***
ns(Age, df = 3)3 -4.10291    0.89387   -4.59  4.4e-06 ***
Pclass2          -0.75677    0.28842   -2.62   0.0087 ** 
Pclass3          -1.98663    0.30651   -6.48  9.1e-11 ***
Fare              0.00631    0.00291    2.17   0.0303 * 

Now I know that the coefficients represent the change in the log odds of the outcome (surviving the titanic crash in this case) but I don't know how to interpret the coefficients associated with the age spline.

My first thought is that they are the coefficients of the polynomial, but that does not seem accurate because they function: -4.1*Age^3 - 6.53*Age^2 - 1.29*Age does not even come close to tracking the log odds versus age.

I'm really lost on this one and it's really frustrating, can anyone help?

whuber
  • 322,774
  • 2
    As an aside, you can make that sex column binary like this as.integer(titanic$Sex == "male") –  Jan 07 '21 at 22:15
  • Splines are not polynomials, they are piecewise polynomial. Those coefficients are coefficients of a basis expansion (weights assigned to elements of a basis). – passerby51 Jan 08 '21 at 07:30

2 Answers2

4

The spline represents a nonlinear additive contribution to the response due to the Age variable. Because there are no interaction terms involving Age, this contribution will not vary with the values of any of the other variables. One of the best ways to evaluate it is to plot the spline over a reasonable range of ages. I usually pick the full range of ages in the data, but you can change this.

Plotting the spline can be done by reverse-engineering the code, but (regardless of your statistical computing platform) there's an easier way. Simply create an artificial dataset with reasonable values of all the variables but with the values of Age varying gradually from the smallest to the largest. The code below spaces N values equally across this range. Apply the prediction to this dataset. The resulting values show (by construction) how the contribution of the spline term varies:

Figure showing a graph of the spline contribution

In this case you can see it is approximately linear, which would suggest comparing your model to one without the spline. (The difference in AIC is only 6.3478, which is just on the threshold of justifying the use of the spline, suggesting this is worth more thought.)

If you really need a formula, see my answer at https://stats.stackexchange.com/a/101484/919 for a way to get it without doing any numerical programming.

As an aid to interpretation, recall that the coefficients of the non-spline terms are the first derivatives (slopes) of their contributions to the model. The spline causes these "effective coefficients" to vary with Age. You can easily plot this varying coefficient by computing the first derivatives numerically: divide the successive differences in predicted values by the successive differences in age.

Figure 2: Plot of the spline slope

For instance, at Age near 35 the effective slope is nearly zero, meaning small changes of Age in this range have no effect on the predicted response. Near ages of zero the effective slope is near $-0.15,$ indicating each additional year of age reduces the value of the link function by about $0.15.$ At the oldest ages the effective slopes are settling down to a value near $-0.09,$ indicating each additional year of age in this age group decreases the link function by $-0.09.$


Here is R code illustrating this approach (and demonstrating how simple it is to execute).

#
# Create a data frame for prediction: only `Age` will vary.
#
N <- 101
x <- titanic[which.max(complete.cases(titanic)), ]
df <- do.call(rbind, lapply(1:N, function(i) x))
df$Age <- with(titanic, seq(min(Age, na.rm=TRUE), max(Age, na.rm=TRUE), length.out=N))
#
# Predict and plot.
#
df$Survived.hat <- predict(model_02, newdata=df) # The predicted *link,* by default
with(df, plot(Age, Survived.hat, type="l", lwd=2, ylab="", main="Relative spline term"))
mtext("Spline contribution\nto the link function", side=2, line=2)
#
# Plot numerical derivatives.
#
dAge <- diff(df$Age[1:2])
delta <- diff(df$Survived.hat)/dAge
age <- (df$Age[-N] + df$Age[-1]) / 2
plot(age, delta, type="l", lwd=2, ylab="Change per year", xlab="Age",
     main="Spline Slope (Effective Coefficient)")
whuber
  • 322,774
0

From the help file ?ns::spline, we have

ns is based on the function splineDesign. It generates a basis matrix for representing the family of piecewise-cubic splines with the specified sequence of interior knots, and the natural boundary conditions.

You can see what these basis functions look like by plotting them.

x = seq(0, 1, length.out = 30)
B = splines::ns(x, df = 3)
plot(x, B[,1] , type = "b", col="red", ylim = range(B), pch=16, ylab = "Basis Function")
lines(x, B[,2] , type = "b", col="blue", pch=16)
lines(x, B[,3] , type = "b", col="skyblue2", pch=16)

Here is the result:

enter image description here

What you are getting in your regression are the coefficients multiplying these functions (not these exactly, but the basis functions based on Age data). See Chapter 5 of The Elements of Statistical Learning.

passerby51
  • 1,811