1

Consider a predictor $X$ that predicts a non-linear response $y$. A natural cubic spline with $K$ knots can be fit to $y$.

Several different linear basis expansions in $X$ can be used to fit the natural cubic spline, such as the following $K$ basis functions—from Hastie, Tibshirani & Friedman (2017):

\begin{align} h_1(X) &= 1\\ h_2(X) &= X\\ h_{k+2}(X) &= d_k(X) - d_{K-1}(X),\\ \end{align}

where

$$d_k(X) = \frac{(X-\xi_k)_+^3-(X-\xi_K)_+^3}{\xi_K-\xi_k},$$ $\xi_k \,$ is the $k$-th knot location, and $\, t_+ \,$ indicates that only the positive part is kept.

The natural cubic spline model can then be expressed as

$$f(X) = \sum_{k=1}^{K} \beta_{k} h_{k}(X),$$

and the $\beta$ coefficients can be estimated using least squares

$$\hat{\beta} = (\, h(X^T) \; h(X) \, )^{-1} \; h(X^T) \; y.$$

Question: How can one reconstruct the cubic spline in a single algebraic equation using only the fitted coefficients $\hat{\beta}$, the knot locations $\xi$, and the raw predictor $X$? I would appreciate a general expression for $N$ predictors.

Note: I am using the Python package statsmodels to do the natural cubic spline fitting (specifically the classes statsmodels.gam.smooth_basis.BSplines and statsmodels.gam.generalized_additive_model.GLMGam). I don't know know what basis functions are used in BSplines; I've included the functions above just as an example.

Here is a simple reproducible example where I try to fit a non-linear response:

import numpy as np
import BSplines from statsmodels.gam.smooth_basis
import GLMGam from statsmodels.gam.generalized_additive_model
import matplotlib.pyplot as plt

n_pts = 500

scaling factor for noise in data

noise_scl = 0.2 X = np.arange(n_pts)

create non-linear response

y = np.sin(np.arange(n_pts) / 150) * np.cos(np.arange(n_pts) / 75) + np.random.normal( loc=0, scale=noise_scl, size=n_pts )

deg_free = 8 degree = 3

get the basis functions

bs = BSplines( X, df=deg_free, degree=degree, include_intercept=True, )

get generalized additive model and apply basis functions

base_model = GLMGam(y, X, smoother=bs)

fit spline regression

res = base_model.fit()

this equation is wrong but is my best attempt

y_eq = ( res.params[0] + res.params[1] * X + res.params[2] * np.max(X - bs.smoothers[0].knots[2], 0) ** 3 + res.params[3] * np.max(X - bs.smoothers[0].knots[3], 0) ** 3 + res.params[4] * np.max(X - bs.smoothers[0].knots[4], 0) ** 3 + res.params[5] * np.max(X - bs.smoothers[0].knots[5], 0) ** 3 + res.params[6] * np.max(X - bs.smoothers[0].knots[6], 0) ** 3 + res.params[7] * np.max(X - bs.smoothers[0].knots[7], 0) ** 3 + res.params[8] * np.max(X - bs.smoothers[0].knots[8], 0) ** 3 ) fig, ax = plt.subplots() ax.scatter(X, y, color="blue") ax.plot(X, res.fittedvalues, color="red", label="fittedvalues") ax.plot(X, y_eq, color="orange", label="equation") fig.legend() plt.show()

Florent H
  • 153
  • Look at Section 2.4.5 of Frank Harrell's Regression Modeling Strategies. I think that uses the same basis functions and shows the general formula to express results in the form you want. The rcspline.restate() function in his Hmisc package implements that directly from a set of knot positions and coefficients, so you could double check against the code for that function. – EdM Jun 06 '23 at 20:44
  • Thank you @EdM. Is the restatement for $K = 4$ as simple as $f(X) = \beta_1 + \beta_2 X + \beta_3 (X - \xi_2)+^3 + \beta_4 (X - \xi_3)+^3$ ? Does it depend on the basis functions used? – Florent H Jun 06 '23 at 21:02
  • 1
    Perhaps https://stats.stackexchange.com/a/101484/919 answers your question? – whuber Jun 06 '23 at 21:04
  • I only know Python and have elementary knowledge of R, so I'm having a little trouble interpreting your code. But it does look like your answer over there is relevant to my question. – Florent H Jun 06 '23 at 21:10
  • 1
    Please look carefully at the page linked by @whuber, as his answer shows marvelously how to translate from the ns version of basis functions in the R splines package, and the page includes a brief answer from Frank Harrell about use of the truncated power basis that he prefers and is noted in my prior comment. If there's still something unclear, please edit the question to state the outstanding issues, as comments are easy to overlook and can be deleted. – EdM Jun 06 '23 at 21:16
  • I've read the page several times, but I'm still lost. I think it's necessary for me to indicate what software I am using to perform the fit as the basis functions differ from software to software. I've added these details in the question. – Florent H Jun 07 '23 at 04:03
  • 1
    B-Spline basis is different from truncated power splines. B-splines are usually defined recursively. The splines in statsmodels are based on patsy https://patsy.readthedocs.io/en/latest/spline-regression.html with additional boundary knots options, and corresponds to R mgcv package which has better documentation than the corresponding parts in statsmodels/patsy. – Josef Jun 07 '23 at 15:48
  • Thank you for your input @Josef. I read the relevant sections in the mgcv documentation but did not find a mathematical formulation for the B-spline bases. Do you have a specific page number from the PDF or perhaps another source I can look at? Or do you know the basis formulation yourself? – Florent H Jun 08 '23 at 03:54
  • It looks like I found a good source that describes the B-Spline bases! Page 5 of the PDF in A review of spline function procedures in R. Now I just need to understand if and how I can implement a one-line equation for the recursive bases. – Florent H Jun 08 '23 at 04:15
  • After a lot of trial and error and checking various other sources, I found the right formulation after I noticed a glaring mistake in the article I shared above. When calculating the basis $B_{d,k}(x)$, the recursive formula should be an addition of the two terms, and NOT a subtraction! It's also important to define $0/0=0$ everywhere in your recursive function. Here's a better source, but the reader should beware that the degree $d$ is defined differently than in statsmodels ($d_{statsmodels} = d_{source} - 1$). – Florent H Jun 08 '23 at 07:41
  • 1
    A good resource for numerical algorithms of this sort is Numerical Recipes. – whuber Jun 08 '23 at 13:20

0 Answers0