Consider a predictor $X$ that predicts a non-linear response $y$. A natural cubic spline with $K$ knots can be fit to $y$.
Several different linear basis expansions in $X$ can be used to fit the natural cubic spline, such as the following $K$ basis functions—from Hastie, Tibshirani & Friedman (2017):
\begin{align} h_1(X) &= 1\\ h_2(X) &= X\\ h_{k+2}(X) &= d_k(X) - d_{K-1}(X),\\ \end{align}
where
$$d_k(X) = \frac{(X-\xi_k)_+^3-(X-\xi_K)_+^3}{\xi_K-\xi_k},$$ $\xi_k \,$ is the $k$-th knot location, and $\, t_+ \,$ indicates that only the positive part is kept.
The natural cubic spline model can then be expressed as
$$f(X) = \sum_{k=1}^{K} \beta_{k} h_{k}(X),$$
and the $\beta$ coefficients can be estimated using least squares
$$\hat{\beta} = (\, h(X^T) \; h(X) \, )^{-1} \; h(X^T) \; y.$$
Question: How can one reconstruct the cubic spline in a single algebraic equation using only the fitted coefficients $\hat{\beta}$, the knot locations $\xi$, and the raw predictor $X$? I would appreciate a general expression for $N$ predictors.
Note: I am using the Python package statsmodels to do the natural cubic spline fitting (specifically the classes statsmodels.gam.smooth_basis.BSplines and statsmodels.gam.generalized_additive_model.GLMGam). I don't know know what basis functions are used in BSplines; I've included the functions above just as an example.
Here is a simple reproducible example where I try to fit a non-linear response:
import numpy as np
import BSplines from statsmodels.gam.smooth_basis
import GLMGam from statsmodels.gam.generalized_additive_model
import matplotlib.pyplot as plt
n_pts = 500
scaling factor for noise in data
noise_scl = 0.2
X = np.arange(n_pts)
create non-linear response
y = np.sin(np.arange(n_pts) / 150) * np.cos(np.arange(n_pts) / 75) + np.random.normal(
loc=0, scale=noise_scl, size=n_pts
)
deg_free = 8
degree = 3
get the basis functions
bs = BSplines(
X,
df=deg_free,
degree=degree,
include_intercept=True,
)
get generalized additive model and apply basis functions
base_model = GLMGam(y, X, smoother=bs)
fit spline regression
res = base_model.fit()
this equation is wrong but is my best attempt
y_eq = (
res.params[0]
+ res.params[1] * X
+ res.params[2] * np.max(X - bs.smoothers[0].knots[2], 0) ** 3
+ res.params[3] * np.max(X - bs.smoothers[0].knots[3], 0) ** 3
+ res.params[4] * np.max(X - bs.smoothers[0].knots[4], 0) ** 3
+ res.params[5] * np.max(X - bs.smoothers[0].knots[5], 0) ** 3
+ res.params[6] * np.max(X - bs.smoothers[0].knots[6], 0) ** 3
+ res.params[7] * np.max(X - bs.smoothers[0].knots[7], 0) ** 3
+ res.params[8] * np.max(X - bs.smoothers[0].knots[8], 0) ** 3
)
fig, ax = plt.subplots()
ax.scatter(X, y, color="blue")
ax.plot(X, res.fittedvalues, color="red", label="fittedvalues")
ax.plot(X, y_eq, color="orange", label="equation")
fig.legend()
plt.show()
rcspline.restate()function in hisHmiscpackage implements that directly from a set of knot positions and coefficients, so you could double check against the code for that function. – EdM Jun 06 '23 at 20:44R, so I'm having a little trouble interpreting your code. But it does look like your answer over there is relevant to my question. – Florent H Jun 06 '23 at 21:10nsversion of basis functions in the Rsplinespackage, and the page includes a brief answer from Frank Harrell about use of the truncated power basis that he prefers and is noted in my prior comment. If there's still something unclear, please edit the question to state the outstanding issues, as comments are easy to overlook and can be deleted. – EdM Jun 06 '23 at 21:16mgcvpackage which has better documentation than the corresponding parts in statsmodels/patsy. – Josef Jun 07 '23 at 15:48statsmodels($d_{statsmodels} = d_{source} - 1$). – Florent H Jun 08 '23 at 07:41