How do I find a p-value of smooth spline / loess regression?

Question

I have some variables and I am interested to find non-linear relationships between them. So I decided to fit some spline or loess, and print nice plots (see the code below). But, I also want to have some statistics that gives me an idea how much likely is that the relationship is a matter of randomness... i.e., I need some overall p-value, like I have for linear regression for example. In other words, I need to know whether the fitted curve makes any sense, since my code will fit a curve to any data.

x <- rnorm(1000)
y <- sin(x) + rnorm(1000, 0, 0.5)

cor.test(x,y)
plot(x, y, xlab = xlab, ylab = ylab)
spl1 <- smooth.spline(x, y, tol = 1e-6, df = 8)
lines(spl1, col = "green", lwd = 2)

spl2 <- loess(y ~ x)
x.pr <- seq(min(x), max(x), length.out = 100)
lines(x.pr, predict(spl2, x.pr), col = "blue", lwd = 2)

Greg Snow · Answer 1 · 2015-01-31T21:03:37.257

10

The splines library has functions bs and ns that will create spline basis to use with the lm function, then you can fit a linear model and a model including splines and use the anova function to do the full and reduced model test to see if the spline model fits significantly better than the linear model.

Here is some example code:

x <- rnorm(1000)
y <- sin(x) + rnorm(1000, 0, 0.5)

library(splines)

fit1 <- lm(y~x)
fit0 <- lm(y~1)
fit2 <- lm(y~bs(x,5))

anova(fit1,fit2)
anova(fit0,fit2)

plot(x,y, pch='.')
abline(fit1, col='red')
xx <- seq(min(x),max(x), length.out=250)
yy <- predict(fit2, data.frame(x=xx))
lines(xx,yy, col='blue')

You can also use the poly function to do a polynomial fit and test the non-linear terms as a test of curvature.

For the loess fit it is a little more complicated. There are some estimates of equivalent degrees of freedom for the loess smoothing parameter that could be used along with the $R^2$ values for the linear and loess models to construct and F test. I think methods based on bootstrapping and permutation tests may be more intuitive.

There are techniques to compute and plot a confidence interval for a loess fit (I think there may be a built-in way in the ggplot2 package), you can plot the confidence band and see if a straight line would fit within the band (this is not a p-value, but still gives a yes/no.

You could fit a linear model and take the residuals and fit a loess model to the residuals as response (and the variable of interest as predictor), if the true model is linear then this fit should be close to a flat line and reordering the points relative to the predictor should not make any difference. You can use this to create a permutation test. Fit the loess, find the predicted value farthest from 0, now randomly permute the points and fit a new loess and find the furthest predicted point from 0, repeat a bunch of times, the p-value is the proportion of permuted values that are further from 0 than the original value.

You may also want to look at cross-validation as a method of choosing the loess bandwidth. This does not give a p-value, but an infinite bandwidth corresponds to a perfect linear model, if the cross-validation suggests a very large bandwidth then that suggests a linear model may be reasonable, if the higher bandwidths are clearly inferior to some of the smaller bandwidths then this suggests definite curvature and linear is not sufficient.

edited Jan 31 '15 at 21:03

answered Jan 31 '15 at 18:37

Greg Snow

51,722

Thank you Greg! I think the 1st paragraph sounds as the way to go, except that I am not interested in comparison with linear model, just to see if the spline explains it or not. Could you please provide some code or more concrete pointers as to how to test the spline with anova? I've been looking at the bs and ns functions but I am not so good in statistics to be able to invent it myself. – Tomas Jan 31 '15 at 20:39
And yes I have found how to compute $R^2$ for loess (see https://fibosworld.wordpress.com/2012/11/04/loess-regression-with-r/) but I have no idea how to convert $R^2$ into the p-value... – Tomas Jan 31 '15 at 20:42
I added some example code for the anova with splines approach. For the F test from $R^2$ consider that $R^2$ is the SSR divided by SST and $1-R^2$ is SSE divided by SST, so the ratio $\frac{R^2}{1-R^2}$ is just the SSR divided by SSE (the 2 cases of SST cancel out). Include the degrees of freedom and a little algebra and you have the F statistic for overall regression. – Greg Snow Jan 31 '15 at 21:06
Greg, thanks! 1) Could you please explain what is lm(y~bs(x,5)) doing and why it's not lm(y~I(bs(x,5)))? I am quite confused by this call because the result of bs(x,5) is not a variable... 2) Do I understand it correctly that the p-value I am looking for is the result of anova(fit0,fit2)? – Tomas Jan 31 '15 at 21:23
1

The bs function creates a matrix of basis splines that are passed to the lm function and lm knows how to deal with the results of functions like this. The anova with fit0 compares your splines to an overall mean, the one with fit1 compares it to a linear relationship, I think the 2nd is probably the more interesting. The comparison to fit0 is actually the same as the overall F test from summary(fit2). – Greg Snow Jan 31 '15 at 23:59
Thanks Greg! So the magic lm(y~bs(x,5)) is not the ordinary linear regression with a result of bs(x,5), but completely different spline regression? Maybe this is the confusing part, because I thought lm is only doing linear regression. – Tomas Feb 01 '15 at 00:09
1

It is just doing linear regression. Just like you can do linear regression with $x$, $x^2$, and $x^3$ to fit a curve based on a polynomial, the bs function creates transforms of the $x$ variable and passes them to lm which does the linear regression. – Greg Snow Feb 01 '15 at 00:20
Greg, as I understood splines, unlike polynomials they cannot be dirrectly written as a single linear function (you would need some kind of if or step function I think, right?). So this still confuses me a bit... – Tomas Feb 01 '15 at 01:40

How do I find a p-value of smooth spline / loess regression?

1 Answers1

Linked