Modeling intercept in ridge regression in a leave-one-out cross-validation test

Question

In a leave-one-out cross validation test, I am calling a ridge regression model for the training samples, and predict the response for the test samples. And my metric for the accuracy is whether the samples are ranked correctly (i.e. spearman correlation) after I get the predictions for all samples (i.e. all folds). Interestingly, for high values of $\lambda$ (300000 for example), the result is extremely different depending on whether or not the intercept is fitted (meaning intercept parameter in glmnet is set to TRUE), while the result is very similar for intercept=TRUE and intercept=FALSE for small values of $\lambda$ (3 for example). Here is the example R code:

library(glmnet)
library(cvTools)

varsize = 500
samsize = 100
lambda = 300000
# lambda = 3

set.seed(333)
X = matrix(rnorm(samsize*varsize), ncol=varsize)
set.seed(343)
w = matrix(rnorm(varsize), ncol=1)
set.seed(353)
eps = matrix(rnorm(samsize), ncol=1)
y = X %*% w + eps

foldcount = samsize
folding = cvFolds(samsize, K = foldcount)

pred_r1 = numeric(samsize)
pred_r2 = numeric(samsize)
for (f in 1:foldcount) {
    trainsamp = folding$subsets[folding$which!=f]
    testsamp = folding$subsets[folding$which==f]
    trainX = X[trainsamp,]
    trainy = y[trainsamp]
    testX = X[testsamp,,drop=F]

    glmob1 = glmnet(trainX, trainy, lambda=lambda, alpha = 0, standardize=F, intercept=T)   
    pred_r1[testsamp] = predict(glmob1, newx=testX)
    glmob2 = glmnet(trainX, trainy, lambda=lambda, alpha = 0, standardize=F, intercept=F)
    pred_r2[testsamp] = predict(glmob2, newx=testX)
}

print(c(cor(pred_r1, y, method='spearman'), cor(pred_r2, y, method='spearman')))

This code prints 0.5240924 0.5241524 for lambda=3 while it prints -0.9993759 0.5116952 for lambda=300000. I am totally confused by the value -0.9993759. It means that samples are ranked almost completely opposite to the true response when lambda=300000 and intercept=T. How come this is possible? Why fitting the intercept or not changes things that severely for high values of $\lambda$? What is the statistical theory behind it?

I read in multiple places (like here or here or here) that the intercept is not penalized in ridge regression, but I would really appreciate if someone explains if 'fitting the intercept' and 'penalizing the intercept' correspond to the same thing, and why fitting it (which is actually the default behavior of glmnet) results in such a strange behavior.

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

First of all: Always include the intercept (i.e. "fitting the intercept") in your model unless you do not have a very good reason to belive that your intercept is 0 (e.g. because you centered your prediction target y and your predictors X). Otherwise your model will be uninterpretable and unfit for prediction.

You are correct that the intercept is not penalized by lambda in ridge regression by default. Therefore, at very large levels of lambda, all predictors will be close to null and predictions will essentially reflect the intercept, which will essentially be the mean of your training sample. At lower levels of lambda, the other parameters will be larger and eventually outweigh the influence the intercept on your predictions. Now, the strong negative correlation for intercept='T' is a result of your leave-one-out cross-validation (LOOCV): As described above, each left-out y will be predicted to be approx. at the mean of the current training sample. Large left-out y's in your sample will always be larger than the training mean and therefore under-estimated. Vice-versa, low values of y will always be over-estimated, since they would drag the mean down — thus the negative correlation. Your cross-validation results basically represent distance of each data-point to the global mean. Just try varying the size of foldcount (e.g. foldcount = 5). The closer the foldcount gets to the sample-size the larger the negative correlation.

Thank you, this is quite helpful. I understand that for very high values of $\lambda$, all estimated coefficients other than the intercept ($w_0$) are negligible, and the result is almost equal to $w_0$. However, I am not very clear about the opposite ranking, though. Opposite ranking means that if the true response is small, then the estimate (so the intercept since $\lambda$ is high) is big. Why is that? — user5054, Feb 10 '17 at 18:28

Modeling intercept in ridge regression in a leave-one-out cross-validation test

1 Answers1