In a leave-one-out cross validation test, I am calling a ridge regression model for the training samples, and predict the response for the test samples. And my metric for the accuracy is whether the samples are ranked correctly (i.e. spearman correlation) after I get the predictions for all samples (i.e. all folds). Interestingly, for high values of $\lambda$ (300000 for example), the result is extremely different depending on whether or not the intercept is fitted (meaning intercept parameter in glmnet is set to TRUE), while the result is very similar for intercept=TRUE and intercept=FALSE for small values of $\lambda$ (3 for example). Here is the example R code:
library(glmnet)
library(cvTools)
varsize = 500
samsize = 100
lambda = 300000
# lambda = 3
set.seed(333)
X = matrix(rnorm(samsize*varsize), ncol=varsize)
set.seed(343)
w = matrix(rnorm(varsize), ncol=1)
set.seed(353)
eps = matrix(rnorm(samsize), ncol=1)
y = X %*% w + eps
foldcount = samsize
folding = cvFolds(samsize, K = foldcount)
pred_r1 = numeric(samsize)
pred_r2 = numeric(samsize)
for (f in 1:foldcount) {
trainsamp = folding$subsets[folding$which!=f]
testsamp = folding$subsets[folding$which==f]
trainX = X[trainsamp,]
trainy = y[trainsamp]
testX = X[testsamp,,drop=F]
glmob1 = glmnet(trainX, trainy, lambda=lambda, alpha = 0, standardize=F, intercept=T)
pred_r1[testsamp] = predict(glmob1, newx=testX)
glmob2 = glmnet(trainX, trainy, lambda=lambda, alpha = 0, standardize=F, intercept=F)
pred_r2[testsamp] = predict(glmob2, newx=testX)
}
print(c(cor(pred_r1, y, method='spearman'), cor(pred_r2, y, method='spearman')))
This code prints 0.5240924 0.5241524 for lambda=3 while it prints -0.9993759 0.5116952 for lambda=300000. I am totally confused by the value -0.9993759. It means that samples are ranked almost completely opposite to the true response when lambda=300000 and intercept=T. How come this is possible? Why fitting the intercept or not changes things that severely for high values of $\lambda$? What is the statistical theory behind it?
I read in multiple places (like here or here or here) that the intercept is not penalized in ridge regression, but I would really appreciate if someone explains if 'fitting the intercept' and 'penalizing the intercept' correspond to the same thing, and why fitting it (which is actually the default behavior of glmnet) results in such a strange behavior.