I'm having trouble reconciling different values for the ridge parameter that minimizes mean squared error when using RidgeCV in scikit-learn (Python) and cv.glmnet (R).
First a few things to note:
- My code takes into account this post which shows how the "lambda" regularization value in cv.glmnet compares to the "alpha" value in RidgeCV
- In the link above, @eickenberg states that RidgeCV uses unpenalized intercept estimation, while the glmnet documentation states it uses penalized maximum likelihood
Here is my R code which uses the mtcars data:
# Load libraries, get data & set seed for reproducibility
set.seed(123) # sees for reproducibility
library(glmnet) # for ridge regression
library(dplyr) # for data cleaning
data("mtcars")
# Center y, X will be standardized in the modelling function
y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix()
X <- mtcars %>% select(-mpg) %>% as.matrix()
# Perform 10-fold cross-validation to select lambda
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_to_try,
standardize = TRUE, nfolds = 10)
# Best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
print(lambda_cv)
which results in 2.983647
My Python code using the same data (imported from a csv I exported in R):
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.linear_model import RidgeCV
# %% Define features of interest
df = pd.read_csv(PATH_TO_MTCARS.csv)
features = ['cyl', 'disp', 'hp', 'drat', 'wt', 'qsec',
'vs', 'am', 'gear', 'carb']
# %% Extract Variables
# Isolate features and target variable
X = df[features]
X = scale(X)
y = df['mpg'] - df['mpg'].mean()
# %% Ridge cross validation
n_alphas = 100
lambdas = np.logspace(-3, 5, n_alphas)
alphas = lambdas * 32 / 2 # convert to glmnet lambdas to RidgeCV alphas, n=32
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X, y)
print(ridge_cv.alpha_ * 2 / 32) # convert RidgeCV alpha to glmnet lambda
which results in 0.8111308
Funny thing is, if I set "scale=TRUE" in the y assignment line in the R code and remove the X = scale(X) line in the python code, both result in an output of 0.4641589. In my mind, doing these two operations results in non-equivalent models, so I don't understand why this result would happen.
My understanding of the functions are:
- the cv.glmnet function standardizes (i.e. remove mean then divide by stdev) the X-variables automatically
- cv.glmnet uses the average mean squared error of residuals among the folds in its cross-validation algorithm to decide which is the best lambda
- the RidgeCV function does not standardize the X-variables automatically
Should I expect RidgeCV and cv.glmnet to produce the same ridge parameter that minimizes mse, and if so, why doesn't my code do that?
glmnetdoes standardize the variables automatically.. Check that using?glmnetin R. It also uses the MSE of the residuals in the CV algorithm for gaussian models (also check?cv.glmnetfortype.measure) – runr Dec 09 '19 at 10:41