Results of cv.glmnet in R versus RidgeCV in scikit-learn

Question

I'm having trouble reconciling different values for the ridge parameter that minimizes mean squared error when using RidgeCV in scikit-learn (Python) and cv.glmnet (R).

First a few things to note:

My code takes into account this post which shows how the "lambda" regularization value in cv.glmnet compares to the "alpha" value in RidgeCV
In the link above, @eickenberg states that RidgeCV uses unpenalized intercept estimation, while the glmnet documentation states it uses penalized maximum likelihood

Here is my R code which uses the mtcars data:

# Load libraries, get data & set seed for reproducibility
set.seed(123)    # sees for reproducibility
library(glmnet)  # for ridge regression
library(dplyr)   # for data cleaning

data("mtcars")
# Center y, X will be standardized in the modelling function
y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix()
X <- mtcars %>% select(-mpg) %>% as.matrix()

# Perform 10-fold cross-validation to select lambda
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_to_try,
                      standardize = TRUE, nfolds = 10)

# Best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
print(lambda_cv)

which results in 2.983647

My Python code using the same data (imported from a csv I exported in R):

import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.linear_model import RidgeCV

# %% Define features of interest
df = pd.read_csv(PATH_TO_MTCARS.csv)

features = ['cyl', 'disp', 'hp', 'drat', 'wt', 'qsec',
            'vs', 'am', 'gear', 'carb']

# %% Extract Variables
# Isolate features and target variable
X = df[features]
X = scale(X)

y = df['mpg'] - df['mpg'].mean()

# %% Ridge cross validation
n_alphas = 100
lambdas = np.logspace(-3, 5, n_alphas)
alphas =  lambdas * 32 / 2  # convert to glmnet lambdas to RidgeCV alphas, n=32
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X, y)
print(ridge_cv.alpha_ * 2 / 32)  # convert RidgeCV alpha to glmnet lambda

which results in 0.8111308

Funny thing is, if I set "scale=TRUE" in the y assignment line in the R code and remove the X = scale(X) line in the python code, both result in an output of 0.4641589. In my mind, doing these two operations results in non-equivalent models, so I don't understand why this result would happen.

My understanding of the functions are:

the cv.glmnet function standardizes (i.e. remove mean then divide by stdev) the X-variables automatically
cv.glmnet uses the average mean squared error of residuals among the folds in its cross-validation algorithm to decide which is the best lambda
the RidgeCV function does not standardize the X-variables automatically

Should I expect RidgeCV and cv.glmnet to produce the same ridge parameter that minimizes mse, and if so, why doesn't my code do that?

Dear, user110174, your post shows that you put significant effort in investigating the issue yourself. But please specify what you are asking about. I haven't found any question marks in the text :) — hans, Dec 09 '19 at 08:27
so I suspect glmnet also standardises Y by default... Note also that for "gaussian", glmnet standardizes y to have unit variance (using 1/n rather than 1/(n-1) formula) before computing its lambda sequence (and then unstandardizes the resulting coefficients); if you wish to reproduce/compare results with other software, best to supply a standardized y. [https://cran.r-project.org/web/packages/glmnet/glmnet.pdf] — seanv507, Dec 09 '19 at 08:30
glmnet does standardize the variables automatically.. Check that using ?glmnet in R. It also uses the MSE of the residuals in the CV algorithm for gaussian models (also check ?cv.glmnet for type.measure) — runr, Dec 09 '19 at 10:41
thanks for the response @seanv507. I agree your comment suggests that y would be standardized too, however, that makes the fact that changing "scale=FALSE" to "scale=TRUE" in the y variable "scale" pre-processing step result in a different lambda.min value all the more confusing...If glmnet is standardizing y, it should be insensitive to any standardizing done to y outside of the function. — optimusLime, Dec 09 '19 at 17:46

Results of cv.glmnet in R versus RidgeCV in scikit-learn

0 Answers0