comparing glmnet models

Question

The anova function doesn't allow comparison of glmnet models:

data(QuickStartExample)
x <- QuickStartExample$x; y <- QuickStartExample$y
set.seed(11)
fit1 = glmnet(x[,1:10], y, lambda=0.1)
fit2 = glmnet(x[,11:20], y, lambda=0.1)
anova(fit1, fit2)
Error in UseMethod("anova") : 
  no applicable method for 'anova' applied to an object of class "c('elnet', 'glmnet')"

Is there an alternative?

EDIT: Based on Roland's comment, I should perform cross-validation

For predictors x[,1:10]:

cv.train.fit1 <- cv.glmnet(x[1:70,1:10], y[1:70])
best.lambda <- cv.train.fit1$lambda.min
fit1 <- glmnet(x[71:100,1:10], y[71:100] ,lambda=best.lambda)

For predictors x[,11:20]:

cv.train.fit2 <- cv.glmnet(x[1:70,11:20], y[1:70])
best.lambda <- cv.train.fit2$lambda.min
fit2 <- glmnet(x[71:100,11:20], y[71:100], lambda=best.lambda)

But the question remains, how can I compare the performance of fit1 and fit2?

What do you want to compare, how the first ten features predict compared to the next ten features? — Dave, Dec 07 '23 at 00:53
I'm not sure what the goal is here. ANOVA compares nested models and your models are not nested. Also, if you have a reason to test two sets of predictors, you should compare LASSO regression predictive performance, e.g., based on cross-validation. — Roland, Dec 07 '23 at 07:01
@Dave, yes, the idea would be to test if the predictors in f1 explain y better than the predictors in f2 — locus, Dec 07 '23 at 07:58
@Roland, yes, that's the idea, testing two sets of predictors against each other. Do you mean running cv.glmnet(x,y) for f1 predictors and f2 predictors separately? — locus, Dec 07 '23 at 08:00
Well, the function returns the mean cross-validated error for each lambda. But you probably should also test/compare performance with data that wasn't used for tuning lambda. Basically, apply machine learning principles because your problem seems to be a machine learning problem. — Roland, Dec 07 '23 at 08:11
Thanks @Roland, but I'm a bit confused. Should I run cv.glmnet on training data and then test it on test data for both set of predictors? But then how do I compare which set has more predictive power? I edited my question with an example — locus, Dec 07 '23 at 09:22

score 2 · Accepted Answer · answered Dec 07 '23 at 09:56

2

Here is what I believe could be useful:

set.seed(42)
f <- 1:70
#tune model on training fold
cv.train.fit1 <- cv.glmnet(x[f,1:10], y[f])
best.lambda <- cv.train.fit1$lambda.min
fit1 <- glmnet(x[f,1:10], y[f] ,lambda=best.lambda)
cv.train.fit2 <- cv.glmnet(x[f,11:20], y[f])
best.lambda <- cv.train.fit2$lambda.min
fit2 <- glmnet(x[f,11:20], y[f], lambda=best.lambda)
#assess performance on testing fold

sqrt(mean((y[-f] - predict(fit1, newx = x[-f,1:10]))^2))
#[1] 2.188628
sqrt(mean((y[-f] - predict(fit2, newx = x[-f,11:20]))^2))
#[1] 2.549974

Clearly, fit1 performs better for this fold. You could turn this into cross-validation by creating a loop that subsets the data into training and testing folds. You should have additional hold-out data to assess final performance of the selected model.

answered Dec 07 '23 at 09:56

Roland

6,611

Related? – Dave Dec 07 '23 at 18:11
Thanks Roland, that was helpful. If I create loop that subsets the training data can the first iteration be samples [1,4,10,33,63] and the second iteration samples [1,4,18,29,63]? or do the samples always need to be different in each loop, for example, sample1=[1,4,10,22,63], sample2=[3,5,36,59,70], sample3=[2,6,8,11,23],etc? – locus Dec 07 '23 at 22:59
So your suggestion is to create a vector with the mean error from each iteration, right? So let's say fit1.errors=[2.18,1.42,1.23,1.12] and fit2.errors=[2.36,3.34,1.55,3.74]. I can see the average fit1.errors is smaller, but how can I assess whether it's meaningfully smaller? Should I simply run a t.test or something between fit1.errors and fit2.errors? – locus Dec 07 '23 at 23:02
I decided to ask a separate question as there are at least three points I'm still unclear about. – locus Dec 08 '23 at 01:04
1

@locus You don't seem to know how cross-validation works. The Wikipedia article might get you started ... – Roland Dec 08 '23 at 06:00
According to Wikipedia, both are valid. Partitioning the samples without overlap in the different folds would be K-fold cross-validation, with overlap would be Monte Carlo cross-validation – locus Dec 08 '23 at 10:56

comparing glmnet models

1 Answers1