2

The anova function doesn't allow comparison of glmnet models:

data(QuickStartExample)
x <- QuickStartExample$x; y <- QuickStartExample$y
set.seed(11)
fit1 = glmnet(x[,1:10], y, lambda=0.1)
fit2 = glmnet(x[,11:20], y, lambda=0.1)

anova(fit1, fit2)

Error in UseMethod("anova") : no applicable method for 'anova' applied to an object of class "c('elnet', 'glmnet')"

Is there an alternative?

EDIT: Based on Roland's comment, I should perform cross-validation

For predictors x[,1:10]:

cv.train.fit1 <- cv.glmnet(x[1:70,1:10], y[1:70])
best.lambda <- cv.train.fit1$lambda.min
fit1 <- glmnet(x[71:100,1:10], y[71:100] ,lambda=best.lambda)

For predictors x[,11:20]:

cv.train.fit2 <- cv.glmnet(x[1:70,11:20], y[1:70])
best.lambda <- cv.train.fit2$lambda.min
fit2 <- glmnet(x[71:100,11:20], y[71:100], lambda=best.lambda)

But the question remains, how can I compare the performance of fit1 and fit2?

locus
  • 1,593
  • What do you want to compare, how the first ten features predict compared to the next ten features? – Dave Dec 07 '23 at 00:53
  • I'm not sure what the goal is here. ANOVA compares nested models and your models are not nested. Also, if you have a reason to test two sets of predictors, you should compare LASSO regression predictive performance, e.g., based on cross-validation. – Roland Dec 07 '23 at 07:01
  • @Dave, yes, the idea would be to test if the predictors in f1 explain y better than the predictors in f2 – locus Dec 07 '23 at 07:58
  • @Roland, yes, that's the idea, testing two sets of predictors against each other. Do you mean running cv.glmnet(x,y) for f1 predictors and f2 predictors separately? – locus Dec 07 '23 at 08:00
  • Well, the function returns the mean cross-validated error for each lambda. But you probably should also test/compare performance with data that wasn't used for tuning lambda. Basically, apply machine learning principles because your problem seems to be a machine learning problem. – Roland Dec 07 '23 at 08:11
  • Thanks @Roland, but I'm a bit confused. Should I run cv.glmnet on training data and then test it on test data for both set of predictors? But then how do I compare which set has more predictive power? I edited my question with an example – locus Dec 07 '23 at 09:22

1 Answers1

2

Here is what I believe could be useful:

set.seed(42)
f <- 1:70

#tune model on training fold cv.train.fit1 <- cv.glmnet(x[f,1:10], y[f]) best.lambda <- cv.train.fit1$lambda.min fit1 <- glmnet(x[f,1:10], y[f] ,lambda=best.lambda)

cv.train.fit2 <- cv.glmnet(x[f,11:20], y[f]) best.lambda <- cv.train.fit2$lambda.min fit2 <- glmnet(x[f,11:20], y[f], lambda=best.lambda)

#assess performance on testing fold
sqrt(mean((y[-f] - predict(fit1, newx = x[-f,1:10]))^2)) #[1] 2.188628 sqrt(mean((y[-f] - predict(fit2, newx = x[-f,11:20]))^2)) #[1] 2.549974

Clearly, fit1 performs better for this fold. You could turn this into cross-validation by creating a loop that subsets the data into training and testing folds. You should have additional hold-out data to assess final performance of the selected model.

Roland
  • 6,611
  • Thanks Roland, that was helpful. If I create loop that subsets the training data can the first iteration be samples [1,4,10,33,63] and the second iteration samples [1,4,18,29,63]? or do the samples always need to be different in each loop, for example, sample1=[1,4,10,22,63], sample2=[3,5,36,59,70], sample3=[2,6,8,11,23],etc? – locus Dec 07 '23 at 22:59
  • So your suggestion is to create a vector with the mean error from each iteration, right? So let's say fit1.errors=[2.18,1.42,1.23,1.12] and fit2.errors=[2.36,3.34,1.55,3.74]. I can see the average fit1.errors is smaller, but how can I assess whether it's meaningfully smaller? Should I simply run a t.test or something between fit1.errors and fit2.errors? – locus Dec 07 '23 at 23:02
  • I decided to ask a separate question as there are at least three points I'm still unclear about. – locus Dec 08 '23 at 01:04
  • 1
    @locus You don't seem to know how cross-validation works. The Wikipedia article might get you started ... – Roland Dec 08 '23 at 06:00
  • According to Wikipedia, both are valid. Partitioning the samples without overlap in the different folds would be K-fold cross-validation, with overlap would be Monte Carlo cross-validation – locus Dec 08 '23 at 10:56