2

I'd like to compare performance between two models that use different sets of predictors. I'm trying to implement what Roland suggested I did in his answer:

set.seed(42)
data(QuickStartExample)
x <- QuickStartExample$x; y <- QuickStartExample$y

errors <- data.frame(err1=rep(NA,10), err2=rep(NA,10)) for (i in 1:10) {

f &lt;- split(sample(1:80), rep(c(1,2), times=c(64,16)))
train &lt;- f<span class="math-container">$`1`

test <- f$2

cv.train.fit1 &lt;- cv.glmnet(x[train,1:10], y[train])
best.lambda &lt;- cv.train.fit1$lambda.min
fit1 &lt;- glmnet(x[train,1:10], y[train] ,lambda=best.lambda)

cv.train.fit2 &lt;- cv.glmnet(x[train,11:20], y[train])
best.lambda &lt;- cv.train.fit2$lambda.min
fit2 &lt;- glmnet(x[train,11:20], y[train], lambda=best.lambda)

#assess performance on testing fold   
errors<span class="math-container">$err1[i] &lt;- sqrt(mean((y[test] - predict(fit1, newx = x[test,1:10]))^2))

errors$err2[i] <- sqrt(mean((y[test] - predict(fit2, newx = x[test,11:20]))^2)) }

holdout.x <- x[81:100] holdout.y <- y[81:100]

1 - Is it correct to sample the same data multiple times? In my example, different training folds (f$1) could contain identical elements, eg, fold1=[1,4,17,2,...], fold2=[14,1,84,2,...], etc. Is this valid, or do training (and test) folds always need different elements?

2 - How can I evaluate the performance of fit1 vs fit2? Should I run a t-test between errors$err1 and errors$err2?

3 - Roland suggested assessing the final performance of the selected model on holdout data. I reserved the last 20 cases (holdout.x and holdout.y) for that purpose, but I'm unsure what to do with them...

utobi
  • 11,726
locus
  • 1,593
  • 1
    Could that JMLR "Time for a Change" paper by Benavoli apply here? – Dave Dec 08 '23 at 00:19
  • @Dave, do you mean regarding the second point? – locus Dec 08 '23 at 01:11
  • re 3) You want to assess accuracy and precision. Plot observed vs predicted, calculate the root mean squared error, ... for the hold-out data. 20 cases seems kind of small. If you don't have a decent amount of data, the model selection you are aiming for is probably doomed. – Roland Dec 08 '23 at 06:26
  • 1
    I think that paper hits all of these issues, at least to some extent. – Dave Dec 08 '23 at 15:17

1 Answers1

1

Roland's example presents a single split of the data into two sets: training (f) and testing (-f). To perform cross-validation, you would build numerous splits, each with a unique collection of data points in the training and testing sets. The objective here is to ensure that each split is distinct, with no overlap between training and testing data within each split. To ensure strong cross-validation, each data point should appear in the test set exactly once across all splits.

To compare 'fit1' with 'fit2', you can tweak Roland's approach by adding a loop that repeats the procedure numerous times, each with a different training/testing split. You would gather error metrics for each iteration. After the cross-validation loop is complete, you can use a statistical test (such as a t-test) on the collected error metrics to compare the performance of 'fit1' with 'fit2'.

The holdout data, which you have reserved as the last 20 cases (holdout.x and holdout.y), should be used to assess the final performance of your chosen model. This data has not been used in either the training or the cross-validation process, so it provides an unbiased evaluation of the model's performance.

After you have chosen the better model based on your cross-validation results, you can test this model on the holdout data. Calculate the error metric (such as RMSE) on this holdout set to get an estimate of how well your model is likely to perform on unseen data. This final step is crucial for understanding the generalizability of your model.

To sum up, Roland's approach is an example of a single train-test split, which you can extend into a cross-validation process. After the cross-validation, use the holdout data for a final performance assessment.

Lynchian
  • 155
  • Thanks for your answer! To ensure strong cross-validation, each data point should appear in the test set exactly once across all splits. Couldn't I use resampling with replacement for my testing data, similar to bootstrapping? I believe this would be an example of Monte Carlo cross-validation, wouldn't it? – locus Feb 15 '24 at 09:01