1

It appears that I have completely misunderstood cross validation for several months so I want your help clarifying the idea using an example rather than only theory from some great SE questions. My fundamental misunderstanding comes from looking at the code. Coming off of this article, it's apparent that you're using cv to see how good the performance of the theoretical final model is. But using a cv method from train control, is that best model internally selected?

TL;DR: what exactly is a "final model"?

See my example for more:

attach(sat.act)
sat.act<- na.omit(sat.act)

#rename outcome and make as factor sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female")) sat.act$gender <- as.factor(sat.act$gender)

#create train and test indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE) train<-sat.act[indexes,] test<-sat.act[-indexes,]

#set up RF ctrl <- trainControl(method = "cv", number = 5, savePredictions = TRUE, summaryFunction = twoClassSummary, classProbs = TRUE)

model <- train(gender ~ ., data=train, trControl = ctrl, method= "rf", preProc=c("center","scale"), metric="ROC", importance=TRUE)

print(model) #some lines omitted below ROC was used to select the optimal model using the largest value. The final value used for the model was mtry = 3.

> model$resample ROC Sens Spec Resample 1 0.5861751 0.8225806 0.2285714 Fold1 2 0.6845351 0.8064516 0.3529412 Fold3 3 0.4717742 0.7580645 0.1764706 Fold2 4 0.4817362 0.7419355 0.2647059 Fold5 5 0.6930876 0.8709677 0.4000000 Fold4

If we look at model we see that it looked at three different mtrys and picked 3. And model$resample gives the results of the test folds. But what does that mean for the final model that is predicting on the test data?

# predict the outcome on a test set
model_pred <- predict(model, test)

compare predicted outcome and true outcome

confusionMatrix(model_pred, test$gender)

So is model_pred splitting up the test in five parts and then predicting each of the five folds in model, just like how it did it when we trained the model? OR, is model_pred saying I already found the best mtry which was the only reason I did cv, so now when I predict, I'm using all of train and creating basically a one-fold cv problem to predict on test? Like does model_pred have it's own version of resample? I always though the latter but I don't think this is correct.

Basically if we manually pulled the indexes out and trained five models and then predicted on the remaining fold using predict , does that give you the same exact thing as that model$resample?

  • 1
    Have you taken a look at the results of model$finalModel? – Demetri Pananos Jun 11 '20 at 03:27
  • Yes I checked it out, but it's not really clear to me what is going into the final model. – PleaseHelp Jun 11 '20 at 14:47
  • 1
    OK, let me try to answer what is going on here and if it still isn't clear I'll write a real answer. The folds in CV are used to estimate out of sample error. For each combination of parameters, fit a model using 4/5 of the data and predict on the last 1/5. Record the error on that prediction. Do that for each fold. The final model is fit on all the data using the parameters which yielded the best score (e.g. highest ROC). Calling predict uses the best parameters to make a single prediction for each observation. The folds are not used at prediction time, only in training. – Demetri Pananos Jun 11 '20 at 15:15
  • thanks for the great explanation, I think I'm 80% of the way to understanding! So the final part I want to clarify: when predict is being used what does that actual model look like? I imagine something along the lines of ctrl <- trainControl(method = "none") and model <- train(gender ~ ., data=train, trControl = ctrl, method= "rf", tuneGrid = data.frame(mtry = 3)) (if we're pulling the mtry from above example)? I know you would not need to write this line in real life but for my understanding I'm trying to visualize all the steps that are happening in that 1 line of caret code. – PleaseHelp Jun 11 '20 at 15:33
  • When you call predict, the model$finalModel is used to make the prediction. The CV step finds the best parameter combos, use the best combo to fit that model on the entire training set, store the result in model$finalModel, and use that to predict (though this happens behind the scene. You don't have to explicitly tell R to use that model, it already knows). – Demetri Pananos Jun 11 '20 at 15:48
  • Sorry for harping this one point and thank you for your explanation so far. So model$finalModel yields randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE). So basically at the end of it, in order to create a the final model, it doesn't use train() but rather the randomForest function. So basically train() is really just a method for finding the best param but randomForest is actually making the model? – PleaseHelp Jun 11 '20 at 15:55
  • 1
    More or less, yes. The train function is really a wrapper for the randomForest function. – Demetri Pananos Jun 11 '20 at 15:57
  • Thank you so much!! You cleared up months of confusion!!! – PleaseHelp Jun 11 '20 at 15:58

0 Answers0