It appears that I have completely misunderstood cross validation for several months so I want your help clarifying the idea using an example rather than only theory from some great SE questions. My fundamental misunderstanding comes from looking at the code. Coming off of this article, it's apparent that you're using cv to see how good the performance of the theoretical final model is. But using a cv method from train control, is that best model internally selected?
TL;DR: what exactly is a "final model"?
See my example for more:
attach(sat.act)
sat.act<- na.omit(sat.act)
#rename outcome and make as factor
sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female"))
sat.act$gender <- as.factor(sat.act$gender)
#create train and test
indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE)
train<-sat.act[indexes,]
test<-sat.act[-indexes,]
#set up RF
ctrl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model <- train(gender ~ ., data=train,
trControl = ctrl,
method= "rf",
preProc=c("center","scale"),
metric="ROC",
importance=TRUE)
print(model) #some lines omitted below
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
> model$resample
ROC Sens Spec Resample
1 0.5861751 0.8225806 0.2285714 Fold1
2 0.6845351 0.8064516 0.3529412 Fold3
3 0.4717742 0.7580645 0.1764706 Fold2
4 0.4817362 0.7419355 0.2647059 Fold5
5 0.6930876 0.8709677 0.4000000 Fold4
If we look at model we see that it looked at three different mtrys and picked 3. And model$resample gives the results of the test folds. But what does that mean for the final model that is predicting on the test data?
# predict the outcome on a test set
model_pred <- predict(model, test)
compare predicted outcome and true outcome
confusionMatrix(model_pred, test$gender)
So is model_pred splitting up the test in five parts and then predicting each of the five folds in model, just like how it did it when we trained the model? OR, is model_pred saying I already found the best mtry which was the only reason I did cv, so now when I predict, I'm using all of train and creating basically a one-fold cv problem to predict on test? Like does model_pred have it's own version of resample? I always though the latter but I don't think this is correct.
Basically if we manually pulled the indexes out and trained five models and then predicted on the remaining fold using predict , does that give you the same exact thing as that model$resample?
model$finalModel? – Demetri Pananos Jun 11 '20 at 03:27predictuses the best parameters to make a single prediction for each observation. The folds are not used at prediction time, only in training. – Demetri Pananos Jun 11 '20 at 15:15predictis being used what does that actual model look like? I imagine something along the lines ofctrl <- trainControl(method = "none")andmodel <- train(gender ~ ., data=train, trControl = ctrl, method= "rf", tuneGrid = data.frame(mtry = 3))(if we're pulling the mtry from above example)? I know you would not need to write this line in real life but for my understanding I'm trying to visualize all the steps that are happening in that 1 line of caret code. – PleaseHelp Jun 11 '20 at 15:33model$finalModelis used to make the prediction. The CV step finds the best parameter combos, use the best combo to fit that model on the entire training set, store the result inmodel$finalModel, and use that to predict (though this happens behind the scene. You don't have to explicitly tell R to use that model, it already knows). – Demetri Pananos Jun 11 '20 at 15:48model$finalModelyieldsrandomForest(x = x, y = y, mtry = param$mtry, importance = TRUE). So basically at the end of it, in order to create a the final model, it doesn't usetrain()but rather therandomForestfunction. So basicallytrain()is really just a method for finding the best param butrandomForestis actually making the model? – PleaseHelp Jun 11 '20 at 15:55trainfunction is really a wrapper for therandomForestfunction. – Demetri Pananos Jun 11 '20 at 15:57