1

I know that random forest is just getting a lot of trees that are not pruned But, by the act of averaging predictions across many trees actually reduces the variance and get consistent prediction. Then, how can you overfit?

And I read that if you have too many trees and the ensemble method (random forest) will get really complex. By, don't we average the predictions and complexity decreases?

One is saying you don't face overfitting and one is saying you are going to face overfitting.

Do you know which one is true?

  • Using typical pruning and other parameters I’ve seen massive overfitting from random forests. I no longer trust them. Anyone using RF should require smooth unbiased calibration curves before using the result. – Frank Harrell Oct 08 '23 at 12:02

1 Answers1

0

It depends on the hyper-parameters of the random forest, such as the number of trees, the depth of the trees, the number of features per tree.

So, for example, if you have 4 data features and each tree uses 3 features to be fitted, then you would have 4 choose 3 different sets of features. If you use all data instances to fit the trees, you would have a maximum of 4 unique trees. So, having more trees would indeed over-fit the random forest as the excess trees would be repeats of the unique ones that have already been fitted.

Indeed as it is a popularity vote and averaging amongst trees in the forest, so to prevent over-fitting it would be highly dependent on the data that you have along with the hyper-parameters.