I'm trying to wrap my head around the concept of variable importance (for regression) from the randomForest package in R. I'm trying to find a mathematical definition of how the importance measures are calculated, specifically the IncNodePurity measure.
When I use ?importance the randomForest package states:
The second measure (i.e., IncNodePurity) is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
So, if I am interpreting it correctly, for regression, the measure is the total decrease in the residual sum of squares (RSS) after splitting on the variable.
Can anyone help me find a mathematical definition of this method, so I can help clarify this concept in my mind? I have searched quite a bit and although there are a lot of explanations on the internet, no one seems to define this method mathematically.
Would I be correct in saying that it is the difference in MSE measured both before and after a split? If the MSE is given by:
$MSE = \frac{1}{n}\sum_{i=0}^n(y_{i}-y_{i}^p)^2$
and $ \Delta i$ is the decrease from splitting:
$\Delta i = MSE_{before} - MSE_{after}$
The Impurity resulting from the split is recorded for all nodes (n) and all trees(T) would be given by something like:
$IMP = \sum_{T} \sum_{n} \Delta i(n,T)$
Im basing this on information I found that states that this importance measure is analogous to the Gini-index.
Some discussion relating this importance measure to MSE can be found here: In a random forest, is larger %IncMSE better or worse?
IncNodePuritymeasure for regression. – Electrino Aug 10 '19 at 12:18mse0is the usual MSE in the context of regression. It is just the case that the measure is not renamed asIncNodeMSEin the case of regression but aside that everything follows in the same way. For example see: Breiman (2001) in section 11 where it directly deals with MSE as the metric used for the generalisation error. – usεr11852 Aug 16 '19 at 23:11IncNodePurityis calculated, except using MSE instead of RSS... and then summed over all splits and trees? – Electrino Aug 17 '19 at 20:09