4

I am using the following code to get Random Forest variable importance plot:

statRF <- randomForest(formula = Trend ~ ., data = data[,features], sampsize=c(600,600,600),mtry=6, ntree=500, importance=TRUE)
varImpPlot(statRF, cex=1.2)

enter image description here

However, when I try to extract Mean Decrease in Accuracy I get completely different variable importance

statRF$importance
              Decreasing  Increasing   No Trend         MeanDecreaseAccuracy MeanDecreaseGini
EcoRegion      0.005331568 0.002025101 6.025702e-05         0.0009792462         6.340508
Geology        0.009487879 0.004385796 4.427072e-03         0.0047468217        25.811581
Avg1980        0.068535362 0.026512398 6.766761e-03         0.0165637391       171.622158
Fire_Group     0.114414044 0.023774639 1.941874e-02         0.0269273991        52.122888
FLOW_SUM       0.009836593 0.009120500 5.692553e-03         0.0069617922       130.574740
MEAN_SLOPE     0.011427702 0.003421026 2.723633e-03         0.0034971800       134.810582
MEAN_ELEVATION 0.071074497 0.027537933 3.030051e-02         0.0321650097       167.462789
NEAR_DIST      0.018364729 0.004711747 9.081642e-04         0.0031616073       133.859939
Latitude       0.065935569 0.035386208 2.414563e-02         0.0301581377       176.920755
Longtitude     0.098719411 0.060942430 4.483657e-02         0.0530569867       200.474059

sort(statRF$importance[,4], decreasing=TRUE) Longtitude MEAN_ELEVATION Latitude Fire_Group Avg1980 FLOW_SUM Geology 0.0530569867 0.0321650097 0.0301581377 0.0269273991 0.0165637391 0.0069617922 0.0047468217 MEAN_SLOPE NEAR_DIST EcoRegion 0.0034971800 0.0031616073 0.0009792462

Notably, elevation is now the second "most important" variable instead of the fourth and a few other switches in the postion of different variables.

Wondering if the varImpPlot function is plotting something different than the MeanDecreaseAccuracy variable from the random forest model? If so how do I get those values?

EDIT: I can get the MeanDecreaseAccuracy values from the first plot with the following code:

var.imp <- varImpPlot(statRF)
var.imp <- as.data.frame(var.imp)

var.imp MeanDecreaseAccuracy MeanDecreaseGini EcoRegion 4.939973 6.340508 Geology 16.326295 25.811581 Avg1980 34.301641 171.622158 Fire_Group 49.419724 52.122888 FLOW_SUM 18.991762 130.574740 MEAN_SLOPE 12.053575 134.810582 MEAN_ELEVATION 47.251207 167.462789 NEAR_DIST 10.508457 133.859939 Latitude 52.898975 176.920755 Longtitude 74.645221 200.474059

But I am still unclear why the scale and order is different in statRF$importance.

1 Answers1

3

The variable importance in the final plot are scaled by their standard errors, if you check the help page for varImp plot, the default argument is scale=TRUE which is passed to the function importance. To get back the scaled values, you can use the importance() function like below:

library(randomForest)
set.seed(111)
fit = randomForest(Species ~ .,data=iris,importance=TRUE)

enter image description here

importance(fit,scale=TRUE)
                setosa versicolor virginica MeanDecreaseAccuracy
Sepal.Length  6.716993  7.4654657  7.697842            10.869088
Sepal.Width   4.581990 -0.5208697  4.224459             3.772957
Petal.Length 22.155981 33.0549839 27.892363            33.272150
Petal.Width  22.497643 31.4966353 31.589361            33.123064
             MeanDecreaseGini
Sepal.Length         9.333510
Sepal.Width          2.425592
Petal.Length        43.324744
Petal.Width         44.146107

Or to see how this is calculated, you do:

fit$importance[,1:4] / fit$importanceSD
            setosa versicolor virginica MeanDecreaseAccuracy

Sepal.Length 6.716993 7.4654657 7.697842 10.869088 Sepal.Width 4.581990 -0.5208697 4.224459 3.772957 Petal.Length 22.155981 33.0549839 27.892363 33.272150 Petal.Width 22.497643 31.4966353 31.589361 33.123064

StupidWolf
  • 5,077
  • Thank you! Do you know of any reason to used unscaled variable importance? – H.Traver Oct 27 '20 at 16:41
  • it's the actual decrease in accuracy. https://stats.stackexchange.com/questions/197827/how-to-interpret-mean-decrease-in-accuracy-and-mean-decrease-gini-in-random-fore – StupidWolf Oct 27 '20 at 16:58
  • However, this measure by itself can be unstable (see the last answer in the linked post). Hence you scale it by the standard error to get something more sensible.. I am trying to find the actual link – StupidWolf Oct 27 '20 at 17:00