7

I've run a Random Forest in R using randomForest package.

The fitted forest I've called: fit.rf.

All I want to know is: When I type fit.rf the output shows '% var explained' Is the % Var explained the out-of-bag variance explained?

Dave
  • 62,186
jc52766
  • 71

3 Answers3

10

Yes %explained variance is a measure of how well out-of-bag predictions explain the target variance of the training set. Unexplained variance would be to due true random behaviour or lack of fit.

%explained variance is retrieved by randomForest:::print.randomForest as last element in rf.fit$rsq and multiplied with 100.

Documentation on rsq:

  • rsq (regression only) “pseudo R-squared”: 1 - mse / Var(y).

Where mse is mean square error of OOB-predictions versus targets, and var(y) is variance of targets.

See this answer also.

gibbone
  • 109
  • 3
  • Would not have thought that was possible, as then either model variance('mse') or total variance('Var(y)') would have to be negative. I'd like to see a code that can produce a >100% performance. It can be negative though when model variance is larger than total variance. The implications thereof is then you likely would be better of with off with a simple average than a random forest model. – Soren Havelund Welling Nov 03 '15 at 16:51
  • 1
    [note] above comment was a small answer as some body asked: "What if explained variance is more than 100%" and the person later deleted the comment again – Soren Havelund Welling Nov 05 '15 at 10:21
0

To add some details to the content of the other answer, the formula to get the explained variance displayed in the summary is:

#fit.rf <- randomForest(...) 
round(100 * fit.rf$rsq[length(fit.rf$rsq)], digits = 2)

You can check this by looking at what randomForest is printing with the command getAnywhere(print.randomForest).

Furthermore, this is equivalent to the following commands:

# recalculate using model output
round(100* (1 - var(fit.rf$y - fit.rf$predicted) / var(fit.rf$y)), digits = 2)

recalculate using the formula for rsq used internally

see getAnywhere(randomForest.default).

n <- length(fit.rft$y) rsq = 1 - fit.rf$mse/(var(fit.rf$y) * (n - 1)/n) round(100 * rsq[length(rsq)], digits = 2)

gibbone
  • 109
  • 3
0

This seems to be a misinterpretation of extending $R^2$ to more complicated situations than the usual in-sample OLS linear regression. In particular, the "propotion of variance explained" interpretation of $R^2$ is the exception, not the rule. As is derived in the link, that definition only applies when $\overset{N}{\underset{i=1}{\sum}}\left[ \left( y_i - \hat y_i \right)\left( \hat y_i - \bar y \right) \right] = 0$, which is not the case in a random forest regression.

library(randomForest)
set.seed(2023)
N <- 1000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
y <- x1*x2 + x3^2 + rnorm(N)
# d <- data.frame(x1, x2, x3, y)
forest <- randomForest(y ~ x1 + x2 + x3, mtry=3)
y_hat <- forest$predicted
y_bar <- mean(y)
sum(
  (
    y - y_hat
  ) 
  *
  (
    y - y_bar
  )
)
# I get 1778.79

Indeed, the documentation gives this quantity as:

$$ 1-\left( \dfrac{ \text{MSE} }{ \text{var}\left(y\right) } \right) = 1-\left( \dfrac{ \dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left( y_i - \hat y_i \right)^2 }{ \dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left( y_i - \bar y \right)^2 } \right) = 1-\left( \dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i - \hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i - \bar y \right)^2 } \right) $$

The third of the three expressions is a common definition of $R^2$, so the linked information about $R^2$ applies.

This does not mean that such a value is worthless, however. Indeed, I have lots of thoughts on an $R^2$-style performance metric in complicated settings.

Dave
  • 62,186