1

I am doing a random forest and classificatsion tree. I only have numeric variable, no factors, so I have some questions regarding the output. Background of the variables: Prob_1 are values between 0-1 (I devided the real values with 100, to have values between 0-1), all of the other variables used for the are between 1-100

First question is regarding this output:

Regression tree:
tree(formula = Prob_1 ~ ., data = P14_Q1_2, subset = train)
Variables actually used in tree construction:
[1] "Flowers"             "Herbaceous_area" "Woodland"     
Number of terminal nodes:  4 
Residual mean deviance:  0.06002 = 3.301 / 55 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.63250 -0.12570  0.03416  0.00000  0.14140  0.59530 

I can see residual mean deviance is 0.06002, but when I did tree with values (1-100) [Prob_1 are just the real values devided by 100), I got the residual mean deviance over 600, which seems a lot? How to tell if the residual mean deviance is good, if it too high what are the solutsions?

I calculated the MSE:

> mean((yhat - boston.test)^2)
[1] 0.1234633

What are considered to be good values of MSE, when can I say the tree is predicting correctly?

Lastly:

> RF1 <- randomForest(Prob_par ~ ., data = P14_Q1_2,
+                            subset = train, ntree=50000, mtry = 1, importance = TRUE)
> RF1

Call: randomForest(formula = Prob_par ~ ., data = P14_Q1_2, ntree = 50000, mtry = 1, importance = TRUE, subset = train) Type of random forest: regression Number of trees: 50000 No. of variables tried at each split: 1

      Mean of squared residuals: 0.08347299
                % Var explained: 0.69

The % Var explained= 0.69, which is quiet low, and when changeing the mtry, the value is some times even with a - sign (for example -3.22). What are the solutions?

Dave
  • 62,186
Sisi
  • 61

1 Answers1

0

I got the residual mean deviance over 600, which seems a lot?

This is exactly what should happen. By dividing your Prob_1 outcome variable by $100$, you change the units (such as going between centimeters and meters).

When you do divide by $100$, you get that the error is $0.06002\space m^2$. When you do not divide by that $100$, you get that the error is $600.2\space cm^2$.

$$0.06002\space m^2 = 600.2\space cm^2$$

You might not be working in meters and centimeters, but these error values all have units, and you are getting the same answer whether you divide by $100$ or not, just in different units.

What are considered to be good values of MSE, when can I say the tree is predicting correctly?`

See this for why that requires a context.

The % Var explained= 0.69, which is quiet low, and when changeing the mtry, the value is some times even with a - sign (for example -3.22).

Again, whether or not a particular measure of performance is any good requires a context, and it might be that your value of $0.69$ is pretty good! For instance, I have seen papers in top journals with values a tenth as high as that.

Regarding the values below zero, in a nonlinear regression like a random forest, the notion of "proportion of variance explained" is a bit dubious, as I explain here. However, you can regard that value as being a comparison of the mean squared error of your model to that of a baseline model that you must beat. If your value is less than zero, your model is doing a worse job of predicting than that "must beat" model is doing. If you get a result that your model performance can range from a solid value of $0.69$ to a totally unacceptable value less than zero, it would seem that your predictions are unstable.

Dave
  • 62,186