1

I've been working on building a random forest model using h2o.ai in R for climate data. I know that there is some issue, either with my understanding of randomforest, code or dataset. However, I'm not sure exactly what is causing my model to have a very high MSE and low percent variance explained. My apologies in advance if I've overlooked something very simple. I have spent much time reading and testing but haven't improved.

So far I've tried: adjusting the parameters, reducing number of correlated predictors, checking formulas and input data for outliers, normality. Based on what I've researched random forest has been used for similar data in the past. I am using 70 rows in total with a 0.7 split. The entire dataset I've created is 12M rows and to create this subset I have taken the mean for 70 regions. I have tested on the entire dataset with no significant change. Here is my code, header and current results:

##Code##

#Check data
head(dNBR_model)
#summary(dNBR_model)

#correlation matrix corNBR <- cor(dNBR_model) dNBRcor <- cor.mtest(dNBR_model, conf.level = 0.95)

corrplot(corNBR, p.mat = dNBRcor$p, type = "upper", order = "hclust", insig='blank', addCoef.col ='black', tl.col = "black", tl.srt = 45)

#Run RF model set.seed(561) dNBR_split <- initial_split(dNBR_model, prop = .7) dNBR_train <- training(dNBR_split) dNBR_test <- testing(dNBR_split)

y <- "dNBR" x <- setdiff(names(dNBR_train), y)

#initialize h2o h2o.init(max_mem_size='50G')

#convert train to h2o train.h2o <- as.h2o(dNBR_train) dNBR_test.h2o <- as.h2o(dNBR_test)

testDRF <- h2o.randomForest(x, y, ntrees = 500, max_depth = 15, min_rows = 1, mtries = 7, nbins = 20, sample_rate = 0.75000, training_frame = train.h2o, validation_frame = dNBR_test.h2o)

testperf <- h2o.performance(testDRF) summary(testDRF)

#percent variance explained VE = ((1 - h2o.mse(testDRF))/(h2o.var(train.h2o$dNBR)))*100 print(VE)

RMSE = h2o.mse(testDRF) %>% sqrt() PRMSE = (RMSE/(mean(dNBR_test$dNBR)))*100 print(PRMSE)

h2o.varimp_plot(testDRF) varimp <- h2o.varimp(testDRF)

h2o.residual_analysis_plot(model = testDRF, newdata = dNBR_test.h2o)

##Results##

** Reported on validation data. **

MSE: 4034.157 RMSE: 63.51502 MAE: 47.38234 RMSLE: 0.1174528 Mean Residual Deviance : 4034.157

% variance explained: -167.545

residuals plot

##Header## header for input data

  • 1
    Welcome to Cross Validated! 1) How do you calculate the percent of variance explained? Do you know the math or at least the software command? 2) When you say that you have 70 rows, do you mean that you have 70 observations in the model, from which you allocation 70% to training and 30% to testing? When you write that you "have tested on the entire dataset with no significant change," do you mean that you have applied your pipeline to all 12-million observations? – Dave Mar 16 '23 at 00:17
  • To clarify 1) I've calculated percent variance explained as 1 - MSE/var(y) * 100 I couldn't find a command to do this in h2o so I had to write it. 3) Yes, this model has 70 observations with 70% training, 30% testing. Yes, I also tried with all 12-million observations. – DGeospatial Mar 16 '23 at 00:21
  • Next, how is your in-sample performance? // If 1-MSE/var(y) < 0, do you see what that says about MSE/var(y)? – Dave Mar 16 '23 at 00:23
  • RMSE: 66.49103 MAE: 52.36736 RMSLE: 0.139077 Mean Residual Deviance : 4421.057``` – DGeospatial Mar 16 '23 at 00:26
  • What about in-sample percent of variance explained? – Dave Mar 16 '23 at 00:27
  • in-sample is %VE is -79.849 and out is -167.545 (I think i had this mislabeled above) I know that it should not be negative – DGeospatial Mar 16 '23 at 00:31
  • My model is overfitting random noise in the predictor variables? – DGeospatial Mar 16 '23 at 01:00
  • 2
    Overfitting is a thought, but the fact that you have such poor performance even in-sample suggests another culprit. Overfitting leads to poor out-of-sample performance, but kind of the whole point is that overfitting leads to an in-sample performance that does not generalize, yet your in-sample performance is worse than that of a model that predicts $\bar y$ every time, as I discuss here. This is hardly the sign of a model that has detected coincidences in the data and fit to them; the model cannot even do that much! – Dave Mar 16 '23 at 01:32
  • Thank you Dave, just checked out that post. It's back to the drawing board with this data. I've realized that this model just has terrible performance. – DGeospatial Mar 16 '23 at 01:44
  • 2
    I think your question remains legitimate: why can't this model fit in-sample at least as well as a model that has no features? – Dave Mar 16 '23 at 01:48
  • This is why I'm even more confused, when I checked Spearman's rank correlation between y and a few predictors it showed a significant correlation. So some relationship does exist within the data – DGeospatial Mar 16 '23 at 02:28
  • This will likely be very hard for us to help you with, since we don't have your data... – Stephan Kolassa Mar 16 '23 at 09:20
  • I think I ended up finding the issue! It looks like I typed my coding in wrong for % variance explained. I tried the exact same data, with similar parameters in the original randomforest r package. I noticed that this package and h2o reported similar MSE but not %VE. In randomforest they calculated pseudo r2 as 1 - mse / Var(y) whereas I had to type it in manually as h2o does not calculate it. Up above, I had input the foruma as: ((1 - h2o.mse(testDRF))/(h2o.var(train.h2o$dNBR)))*100 and I changed it to (1 - (MSE)/(var(dNBR)))*100 I get 20.08 %VE in RF and 24.82%VE in DRF h2o. – DGeospatial Mar 16 '23 at 20:35

1 Answers1

1

(In the comments, I see you think it solved, so this is more general advice.)

To identify the root cause of a poor model, you should start by getting a baseline model to compare against.

As you are using H2O, that is very easy, as you can just swap h2o.randomForest with, say, h2o.glm (for generalized linear model). You could also try h2o.deeplearning for a neural net, but using a linear model as a baseline is often good as it is quick, and the defaults are good.

So if the results are equally bad, the data or how it is loaded becomes the suspect. If the glm is much better, the way you are using randomForest becomes suspect.

With H2O, you have H2O Flow: open a web browser at http://localhost:54321/ and you can see the models and data you have loaded in, and you can then explore and analyze it.

Another thing, which also came up in the comments, is that if you are overfitting you should be seeing good scores when evaluating on your training data. If you don't then I'd be suspicious of the data again.

E.g. with this data, the very best training model score you could hope for is 0.33.

x1,x2,y
1,2,A
2,3,A
1,2,B
2,3,B
1,2,C
2,3,C

This kind of thing can happen if the data has been loaded badly, e.g. it thought the data was csv, but it was actually semi-colon separated. Again, viewing the data in H2O Flow can confirm if this has happened.

Darren Cook
  • 2,107
  • 1
  • 14
  • 26