Why random forest loses stability on new data, but GMDH works great

Question

I made a toy dataset where the test part doesn't look like a train. I would like to know why random forest works so bad compared to GMDH (see Wikipedia).

my question why does random forest lose stability on test data and GMDH no?

set.seed(111)
s <- function(ve=1:500,a,f,p)    a*sin(f*ve+p)
s1 <- s(a = 1, f = 0.01, p = 0)    + rnorm(500,sd = 0.05)
s2 <- s(a = 0.5, f = 0.06, p = 2)  + rnorm(500,sd = 0.05)
s3 <- s(a = 0.1, f = 0.12, p = 4)  + rnorm(500,sd = 0.05)
X <- cbind(s1,s2,s3)
Y <- s1 +  s2  +  s3
tr <- 1:300
ts <- 301:500
library(randomForest)
rf <- randomForest(Y[tr]~.,X[tr,],ntree=500,mtry=ncol(X))
rf_pred <- predict(rf,X)
library(GMDHreg)
gmdh <- gmdh.gia(X = X[tr,], y = Y[tr], prune = 5, criteria = "PRESS")
gmdh_pred <- predict(gmdh, X)[,1]
par(mfrow=c(2,1),mar=c(2,2,2,2))
matplot(X,t="l",lty=1) 
abline(v=length(tr),col=1,lty=2,lwd=2)
plot(Y,t="l",col="gray90",lwd=10,main="Target")

abline(v=length(tr),col=1,lty=2,lwd=2)
lines(rf_pred,col=2,lwd=1)
lines(gmdh_pred ,col=4,lwd=1)
legend(x = "bottomleft",

       legend = c("original", "GMDH", "RandomForest"),

       col = c(8,4,2), lwd = c(5,1,1))

Why even study a case where the test set is not similar to the training set? — Richard Hardy, Apr 15 '23 at 15:01
Might the reason be that GMDH naturally accommodates a model like y=s1+s2+s3, while trees (in a forest) do not? If you tried a linear model, you would get about the same performance as from the GMDH, would you not? — Richard Hardy, Apr 15 '23 at 15:20
I am not an expert in machine learning so my questions may be stupid. — mr.T, Apr 15 '23 at 15:24

Firebug · Accepted Answer · 2023-04-15T15:34:27.960

3

As shown in Random forest regression not predicting higher than training data, Random Forests are based on averages of subsets of values of the response variable in the training set, and thus, naïvely, cannot predict values lower or higher than what the training examples themselves demonstrate.

The GMDH (a regression over basis functions) can have terms that match the data generating process, thus allowing it to extrapolate. In fact, you gave it the exact same terms that generated the data.

edited Apr 15 '23 at 15:34

answered Apr 15 '23 at 15:22

Firebug

19,076
6
77
139

1

Yeah, that is what I noted in a comment to the OP. on averages could be expanded into on averages of subsets of values of the response variable in the training set, that would make it even clearer (at least to me). – Richard Hardy Apr 15 '23 at 15:32
I incorporated it, thanks @RichardHardy :) – Firebug Apr 15 '23 at 15:33

Why random forest loses stability on new data, but GMDH works great

1 Answers1