2

I made a toy dataset where the test part doesn't look like a train. I would like to know why random forest works so bad compared to GMDH (see Wikipedia).

my question why does random forest lose stability on test data and GMDH no? enter image description here

set.seed(111)
s <- function(ve=1:500,a,f,p)    a*sin(f*ve+p)
s1 <- s(a = 1, f = 0.01, p = 0)    + rnorm(500,sd = 0.05)
s2 <- s(a = 0.5, f = 0.06, p = 2)  + rnorm(500,sd = 0.05)
s3 <- s(a = 0.1, f = 0.12, p = 4)  + rnorm(500,sd = 0.05)

X <- cbind(s1,s2,s3) Y <- s1 + s2 + s3

tr <- 1:300 ts <- 301:500

library(randomForest) rf <- randomForest(Y[tr]~.,X[tr,],ntree=500,mtry=ncol(X)) rf_pred <- predict(rf,X)

library(GMDHreg) gmdh <- gmdh.gia(X = X[tr,], y = Y[tr], prune = 5, criteria = "PRESS") gmdh_pred <- predict(gmdh, X)[,1]

par(mfrow=c(2,1),mar=c(2,2,2,2)) matplot(X,t="l",lty=1) abline(v=length(tr),col=1,lty=2,lwd=2)

plot(Y,t="l",col="gray90",lwd=10,main="Target")
abline(v=length(tr),col=1,lty=2,lwd=2) lines(rf_pred,col=2,lwd=1) lines(gmdh_pred ,col=4,lwd=1)

legend(x = "bottomleft",
legend = c("original", "GMDH", "RandomForest"),
col = c(8,4,2), lwd = c(5,1,1))

Firebug
  • 19,076
  • 6
  • 77
  • 139
mr.T
  • 259
  • Why even study a case where the test set is not similar to the training set? – Richard Hardy Apr 15 '23 at 15:01
  • well HMDH as you can see it works good, I wonder why – mr.T Apr 15 '23 at 15:04
  • Might the reason be that GMDH naturally accommodates a model like y=s1+s2+s3, while trees (in a forest) do not? If you tried a linear model, you would get about the same performance as from the GMDH, would you not? – Richard Hardy Apr 15 '23 at 15:20
  • I am not an expert in machine learning so my questions may be stupid. – mr.T Apr 15 '23 at 15:24

1 Answers1

3

As shown in Random forest regression not predicting higher than training data, Random Forests are based on averages of subsets of values of the response variable in the training set, and thus, naïvely, cannot predict values lower or higher than what the training examples themselves demonstrate.

The GMDH (a regression over basis functions) can have terms that match the data generating process, thus allowing it to extrapolate. In fact, you gave it the exact same terms that generated the data.

Firebug
  • 19,076
  • 6
  • 77
  • 139
  • 1
    Yeah, that is what I noted in a comment to the OP. on averages could be expanded into on averages of subsets of values of the response variable in the training set, that would make it even clearer (at least to me). – Richard Hardy Apr 15 '23 at 15:32
  • I incorporated it, thanks @RichardHardy :) – Firebug Apr 15 '23 at 15:33