There are two popular R packages to build random forests introduced by Breiman (2001): randomForest and randomForestSRC. I am noticing small, yet significant discrepancies in terms of accuracy between the two packages, even when I try to use the same input parameters. I understand we would expect a slightly different random forest, but in example below, randomForestSRC package consistently outperforms the randomForest package. I'm guessing there are other examples where randomForest is superior. Can someone please explain why these packages provide different predictions? Is there a way to generate a random forest for both packages using the same methodology?
In the example, there's no missing data, all values are distinct, mtry=1, and trees are grown until nodesplit=5. I believe the same bootstrap approach and split rule is used too. Increasing ntree or number of observations in the simulated dataset does not change the relative difference between the two packages.
library(randomForest)
library(randomForestSRC)
set.seed(130948) #Other seeds give similar comparative results
x1<-runif(1000)
y<-rnorm(1000,mean=x1,sd=.3)
data<-data.frame(x1=x1,y=y)
#Compare MSE using OOB samples based on output
(modRF<-randomForest(y~x1,data=data,ntree=500,nodesize=5))
(modRFSRC<-rfsrc(y~x1,data=data,ntree=500,nodesize=5))
#Compare MSE using a test sample
x1new<-runif(10000)
ynew<-rnorm(10000,mean=x1new,sd=.3)
newdata<-data.frame(x1=x1new,y=ynew)
mean((predict(modRF,newdata=newdata)-newdata$y)^2) #MSE using randomForest
mean((predict(modRFSRC,newdata=newdata)$predicted-newdata$y)^2) #MSE using randomForestSRC
randomForest, but mainly because it's older. I'd suggest putting your actual question out there would be more productive. The paradox, what you did with each package, what results you had, etc. You may get a direct answer about the paradox, or perhaps just a discussion on why package A seems to have ~3% less accuracy than B, but adding an RV resulted in 5% higher accuracy than B, which didn't change. – Wayne Jan 17 '16 at 17:18randomForestandrandomForestSRCgive different predictions when both claim they are both implementing the original RF method. If one can explain why the packages give different predictions for the example provided, this would also explain differences in the paradox. I apologize if my initial question wasn't clear; I am simply trying to understand which part of the two algorithms leads to different results. – Peter Calhoun Jan 18 '16 at 19:17