I would like to quantify the uncertainty of a Random Forest binary classifier.
The idea that popped in my mind was to fit the Random Forest 100 times with different seeds. Computing the variance of the forest's output on data in a validation set, I can then estimate the variance of the model.
Notice that the technique is very similar to what is done with neural network, where we randomly initialize different models and use the ensemble of nets to estimate the variance of the single net.
Do you see some flaws in this procedure?
Thank you in advance!
Disclaimer: I am aware of different approaches to estimate the variance of a Random Forest, and of some implementations in Python. However, if the number of samples is high, such estimator require a very high number of trees in the forest in order to converge. In my dataset, I am dealing with ~500k data points and 100 trees. For this reason, using such estimation techniques is not possible.
I report here a reproducible example that shows a situation where the aforementioned available estimator does not converge:
import numpy as np import sklearn.datasets from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import forestci as fci
X, y = sklearn.datasets.make_blobs(n_samples=100000, n_features=5, centers=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
clf = RandomForestClassifier(random_state=61, n_estimators=100, n_jobs = 10, criterion='entropy')
clf.fit(X_train, y_train)
y_test_var = fci.random_forest_error(clf, X_train=X_train, X_test=X_test)
max(y_test_var) # should be <0.25, since we are dealing with binary data, but isn't
I am totally aware that varying the random seed does not account for the intrinsic variability of data, but at least provides an estimator far less biased than the other one. Do you have any other suggestions for estimating the variance?
– Niccolò Ajroldi Mar 24 '22 at 08:43Nevertheless, we have evidence that the classical Jackknife approach does not fit cases where n>>B. Do you have any suggestions on how to estimate the variance?
– Niccolò Ajroldi Mar 24 '22 at 13:50