1

I would like to quantify the uncertainty of a Random Forest binary classifier.

The idea that popped in my mind was to fit the Random Forest 100 times with different seeds. Computing the variance of the forest's output on data in a validation set, I can then estimate the variance of the model.

Notice that the technique is very similar to what is done with neural network, where we randomly initialize different models and use the ensemble of nets to estimate the variance of the single net.

Do you see some flaws in this procedure?

Thank you in advance!


Disclaimer: I am aware of different approaches to estimate the variance of a Random Forest, and of some implementations in Python. However, if the number of samples is high, such estimator require a very high number of trees in the forest in order to converge. In my dataset, I am dealing with ~500k data points and 100 trees. For this reason, using such estimation techniques is not possible.


I report here a reproducible example that shows a situation where the aforementioned available estimator does not converge:

import numpy as np import sklearn.datasets from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import forestci as fci

X, y = sklearn.datasets.make_blobs(n_samples=100000, n_features=5, centers=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

clf = RandomForestClassifier(random_state=61, n_estimators=100, n_jobs = 10, criterion='entropy') clf.fit(X_train, y_train)

y_test_var = fci.random_forest_error(clf, X_train=X_train, X_test=X_test) max(y_test_var) # should be <0.25, since we are dealing with binary data, but isn't

1 Answers1

1

Why would you do that? As you noticed, there already exists a solution by respected authors including Trevor Hastie and Bradley Efron who rigorously studied and empirically validated the solution. If you are planning to introduce an alternative solution, you should probably start with equally rigorous mathematical proofs and an empirical evaluation vs their solution. It sounds like an interesting research paper, but if you want to use it as an applied method, re-inventing the wheel is probably not the best idea.

Your proposed method considers the random seed as the only source of randomness. It does not take into account the randomness in the data due to sampling. If you are intending to use it for calculating confidence intervals it would underestimate the intervals because of that. Same argument would apply to varying the random seed only for neural networks. You need more than this, for example, use random samples of the training data (bootstrap) for each in the models in ensemble used for estimating the variability, and for neural network-specific approaches things like Monte Carlo dropout, or injecting randomness to the weights, etc.

Tim
  • 138,066
  • Thank you for the complete response. The problem with such methods is that the correspondent estimator do not converge if the number of trees is "small". In my case I have 100 trees, and I obtain estimated variances greater than 0.25, whereas, since we are dealing with binary data, the variance should be bounded by 0.25 – Niccolò Ajroldi Mar 23 '22 at 16:56
  • Since the output of the model is a bounded random variable between 0 and 1, the variance of such random variable should be less than 0.25. See e.g. https://en.wikipedia.org/wiki/Popoviciu%27s_inequality_on_variances – Niccolò Ajroldi Mar 24 '22 at 08:35
  • Notice also that convergence results are present in the paper by Wager, Hastie and Efron. Letting n be the number of samples and B the number of trees, their estimator converges for B=O(n). For such reason, their proposal in not an option for my model choice, where I have n>>B.

    I am totally aware that varying the random seed does not account for the intrinsic variability of data, but at least provides an estimator far less biased than the other one. Do you have any other suggestions for estimating the variance?

    – Niccolò Ajroldi Mar 24 '22 at 08:43
  • @NiccolòAjroldi how do you know it is "far less biased"? – Tim Mar 24 '22 at 08:48
  • Tim, one is far above 0.25, assumes values like 1.5, while the other one is at least bounded by 0.25. I know I have no guarantee that it is a solid estimator, but for sure it is far less biased! – Niccolò Ajroldi Mar 24 '22 at 12:17
  • 1
    @NiccolòAjroldi counterexample: if you "predicted" variance to be 0.01 always, would it be less biased as well? – Tim Mar 24 '22 at 12:19
  • I got your point. I should conduct a simulation on synthetic data in order to prove at least empirically the validity of such approach. Also, I may try to derive some theoretical bounds.

    Nevertheless, we have evidence that the classical Jackknife approach does not fit cases where n>>B. Do you have any suggestions on how to estimate the variance?

    – Niccolò Ajroldi Mar 24 '22 at 13:50
  • @NiccolòAjroldi bootstrap is always an option for black-box functions like the random forest, but the idea presented in the paper was a suggestion how we could do better than this. – Tim Mar 24 '22 at 13:52
  • I don't get your last comment. Which technique is always an option for random forest? The one proposed in my question? – Niccolò Ajroldi Mar 24 '22 at 13:56
  • @NiccolòAjroldi using bootstrap. – Tim Mar 24 '22 at 14:03
  • Refit several Random Forest by changing the random seed is equivalent to using Bootstrap. Basically, I was computing the bootstrap estimator without knowing. – Niccolò Ajroldi Mar 24 '22 at 14:34
  • 1
    @NiccolòAjroldi not exactly, because each random forest is built using the same data, just each tree uses bootstrap. That is a subtle difference. – Tim Mar 24 '22 at 14:40