My question here is a specific example of what was discussed in part in the answers of Bootstrapping test set? .
Suppose I train a model where I cannot mathematically derive a confidence interval for the error (think random forest or neural network), and further suppose that the test set and training set have roughly the same distribution of data for the target variable.
Would bootstrapping the test set and evaluating many times on it give a real confidence interval for the error of the model? Specifically, could I be X% certain that a random draw of the same size as the test set from the unsampled population will have an error within the interval?