sample size in r for random forest

Question

For sample size, in r, samplesize = if replace, nrow(x) else ceiling(0.632*nrow(x))

What I know is random forest constructs a large number of trees with random bootstrap samples from the training data.

But, in r, if we have a sample size of replacement, we use all the observations. Then, it's not random. Can you please explain what you mean by random bootstrap samples if "replace"?

Are you asking about a specific function in R? If so, which function? In which library? There are several random forest implementations in R. // If I understand your question correctly, the software is giving you an option to turn off replacement sampling, which someone might want to do for some reason. — Sycorax, Apr 19 '23 at 21:13
here is the code
S3 method for class 'formula'

randomForest(formula, data=NULL, ..., subset, na.action=na.fail)

Default S3 method:

randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),... — shawn, Apr 19 '23 at 21:23
we are just using all of the rows for sample size. So, I do not think we construct a large number of trees with random bootstrap samples. instead, we create trees from known bootstrap samples. Can you please explain where I am misunderstanding? — shawn, Apr 19 '23 at 21:25

Sycorax · Answer 1 · 2023-04-20T16:56:13.563

1

The code is just an if-then statement.

If replace==TRUE then random forest just does bootstrap resampling: for each tree, it draws a sample with replacement from the data with size equal to the original sample size.

If replace==FALSE then random forest just random sampling: for each tree, it draws a sample without replacement from the data with size equal to ceiling(.632*nrow(x)).

The number 0.632 might seem like a "magic number" but it is approximately $1-\exp(-1)$, which is an important number in bootstrap samples. See:

edited Apr 20 '23 at 16:56

answered Apr 19 '23 at 21:29

Sycorax

90,934

Thank you. So, if we are using replacement, we are creating a model from all the observations. Then, it's not random. Am I understanding correctly? – shawn Apr 19 '23 at 21:33
No, if you use replacement, you're using a [tag:bootstrap] and the samples are drawn with replacement. The size of the bootstrap samples is nrow(x) and some of the data will be repeated because that's what replacement means. – Sycorax Apr 19 '23 at 21:36
Oh. So, let's say we have 1000 observations. We are going to have a bootstrap sample of size n (i.e. let's say n = 100 for example. This "n" can be any number. Please correct me if I am wrong). Then, for the first observation, we draw it from the 1000 observations. Then, for the second observation, we draw it from the same 1000 observations. And so on. For the last 100th observation, we draw it from the same 1000 observations. Is it what you are trying to say? – shawn Apr 19 '23 at 21:39
Yes, that is what replacement means. It's the same reason if you roll a die twice you can see 3 on the first roll and then 3 again on the second roll. – Sycorax Apr 19 '23 at 21:39
Thank you, Sycorax. Everything that I said is correct. – shawn Apr 19 '23 at 21:41
Your first comment is absolutely not correct. – Sycorax Apr 19 '23 at 21:42
Do you mean this one?
i.e. Oh. So, let's say we have 1000 observations. We are going to have a bootstrap sample of size n (i.e. let's say n = 100 for example. This "n" can be any number. Please correct me if I am wrong). Then, for the first observation, we draw it from the 1000 observations. Then, for the second observation, we draw it from the same 1000 observations. And so on. For the last 100th observation, we draw it from the same 1000 observations. Is it what you are trying to say
– shawn Apr 19 '23 at 21:43
Hi Sycorax. Just want to confirm. you said this: The size of the bootstrap samples is nrow(x). That does not mean the bootstrap sample has unique observation that is used only once. Some of them are duplicates. So, that is why we have random bootstrap samples because you don't know which observations are duplicate. Can you please verify this? I am not sure if I understood your comment. thanks – shawn Apr 20 '23 at 16:52
The behavior depends on replace. I think the misunderstanding is that you don't understand (1) sampling with replacement vs without replacement (2) bootstrapping and (3) random forest. Please take the time to read high-quality resources about each of these topics. We have many threads about these topics; you can find them using search: https://stats.meta.stackexchange.com/questions/5549/faq-best-practices-for-searching-cv – Sycorax Apr 20 '23 at 17:09

sample size in r for random forest

S3 method for class 'formula'

Default S3 method:

1 Answers1

Related