1

I am working on a project to determine the variables that better predict the binary outcome. I am using conditional random forest and permimp::permimp for determining the importance of variables for my subgroup analysis.

Now, I want to determine the cut-offs for the continuous variables which could help to predict the outcome. Could you please suggest the method to do so? there are so many methods so I got lost

Kate
  • 93
  • 1
    Wouldn’t there be a different splitting value (cutoff) in every tree in the random forest? – Dave Jan 19 '24 at 10:35
  • @Dave something like the average/median cut-off used within forest? – Kate Jan 19 '24 at 10:46
  • What about when a feature has more than one split to fit nonlinear behavior? (I think that can happen.) $//$ Maybe it would be best to ask about your ultimate goal once you get these splits. Extracting the splits is a programming question that is considered off-topic here, but there might be a real statistics question beneath it. – Dave Jan 19 '24 at 10:52
  • @Dave I was asking about other methods if you would have read my question properly – Kate Jan 19 '24 at 11:05
  • Once you clarify the statistical content of your question, we’re all ears. – Dave Jan 19 '24 at 11:10
  • @Dave I already have it in my post – Kate Jan 19 '24 at 11:40
  • Multiple trusted moderators of this community disagree and have put the question on hold (“closed”) until such clarification comes. – Dave Jan 19 '24 at 12:42
  • I don't think the question it's unclear but the OP can't be more specific for lack of specific knowledge. A good answer can pinpoint to the tools used to classify using cut-offs (or sort of), starting linear classification and continuing with random forest. https://www.statlearning.com/ may be a good source to start digging. – Pere Jan 20 '24 at 10:09
  • 2
    The fact that this only has the [tag:r] tag makes it seem that the question is just looking for some code, which remains off-topic here (not an inherently bad question, just outside the purview of Cross Validated). Perhaps this should be reopened to allow for a statistical response, however, perhaps even one that disputes the utility of the hard cutoffs the OP seems to seek. – Dave Jan 20 '24 at 10:43
  • @Dave which tag shall I add? Anything except the tag? – Kate Jan 20 '24 at 13:31
  • @Pere I am using conditional random forest and permimp::permimp for determining the importance of variables for my subgroup analysis . Now, I want to find cut-offs for continuous variables. But there are so many methods so I got lost – Kate Jan 20 '24 at 13:34
  • 1
    I'm not very familiar with tags (nor with the topic of the question) but I would replace the r tag with classification tag. – Pere Jan 20 '24 at 13:58
  • 4
    As discussed extensively on this site, cutoffs are bad ideas. They don’t exist in nature because discontinuities don’t exist in nature unless X=time. Since they don’t exist, every analyst will find a different cutoff. – Frank Harrell Jan 23 '24 at 12:36
  • @FrankHarrell sorry for not being clear in the post - I pass continuous variable to fit random forest and to identify the important variables. Once important variables are identified, I would like to identify the cut-off for continuous variables since it is valuable for clinicians. So I am not applying the cutoff before fitting random forest – Kate Jan 25 '24 at 07:38

1 Answers1

4

Don't do this, as it particularly doesn't make sense for a random forest model. In addition to the many reasons that categorizing a continuous predictor is a bad idea, it undercuts a potential strength of a random forest: its ability to find unsuspected interactions among predictors in a model.

With a random forest, the association of one predictor with outcome can thus depend on the values of other predictors. Any cutoffs you might choose would thus necessarily depend on the values of other predictors. Why not just use the full model to make predictions as needed?

In response to comment

The above applies to post-modeling cutoffs, as constructing the random forest already involves making multiple cutoffs of each continuous-predictor value. At lower branches of each of the many trees that are built, the choice of cutoff for one predictor will depend on the values chosen for all the predictors used at higher levels of that tree. After the model is built there will be no one cutoff for any continuous predictor that is independent of the values of the other predictors.

This is even the case for simple standard regression models that contain simple interaction terms. For example, if the effect of one continuous variable (e.g., age) depends on another (e.g., hemoglobin A1C) in an interaction term, then you would have to change the cutoff for age depending on the value of hemoglobin A1C.

A clinician is presumably interested in the probability of developing the disease. That depends on combinations of values of all the predictors. You can make probability predictions from a properly constructed random forest, based on each new patient's values.

Even if a clinician and patient might then choose some probability cutoff based on the relative costs of false-positive and false-negative assessments, that probability will depend on (usually unknowable, with a random forest) combinations of all the predictors that were identified during model construction.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • sorry for not being clear in the post - I pass continuous variable to fit random forest and to identify the important variables. Once important variables are identified, I would like to identify the cut-off for continuous variables since it is valuable for clinicians. So I am not applying the cutoff before fitting random forest – Kate Jan 25 '24 at 07:37
  • @Kate I was talking about looking for cutoffs after the model is built. I've added a bit of extra explanation. A random forest isn't like a single decision tree, where the modeling might have found single cutoffs for each continuous predictor. – EdM Jan 25 '24 at 09:17
  • To summarize and highlight @EdM's points: You are proposing identifying cutoff points as if what mattered were bivariate relationships. This is inconsistent with the use of a more sophisticated, multivariate method (random forest) which, like multiple regression, will go beyond such bivariate relationships to show cases that, by virtue of their predictor combinations, have high probabilities re: the outcome. – rolando2 Jan 25 '24 at 11:16
  • 1
    As @EdM stated, the use of cutoffs is a really, really bad idea. Though often used in clinical medicine, users of cutoffs on measurements don’t realize the the cutoff of a predictor must be a function of all the actual levels of the other predictors. This is demonstrated in https://hbiostat.org/bbr/info#categorizing-continuous-predictors – Frank Harrell Jan 25 '24 at 14:30
  • Thank you so much for the great explanation! I have another question. For instance, if I initially used 30 variables to fit the random forest, and through permutation, identified 7 variables as important. Now, when we have a new patient, and we want to predict the probability of their response, we may only have the 7 important variables collected, not all 30. Can we still predict the probability of the patient being a responder based on these 7 variables alone, or do we need all 30 variables? Alternatively, should we refit the random forest based only on these 7 variables? – Kate Jan 25 '24 at 15:30
  • 1
    @Kate I'm not an expert on random forests and don't use them much. I would suggest posting a separate question on this site about what to do in that scenario. In that question, please provide as much detail as possible about the nature of the outcome and the predictor variables, and the reasons why you might not have data on all the predictor variables. My understanding is that random forests can handle missing data fairly well, but you would be better off getting advice from someone who uses them routinely. – EdM Jan 25 '24 at 15:56