0

I have some trouble finding the best ML approach to solve the following problem: I have a set of continuous variables representing how a specific medical procedure is conducted. I need to find the optimal threshold value for each variable to maximize the probability of a positive outcome.

I already created a Random Forest model for variable selection and a second model predicting the outcome using the variables as continuous but I have no idea how to proceed from here.

fb95
  • 31
  • 3

2 Answers2

4

Dichotomization is usually problematic: What is the benefit of breaking up a continuous predictor variable?. It will probably be better to use the data as it is, feed it into a probabilistic (!) model, and then use a search to find predictor values that predict a high probability for a good outcome.

Note that most models will assume a monotonous relationship between the predictors and the output, at least over part of the data space. That can lead to nonsensical predictions: if more vitamin C improves outcome X, then your model may recommend eating five pounds of vitamin C per day. So you may need to constrain your optimizer in some way.

Also, medical procedures are often not independent, e.g., medications may interact. You may want to keep this in mind when optimizing.

Stephan Kolassa
  • 123,354
  • Thank you for your help. I’m currently working on tissues biopsies and my goal is to find the optimal number of withdrawals, quantity of tissue etc etc. that’s why I need the dichotomization. What would be in your opinion the best method to search for optimal predictive values using the Caret package in R – fb95 Oct 12 '23 at 13:30
  • 1
    I would still say that no, you don't need a dichotomization. Per my answer, I would recommend to fit a model and run an optimization. Likely enough, your model will think that "more withdrawals is better", and "more tissue is better" and recommend taking an unlimited amount of tissue in an unlimited number of withdrawals. So you will need to constrain it in some way. You could, for instance, look for minimum values of withdrawals and tissue amounts that give you a certain minimum result. We might be able to better help you if you could provide some sample data. – Stephan Kolassa Oct 12 '23 at 13:42
1

It appears that you have a prescriptive analytics problem, which involves finding optimal combination of inputs based on desired outcomes. This term is commonly used in some fields (like business) and may help you find appropriate methods for your problem. If you are using software like RapidMiner, you can refer to its documentation on prescriptive analytics (https://docs.rapidminer.com/8.2/studio/operators/scoring/prescriptive_analytics.html) for more information.

If you are using programming languages like R or Python, you can explore constraint optimization packages. In your case, assuming you have a machine learning model called $h(\cdot)$ that takes numerical inputs $x$ representing your variables and predicts a numerical score indicating the "probability" of a positive outcome, you essentially have an optimization problem of the form $\text{argmax}_x\ h(x)$. In your specific case, the machine learning model used is Random Forest. It is important to ensure that your model predicts a score rather than a categorical outcome, as many implementations of Random Forest classifiers typically provide categorical predictions by default. In R, you can accomplish this by using the predict function with the argument predict(rf, type = "prob", newdata=x). However, as said before by Kolassa, if your variables are dependent on each other, you would need to include this dependency as a constraint since they cannot vary independently. For example, if variable B and variable A, and typically A < 2 * B, then this would need to be added as a constraint.

Jseng
  • 11