Integrated biomarkers for ROC

Question

I'm am looking for ways to combine multiple quantitative values in order to build a ROC curve with specificity and sensitivity. This seems to be common in multiple biomarkers paper, but I can't find any direct way to do it. CombiROC seemed to be made for this, but the Shiny app is no longer working and the packages haven't been updated, which makes me question its use.

I have multiple protein concentrations for a given timepoint (no follow-up) from a cohort of patients with binary presentation (could be Control vs diseased). Using these proteins alone, I can build a ROC curve. I found that some to be associated with my disease. I would like to push my analysis further to have a model where I can combine/integrate 2 or 3 of these markers so refine the diagnostic and have better AUC/Spe/Sen values. For instance, at a certain point, having both A and B elevated could have a worse prognostic compared to just high A.

So far, this is the way I found to do it: I first build a glm model, from which I predict the chance of falling into the group. I finally calculate the ROC by plotting the predicted vs actual outcome.

#Build a glm, with the clinical group (binairy outcome)
Combined.glm <- glm(dataset$Group1 ~ 
                               dataset$Analyte1 +
                               dataset$Analyte2 +
                               dataset$Analyte3, 
                               family = "binomial")
#Use the model to predict the group outcome
dataset$prob.Group1=predict(Combined.glm,type=c("response"))
#Build the ROC
roc<- roc(dataset$Group1, dataset$prob.Group1)

I would like to know if this is a correct way to do this, or if not, how should do this?

Thanks!

Edit: added clarifications and rephrased

Hello and Welcome to CV. For some good reasons, XY questions in this site are discouraged. — utobi, Dec 18 '23 at 17:35
I would be happy to post my question in a different SE forum, but I don't see how the X-Y problem applies here. My problem/question about X still applies, I just showed Y as the way I did it and have the feeling it is not right. Removing my code example doesn't change I still can't find good example of how X is done. — Renaud, Dec 18 '23 at 18:18
The answer says it all: "The XY problem is asking about your attempted solution rather than your actual problem". Having said that, your post is on topic, however, I'm sorry but I don't find your R code helpful either since we cannot replicate your example. Could you explain what you mean by "combine multiple quantitative values"? Do you have chunks of data coming in, say sequentially in time and you want to update the ROC curve in the light of new data? — utobi, Dec 18 '23 at 20:13
I also invite you the read the [whole post] (https://meta.stackexchange.com/a/269222), saying that focusing on Y is not constructive and doesn't help me answer X. Each individual are classified into a group (could be control/diseased) and has a series of quantitative analytes. Individually, analytes have the potential to discriminate between groups. I want to find the diagnostic/predictive potential of 2 analytes combined. Imagine going to the MD and him saying ''Your A and B proteins are elevated therefore, you are more at risk of Z complication". — Renaud, Dec 18 '23 at 21:31
Can you tell us more about your data and research goal? This should go beyond "I want to build an ROC curve because everyone else does it" — Demetri Pananos, Dec 18 '23 at 21:39
What made you jump to this conclusion, that because I am unsure about a method it means I simply want to do this despite adding values? My data are protein concentrations, comparing a cohort of patients and controls. I found that some proteins have a diagnostic potential. I want to see if using more than one can bring a more precise diagnostic than one alone. — Renaud, Dec 18 '23 at 21:47

EdM · Accepted Answer · 2023-12-19T20:45:00.970

The ROC curve, specificity and sensitivity shouldn't be primary concerns. You need a well-calibrated model of the probability of disease status based on a set of predictors. Once that's available, the ROC curve might not be helpful at all.

You've made a start at that with your binomial generalized linear model, but only a start. Your model implicitly assumes a linear association of each of your analytes with the log-odds of disease status, each independent of the values of the others. That's a pretty strong assumption.

There is extensive information on this site about how to set up regression models in general and about issues specific to binary regression, available from the search function and tags. Frank Harrell's Regression Modeling Strategies is a freely available central resource with Chapters 10, 11 and 12 devoted specifically to binary logistic regression. Follow those guidelines to construct a well-calibrated model that doesn't overfit the data.

Once you have that model, you have pretty much all that you need. It tells you the log-odds of disease based on the analyte levels. That's a quantitative form of "Your A and B proteins are elevated therefore, you are more at risk of Z complication"; the regression model provides information on how much more likely disease is based on how much the proteins are elevated.

Furthermore, the probability model allows a cost-based choice of probability cutoff for classification, if needed, depending on the relative costs of false-positive and false-negative class assignments. Quoting from this page:

With 0 cost of true classification, say that the costs of false positives and false negatives are scaled to sum to 1. Call the cost of a false positive $c$ so that the cost of a false negative is $(1-c)$. Then the optimal probability classification cutoff for minimizing expected cost is at $c$.

You don't need an ROC curve to make that choice for classification. Yes, you can think of that choice as moving along the ROC curve with its associated values of sensitivity and specificity as a function of the model's linear-predictor values. But that ROC curve doesn't directly illustrate the disease probability as a function of the linear predictor.

If you want to show a plot, show plots of how the log-odds of the outcome changes as a continuous function of each analyte value, similar to Figure 11.2 of Regression Modeling Strategies. Show a plot documenting the model's calibration, with good agreement between predicted and observed outcomes, as in Figure 11.5.

Similarly, a choice of probability cutoff will have associated values of specificity and sensitivity, but they aren't then of primary concern. The sensitivity/specificity tradeoff should be based on relative costs, for which the probability model is most directly useful.

If you nevertheless want to display a ROC curve, the correct way to do so would depend on the documentation for the roc() function, whose package you don't specify. On this statistics web site, software-specific questions are considered off-topic, anyway. If that gives a plot of sensitivity versus 1-specificity then you have an ROC curve. I'd recommend further adorning the curve with associated probabilities and/or linear-predictor values along its length, as those are what's of most general interest and importance.

Added to respond to comment

I want to see if using more than one can bring a more precise diagnostic than one alone.

Frank Harrell discusses this in this web post. A likelihood-ratio test comparing a model with the original and the additional predictors to that with only the original predictor can gauge the "statistical significance" of the additional predictors. A related "adequacy index" describes the adequacy of the original model without the additional predictors. He illustrates the approach with a logistic regression model.

Thank you very much for this answer. It helps a lot. Working in a purely biological background, ROC curves are often preferred over tables, as they are more visual, while people in epidemiology/statistics will prefer tables with models. I also checked for colinearity and effect of age and sex as covariates. In general, we think only a few could have a linear association with the disease, while others are elevated because of the disease. I will carefully look at the examples you provided. — Renaud, Dec 19 '23 at 20:56
@Renaud as I biologist I also often prefer graphs/curves over tables. The trick is to show graphs that convey the most useful information. I have published one or two ROC curves in the past, but as I learned more I came to realize that they convey very little useful information on their own. At the least, annotate the ROC curve with corresponding values of class-probability estimates and/or your quantitative biomarker levels. Plots like those in Figure 11.2 of the current online Regression Modeling Strategies tell a much more complete story. — EdM, Dec 20 '23 at 14:44

Integrated biomarkers for ROC

1 Answers1