Best subgroup identification - Cox regression, biomarker performance

Question

We are working on a blood value biomarker type (lets call it NEW_bm) for a specific disease. We got data of 4 years follow-up and did a multivariable Cox-regression to see, if it has predictive capacity or not. We adjusted for several "usual" variables like age, sex, and some other already established blood values.

We saw, that our new NEW_bm works very reasonable and improves the prediction of the event. Now we want to identify "subgroups", where NEW_bm performs "the best", so we can identify specific risk groups of the disease we are looking at.

Is there any way to do this with statistical methods? For me, the main problem is, that I don't know where to start. Should I first look into "diabetic male elderly" patients or into "non-diabetic young sporty" patients? Maybe any method by looking at residuals? Or any ML-algorithm? Just as a starting point. For sake of completeness, the aim is not to boost the NEW_bm, its more for understanding the underlying illness.

EdM · Answer 1 · 2024-01-11T22:33:32.503

One way to address this would be to start with your and your colleagues' understanding of the subject matter. Presumably you know something (or at least have some informed guesses) about how and why NEW_bm is associated with outcome and the clinical characteristics that are most likely to affect that association of NEW_bm with outcome. Then extend your Cox model to include interaction terms between NEW_bm and those clinical characteristics.

Make sure, however, to start with a pre-specified model that is consistent with the model complexity that you can expect to get without overfitting, based the size of your data set. See Frank Harrell's Regression Modeling Strategies, particularly Chapters 2 and 4 on general issues and the chapters on survival models. Don't fall into the trap of trying lots of interactions until you find one that's "significant," as that type of data dredging runs a big risk of not working on a new data set.

An alternative is to use some type of tree-based survival model. If you're not familiar with tree models, Chapter 8 of An Introduction to Statistical Learning has a good introduction to the general principles. There are implementations of the various tree-based approaches that allow for Cox models and censoring.

Tree-based models, built carefully, can take advantage of the associations of combinations of predictors, like those of your NEW-bm with other clinical variables, without your having to pre-specify the combinations. The best-performing types of models don't directly tell you what those combinations are, as they don't report results similar to the interaction coefficients of multiple regression models. Two-way "partial dependence plots," however, can illustrate the relative importances of the two-way combinations of NEW-bm with the other variables. That could point the way to leads for further investigation. The R pdp package can produce such plots for many types of machine-learning models.

Best subgroup identification - Cox regression, biomarker performance

1 Answers1