3

I'm looking at building a model to find what factors from an electronic record are associated with trial enrolment.

I have two questions about this.

  1. I have a large set of possible covariates and a logit model will not converge will all of them included; therefore I need to select a subset of the variables available. I have recently seen in a talk given by Frank Harrell that 'unless you have a gun to your head' you shouldn't perform a variable selection algorithm. To what extent is this true in this situation and what can I do?
  2. I am not looking for causation in my covariates; however, as I am interested in the unbiased effect of each of them, rather than just trying to get the most accurate predcition of enrolment. A such, would I be correct to assume that each of the covariates may suffer from potential selection bias? If so, typically, there is one variable of interset that you carry out causal inference methods on to control for observed confounding. What alternatives are there when I'm interested in balancing across all covariates within the model. If this is possible, is this the correct approach to take in this situation?
  3. Are there other modelling options other than regression and penalised regression?

Any help is highly appreciated.

Geoff
  • 601

2 Answers2

1

It would be useful to know in what context Dr. Harrel made that remark.

The classic misadventure of variable selection algorithms is using such an algorithm to identify the hypothesis, then reporting the results of the hypothesis test as if it were prespecified. When formulating a hypothesis from a multivariable regression model, different combinations of adjustment variables comprise different hypotheses even with the same main effect.

Developing a predictive model does not usually involve hypothesis testing, so many analysts relax the stringency that usually conserves type 1 error rates - sometimes applying many models and picking which results are best, or manually tuning penalties, weights, or tradeoffs. Variable selection and prediction are closely related. The type 1 error in prediction has traditionally been underemphasized, they believe the associated cost is different and less than in confirmatory studies. In confirmatory studies, type 1 errors amount to adopting ineffective drugs to the marketplace, ruling "guilty" for innocent defenders, implementing ineffective public policies, etc. But one could argue that in big data, repeated type 1 errors are contributing to a fizzling hype. So it's not clear on which side of the fence one should fall.

Trying to formally control type 1 error rate for the many ML/DS methods has not played out well, numerous studies show it's quite hard to formally conserve a type 1 error rate in such problems as adding variables to predictive models or other machine learning applications. One method to enforce internal consistency of predictions is split sample validation. This, if anything, benefits the analysis by trimming the $N$ so as not to select too many variables. I think that, in the context of variable selection this would be fine.

When a subject enrolls to a trial, the trial results are highly subject to participation bias, a specific form of "selection bias". Trial participants tend to be white, female, and healthier than the general population. However, if your trial participants were identified from an electronic data base, and you have access to electronic database records for all utilizers, then selection bias isn't a problem, and developing an "enrollment" model helps develop weights and probabilities to correct the selection bias in the trial results.

As far as selecting variables in the model, it can really be as simple as statistical significance in the logistic model, or you can even use a ROC regression where statistical significance means the ROC is significantly better, variables can be added or removed via stepwise forward or backward selection. You can use an L1 penalty. You can add select models with optimal AIC or BIC...

AdamO
  • 62,637
  • Thanks for this reply. I'm not sure if I'm misunderstanding, but I'm not trying to give the best prediction of enrolment. I'm trying to find the what subgroups in the EHR have a lower chance to enrol and then look into it further. This wouldn't be a stage one regression for an 2 stage health outcome model. What I need is unbiased coefficients in the enrolment model, and as such, I was wondering if it is appropriate to use causal inference methods on the enrolment model, not the outcome model. Would using the methods you suggested be ideal for this? – Geoff Jan 30 '23 at 10:14
0

First, Frank Harrell discusses many approaches to this problem in Regression Modeling Strategies (RMS), especially Chapters 4 and 5. There's nothing wrong with selecting predictors based on your understanding of the subject matter, grouping related predictors together into single predictors, or other methods of data reduction, provided that you don't use the outcomes in making those choices. If you don't use outcomes in this process, you don't inflate the Type I error rate. That type of data reduction can be a useful first step in any event. Don't under-estimate the importance of applying subject-matter knowledge first, something that can be under-emphasized in discussions of machine learning from "big data."

Second, with a binary outcome like yours it's wise to include as many outcome-associated predictors in the model as possible. Otherwise there is a risk of omitted-variable bias even if omitted predictors are uncorrelated with those in the model. The task is thus to try to include as many outcome-associated predictors as possible without overfitting.

Third, L2-penalized maximum-likelihood estimation (extending ridge regression to a binary outcome) is a well respected choice if there are still too many predictors. That includes all predictors while penalizing coefficients to minimize overfitting. If there is a particular predictor of major interest, you could choose not to penalize its coefficient while penalizing those of covariates that you are trying to adjust for. Be careful with penalization when you have categorical predictors, however; see Section 9.11 of RMS and this page and its links.

Fourth, you could use learning methods like boosted trees. If they learn slowly, they can use all the data and incorporate unsuspected interactions among predictors without overfitting. The resulting model can be very good at predictions, but is typically difficult to interpret in terms of the individual predictors. One approach to simplify the model and improve interpretability could be to develop the tree-based model, collect its predicted log-odds estimates, and then use those predictions as outcomes to model via linear regression on the predictor variables.

Types of bias

It's important to distinguish selection bias from the bias introduced by penalized regression.

Selection bias means that the analyzed data sample doesn't adequately represent the underlying population, so that estimates based on the analysis don't adequately represent what you would find in the full population. Consider your situation: a large set of patients with only a subset enrolling for trials.

A trial based on your enrollees would suffer from selection bias if they don't represent the underlying population. A model of who chooses to enroll, based on your electronic records, won't suffer from selection bias provided that your full set of records adequately represent the underlying population.

Penalized regression doesn't produce selection bias. It introduces a different type of bias, a downward bias of the magnitudes of regression coefficient estimates that leads to a corresponding bias in model predictions.

That's a choice made to improve how the model will work on the underlying population, via the bias-variance tradeoff. See Section 2.2.2 of ISLR. A low ratio of cases to predictors in building a model can lead to excessive variance when you apply the model to the broader population. Deliberately introducing a small amount of bias in coefficient estimates and model predictions via penalized regression can provide a more-than-corresponding decrease in variance and greatly improve model performance.

Furthermore, L2 penalization (ridge regression) keeps all predictors in the model and can work when there are more predictors than cases. So you don't face the predictor-selection problem that arises with LASSO or stepwise regression. All the predictors can be penalized to similar degrees if you wish, so you thus can (at least try to) evaluate relative contributions of predictors to the choice of whether to enroll. The estimates of individual coefficients might be biased, but their relative magnitudes can be pretty much maintained.

If you want to model the choice to enroll, penalized regression thus can reliably accomplish what you want. The selection bias will be no different than what's in your full data set, and the coefficient bias will improve model performance with respect to the underlying population beyond your current data set.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thanks for the reply. Looking into it further, would penalised regression be appropriate here? The idea is to find inequalities in enrollment. So as this type of regression biases coefficients, I'm not sure if I should use it? – Geoff Jan 30 '23 at 10:23
  • @Geoff penalized regression introduces bias, but (if I understand your meaning correctly) unlike L1/LASSO there's no "selection bias" with L2 penalization (ridge regression). With L2/ridge, all predictors are maintained in the model; all coefficients are penalized. That would seem to meet your requirements even if all coefficients are "biased" from what they would be otherwise. If you want to apply your finding to new data, you do have to worry about predictive performance and the bias-variance tradeoff. See Section 2.2.2 of ISLR. Ridge deals nicely with that. – EdM Jan 31 '23 at 14:31
  • Ah ok. I wasn't talking about selection bias with regard to covariate selection, I meant seleciton bias with regards to the distribution that each covariate I have in my dataset. That is, is the way in which data were collected contributing to confounding? If so, is adjustment needed with, say, propensity score weighting. The issue I had with penalised regression is the same as why I was concerned about penalised regression, I need to make unbiased inferences on the contribution of each covariate to trial enrolment. What are your thoughts on this? – Geoff Jan 31 '23 at 15:39
  • @Geoff if you are just trying to determine propensity-score weights relative to enrollment from your data sample, you don't have to worry so much about overfitting the propensity model. See this answer or this answer, although there are limits. The boosted-tree method of twang shouldn't have convergence problems. If you want to generalize to new data, however, the bias-variance tradeoff matters. – EdM Jan 31 '23 at 16:02
  • I think there has been some confusion but let me know if not. My enrolment model is my main outcome model, it is not the propensity score model. With my enrolment model being the main model, I am interested in whether the coefficients in this model are biased from selection bias. I am also interested in interpretation. This would be related to a policy question, are there certain groups that have lower chance of erolment. With this question in mind, I feel like the models suggested are not appropriate. Let me know what you think. – Geoff Jan 31 '23 at 17:33
  • @Geoff I added to the answer to deal with selection bias and penalized regression bias. If you want to model the choice to enroll in trials and you can't decrease the number of predictors adequately by data reduction as outlined in the first paragraph of the answer, then penalized regression is a good choice. The bias in coefficient estimates still allows for interpretation, while it doesn't add to selection bias. – EdM Feb 01 '23 at 23:40