Feature not applicable to some samples

Question

I am working with a private medical dataset including categorical features coming from patients examinations. However, the problem is that some patients underwent MRI, others scanner, and some underwent both. Thus, scanner-only patients have missing values in the MRI associated features, and vice-versa.

How could I handle this situation? I thought about 3 solutions for now:

Using an "examination not passed" category to replace missing values, but this would be considered as a full category on itself by machine learning algorithms. They could make correlations such as "exam not passed" => "class number 1" but there is no link between both as the examination rely on availability of imaging devices in the hospitals from where the data were collected. Some just didn't own MRI devices, etc.
Treat MRI, scanner, and MRI+scanner patients as 3 different datasets and train a different model on each one. But doing so would imply writing specific code wrapping Sklearn objects in order to automatize the whole training process.
Using a model robust to missing values such as XGBoost. I don't think it is a good idea, my problem should be handled beforehand as XGBoost uses its own imputing values. It is just moving the problem elsewhere.

just fill in missing values as 0 to denote those patients that either underwent MRI or scanner, and proceed with the model? — Jay Ekosanmi, Jan 14 '22 at 13:34
It would be equivalent to adding a category "exam not passed" for each feature that is not applicable. But models would consider this new category as any other one, and will make computations with it while they should ignore it because it has no link with the output and could mislead them into believing this lack of information is actually information — aulok, Jan 14 '22 at 13:45
One problem with your reasoning is, that you assume that the absence of such test is not related to the outcome. However, think about how not getting a MRI Test relates to other variables that do impact the outcome. For example ( this do not have to be true or applicable in you case). Non-ensured people avoid getting an MRI to save cost, usually they tend to have worse health. Hospitals in rural areas tend to have less frequent MRI machines, maybe rural vs urban has an effect on the outcome. Or less skilled doctors fail to see the need for MRI scan — Janosch, Jun 08 '22 at 13:48
Have a look at ( and modify accordingly) https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258 — kjetil b halvorsen, Jun 08 '22 at 16:43

The Doctor · Answer 1 · 2022-06-08T15:29:20.473

0

You can encode variables related with MRI for Not applicable cases using an out of range value, i.e., if your measures lay between 0 and 1000 you can encode the N/A cases as -1000. I can think on at least two reasons you should not let it missing and rely on XGBoost to handle it:

It is not clear if XGBoost implicit imputation can provide unbiased estimators beyond missing completely at random (MCAR) mechanism, when the missingness is just a random sample of the observed values. Simulation studies are needed to understand its limitations.
This is not real missingness, you know that they weren't examined because of lack of resources or they didn't comply with clinical criteria for being examined by MRI. So you should not impute these cases.

The lack of information can be also informative. It can happen that people that were screened by MRI have a better prognostic because this is an important examination that prevents patient deterioration by allowing for early intervention. Or, even more fundamental, hospitals without this machine lack resources compared to the ones where MRI is available leading to better patient care.

However, I can anticipate that performing imputation for actually missing values after encoding this variable can cause an inconsistency problem. For example, imputing one of the MRI related features with the out-of-range value and then imputing the other ones with possible values inside the measurement range when all of them should be -1000.

edited Jun 08 '22 at 15:29

answered Jun 08 '22 at 09:06

The Doctor

103

What makes you believe "XGBoost imputation only works for missing completely at random (MCAR) mechanism"? (I wouldn't call xgb's handling of missing values "imputation", so perhaps I'm not thinking about the same thing as you?) – Ben Reiniger Jun 08 '22 at 12:08
When tree-based methods group missing data and split, they assume that those missing are similar in some way. This can be considered implicit imputation. Here is a simulation study showing that imputation with RF can produce biased estimators under MAR assumption: https://academic.oup.com/aje/article/179/6/764/107562 – The Doctor Jun 08 '22 at 13:12
That article is about using random forest in a MICE imputation setting, so the random forest(s) are training with targets the missing values. And the abstract seems to suggest they're saying that RF MICE imputation produces unbiased estimates?? // xgboost's handling of choosing which of the two sides of a split to put missings is quite different. In particular, I think that's best for when the missingness itself is predictive as you discuss in your penultimate paragraph, so definitely not MCAR. – Ben Reiniger Jun 08 '22 at 14:34
MICE RF is a different technique where RF engine is used in the MICE setup instead of linear regression at the chained regression part of the algorithm. It gets unbiased estimators because of the MICE algorithm, not particularly because of RF. They compare this with purely RF (missForest) package, which I agree is different than letting the algorithm do implicit imputation, but at least provide some information. Simulation studies are required for this kind of approach, but using blindly xgboost without understanding the missing mechanisms is a bad idea. – The Doctor Jun 08 '22 at 15:15

score 0 · Answer 2 · answered Jun 08 '22 at 13:43

Based on the comment on your original questions, I think the easiest solution would be to have two columns per test (overall 4). Lets say the patient passed the MRI the then she would receive a 1 in the first column and a 0 in the second.

If she did not pass she would failed, she gets a 0 in the first and a 0 in the second column. In case she did not receive the treatment, both column are set to 0. Same goes for the other test, with two new columns.

score 0 · Answer 3 · answered Jun 22 '22 at 21:26

I don't think the answers so far by users @Janosch and user The Doctor really answers your problem. Use the solution given at How do you deal with "nested" variables in a regression model?, with one indicator variable has_MRI and another indicator variable has_scanner. Then, in R-speak, use a linear predictor including

 has_MRI + has_MRI:MRI + has_scanner + has_scanner:scan + ...

Feature not applicable to some samples

3 Answers3