0

Suppose I have a numerical discrete variable that does not apply to all my observations, i.e. 'years_married'. Not all the people in my dataframe is married, so they have an 'NA' registered in this variable.

What would be a correct way to proceed in this case? 'years married' is an important variable for my study (if they are married), so I don't want to discard it.

One idea is to split the dataframe in two, one with this variable (for those who are married), and other without it (for singles), and model them separately, but this would drastically reduce the number of observations (at least in one of the subsets) and the predicion accuracy.

Is there any technique or transformation, or recommend me an algorythm (i.e. Random Forest), that can handle this situation?

Thanks :)

Edit: May be it was not a good example. The exact case is about AGE of a device at the moment of the study. I have the date when data form device was collected, but not in all cases I have the construction date of the device

Kaikus
  • 219
  • 1
  • 5
  • 2
    Does this answer your question? 80% of missing data in a single variable The answer from @whuber on that page shows a general approach to this frequent problem. If that doesn't answer your question, please edit your question to specify what's still unclear. – EdM May 01 '23 at 13:51
  • @Edm I have clarified the case, with a more clear exposition of the data in study – Kaikus May 03 '23 at 07:43
  • This would seem to have an obvious value one can use instead of missing. I.e. someone who has never been married has been married for 0 years... Especially, if the a model then additionally gets a "married" flag, as well as possibly the interaction between married and the years married, surely this is pretty flexible? – Björn May 03 '23 at 07:48
  • @Kaikus could you clarify why the duplicate is not solving your question. (And there are actually several other similar questions. It isn't the only duplicate) – Sextus Empiricus May 03 '23 at 09:22
  • 1
    As the edited question now explains that the missing values aren't logically impossible but simply unknown, the suggestion I provided isn't appropriate. This type of missing data should be handled by multiple imputation. This reference explains in detail. In R, the mice package is a frequent choice for implementation. Follow the missing-data tag on this site for suggestions. – EdM May 03 '23 at 13:52

0 Answers0