2

I currently learning value imputation. The popular methods that I've seen such as mean, median, arbitrary value etc, impute all missing values with a single calculated value. Each of these methods can potentially alter the distribution of the variable.

This has given me an idea. Why not impute missing values with multiple imputes calculated to retain the same distribution. For example, when doing mode imputation, if the most frequent values in a categorical variable are E and F, and they occur 10 and 5 times respectively. Then the three missing values can be replaced by E, E, F. Obviously this would be scripted as a more generic solution involving top n values.

Would there be any disadvantages to this method? Is this a known standard method (I couldn't find one)?

0 Answers0