4

I've understood that the kNN imputer, being a multivariate imputer, is "better" than univariate approaches like SimpleImputer in the sense that it takes multiple variables into account, which intuitively feels like a more reliable or accurate estimate of the missing value.

But what are the mechanics behind it?
How does it determine what's the nearest neighbour?

Bonus: How is k best determined?

LeLuc
  • 651

1 Answers1

2

$k$-NN algorithhm is pretty simple, you need a distance metric, say Euclidean distance and then you use it to compare the sample, to every other sample in the dataset. As a prediction, you take the average of the $k$ most similar samples or their mode in case of classification. $k$ is usually chosen on an empirical basis so that it provides the best validation set performance.

Multivariate methods for inputting missing values do not have to be better than the univariate ones. They will be better if you have relevant, high-quality data. However, if your dataset is small, you may be finding some spurious patterns and start imputing based on those patterns. In such a case, the result will be worse than if you didn't consider the other variables. Multivariate methods for inputting missing values make sense only if the other variables enable you to make reasonable predictions for the missing values. For example, if you are missing information about someone's age, using their gender would unlikely help to guess it, as those properties are not really related in most cases.

While this may, or may not, be directly related to your question, you need to always consider why the data is missing. If the missingness is not at random, data-based imputation may lead to incorrect results. You may also want to read the What are the disadvantages of using mean for missing values? thread.

Tim
  • 138,066
  • Thanks for the detailed answer and mentioning the randomness of missing. – LeLuc May 02 '21 at 14:53
  • @Tim how do you know if the data is MAR, MNAR or MCAR? – spectre Nov 22 '21 at 07:01
  • @spectre https://stats.stackexchange.com/questions/23090/distinguishing-missing-at-random-mar-from-missing-completely-at-random-mcar?noredirect=1&lq=1 – Tim Nov 22 '21 at 07:29
  • How can I measure the distance? missing values ​exist Count distances and replace missing values ​after replacing with the mean of that column? – hasic Lim Apr 27 '22 at 10:34
  • @hasicLim you can measure distances only using the matching non-missing columns. – Tim Apr 27 '22 at 12:22