how to prevent overfitting with knn

Question

Using too low a value of K gives over fitting.

But how is overfitting prevented:

How do we make sure K is not too low
And are there any other precautions taken in k-nn that help prevent over fitting.

Try tuning k using cross validation – dsaxton Dec 27 '18 at 01:09 — dsaxton, Dec 27 '18 at 01:09

Matthieu Brucher · Accepted Answer · 2018-12-26T22:42:58.900

1

This relates to the number of samples that you have and the noise on these samples.

For instance if you have two billion samples and if you use $k=2$, you could have overfitting very easily, even without lots of noise.

If you have noise, then you need to increase the number of neighbors so that you can use a region big enough to have a safe decision.

But for a ballpark estimate, I would start with $k=log(nb samples)$, and I would increase $k$ depending on the level of noise in my samples.

edited Dec 26 '18 at 22:42

answered Dec 26 '18 at 20:20

Matthieu Brucher

It seems like this answer has a number of unstated premises that it relies on. How does the number of samples $N$, number of features and noise level relate to the concept of overfitting? Are there any theorems about $k$-NN and these concepts which you can use to explain why $\ln(N)$ is a good starting point? – Sycorax May 14 '22 at 18:24
At least no theorem last time I checked a few years ago. It's just a heuristic that worked well for me and on different datasets. – Matthieu Brucher May 15 '22 at 19:19

1 Answers1