May I ask why a KNN model with K = 1 will have a strange blue dot on the left hand side of the Bayes decision and KNN decision boundary? I extracted this picture from the ISLR
Asked
Active
Viewed 388 times
1
-
With $k=1$, you are assigning the value of the single nearest neighbor; the blue dot is a point which has whatever value "blue" represents, and the circle around it contains the region for which the blue dot is the nearest neighbor. With different data, you might get more or fewer such seeming anomalies. – jbowman Dec 11 '19 at 19:24
-
I am sorry I still do not quite understand... would you mind to give an example if possible? – Tsz Chun Leung Dec 11 '19 at 19:26
-
Two questions to help calibrate my response - 1) do you understand how KNN works? 2) Do you understand what the dots in the plot are? – jbowman Dec 11 '19 at 19:31
-
I have a very basic understanding about knn, which is firstly to locate a point , and then base on the K-value, we will find the nearest k other points to compare for the frequencies of which class of the other points is the highest. Finally we will make the original point as that color. – Tsz Chun Leung Dec 11 '19 at 19:33
-
The dots are the observations of y labels?... – Tsz Chun Leung Dec 11 '19 at 19:58
-
Correct. With $k=1$, you only look at the single nearest neighbor, so if you are nearer the blue dot than to any yellow dot, that's what value you'll assign. Note that there are yellow dots between the isolated blue dot and the rest of the blue dots, so the blue dot is in its own little "island" of blueness. – jbowman Dec 11 '19 at 20:03
-
Thank you, I got a much more clearer now. But may I ask why that blue dot is a blue dot instead of an orange dot ? – Tsz Chun Leung Dec 11 '19 at 20:21
-
Because it's an observed data point. At those x-y co-ordinates, an actual observation of "blue" was made. It may be unlikely that that observation would be blue, but with enough data points, unlikely things will occur. – jbowman Dec 11 '19 at 20:31
-
I remember the author said that the purple dotted curve is the bates classifier boundary and these are all simulated data. So I assume these are the true population with no exception , so every points on the left of the bayes classifier boundary should be orange in color... am I correct ? – Tsz Chun Leung Dec 11 '19 at 20:34
-
No, you're not. 1. The "true" population need not fall entirely on one side or another of the Bayes classifier boundary. Consider say "# cigarettes smoked / day" and "age" as two axes, with the label being "has lung cancer"; there will be a Bayes boundary, but not everyone on one side of it will have lung cancer. 2. It's simulated data, so it can't be the "true" population. – jbowman Dec 11 '19 at 20:37
-
This is really a good example to clear my doubt. Much thanks. – Tsz Chun Leung Dec 11 '19 at 20:50
-
You're welcome! That's what we're here for! – jbowman Dec 11 '19 at 20:51
