I know the training error of k=1 is 0 because k=1 is basically picking itself as the closest point. However, what would be the training error if k=sample size? Say if there are 20 reds and 30 blues, is the training errors simply 20/30=66.6%?
-
1It's as noted in the answer below (+1), the accuracy is 30 / (20+30) = 0.6, so error is 40%. – gunes Jul 27 '21 at 07:59
-
@gunes oh right haha – Math Avengers Jul 27 '21 at 08:11
1 Answers
In $k$NN if you pick $k$ of size equal to training set size $k=N$, then for classification you pick all the training samples as the $k$ nearest neighbors. In such a case, for classification, you use the majority vote, so the training accuracy would be equal to the proportion of the largest class in the training set. For regression, you would make a prediction by averaging the target variable in the training set, so $\hat y = \bar y$, in such case, mean squared error would be equal to variance $\tfrac{1}{N} \sum_{i=1}^N (y_i - \bar y)^2$.
By the way, for $k=1$ you could still end up with a non-perfect error rate if you have in your data such observations where the same values of features have different labels. In such a case, the algorithm would "randomly" choose between equivalent neighbours.
- 138,066