2

This is my first time using a machine learning algorithm. It is for a school project. My model is attempting to predict from inputs: Weight, Height, and Age whether an Olympic athlete (100-meter dash) wins a medal or not. Here is the relevant code:

#Normalize the data
train_athletics_m = min_max_normalization(train_athletics_m)
test_athletics_m = min_max_normalization(test_athletics_m)
X = train_athletics_m[['Age', 'Weight', 'Height']]
y = train_athletics_m['Medal']

test = test_athletics_m[['Age', 'Weight', 'Height']] test_labels = test_athletics_m['Medal']

classifier = KNeighborsClassifier() classifier.score(test, test_labels)

I tested various k values. The worst accuracy is when k=1 (88%), but for k > 3, the accuracy stays constant (94%). I was under the impression that the usual behavior would be for accuracy to reach a peak and then decrease until it stabilizes as k goes onward. Also, my accuracy scores are higher than I expected. I am wondering if this is normal, and if not, where would my error most likely be in?

2 Answers2

10
  • Accuracy is not the best metric. It is also impossible to interpret it without further details about your data. I have no access to your data, but I found a similar Kaggle dataset where 85% of the athletes did not get a medal while 15% did. If your data is similar, this would mean that if you predicted: "no medal" for everybody you already get 85% accuracy.
  • To judge the performance, you need some benchmarks. Do you know any other results obtained using this or a similar dataset? How does your compare to theirs? You also would like to have an internal benchmark, i.e. compare the model to some trivial model (e.g. predict the most frequent class for everyone). Without this, no metric can be interpreted.
  • The $k$ hyperparameter controls how many nearest neighbors you are averaging to make a prediction. Larger $k$ plays a role in regularization: with small $k$ you are likely to overfit, with large $k$ underfit. With $k=N$ (sample size) you are making the same prediction every time. Did you check if the result is not due to the fact that with increasing $k$ you are simply starting to make constant predictions? Maybe in this case this starts happening pretty fast.
D.W.
  • 6,668
Tim
  • 138,066
2

The way KNN works is by taking votes from adjacent points.

If your data is well separated then as you increase K then the model won't be as confused.

In some cases, the performance would decrease; this is when K is so big that it's more than the number of samples inside a class.

If I were you I would plot the points to make sure I understand how this behavior is happening.

And with algorithms like KNN, it's straight-forward to understand the reason why it made the prediction with simple plotting.

Also check you test data is big enough and there's no data leakage between the training and test.

I hope that'll be helpful