2

I'm trying to get the best k from iris dataset using KNeighbors so I write the below code using scikit-learn.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import f1_score

dataset = load_iris()
X = dataset.data
y = dataset.target

skf = StratifiedKFold(y, n_folds = 2, shuffle = True)

tst_f1_results= np.zeros(shape = (len(skf), len(range(1, 25))))

i = 0
for train_index, test_index in skf: 
    i += 1

    # Gettin data
    (X_train,
     X_test,
     y_train,
     y_test) = X[train_index], X[test_index], y[train_index], y[test_index]

    for k in range(1, 25):
        classifier = KNeighborsClassifier(n_neighbors = k)

        # Training the classifier
        classifier.fit(X_train, y_train) 

        # Predicting
        y_pred_test = classifier.predict(X_test)    

        # Save Results
        tst_f1_results[i - 1, k - 1] = f1_score(y_test, y_pred_test, average='macro')


f1_scores = np.average(tst_f1_results, axis=0)
plt.grid()
plt.plot(range(1, 25), f1_scores, 'o-', color="g", label="Cross-validation F1-score") 
plt.show()

I was running too many times with differents results as you can see: enter image description here enter image description here enter image description here enter image description here

  • First image suggests that I should choise k (n_neighbors) = between 5 and 11
  • Second image suggests that I should choise k (n_neighbors) = 1 or 19.
  • Third image suggests that I should choise k (n_neighbors) = 3, 7, 8 or 9.
  • Fourth image suggests that I should choise k (n_neighbors) = 5

That results confuses me. Each time, the curve varies so much. I newbiw on machine learning so maybe I'm doing somthing wrong. Is it the correct way to validate a model? Can anyone explain me what should I do?

  • Do your results stabilize if you increase the number of folds? In general you would want to average results over either many folds, or over different random splits (i.e. average your plots together), or both. – GeoMatt22 Sep 02 '16 at 05:51
  • @GeoMatt22 I have been increased the number of folds but keeps unstable. – Overflow012 Sep 02 '16 at 14:24
  • The scores do not seem to vary too much actually, considering a 0 to 1 scale. For example compare to here. For the Iris data set, you might compare to results here. – GeoMatt22 Sep 02 '16 at 16:10

1 Answers1

1

This is because the dataset is shuffled each time before splitting, since you have declared shuffle = True in skf = StratifiedKFold(y, n_folds = 2, shuffle = True)

Your dataset is shuffled and split randomly into different sets each time, Since you have declared shuffle=True but had not given any seed number for the random shuffling.

Each time the training data changes, the model fit and result changes. Hence, look for a same training data to get consistency in the results.

To have same splitting, declare a seed for the random splitting/shuffling by, giving random_state variable. for example try the below line, you will get consistent graphs.

skf = StratifiedKFold(y, n_folds = 2, shuffle = True,random_state=1)

Look for the documentation here about the parameters.

phanny
  • 442