Can't reproduce results from GridSearchCV?

Question

I am trying to find optimized n_neighbors value for KnearestClassifier using GridSearchCV. I am able to get optimized parameters but when I enter those in my classifier results don't match with GridSearchCVs best results.

clf = KNeighborsClassifier(n_neighbors=15, weights='uniform')
clf.fit(features_train, labels_train)
print('Score using optimized parameters: {}'.format(clf.score(features_test,       labels_test)))
params = {'n_neighbors':[1,10,15,20,25,30,35,40,45,50,60,70,80,90,100],    'weights':['uniform', 'distance']}
grid = GridSearchCV(clf, params, cv=10, )
grid.fit(features_train, labels_train)
print('Optimized Parameters:{}'.format(grid.best_params_))
print('Best Score from GridsearchCV parameters{}'.format(grid.best_score_))

Output:

Score using optimized parameters: 0.928

Optimized Parameters:{'n_neighbors': 15, 'weights': 'uniform'}

Best Score from GridsearchCV parameters: 0.962666666667

score 1 · Accepted Answer · edited Apr 13 '17 at 12:50

The score from your GridsearchCV is biased. You can use cross-validation either for estimating accuracy, or for choosing hyperparameters; but not both. If you use cross-validation to pick the best choice of hyperparameters, by measuring the accuracy of each possible option, the accuracy you got for the option you chose will tend to overestimate the accuracy you'll see on the test set.

To avoid this bias, select a separate hold-out validation set for estimating the accuracy of your selected parameters, or use nested cross-validation (or a scikit-learn Pipeline).

See https://datascience.stackexchange.com/a/17835/8560.

Can't reproduce results from GridSearchCV?

1 Answers1