GridSearchCV and KFold

Question

I noticed that in some cases, a GridSearchCV is applied on the output of KFold. For example, like in the code below. Why is it needed? I thought that something equivalent to KFold is already applied as part of GridSearchCV, by specifying the parameter of cv in GridSearchCV. (For example, if cv=3, isn't GridSearchCV also doing the part of KFold with 3 folds?)

k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
all_scores = []
all_k = []
all_d = [1,2,3,4,5,6,7,8,9,10]
kFolds = sklearn.cross_validation.KFold(X.shape[0], n_folds=10)

for d in all_d:
    svd = sklearn.decomposition.TruncatedSVD(n_components=d)
    scores = []
    for train_index, test_index in kFolds:
        train_data, test_data = X[train_index], X[test_index]
        train_labels, test_labels = Y[train_index], Y[test_index]  
        data_mean = np.mean(train_data, axis=0)
        train_data_centered = train_data - data_mean
        test_data_centered = test_data - data_mean
        X_d = svd.fit_transform(train_data_centered)
        X_d_test = svd.transform(test_data_centered)
        clf.fit(X_d, train_labels)   
        scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(X_d_test)))

    all_scores.append(scores)
    all_k.append(clf.best_params_['n_neighbors'])

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

5

Yes, GridSearchCV applies cross-validation to select from a set of parameter values; in this example, it does so using k-folds with $k=10$, given by the cv parameter.

The purpose of the split within GridSearchCV is to answer the question, "If I choose parameters, in this case the number of neighbors, based on how well they perform on held-out data, which values should I use?" That is, it's part of training the model.

The purpose of the second split, that in the inner loop of the code you've shown, is to answer, "How well does each parameter choice perform on data not seen during training?"

Referencing this answer, the split in GridSearchCV corresponds to the validation set; the split in the inner loop, the test set.

edited Apr 13 '17 at 12:44

Community

1

answered Apr 18 '16 at 14:26

Sean Easter

8,814
2
31
58

Thanks! In such case, when the line all_k.append(clf.best_params_['n_neighbors']) is executed, aren't we just choosing the clf that happened to be the best for the last train_index, test_index inside in kFolds? shouldn't we somehow average on them? That is, when we apply clf.fit(X_d, train_labels) again and again, does it have some "memory" of the previous time we applied it (within the same loop), or is it choosing the best n_neighbors based only on the current X_d, train_labels? (It seems to me that we need the former, but that the code gives us the latter). – Alex Apr 18 '16 at 18:59
I think you're right about that all_k.append(), and I share your struggle to see the point in storing the last, haphazard parameters. To your question about memory, generally it's choosing based only on the current set. This is to get an idea of what the out-of-sample error is, sometimes to see how robust the parameters are w/r/t the training data, etc. – Sean Easter Apr 18 '16 at 21:04

GridSearchCV and KFold

1 Answers1