How to tune parameters through cross-validation without grid search?

Question

There are actually lots of questions about parameter tuning through cross-validation. I have read some of them, e.g. this one. I, however, still can't understand the details of the process. Here are my questions:

How do we get the space for parameters without a grid search?
How do we consider the performance of a particular parameter. For example, given 10 folds. I saw a formula for the cross-validation error of a parameter $\theta$ is:

$$CV(\theta) = \frac 1 n \sum_{k=1}^{K} \sum_{i \in F_k}\big(y_i - f^k_{\theta}(x_i)\big)^2$$

If $\theta$ is given, then $CV(\theta)$ is actually the average error of $f_{\theta}(x)$ on the whole 10 fold, i.e. the whole training data. Then why do we cut them into 10 folds? Why don't we just try all possible $\theta$s on the whole data and see which one is the best for f(x)?
When we evaluate the models we have through CV. On each n-1-fold, there will be an optimal $\theta$. For each round of validations, we will have a different $\theta$ on the training folds and use it on the 1 test fold, and then calculate the average error rates on the 10 folds. My question is, once we know which model is the best, how do we apply it on the test dataset while we have different $\theta$s for different folds?

A couple clarifications before someone can answer: Can you clarify what you are asking in 1), is it "what about the performance of the parameters outside of the grid? For 2), how are you planning on evaluation model performance in your proposal? — Matthew Drury, May 26 '15 at 03:02
For 1), GridSearchCV and LassoCV both use a grid search. For 2), I don't quite follow what you're proposing, could you please explain what you mean by "get a result" in more detail? Is the result the prediction itself or the estimated error? — Matthew Drury, May 27 '15 at 04:24

Matthew Drury · Answer 1 · 2015-05-26T03:20:50.007

1) You don't evaluate the performance of the classifier at parameters outside of the chosen grid. Essentially this relies on a continuity assumption, that the performance of a classifier at points "between" two grid points is comparable to the performance at the bounding grid points. This seems reasonable as long as the grid is chosen finely enough. If you want to get crazy it can probably be proven with the inverse function theorem.

2) You seem to have a misconception about cross validation:

If θ is given, then CV(θ) is actually the average error of $f_θ(x)$ on the whole 10 fold, i.e. the whole training data.

that last stretch of logic is not correct. Here are two situations:

You train a model on the full training data. Then you split your data into 10 folds, calculate the average error of the model predictions on the folds, then average the results.
You split the data into 10 folds, then train the model 10 times, each time leaving out one fold of the data. For each of the 10 resulting models, you score the model on the data it was not trained on, calculate the average error for each, and then average the results.

The fist is the situation you can draw your conclusion from, the second is what cross validation actually does. Cross validation calculates the expected out of sample error averaging over training and testing sets, not just testing sets.

3) You have your quantifiers reversed. It is not:

For each fold, calculate the optimal $\theta$.

It is:

For each $\theta$, calculate the cross validation estimate of the out of sample error (average the out of fold errors). Then choose that $\theta$ with lowest estimate.

Here's a picture:

enter image description here

The x axis is your $\theta$. At each $\theta$ I have ten measurements of out of fold error. The dot is the mean out of fold error for that $\theta$, and the error bars are the standard deviation those measurements.

Often once $\theta$ is determined, it is fixed, and the model is retrained on the entire training set.

Thanks Matthew. 2) The purpose is to find an optimal parameter $\theta$ for the model on all the folds, so I think that $\theta$ should be fixed on all folds. Isn't $\theta$ in the formula fixed? — DukeJun, May 27 '15 at 04:23
The biggest confusion is that I don't understand the main difference between using CV to evaluate models and find optimal parameters for a particular model. For example, suppose we only have one parameter $\theta$. For the former, we will get different good $\theta$s while we train our model on different folds. But for the latter, if we want to find a good $theta$ for a simple linear model, I suppose we just try different $theta$s on the whole training data and see which one is the best. Then why do we need CV here? — DukeJun, May 27 '15 at 04:23

How to tune parameters through cross-validation without grid search?

1 Answers1