30

I have read a lot of research papers about sentiment classification and related topics.

Most of them use 10-fold cross validation to train and test classifiers. That means that no separate testing/validation is done. Why is that?

What are the advantages/disadvantages of this approach, especially for those doing research?

jonsca
  • 1,772
user18075
  • 647

5 Answers5

20

The main reason is that the k-fold cross-validation estimator has a lower variance than a single hold-out set estimator, which can be very important if the amount of data available is limited. If you have a single hold out set, where 90% of data are used for training and 10% used for testing, the test set is very small, so there will be a lot of variation in the performance estimate for different samples of data, or for different partitions of the data to form training and test sets. k-fold validation reduces this variance by averaging over k different partitions, so the performance estimate is less sensitive to the partitioning of the data. You can go even further by repeated k-fold cross-validation, where the cross-validation is performed using different partitionings of the data to form k sub-sets, and then taking the average over that as well.

Note however, all steps of the model fitting procedure (model selection, feature selection etc.) must be performed independently in each fold of the cross-validation procedure, or the resulting performance estimate will be optimistically biased.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
19

This is not a problem if the CV is nested, i.e. all optimisations, feature selections and model selections, whether they themselves use CV or not, are wrapped in one big CV.

How does this compare to having an extra validation set? While the validation set is usually just a more or less randomly selected part of the whole data, it is simply an equivalent of one iteration of CV. To this end, it is actually a worse method because it can be easily be biased by (hopefully) luckily/unluckily selected or cherry-picked validation set.

The only exception to this are time-series and other data where the object order matters; but they require special treatment either way.

Gala
  • 8,501
11

[EDITED in light of the comment]

I think there is a problem if you use CV results to select among multiple models.

CV allows you to use the entire dataset to train and test one model/method, while being able to have a reasonable idea of how well it will generalize. But if you're comparing multiple models, my instinct is that the model comparison uses up the extra level of train-test isolation that CV gives you, so the final result will not be a reasonable estimate of the chosen model's accuracy.

So I'd guess that if you create several models and choose one based on its CV, you're being overly-optimistic about what you've found. Another validation set would be needed to see how well the winner generalizes.

Wayne
  • 21,174
  • 1
    Thank you. Thats right. But my question was especially about why reseach papers lack of a final validation? Is there a proper reason? Is it about to less data or because the CV does good work and a seperate validation isn't needed? – user18075 Feb 10 '13 at 17:19
  • 5
    The approach of data splitting is highly inefficient. Until both training and test sets are enormous, the mean squared error for an estimate of likely future performance for a predictive model is smaller with bootstrapping or with 100 repeats of 10-fold cross-validation, assuming the resampling procedures had access to all modeling steps that involved $Y$. Use data splitting when you also need to validate the measurement process, survey instrument, or other procedures related to the meaning of the data. A good use of data splitting is when instrumentation varies by country. – Frank Harrell Feb 11 '13 at 15:01
9
  • In my experience, the main reason is usually that you don't have enough samples.
    In my field (classification of biological/medical samples), sometimes a test set is kept separate, but often it comprises only few cases. In that case confidence intervals are usually too wide to be of any use.

  • Another advantage of repeated/iterated cross validation or out-of-bootstrap validation is that you build a bunch of "surrogate" models. These are assumed to be equal. If they are not, the modes are unstable. You can actually measure this instability (with respect to exchanging a few training cases) by comparing either the surrogate models themselves or the predictions different surrogate models make for the same case.

  • This paper by Esbensen & Geladi gives a nice discussion of some limitations of cross validation.
    You can take care of most of them, but one important point that cannot be tackled by resampling validation is drift, which is related to mbq's point:

    The only exception to this are time-series and other data where the object order matters

    Drift means that e.g. an instrument's response/true calibration changes slowly over time. So the generalization error for unknown cases may not be the same as for unknown future cases. You arrive at instructions like "redo calibration daily/weekly/..." if you find drift during validation, but this needs test sets systematically acquired later than the training data.
    (You could do "special" splits that take into account acquisition time, if your experiment is planned accorodingly, but usually this will not cover as much time as you'd want to test for for drift detection)

2

Why we should do cross-validation instead of using separate validation set?

Aurélien Géron talks about this in his book

To avoid “wasting” too much training data in validation sets, a common technique isto use cross-validation.

Instead of other k values, why we may prefer to use k=10 in cross-validation?

To answer this, at first, I would like to thank Jason Brownlee, PhD for his great tutorial on k-fold Cross-Validation. I am citing one of his cited book.

Kuhn & Johnson talked about the choice of k value in their book .

The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller (i.e., the bias is smaller for k=10 than k= 5). In this context, the bias is the difference between the estimated and true values of performance

Then, one may say that why we do not use leave-one-out cross-validation (LOOCV) as k value is maximum there and thus, bias will be least there. In that book, they have also talked why we can prefer 10 fold CV instead of preferring LOOCV.

From a practical viewpoint, larger values of k are more computationally burdensome. In the extreme, LOOCV is most computationally taxing because it requires as many model fits as data points and each model fit uses a subset that is nearly the same size of the training set. Molinaro (2005) found that leave-one-out and k=10-fold cross-validation yielded similar results, indicating that k= 10 is more attractive from the perspective of computational efficiency. Also, small values of k, say 2 or 3, have high bias but are very computationally efficient.

I have read a lot of research papers about sentiment classification and related topics. Most of them use 10-fold cross validation to train and test classifiers. That means that no separate testing/validation is done. Why is that?

If we do not use cross-validation (CV) to select one of the multiple models (or we do not use CV to tune the hyper-parameters), we do not need to do separate test. The reason is, the purpose of doing separate test is accomplished here in CV (by one of the k folds in each iteration). Different SE threads have talked about this a lot. You may check.

At the end, feel free to ask me, if something I have written is not clear to you.