0

Can we "trust" about this method when we have small data set ? Is there a minimum size required to partition the data ? e.g. N=200 ?

user44677
  • 307
  • 1
    There is no lower limit, but sampling issues grow with smaller sets. For example see an example of a Train/Test split working well on the $150$ element iris dataset and would probably have worked reasonably well with smaller data sets – Henry Jun 29 '17 at 07:52
  • 2
    it's not about whether you can 'trust' the method or not. The only way to report the performance of a system is to evaluate it on unseen observations. So, in every case you need to train your model on some training observations and generate predictions on a separate test set. Usually, the train/test split is around 70/30 (randomly, repeated several times). When you do 10-fold CV, which is also rather standard, it is 90/10. Now, if your dataset is very small, you can make the ratio even more unbalanced 95/5, etc. The extreme being 'leave-one-out' cross validation – Antoine Jun 29 '17 at 09:10
  • Ok, I did not know that we could make the ratio even more unbalanced 95/5, etc I thought it was only 50/50, 70/30, etc. – user44677 Jun 29 '17 at 09:42
  • 1
    @Antoine's "leave one out" is for cross-validation. For Training-Test you wouldn't want to do that. The estimate of accuracy on a Test set follows a binomial distribution (for a classification problem), so the smaller your test set the wider the confidence interval around your estimate of the Test error is. 10fold cross-validation will give you a better estimate of the true error than repeating a 90/10 split 10 times. For an example (using Weka software) see section 2.6 here: http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/ – zbicyclist Jun 29 '17 at 13:21

0 Answers0