a question about splitting data into TRAIN/TEST set

Question

Can we "trust" about this method when we have small data set ? Is there a minimum size required to partition the data ? e.g. N=200 ?

There is no lower limit, but sampling issues grow with smaller sets. For example see an example of a Train/Test split working well on the $150$ element iris dataset and would probably have worked reasonably well with smaller data sets — Henry, Jun 29 '17 at 07:52
it's not about whether you can 'trust' the method or not. The only way to report the performance of a system is to evaluate it on unseen observations. So, in every case you need to train your model on some training observations and generate predictions on a separate test set. Usually, the train/test split is around 70/30 (randomly, repeated several times). When you do 10-fold CV, which is also rather standard, it is 90/10. Now, if your dataset is very small, you can make the ratio even more unbalanced 95/5, etc. The extreme being 'leave-one-out' cross validation — Antoine, Jun 29 '17 at 09:10
Ok, I did not know that we could make the ratio even more unbalanced 95/5, etc I thought it was only 50/50, 70/30, etc. — user44677, Jun 29 '17 at 09:42
@Antoine's "leave one out" is for cross-validation. For Training-Test you wouldn't want to do that. The estimate of accuracy on a Test set follows a binomial distribution (for a classification problem), so the smaller your test set the wider the confidence interval around your estimate of the Test error is. 10fold cross-validation will give you a better estimate of the true error than repeating a 90/10 split 10 times. For an example (using Weka software) see section 2.6 here: http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/ — zbicyclist, Jun 29 '17 at 13:21

a question about splitting data into TRAIN/TEST set

0 Answers0