0

Right now, I'm doing my thesis which use multilayer perceptron learning method to train a model. What I learned from my class is the purpose to partition the data which can be separated into 3 group

  • Training dataset - This set is to train the model.
  • Validation dataset - This set is to find the best model parameters.
  • Testing dataset - This set is to evaluate the performance of the model.

But the problem is I cannot find the reference to support why I have to partition the samples in my thesis with certain ratio.

I try to google it but can't find those articles. It's probably because I googled it with the incorrect technical terms.

Back to the question, Can anyone here suggest the articles discussing the best ratio with supported evidence?

  • How is "finding the best model parameters" different from "training the model"? I guess I am asking you what is the difference between training dataset and validation dataset as per the definition given in your post? – TenaliRaman May 22 '12 at 14:13
  • @TenaliRaman, it's cross-validation method to avoid overfitting. – Jessada Thutkawkorapin May 22 '12 at 14:35

1 Answers1

1

There isn't enough theory to provide a unique answer. This is one of several reasons to entertain the use of the bootstrap or the double bootstrap. More information about the bootstrap for model validation may be found in http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • Gee Frank You could have plugged my book too! But I gave you a +1 anyway! – Michael R. Chernick May 22 '12 at 13:49
  • I think that some people use different terminology and only use two sets. Training : Meaning to fit a model to the data including the parameter estimates. You start with a specific form for the model and the training data teaches you what to use for the parameters to make the classifier fit best to the training data. Test data: Data held out of training which is used to evaluate the performance of the classifier. This step both validates and evaluates since a poor result may indicate an "invalid" model (assuming better performance is possible. – Michael R. Chernick May 22 '12 at 14:24
  • Thanks Michael. It would be good if we could develop a minimum sample size for an adequate validation sample, and a better estimate of the number of times that cross-validation has to be repeated to give near-optimal precision. The bootstrap has fewer tunable parameters - we mainly need to come up with a good number of bootstrap resamples, vs. solving for r and k in "repeat k-fold cross-validation r times". – Frank Harrell May 22 '12 at 15:38
  • Interesting idea Frank. Maybe that would make a good article or part of a thesis topic for a student. – Michael R. Chernick May 22 '12 at 15:57