2

I'm very confused on how to practically handle data sets for deep learning. If i want to use DL for some task i (usually) don't have all possible variations to train the network perfectly. Thus, given some task, someone usually starts by searching for a data set to start with. This data set will then change over time, because more samples of the original data becomes available or new untrained data becomes available. To get an answer to my questions, I would like to sketch how I would handle training. This is also heavily related to the question on how to handle training/test/validation sets.


The sketch:

Let's assume I want to build a network to recognize an animal based on a picture. At first i start searching for animal pictures and maybe out of luck find a data set of a few hundred pictures of a mixture of exactly one adult cat, dog and horse.

My first attempt would be to just randomly shuffle all animals in a 90% training, 5 % validation and 5 % test set - so all pictures are in a directory on my hard drive and i randomly pick 90% for training, .... Let's call this shuffle A(90/5/5).

The network is trained and it's prediction capabilites are evaluated using the 5% validation set. I'm not very impressed so I change hyperparameters, like a deeper network, larger hidden layers, change learning rate and so on. For every hyperparameter change I retrain the model again with the same 90% set and evaluate it's performance on the same validation set.

After a few days, I evaluate the best model of my hyperparameter tuning on the 5% test set for the first time. The result is very bad, thus I shuffle again and call it B(90/5/5) and start again with the same trial and error scheme.

Hopefully after a few try's i get good results on the 5% test set, so i'm finished now.

A few weeks later my dataset changes. I found a new dataset with two new species, one contains hundreds of pictures, the other only 50 pictures. Also I added puppies and kittens to the previous dataset. I repeat the whole process again and hope it get's better.


My thoughts and questions on this:

So this process is a big trial and error scheme which has the huge drawback, that I start from zero on each data set change. So if I have a good working network, a small change in the data set could lead to an insane amount of training/tuning repetitions which maybe takes very long until I find good parameters.

At first, i would change the 90/5/5 set's based on the set of different species and categories I have. For example, the pictures are split in different directories on my hard drive as, cats/adult, cats/kittens, dogs/adult, dogs/puppies, and so on. For each of those directories I would then randomly select 90%/5%/5% pictures and use the union of the 90% sets for training.

The idea is to avoid overfitting to the exact given pictures of a species if I have only a few samples.

The second idea would be to reuse the same network and just increase some parameters instead of starting from scratch or trying a completely different architecture. The idea here is that a bigger data set needs a bigger network to achieve good results. But here is some mistake, I think. Maybe i already overfitted the previous network and it doesn't make any sense to increase the network?

Do i really need to reshuffle the images again if the test set doesn't perform good? How to speed up iteration and maybe recycle knowledge of hyperparameters? How do I know which parameters I should change if my dataset changes?

How do big companies handle this problem? How would you handle a speech recognition system if your users generate hundreds of hours of possible data every day or if your users provide thousands of images to train on.

  • You are not using 90-10 Train-Validation, not 90-5-5. Test sets should only used once, to get an accurate performance estimation.
  • – Laksan Nathan Sep 08 '18 at 20:05
  • You can save your model parameter and use them as initial setting to start other training routines. Checkpoints can be used to add new data to your model (Batch Training).
  • – Laksan Nathan Sep 08 '18 at 20:08
  • @lnathan 1. What do you mean by not using 90-10 Train-Validation, not 90-5-5? I thought about 90% train and 5% validation and 5% test set. I'm not using 10% validation. – John Doe Sep 10 '18 at 15:54
  • lnathan, 2. What do mean by Checkpoints? I know batch training. But if i add a new species which didn't exist before, do you mean, that i just start with the parameters of my working model and assume the new species to be a batch? And just train on this batch? Wouldn't this overfit to the new species. – John Doe Sep 10 '18 at 15:57
  • "after a few try's i get good results on the 5% test set" -> you are using the test set as Validation set. You get an unreliable performance estimation. – Laksan Nathan Sep 10 '18 at 18:11
  • "new species" -> check out Transfer Learning or see this video: https://www.youtube.com/watch?v=vIci3C4JkL0 – Laksan Nathan Sep 10 '18 at 18:12
  • @lnathan Thanks for the hint on transfer learning. Regarding your hint on ... using the test set as Validation set: I'm confused here. Is the probleme here just that I might use terms test set and validation set incorrectly or is the workflow itself wrong? Isn't the point that I use the training set to update the network weights. Another 5% to change hyperparameters and the last 5% set to make sure my network produces good/correct results right? And those two 5% sets are the validation set (for hyperparameters) and test set for final verification? – John Doe Sep 11 '18 at 09:31
  • @lnathan Maybe it helps me better if you can answer the following question: When within the workflow and how often should I randomly reshuffle which set (everything, training, test, validation sets?) ? – John Doe Sep 11 '18 at 09:32
  • 1
    Yes, you use the terms test set and validation set incorrectly. You can train/validate as much as you like. But if you see bad results on your test set and go back to previous steps you are overfitting the test set (using it as validation). – Laksan Nathan Sep 11 '18 at 10:53