16

I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining.

My colleague (who is a machine learning expert) said that, in his opinion, the arguably most important practical aspect of machine learning is how to understand whether you have collected enough data to train your machine learning model.

This statement surprised me, because I had never given that much importance to this aspect...

I then looked for more information on the internet, and I found this post on FastML.com reporting as rule of thumb that you need roughly 10 times as many data instances as there are features.

Two questions:

1 - Is this issue really particularly relevant in machine learning?

2 - Is the 10 times rule working? Are there any other relevant sources for this theme?

Sean Owen
  • 6,595
  • 6
  • 31
  • 43
DavideChicco.it
  • 281
  • 1
  • 3
  • 7
  • Yes. 2. It's a good baseline but you can get around it with regularization to reduce the effective degrees of freedom. This works especially well with deep learning. 3. You can diagnose the situation on your problem by plotting the learning curve of the sample size against the error or score.
  • – Emre Jun 26 '17 at 21:58
  • @Emre Thanks! Can you also suggest me some papers or any material to read? – DavideChicco.it Jun 28 '17 at 02:14
  • This will usually be covered alongside cross-validation and other model validation techniques in your textbook. – Emre Jun 28 '17 at 04:32
  • The 10 times rule is great if you can achieve it, but it is just not practical in some business settings. There are many situations where the number of features is much greater than data instances (p>>n). There are machine learning techniques designed specifically to deal with these situations. – acylam Aug 29 '17 at 15:59
  • If you need a detailed explanation which can help you to understand the learning curve graph check this out: https://www.scikit-yb.org/en/latest/api/model_selection/learning_curve.html – shrikanth singh Nov 11 '19 at 17:11