0

I just started getting involved with Machine Learning and I decided to create a spam filter for my social app, using the Naive Bayes classifier. I'm following this guide: https://hackernoon.com/how-to-build-a-simple-spam-detecting-machine-learning-classifier-4471fe6b816e

My app has ~70,000 posts and about 3,000 of them are marked as spam. How many of my non-spam posts should I use to train my model?

1 Answers1

1

In general, you do stratified sampling to create training/test splits; otherwise your priors will be biased. Specifically, in Naive Bayes, you estimate class priors from data. If the prior is $3/70$ and you choose to equally include spams and non-spams, your prior estimate will be $\pi=0.5$, which can easily harm your predictions. A typical train/test split can follow 80/20 convention.

gunes
  • 57,205