2

I'm performing SVM classification on a relatively large data set (~1M rows, 4 variables). I want to assign a classification score to each row, not evaluate input parameters, so following the top answer here I'm not worrying about cross-validating.

However, the data is too large to fit the classifier on all data points at once. The practical maximum for my use is about 10,000 points. Any more and it takes too long.

What's the best way to proceed in this case? Is it possible to fit multiple models and average them, e.g. fit 100 models on 10,000 rows each, thus sampling each of the 1M data points? If so, would I average the classification scores, or internal model parameters, or something else entirely?

1 Answers1

1

Yes you can totally do it. Those methods are called 'ensemble methods' and the one you want to use is called 'pasting' for example http://scikit-learn.org/stable/modules/ensemble.html. It's possible that your model will have same or even better accuracy as the normal model. However usually you will train multiple classifiers and let them vote, not to average the models.

rep_ho
  • 7,589
  • 1
  • 27
  • 50