How much time do scikit classifiers take to classify?

Question

I am planning to use scikit linear support vector machine (SVM) classifier for text classification on a corpus consisting of 1 million labeled documents. What I am planning to do is, when a user enters some keyword, the classifier will first classify it in a category, and then a subsequent information retrieval query will happen in within the documents of that category catagory. I have a few questions:

How do I confirm that classification will not take much time? I don't want users to have to spend time waiting for a classification to finish in order to get better results.
Is using Python's scikit library for websites/web applications suitable for this?
Does anyone know how amazon or flipkart perform classification on user queries, or do they use a completely different logic?

You can classify all keywords beforehand and then just pull category from the index. — ffriend, Oct 01 '14 at 13:31
@ffriend seems like an answer for one word query. But if search query is consist more words .. or combinations of words .. i have to create index for all combinations!!! — user3498, Oct 01 '14 at 13:35
SVC is fast, so if you want to use it for query classification in a moderate-load application, it will work. But classification by a single (or even several words) is a bad idea in most cases. Take ambiguous words, for example: what if some word belongs to 2 categories with very little difference in probabilities? Are you going to throw just a little bit less probable category out of search? What you most probably want is an additional term in ranking formula while searching, not rejecting less probable categories at all. — ffriend, Oct 01 '14 at 13:56

score 4 · Answer 1 · answered Jan 21 '16 at 11:29

I don't see a huge problem here. So, I would try to answer all of your questions from the production-level point of view:

How do I confirm that classification will not take much time?

Take a subset of the corpus data you have (you can randomly do it, no need of sampling), and test your algorithm on it, and they approximate/generalize it to the overall dataset.

(SVM is comparatively faster. Nevertheless, do the above process just to be sure.)

And do test it in the development environment before pushing to production.

Is using Python's scikit library for websites/web applications suitable for this?

Yes, it is. It is already being used by a nice chunk of companies out there.

The third question about Amazon and Flipkart cannot be answered by someone outside their teams.

In addition, I would advise you to use the mapreduce techniques for training your models. And as already advised, pickle your models so that you don't need to train them with every request.

score 3 · Answer 2 · answered Oct 02 '14 at 08:17

3

The only reliable way to see how long it takes is to code it up and give it a shot. Training will take more time, then you can save your model (pickle) to use later.

answered Oct 02 '14 at 08:17

user1269942

181
1
3

How much time do scikit classifiers take to classify?

2 Answers2