I am planning to use scikit linear support vector machine (SVM) classifier for text classification on a corpus consisting of 1 million labeled documents. What I am planning to do is, when a user enters some keyword, the classifier will first classify it in a category, and then a subsequent information retrieval query will happen in within the documents of that category catagory. I have a few questions:
- How do I confirm that classification will not take much time? I don't want users to have to spend time waiting for a classification to finish in order to get better results.
- Is using Python's scikit library for websites/web applications suitable for this?
- Does anyone know how amazon or flipkart perform classification on user queries, or do they use a completely different logic?