How can I improve sentiment analysis of user comments?

Question

I'm implementing sentiment analysis on the set of user comments. All comments are on the same object. At the moment I decided to have three classes - negative, neutral and positive. I got test array of 1500 comments with marked classes. Tried to use SVM for classification on binary feature vectors in which each element refers to the presence of some word in the comment. I got maximum accuracy of 60% correct classes. Known researches had 80% and better accuracy, but they was done on English texts.

One of the problems - numerous errors in the comments, spelling and grammar. Also the Russian language is more complex than English.

I would appreciate advice of any kind. Are there any good tools for the analysis of the Russian language? Maybe SVM isn't the right choice, are there any better algorithms for my case? Or maybe i must choose the more efficient feature space?

Your question feels very generic. Check the FAQ: http://stackoverflow.com/faq#questions for the kind of questions you can ask here. On your question, if I were you, I'd look at the results, try to detect patterns from failed cases, and add features to capture the information I'm missing. Its too hard to tell if SVM is a bad choice, given the info, but you should be able to better. Choose a wider (possibly less efficient) feature space now, get the numbers and work on reduction later — , Jun 28 '12 at 13:16

score 1 · Answer 1 · answered Nov 05 '17 at 14:20

There are many options to increase the performance of your model. You should not mix pre-processing and training/validating/testing a ML model! In my experience, when it comes to NLP, pre-processing is key. I am no expert in Russian, but there are frameworks that help you to lemmatize/stem your texts (NLTK with snowball).

Apart from pre-processing, you have the following options on the ML side:

Try different algorithms: SVM, tree-based algorithms, neural nets (your data set might be too small), ensembles (boosting or bagging). In my experience, ensembles work pretty well. Don't Overfit though (maybe read this)!
perform a parameter optimization
Try to pre-train your model on other public available, comparable data sets (might be difficult in your case)

Also remember: if you have a highly class-imbalanced data set, accuracy is not a good performance metric. Try e.g. F1-Score... Hope i could help.

score 1 · Answer 2 · answered Jun 28 '12 at 13:32

One suggestion is to do a simple morphological analysis on the Russian words and remove the cases and genders. This is a major difference between English and Russian. For example, in English, you say:

*hungry* boy
*hungry* girl
*hungry* society
I gave the *hungry* boy a tea-bread.

etc... whereas in Russian, you have (I might have mis-spelled some of these):

*голодный* мальчик
*голодная* девушка
*голодное* обшество
Я дал *голодному* мальчике сушку.

All the different declensions of this one adjective (thanks to this site):

голоден, голодна, голодная, голодно, голодного, голодное, голодной, голодном, голодному, голодною, голодную, голодны, голодные, голодный, голодным, голодными, голодных

Thus instead of one word (hungry), you have seventeen. You could replace all the instances of any of these seventeen with just the root - say голод - thus making the input simpler. Granted, you will lose some information, but it will make the source look more like English, which might get you at least closer to 80% accuracy.

I'm not sure what to do with the misspelled version of these... maybe you can somehow use google's search suggestions to find the right spelling for misspelled words (e.g. http://tinyurl.com/cprzf6n).

How can I improve sentiment analysis of user comments?

2 Answers2