0

I've been trying to classify multi-label texts with different classification algorithms.

I get some pretty good results with linear kernel SVM and with the rest of the kernels the result is not good. I understand that this happens because usually the classification of texts space is linear.

When I use Random Forest tests, the results are much worse but acceptable. Some labels are correctly classified, but many are not classified.

Finally I used a Multinomial Naïve Bayes and the results are very bad, in fact the classifier does not classify anything.

This is normal? Is there any reason for these very poor results with Naïve Bayes?

Xi'an
  • 105,342
Blunt
  • 3
  • How does the data look like? How many classes? Samples per class? Number of dimensions? How do you measure performance? – jpmuc Jun 26 '15 at 09:31
  • The input data for the training are texts with different labels.

    I binarizo entries using scikit learn and perform counting each with countvectorizer and tdiftTransform. Tokenizo and stemming previously performed.

    Tags are about a thousand.

    For performance measures are used as precision, recall and F1.

    – Blunt Jun 26 '15 at 09:53

1 Answers1

2

One plausible explanation is the high correlation of your features. I am by no means an expert on NLP, but you could check the following hypothesis: tf-idf is proportional to the frequency a word appears in a document. Through stemming that frequency is higher, since you do not differentiate flexions, conjugations and so on, of verbs and nouns.

The algorithms you mention, like Multinomial Naive Bayes and Random Forest are sensitive to highly correlated features, and their performance degrades with it. SVMs, on the contrary, are much more robust against it.

This interesting paper (Tackling the Poor Assumptions of Naive Bayes Text Classifiers) explains why multinomial naive Bayes is sensitive to correlations, and ways to overcome it.

Understanding why random forest are sensitive to correlated features is more involved. This paper (Correlation and variable importance in random forests) explains in great detail why the measure used for measuring the relevance of a feature, based on a permutation test, is sensitive to it.

This presentation gives you some advice on how to tackle this problem with random forest.

jpmuc
  • 13,964