4

I am working on a piece of software to detect opinions in text. As a simplistic example, I would like my algorithm to tell me that Andrea has a positive opinion +1 (rather than neutral 0 or negative -1) of New York when I enter the following sentence:

Andrea loves New York.

Now I have hundreds of millions of unstructured text files with sentences just like this one, or way more complex, so it is not an option to analyse these sentences by hand! Hence, the piece of software. But I have different methods, based on different papers, or on my own experience with opinion mining.

             method_a    method_b    method_c    labelled_by_hand(n)
positive     20%         30%         50%         30%
neutral      80%         30%          0%         20%
negative      0%         40%         50%         50%

sample size n = 100

I would like to know how to assess the quality of a given algorithm. I have no distribution, no statistical model that I can assume to be true so what I read here or there about AIC and BIC and other methods just seem way too elaborated (I may be wrong, but I don't have any common variables to use to compare my models to each other).

So my questions:

  1. how many samples should I label by hand to achieve a given quality for my assessment test?
  2. which measures are commonly used for a simple comparative task like this? Precision and recall?
ATN
  • 340
  • Out of curiosity, is the ability to extract the entities and relationships (e.g. Andrea feels x about NY) part of the differentiation between algorithms, or is the exercise purely about evaluating the accuracy of the value assigned to x? – Jonathan Jun 05 '13 at 18:01
  • @Jonathan: in my case, it is a little bit of both. – ATN Jun 05 '13 at 19:49

1 Answers1

2

Given the small number of categories (positive, neutral, negative) it will not take many to evaluate the relative performance of each algorithm. The specific number you'll need depends on the type of comparison you perform and the confidence and power you desire, but it's not worth over-thinking it at this point. I would start with 100 and you can always do more later, if you need.

You will first want to start by looking at the confusion matrix or error matrix for each algorithm relative to the manual classification. The error matrix is the cross-tabulation of how you rated the category for each example, and how the algorithm rated the example.

                   Algorithm A
                   Positive    Neutral    Negative
Manual    Positive    10         10         10
          Neutral     10         10         10
          Negative    10         10         10

This will be the base for you to explore the data and begin to understand differences in classification. In the example above, you can see that Algorithm A achieved only 30% accuracy( (10 + 10 + 10) / 90 = 30% ), and you can compare the simple accuracy percentage with percentage-based statistics.

Other metrics you will want to look at, for each category are

  • User accuracy what percentage of examples labeled "positive" by the algorithm are also labeled "positive" by the reference data?
  • Producer accuracy what percentage of examples labeled "positive" by the reference data are also labeled "positive" by algorithm?

But you can develop more advanced metrics based on what you want to get from the algorithm. For example, does it "cost" more if the algorithm mistakes a positive for a negative, than if it mistakes a positive for a neutral?

Examples of different methods of assessing accuracy can be found in the paper: Comparative assessment of the measures of thematic classification accuracy

Jonathan
  • 1,303