I am working on a piece of software to detect opinions in text. As a simplistic example, I would like my algorithm to tell me that Andrea has a positive opinion +1 (rather than neutral 0 or negative -1) of New York when I enter the following sentence:
Andrea loves New York.
Now I have hundreds of millions of unstructured text files with sentences just like this one, or way more complex, so it is not an option to analyse these sentences by hand! Hence, the piece of software. But I have different methods, based on different papers, or on my own experience with opinion mining.
method_a method_b method_c labelled_by_hand(n)
positive 20% 30% 50% 30%
neutral 80% 30% 0% 20%
negative 0% 40% 50% 50%
sample size n = 100
I would like to know how to assess the quality of a given algorithm. I have no distribution, no statistical model that I can assume to be true so what I read here or there about AIC and BIC and other methods just seem way too elaborated (I may be wrong, but I don't have any common variables to use to compare my models to each other).
So my questions:
- how many samples should I label by hand to achieve a given quality for my assessment test?
- which measures are commonly used for a simple comparative task like this? Precision and recall?