What is a good measure of the inter-rater agreement when one has the following two conditions:
- each annotation is a ranked list of 3 elements (the annotator can choose them amongst 10 elements),
- there are more than 2 annotators (a.k.a. raters)
?
Example:
I have 1000 short texts, 10 types of sentiments (e.g., "happy", "funny", "sarcastic" or "ironic"), and 5 human annotators. I am asking each annotator to go over each of the 1000 short texts, and for each short text indicate, as a ranked list, which 3 sentiments are the most tangible in the short text. For example, human annotator #1 might decide that short text #361 is ["sarcastic", "ironic", "funny"] (meaning that short text #361 is more "sarcastic" than "ironic", and more "ironic" than "funny", and more "funny" than any of the other 7 sentiments).