I run a human evaluation on a recruiting website. The situation is the following:
- There are 78 annotators.
- There are 100 questions.
- Each annotator annotates 5 questions.
- So, on average*, each question is annotated by 4 people. I say on average as the number of responses might vary.
- Each question has 4 dimensions (grammaticality, meaning, simplicity, all), which are measured on a Likert 0 to 5 scale (-2 to 2 in case of the simplicity)
So, for example, looking at the grammaticality dimension, one has something like this:
| Q1 | Q2 | Q3 | Q4 | ... | Q500 | |
|---|---|---|---|---|---|---|
| A1 | NaN | 2 | NaN | NaN | ... | 3 |
| A2 | NaN | NaN | 0 | NaN | ... | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| A78 | 1 | 2 | 3 | 1 | ... | 5 |
Obviously, if one does not consider the specific annotator, we can represent it like
| AA | AB | AC | AD | AE |
|---|---|---|---|---|
| 1 | 1 | 2 | 5 | 2 |
| 2 | 2 | 2 | NaN | 2 |
| 3 | 3 | 2 | 3 | 2 |
| 0 | 4 | 5 | 5 | 2 |
| ... | ... | ... | ... | ... |
I am wondering if there is any inter-annotator agreement measure that can be applied in this case or if there is any other way to estimate the agreement.
PS: Is there a book/paper you would recommend to study IAA measures beyond the simple N "fixed" annotators annotate M examples with binary votes?