Inter annotator agreement with varying annotators

Question

I run a human evaluation on a recruiting website. The situation is the following:

There are 78 annotators.
There are 100 questions.
Each annotator annotates 5 questions.
So, on average*, each question is annotated by 4 people. I say on average as the number of responses might vary.
Each question has 4 dimensions (grammaticality, meaning, simplicity, all), which are measured on a Likert 0 to 5 scale (-2 to 2 in case of the simplicity)

So, for example, looking at the grammaticality dimension, one has something like this:

	Q1	Q2	Q3	Q4	...	Q500
A1	NaN	2	NaN	NaN	...	3
A2	NaN	NaN	0	NaN	...	NaN
...	...	...	...	...	...	...
A78	1	2	3	1	...	5

Obviously, if one does not consider the specific annotator, we can represent it like

AA	AB	AC	AD	AE
1	1	2	5	2
2	2	2	NaN	2
3	3	2	3	2
0	4	5	5	2
...	...	...	...	...

I am wondering if there is any inter-annotator agreement measure that can be applied in this case or if there is any other way to estimate the agreement.

PS: Is there a book/paper you would recommend to study IAA measures beyond the simple N "fixed" annotators annotate M examples with binary votes?

	Q1	Q2	Q3	Q4	...	Q500
A1	NaN	2	NaN	NaN	...	3
A2	NaN	NaN	0	NaN	...	NaN
...	...	...	...	...	...	...
A78	1	2	3	1	...	5

	Q1	Q2	Q3	Q4	...	Q500
A1	NaN	2	NaN	NaN	...	3
A2	NaN	NaN	0	NaN	...	NaN
...	...	...	...	...	...	...
A78	1	2	3	1	...	5

Inter annotator agreement with varying annotators

0 Answers0

	Q1	Q2	Q3	Q4	...	Q500
A1	NaN	2	NaN	NaN	...	3
A2	NaN	NaN	0	NaN	...	NaN
...	...	...	...	...	...	...
A78	1	2	3	1	...	5