I have data about performance on two types of judgment task (for people), each type with a different format of ground truth for the targets (also people). All judges evaluated all targets, there were 24 targets per task, so 48 different targets in total.
- For the first task, the judgment could range from 0 to 100 and the ground truth could also range from 0 to 100.
- For the second task, the judgment could range from 0 to 100 but the ground truth was binary, either 0 or 1.
For the first task, I measured accuracy with absolute error. For the second task, I measured accuracy with Brier scores.
I have two questions:
I would like to compare the accuracy of the first task with accuracy on the second task (the same judges completed both), but I believe Brier scores do not make sense for the format of the ground truth on the first task and I'm not sure that using absolute error for the second task is a good idea. Could you provide advice on how best to compare the accuracy for both tasks?
I'm interested in establishing whether performance in each task can be mostly attributed to skill/ability/talent or luck/chance. How can I best accomplish this?