How to evaluate luck vs skill in judgment accuracy and how to compare different measures of accuracy?

Question

I have data about performance on two types of judgment task (for people), each type with a different format of ground truth for the targets (also people). All judges evaluated all targets, there were 24 targets per task, so 48 different targets in total.

For the first task, the judgment could range from 0 to 100 and the ground truth could also range from 0 to 100.
For the second task, the judgment could range from 0 to 100 but the ground truth was binary, either 0 or 1.

For the first task, I measured accuracy with absolute error. For the second task, I measured accuracy with Brier scores.

I have two questions:

I would like to compare the accuracy of the first task with accuracy on the second task (the same judges completed both), but I believe Brier scores do not make sense for the format of the ground truth on the first task and I'm not sure that using absolute error for the second task is a good idea. Could you provide advice on how best to compare the accuracy for both tasks?
I'm interested in establishing whether performance in each task can be mostly attributed to skill/ability/talent or luck/chance. How can I best accomplish this?

score 0 · Accepted Answer · answered Dec 04 '22 at 03:21

I have not found an authoritative answer but perhaps this will be useful for others. For the first question, this comment makes me think that comparing absolute error with squared error (Brier scores) is not sensible. Since I have no reason to believe that an error of 10 is more than twice as bad than an error of 5, I just used absolute error for both tasks.

For my second question, about getting an idea of the luck and skill components for each task I tried to adapt the method from this blog post. The important insight for me was that

$var(performance) = var(talent) + var(luck)$, assuming that talent and luck are independent

For performance, I included all observations (all judges, all targets) for each task. For luck, I had to make an adjustment (I could not use the binomial approximation to normal) since my outcomes--at least for one of the tasks--are continuous rather than binary: I sampled with replacement from a virtual urn that contained 0 to 100 (101 items) for each observation and computed the variance of that.

Hope that helps or at least spurs someone more knowledgeable than me to provide a correction or better answer.

How to evaluate luck vs skill in judgment accuracy and how to compare different measures of accuracy?

1 Answers1