How to calculate Cohen's kappa with multiple classes?

Question

I'm working with an imbalanced data set including 12 classes. I was looking for a metric which I can minimize in my objective function during hyperparameter tuning and the final evaluation of the model.

Since accuracy doesn't seem to be a good choice I read about Cohen's kappa statistic. I used the implementation of scikit-learn (sklearn cohens_kappa) and I'm pretty happy with the results of the hyperparameter tuning. They outperform accuracy or weighted recall for example quite a bit. I just use it like this:

score = cohen_kappa_score(y_test_cv, y_pred)

Anyway I'm a bit confused about its calculation when I'm facing multiple classes. There are few websites (The Data Scientist) which state that it is a good metric for problems with multiple classes and imbalanced data sets (which is exactly what I'm working with).

How do I (or scikit-learn) calculate Cohen's kappa score for multiple classes and is it a good idea for my usecase (12 classes, imbalanced data)?

I don't plan to evaluate my final model solely based on the kappa score. I'll have a look at the confusion matrix and multiple metrics. But in the parameter tuning stage I'm currently only relying on kappa and I'm happy with the results.

German wikipedia (wikipedia) shows an example but lists two different options to calculate p_c.

If you are looking for source code of sklearn Cohan kappa score, you can access this page: https://github.com/scikit-learn/scikit-learn/blob/1495f6924/sklearn/metrics/classification.py#L500 — Shark Deng, Aug 17 '19 at 03:20

Dave · Answer 1 · 2023-05-31T22:40:29.353

From the sklearn documentation:

$$ \kappa = (p_o - p_e) / (1 - p_e) $$

where $p_o$ is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and $p_e$ is the expected agreement when both annotators assign labels randomly. $p_e$ is estimated using a per-annotator empirical prior over the class labels.

Let's unpack the documentation.

$p_o$ is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio)

The observed agreement ratio is the classification accuracy. I say this makes sense, as the classification accuracy is the number of predicted labels that agree with the true labels divided by the total number of attempts.

$p_e$ is the expected agreement when both annotators assign labels randomly

This means that $p_e$ is the expected classification accuracy when predicted labels are randomly assigned and true labels are randomly assigned.

$p_e$ is estimated using a per-annotator empirical prior over the class labels.

This means that the distribution of labels comes from the true distribution of labels. In other words, the random labels come from sampling from the true labels with the class ratios respected. The reference given for how exactly this is calculated is Artstein and Poesio (2008), with the derivation completed on page $8$, which appears to be the same as the calculation given on Wikipedia.

Let $N$ be the total number of classification attempts; let there be $K$ categories; let $n_{k1}$ be the number of times label $k$ appears in the predictions; and let $n_{k2}$ be the number of times label $k$ is a true label. Then: $$p_e = \dfrac{1}{N^2}\overset{K}{\underset{k=1}{\sum}}n_{k1}n_{k2}$$

With these definitions for $p_0$ and $p_e$, we arrive at the sklearn calculation: $$ \kappa = (p_o - p_e) / (1 - p_e) $$

and is it a good idea for my usecase

Cohen's $\kappa$ is a function of the classification accuracy, so if you are interested in the classification accuracy, Cohen's $\kappa$ might be a statistic that gives a context for that accuracy. In particular, Cohen's $\kappa$ can be seen as a comparison between the classification accuracy of your model and the classification accuracy that comes from randomly assigning labels. An advantage of transforming the accuracy this way is that it exposes performance worse than random. For instance, a common complaint about classification accuracy is that it can be high for an imbalanced problem yet not indicate good performance, such as getting $95\%$ accuracy when $99\%$ of the observations belong to one category. While the $95\%$ accuracy looks high, running such a situation through the Cohen's $\kappa$ calculation is likely to expose such performance as being worse than it would be for random guessing. If this sounds appealing, then Cohen's $\kappa$ might be a good measure of performance for you.

A drawback of Cohen's $\kappa$ is that it requires you to bin continuous model outputs, such as those given by logistic regressions and neural networks. While this does not discuss Cohen's $\kappa$ in particular, all of the criticisms apply.

REFERENCES

Artstein, Ron, and Massimo Poesio. "Inter-coder agreement for computational linguistics." Computational linguistics 34.4 (2008): 555-596.

Since Cohen’s Kappa can be seen as a comparison to the performance of some kind of baseline model, I wonder how my discussion here, which has support in the formal statistics literature, would figure into this when it comes to performing out-of-sample analysis. — Dave, Jul 03 '23 at 11:20

How to calculate Cohen's kappa with multiple classes?

1 Answers1

Linked