Distribution of accuracy from randomly guessing

Question

Let's consider a true classification problem, that is, one where the predictor makes categorical predictions (not probabilities).

It makes sense to assess the accuracy of such a predictor. However, that accuracy should be given context. The easiest example of this is when there is imbalance: a model can achieve an accuracy of $90\%$ that sounds impressive, but if the imbalance is such that $95\%$ of the outcomes are just one category, then the $90\%$ accracy is worse than could have been achieved by predicting that dominant category every time.

UCLA has a name for a metric that compares the accuracy to the accuracy of always predicting the majority label: adjusted count. For reasons that I discuss here and here, I call it $R^2_{\text{accuracy}}$.

However, this assumes that the right "baseline" level of performance is that which comes from always guessing the majority category. An alternative strategy is to randomly guess all of the possible labels according to their relative proportions. If half of the labels are dogs, guess "dog" half the time. If $40\%$ of the labels are cats, guess "cat" $40\%$ of the time. If $10\%$ of the labels are aardvarks, guess "aardvark" $10\%$ of the time. Then the baseline performance would be the expected performance of such a strategy.

Given that we have the proportions for each label, what would be the distribution and expected value of the accuracy from randomly guessing this way? Simulation is straightforward enough (just sample from the true labels many times and measure the accuracy each time), but I feel like there should be a closed-form, algebraic solution where we input the proportions and get a distribution.

(My simulations are showing that predicting this way instead of just predicting the majority category never gives expected performance stronger than always predicting the majority category, but I still want to know what the distribution of accuracy values would be from such a prediction strategy.)

The alternative strategy is always a worse strategy. Consider two options, with probabilities 1/3 and 2/3. Randomly guessing each category with those probabilities is right 5/9 of the time (you just sum the squares of the probabilities), while just picking the larger category is right 6/9 of the time. — Glen_b, Sep 07 '23 at 00:08
The alternative strategy is always a worse strategy. @Glen_b I'll probably be asking another question asking about why that is the case. Until then, though, I am curious about the distribution of accuracy values according to the alternative strategy. — Dave, Sep 07 '23 at 00:19
$\sum_{i=1}^k p_i^2\leq \max_j p_j$ ... there's a number of ways to see this, but you might think of it in terms of Hölder's inequality, for example. Basically, higher-power "norms" are larger (RMS > MAD, etc). — Glen_b, Sep 07 '23 at 00:29
@Glen_b Okay, so something about $p$ norms, but why are you squaring the $p_i?$ (That seems like the answer to this question.) — Dave, Sep 07 '23 at 00:32
I mention squaring in the first comment. $P(\text{guess correct})$ $ = $ $\sum_i P(\text{correct category is } i)$ $\times $ $P(\text{Randomly choose } i) $ $= $ $\sum_i p_i \times p_i$. — Glen_b, Sep 07 '23 at 00:34
I imagine that calculation is written or implied in multiple answers on site already. The equality case on the sum of squares (or other power norms) occurs when $\text{Var}(p_i)=0$ (uniform probabilities) — Glen_b, Sep 07 '23 at 00:39
Excuse the looseness of the language here (like "always worse" actually should say is never better), omitting explaining where the square root went on the 2 norm, etc. You can tighten up the statements mathematically but that's the gist. — Glen_b, Sep 07 '23 at 00:47
@Glen_b I think I’m figuring it out and might post a self-answer in the next few hours. — Dave, Sep 07 '23 at 00:54
@Dave My opinion would be that, because of what Glen_b is saying, you're probably always better off with estimating this by permuting the guesses to simulate random accuracy. That way, you have a measure of something closer to approaching the actual behavior of your participants on the scale of what is optimal. Also extends easily to multiple categories, mutual information estimates where you can additionally capture the information in specific wrong guesses (such as, even a wrong answer "lion" when correct is "tiger" may be a vote against "aardvark"), etc. — Bryan Krause, Sep 07 '23 at 15:44

chang_trenton · Answer 1 · 2023-09-07T15:14:05.633

The expectation and variance of such a prediction strategy can be obtained by recognizing that the number of correct predictions is interpretable as a Poisson binomial random variable (independent but non-identically distributed Bernoulli trials), then dividing by $N$ to obtain accuracy. More formally, suppose that we have a set of true labels from $C$ classes $\mathbf{y} \in \{1, \dots, C\}^N$, where each $y^{(i)}$ for $i \in 1, \dots, N$ is drawn as follows:

$$y^{(i)} \overset{\text{i.i.d.}}{\sim} \text{Cat}(\mathbf{p}), \quad \mathbf{p} \triangleq [p_1, p_2, \dots p_C], \quad \sum_{j=1}^C p_j = 1,$$

and our predictions $\hat{\mathbf{y}} \in \{1, \dots, C\}^N$ are drawn i.i.d. from $\text{Cat}(\mathbf{p})$ as well.

Thus, the expected accuracy is $\sum_{i=1}^C p_i^2$, or $\lVert \mathbf{p} \rVert_2^2$. As an informal argument, for a prediction $y^{(i)}$ to be correct, both $y^{(i)}$ and $\hat{y}^{(i)}$ have to match, which is akin to two independent $\text{Bernoulli}(p_i)$ draws. Alternatively, to use the Poisson binomial to make this argument, the number of expected correct predictions is $\sum_{i=1}^C N_ip_i$, where $N_i$ is the number of examples $y^{(i)}$ in class $i$; we can divide by $N$ to reach the same answer (using the fact that $\mathbb{E}[N_i / N] = p_i$, where the expectation is taken over randomness in generating $\mathbf{y}$).

For variance, we can use the fact that accuracy is equal to $1/N$ times the Poisson binomial RV representing the # of correct predictions to (1) find the variance of the # of correct predictions, then (2) divide by $N^2$. For some fixed $\mathbf{y}$, some algebra yields the variance of this Poisson binomial RV as

$$\sum_{i=1}^C N_i (1 - p_i^2)p_i^2,$$ so dividing by $N^2$, the final variance of accuracy (accounting for randomness in generating $\mathbf{y}$ again) is

$$\frac{1}{N}\sum_{i=1}^C (1 - p_i^2)p_i^3.$$

To double-check, I verified this argument in simulation using this code:

import numpy as np
accs = []
N = 10000
C = 5 # test multiple values
alphas = np.random.rand(C)
class_probs = np.random.dirichlet(alphas / alphas.sum())
for i in tqdm(range(10000)):
    s = np.random.choice(np.arange(C), size=N, p=class_probs)
    s2 = np.random.choice(np.arange(C), size=N, p=class_probs)
    acc = (s == s2).mean()
    accs.append(acc)
accs = np.array(accs)
print("Empirical mean accuracy:", accs.mean())
print("Empirical variance of accuracy:", accs.var())
pred_mean = np.linalg.norm(class_probs) ** 2
sq_probs = class_probs ** 2
pred_var = (sq_probs * (1 - sq_probs) * class_probs).sum() / N
print("Analytical mean accuracy:", pred_mean)
print("Analytical variance of accuracy:", pred_var)

Aside: why the "alternative strategy" is generally worse is a consequence of Hölder's inequality (taking the max/$\infty$ norm and 1-norm). Namely, since $p_i > 0$ for all $i$, we know that $\lVert \mathbf{p} \rVert_2^2 \leq \underset{i}{\max} p_i \cdot \sum_{i=1}^n p_i = \underset{i}{\max} p_i$ (see here) -- note that $\underset{i}{\max} p_i $ is the accuracy of the "pick majority class" strategy. That is; the expected accuracy of the "alternative strategy" is bounded above by that of the "majority class" strategy, with equality when $p_i = C^{-1}$ for all $i$.

As for why it's always worse -- I haven't analyzed this rigorously, but I suspect that the variance of accuracy in the "alternative strategy" is so small that it's highly unlikely to even touch $\underset{i}{\max} p_i$. To risk some cherry-picking, if we take random seed 42 (i.e., np.random.seed(42)) in the above code, I get that $\max p_i$ is ~25 standard deviations above $\lVert \mathbf{p} \rVert_2^2$. Maybe we can say that the alternative strategy is "almost always worse."

Interesting! I had a feeling that one of those common discrete distributions would come up in this. Is there a way to write the Poisson parameter in terms of the class proportions? — Dave, Sep 07 '23 at 01:20
Just to be clear, this is the Poisson binomial (like a binomial with different success probabilities), which is quite different from the Poisson distribution (for rates) except for namesake, so the parameters are already in terms of the class proportions "by default." But maybe I'm misunderstanding the question? — chang_trenton, Sep 07 '23 at 01:39
Square the probability values for the expectation because it is the sum of the probabilities of predicting each label, each multiplied by the probability of that label occurring, correct? — Dave, Sep 07 '23 at 14:53
But if you're weighting this way (which makes sense to me), then you're not getting the expected value of the Poisson binomial. You're getting the expected value of a transformation, which tells me that Poisson binomial on the predicted probabilities isn't quite right. — Dave, Sep 07 '23 at 15:13
Good catch (+1) — I think we can still derive the expectation and the variance by arguing about the # of correct predictions, but you're right that it's not a Poisson binomial — I'll have to think about how to edit this, as I think my expression for variance is thus incorrect. I'll remove this if someone gets to it before me — in the meantime, happy to continue discussion in chat. — chang_trenton, Sep 07 '23 at 15:23
I think you're right about the expected value, which is what matters the most to me (at least I think expected accuracy is what should be considered baseline performance). However, I am curious about the entire distribution. If the correct expected value is some kind of weighted sum of the Poisson binomial probability values, perhaps the distribution is a related transformation of a Poisson binomial. — Dave, Sep 07 '23 at 15:25

Distribution of accuracy from randomly guessing

1 Answers1