How to interpret the UCLA "adjusted count" logistic regression pseudo $R^2?$

Question

Here, UCLA gives a number of pseudo $R^2$ values for evaluating logistic regression models. Despite the issues with doing this, the last two deal with hard classifications rather than the probabilistic model outputs.

The second-to-last pseudo $R^2$, "count", makes sense, as it is just the proportion classified correctly ("accuracy"). What is the interpretation of the final pseudo $R^2$, the "adjusted count"?

$$ R^2_{\text{AdjustedCount}} = \dfrac{\text{ Correct - n }}{\text{ Total - n }} $$

Dave · Accepted Answer · 2023-02-20T16:53:33.900

This equals the decrease in error rate that I discuss here and call $R^2_{accuracy}$, though it takes some algebra to see why.

$$ R^2_{\text{accuracy}} = 1 - \dfrac{ \text{Error rate of the model under consideration} }{ \text{Error rate of a model that naïvely predicts the majority class every time} } $$

To simplify the calculation, I will shorten the notation.

$$ E_1 = \text{Error rate of the model under consideration} $$$$ E_0 = \text{Error rate of a model that naïvely predicts the majority class every time} $$$$ N = \text{Number of classification attempts (Sample size)} $$

$$ R^2_{\text{accuracy}} = 1 - \dfrac{ E_1 }{ E_0 }= \dfrac{ E_0 - E_1 }{ E_0 } $$

Next, let's break down what the three components of the UCLA fraction mean in this terminology.

For "correct", multiply the accuracy of your model by the total number of classification attempts. Since $E_1$ is the error rate of your model, $1-E_1$ is the accuracy, so $\text{Correct} = N(1-E_1)$.

For "n", apply similar logic but to the model that naïvely predicts the majority class every time. The error rate for such a model is $E_0$, so its accuracy is $1-E_0$. Consequently, the total number of correct predictions by the model that naïvely predicts the majority class every time is $N(1-E_0)$.

Finally, "total" is easy: it's exactly $N$.

Now it's time to plug in and do the algebra.

$$ R^2_{\text{AdjustedCount}} = \dfrac{\text{ Correct - n }}{\text{ Total - n }} = \dfrac{ N(1-E_1) - N(1-E_0) }{ N - N(1-E_0) }$$$$ = \dfrac{ (1-E_1) - (1-E_0) }{ 1 - (1-E_0) }$$$$ = \dfrac{ 1-E_1 - (1-E_0) }{ 1 - (1-E_0) }$$$$ =\dfrac{ 1-E_1 - 1 + E_0 }{ 1-1+E_0 }$$$$ =\dfrac{ E_0 - E_1 }{ E_0 } $$$$ =\dfrac{E_0}{E_0}-\dfrac{E_1}{E_0} $$$$ = 1 -\dfrac{ E_1 }{ E_0 }$$$$ = R^2_{\text{accuracy}} $$

$\square$

EDIT

An R simulation could be fun to show the two to be equal.

set.seed(2023)
R <- 10000 # Number of times to repeat the loop
N <- 1000  # Number of samples within each loop
Function to calculate UCLA's "count"

count <- function(correct, total_count){t
  return(
    correct/total_count
  )
}
Function to calculate UCLA's "adjusted count"

count_adj <- function(correct, total_count, n_most_common){ 
  return(
    (correct - n_most_common)
    /
      (total_count - n_most_common)
  )
}
Function to calculate my R^2_accuracy

r2_accuracy <- function(model_error_rate, naive_error_rate){
  return(
    1 -
      (model_error_rate)/(naive_error_rate)
  )
}
Blank vector to hold differences between adjusted count and R^2_accuracy

d <- rep(NA, R)
Loop R-many times

for (i in 1:R){
Define the true event probabilities

p1 <- runif(N, 0.1, 0.9)
Simulate 0/1 events with probability p1

true <- rbinom(N, 1, p1)
Define probability of a model making a mistake

p2 <- runif(N, 0.1, 0.9)
Define the predictions as the true values plus some noise term
Then mod by 2 so all values are 0 or 1

pred <- (true + rbinom(N, 1, p2)) %% 2
Define the number of correct predictions

n_correct <- length(true) - sum((true - pred)^2)
Define the sample size

total_count <- length(true)
Define the number of values belonging to the most common label

n_most_common <- max(table(true))
Define the accuracy of the predictions using the "count" function
(Yes, it's proportion classified correctly instead of accuracy percentage)

model_accuracy <- count(length(true) - sum((true - pred)^2), length(true))
Define the error rate of the predictions

model_error_rate <- 1 - model_accuracy
Define the accuracy of naively predicting the majority category every time
(Yes, it's proportion classified correctly instead of accuracy percentage)

naive_accuracy <- max(table(true))/length(true)
Define the error rate of naively predicting the majority category every time

naive_error_rate <- 1 - naive_accuracy
Calculate and store the difference between UCLA's adjusted count and
my R^2_ accuracy

d[i] <-
    count_adj(
      n_correct, 
      total_count, 
      n_most_common 
    ) - 
    r2_accuracy(
      model_error_rate, 
      naive_error_rate 
    )
}
Print a summary of the differences between my calculation and UCLA's,
revealing the two to be the same (up to differences that can be attributed
to doing math on a computer (floating point errors))

summary(d)
################################################################################

OUTPUT

################################################################################
> summary(d)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-2.498e-16 -2.776e-17  1.344e-17  2.025e-17  6.939e-17  2.776e-16

The differences between my calculations and the UCLA adjusted count calculations are on the order of $10^{-16}$. This is R's way of saying the difference between the UCLA adjusted count and my $R^2_{accuracy}$ is zero every time out of ten-thousand checks. (Such differences are attributable to floating point errors coming from doing math on a computer.)

As usual, there are issues with mapping the rich probabilistic output of a logistic regression to discrete categories, particularly if a software-default threshold of $0.5$ is used without the statistician thinking. Nonetheless, there can be times when discrete categorical predictions must be assessed, and I was quite happy to learn that others have thought about this $R^2_{accuracy}$, even if their construction of it differs from my own. — Dave, Feb 18 '23 at 01:34

How to interpret the UCLA "adjusted count" logistic regression pseudo $R^2?$

1 Answers1

Function to calculate UCLA's "count"

Function to calculate UCLA's "adjusted count"

Function to calculate my R^2_accuracy

Blank vector to hold differences between adjusted count and R^2_accuracy

Loop R-many times

Define the true event probabilities

Simulate 0/1 events with probability p1

Define probability of a model making a mistake

Define the predictions as the true values plus some noise term

Then mod by 2 so all values are 0 or 1

Define the number of correct predictions

Define the sample size

Define the number of values belonging to the most common label

Define the accuracy of the predictions using the "count" function

(Yes, it's proportion classified correctly instead of accuracy percentage)

Define the error rate of the predictions

Define the accuracy of naively predicting the majority category every time

(Yes, it's proportion classified correctly instead of accuracy percentage)

Define the error rate of naively predicting the majority category every time

Calculate and store the difference between UCLA's adjusted count and

my R^2_ accuracy

Print a summary of the differences between my calculation and UCLA's,

revealing the two to be the same (up to differences that can be attributed

to doing math on a computer (floating point errors))

OUTPUT

Linked