2

Here, UCLA gives a number of pseudo $R^2$ values for evaluating logistic regression models. Despite the issues with doing this, the last two deal with hard classifications rather than the probabilistic model outputs.

The second-to-last pseudo $R^2$, "count", makes sense, as it is just the proportion classified correctly ("accuracy"). What is the interpretation of the final pseudo $R^2$, the "adjusted count"?

$$ R^2_{\text{AdjustedCount}} = \dfrac{\text{ Correct - n }}{\text{ Total - n }} $$

Dave
  • 62,186

1 Answers1

1

This equals the decrease in error rate that I discuss here and call $R^2_{accuracy}$, though it takes some algebra to see why.

$$ R^2_{\text{accuracy}} = 1 - \dfrac{ \text{Error rate of the model under consideration} }{ \text{Error rate of a model that naïvely predicts the majority class every time} } $$

To simplify the calculation, I will shorten the notation.

$$ E_1 = \text{Error rate of the model under consideration} $$$$ E_0 = \text{Error rate of a model that naïvely predicts the majority class every time} $$$$ N = \text{Number of classification attempts (Sample size)} $$

$$ R^2_{\text{accuracy}} = 1 - \dfrac{ E_1 }{ E_0 }= \dfrac{ E_0 - E_1 }{ E_0 } $$

Next, let's break down what the three components of the UCLA fraction mean in this terminology.

For "correct", multiply the accuracy of your model by the total number of classification attempts. Since $E_1$ is the error rate of your model, $1-E_1$ is the accuracy, so $\text{Correct} = N(1-E_1)$.

For "n", apply similar logic but to the model that naïvely predicts the majority class every time. The error rate for such a model is $E_0$, so its accuracy is $1-E_0$. Consequently, the total number of correct predictions by the model that naïvely predicts the majority class every time is $N(1-E_0)$.

Finally, "total" is easy: it's exactly $N$.

Now it's time to plug in and do the algebra.

$$ R^2_{\text{AdjustedCount}} = \dfrac{\text{ Correct - n }}{\text{ Total - n }} = \dfrac{ N(1-E_1) - N(1-E_0) }{ N - N(1-E_0) }$$$$ = \dfrac{ (1-E_1) - (1-E_0) }{ 1 - (1-E_0) }$$$$ = \dfrac{ 1-E_1 - (1-E_0) }{ 1 - (1-E_0) }$$$$ =\dfrac{ 1-E_1 - 1 + E_0 }{ 1-1+E_0 }$$$$ =\dfrac{ E_0 - E_1 }{ E_0 } $$$$ =\dfrac{E_0}{E_0}-\dfrac{E_1}{E_0} $$$$ = 1 -\dfrac{ E_1 }{ E_0 }$$$$ = R^2_{\text{accuracy}} $$

$\square$

EDIT

An R simulation could be fun to show the two to be equal.

set.seed(2023)
R <- 10000 # Number of times to repeat the loop
N <- 1000  # Number of samples within each loop

Function to calculate UCLA's "count"

count <- function(correct, total_count){t return( correct/total_count ) }

Function to calculate UCLA's "adjusted count"

count_adj <- function(correct, total_count, n_most_common){ return( (correct - n_most_common) / (total_count - n_most_common) ) }

Function to calculate my R^2_accuracy

r2_accuracy <- function(model_error_rate, naive_error_rate){ return( 1 - (model_error_rate)/(naive_error_rate) ) }

Blank vector to hold differences between adjusted count and R^2_accuracy

d <- rep(NA, R)

Loop R-many times

for (i in 1:R){

Define the true event probabilities

p1 <- runif(N, 0.1, 0.9)

Simulate 0/1 events with probability p1

true <- rbinom(N, 1, p1)

Define probability of a model making a mistake

p2 <- runif(N, 0.1, 0.9)

Define the predictions as the true values plus some noise term

Then mod by 2 so all values are 0 or 1

pred <- (true + rbinom(N, 1, p2)) %% 2

Define the number of correct predictions

n_correct <- length(true) - sum((true - pred)^2)

Define the sample size

total_count <- length(true)

Define the number of values belonging to the most common label

n_most_common <- max(table(true))

Define the accuracy of the predictions using the "count" function

(Yes, it's proportion classified correctly instead of accuracy percentage)

model_accuracy <- count(length(true) - sum((true - pred)^2), length(true))

Define the error rate of the predictions

model_error_rate <- 1 - model_accuracy

Define the accuracy of naively predicting the majority category every time

(Yes, it's proportion classified correctly instead of accuracy percentage)

naive_accuracy <- max(table(true))/length(true)

Define the error rate of naively predicting the majority category every time

naive_error_rate <- 1 - naive_accuracy

Calculate and store the difference between UCLA's adjusted count and

my R^2_ accuracy

d[i] <- count_adj( n_correct, total_count, n_most_common ) - r2_accuracy( model_error_rate, naive_error_rate ) }

Print a summary of the differences between my calculation and UCLA's,

revealing the two to be the same (up to differences that can be attributed

to doing math on a computer (floating point errors))

summary(d)

################################################################################

OUTPUT

################################################################################

> summary(d) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.498e-16 -2.776e-17 1.344e-17 2.025e-17 6.939e-17 2.776e-16

The differences between my calculations and the UCLA adjusted count calculations are on the order of $10^{-16}$. This is R's way of saying the difference between the UCLA adjusted count and my $R^2_{accuracy}$ is zero every time out of ten-thousand checks. (Such differences are attributable to floating point errors coming from doing math on a computer.)

Dave
  • 62,186
  • As usual, there are issues with mapping the rich probabilistic output of a logistic regression to discrete categories, particularly if a software-default threshold of $0.5$ is used without the statistician thinking. Nonetheless, there can be times when discrete categorical predictions must be assessed, and I was quite happy to learn that others have thought about this $R^2_{accuracy}$, even if their construction of it differs from my own. – Dave Feb 18 '23 at 01:34