Pseudo $R^2$ for probit model: In-sample or out-of-sample?

Question

I have a dataset test_data that measures mortality in response to dosage of a pesticide. I used a probit model that evaluates the efficacy of a single pesticide. Where we would want to determine, for example, the median lethal dosage at of pesticide A at 3 days. The model would be as follows, x=log10(dose) and y=mortality (0-100%) while being weighted out of the total individuals tested.

My goal is to determine the reliability of the model and compare the reliability to other models. For example, e.g. the median lethal dosage at 3 days for pesticide A vs. pesticide B. I am able to calculate the goodness of fit, but would like an additional test. I think pseduo-R^2 might be a good option.

Here is a reproducible example:

> dput(test_data)
structure(list(trt = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B"), dose = c(5L, 50L, 500L, 5000L, 50000L, 500000L, 
5L, 50L, 500L, 5000L, 50000L, 500000L), proportion_dead = c(0, 
0.016666667, 0.25, 0.583333333, 0.916666667, 1, 0, 0.041666667, 
0.05, 0.416666667, 0.833333333, 1), total = c(120L, 120L, 120L, 
120L, 120L, 120L, 120L, 120L, 120L, 120L, 120L, 120L)), class = "data.frame", row.names = c(NA, 
-12L)

Here I build a model for pesticide A and B.

In order to calculate McFadden's Pseduo-R^2, I calculate the 1 - residual deviance / null deviance calculated in the model m1 or m2. I believe that is correct for in-sample pseudo-R^2? But my question is: Can I use in-sample pseudo R^2? I think I could because the model is based on data that was collected in the past.

m1<- glm(proportion_dead ~ log10(dose), weights=total, data=test_data[test_data$trt=='A',], family=binomial(link='probit'))
m2<- glm(proportion_dead ~ log10(dose), weights=total, data=test_data[test_data$trt=='B',], family=binomial(link='probit'))
pr21 <- 1 - m1$deviance / m1$null.deviance
pr22 <- 1 - m2$deviance / m2$null.deviance
output:
> pr21
[1] 0.9932642
> pr22
[1] 0.9715011

I am learning statistics, so any suggestions would be great. Thanks!

In a sample of only 12 observations, split sample validation will have obvious disadvantages. — AdamO, Nov 16 '23 at 18:32
For sure. This only includes dummy data from one experimental replicate. Many comparisons would be made with ~50 observations. Still not ideal, I know. — scott.pilgrim.vs.r, Nov 16 '23 at 18:38
It's a curse of dimensionality problem. The greater the complexity of your model, the more the need for independent validation, but greater so the impact of limited sample size on affecting model parameters. Harrel and Tibshirani have mentioned cross-validation and bootstrapping new datasets as a hybrid method to get the best of both worlds. — AdamO, Nov 16 '23 at 18:46

Dave · Accepted Answer · 2023-11-16T19:41:38.823

BEWARE OVERFITTING!

The danger of using an in-sample measure of performance is that it is quite easy to overfit to coincidences in the sample instead of to the real trends. That is, you can have strong measures of performance on the in-sample data, but when you go to use your model to make new predictions, those coincidences are not present, and your performance is poor.

Out-of-sample testing mimics the use case of making truly new predictions when you can catch if your model can generalize beyond the training data.

You might be able to drive your in-sample McFadden $R^2$ close to a perfect $1$. However, you need to do something to account for model complexity. Out-of-sample testing is one option. It has its drawbacks (you withhold valuable training data from model development, for instance, which could be quite damaging to a studies like yours that have limited data) yet is highly common.

Example

(This is a modification of a post of mine on Data Science.)

library(MLmetrics)
set.seed(2023)
Function to calculate McFadden's R^2 the way I believe it correct
See: https://stats.stackexchange.com/q/590199/247274

r2_mcfadden <- function(y_true, y_pred, y_mean){
return(
    1 - (MLmetrics::LogLoss(y_pred, y_true))/(MLmetrics::LogLoss(y_mean, y_true))
  )
}
Define sample size

N <- 1000
Define number of parameters

p <- 750
Define number of simulations to do

R <- 250
Simulate data

X <- matrix(rnorm(N*p), N, p)
Define the parameter vector to be 1, 0, 0, ..., 0, 0

B <- c(1, rep(0, p-1))
Define the probability values from the probit link

p <- pnorm(X %*% B)
in_sample <- out_of_sample <- rep(NA, R)
for (i in 1:R){
Simulate the binary outcome

y <- rbinom(N, 1, p)
Fit a probit regression to all variables

model <- glm(y ~ X, family = binomial(link = "probit"))
Make probability predictions

preds <- predict.glm(model, type = "response")
Calculate the in-sample McFadden R^2

in_sample[i] <- r2_mcfadden(y, preds, mean(y))
Simulate some new data and calculate the out-of-sample McFadden R^2

y_test <- rbinom(N, 1, p)
  out_of_sample[i] <- r2_mcfadden(y_test, preds, mean(y))
if (i %% 50 == 0 | i < 6){
    print(paste(i/R*100, "% done"))
    }
}
Summarize results

boxplot(in_sample, out_of_sample, names=c("in-sample", "out-of-sample"), main="MSE")
summary(in_sample)
summary(out_of_sample)

In this example, the in-sample McFadden $R^2$ is (within rounding) equal to a perfect $1$ every time. However, every out-of-sample McFadden $R^2$ is less than zero, indicating that the predictions would have been better (in terms of log loss) by predicting the in-sample mean every time, which I claim is a reasonable baseline model with "must-beat" performance.

(It took me $\sim$$30$ minutes to run that block of code, which is longer than I like for an example on here, but you can lower R to get a quicker run time.)

Pseudo $R^2$ for probit model: In-sample or out-of-sample?

1 Answers1

Function to calculate McFadden's R^2 the way I believe it correct

See: https://stats.stackexchange.com/q/590199/247274

Define sample size

Define number of parameters

Define number of simulations to do

Simulate data

Define the parameter vector to be 1, 0, 0, ..., 0, 0

Define the probability values from the probit link

Simulate the binary outcome

Fit a probit regression to all variables

Make probability predictions

Calculate the in-sample McFadden R^2

Simulate some new data and calculate the out-of-sample McFadden R^2

Summarize results