Binomial GLM probability and cdplot probability for same data are wildly different - why?

Question

Code using glm:

model <- glm(vs ~ hp, data=mtcars, family=binomial)
newdata <- data.frame(hp=seq(min(mtcars$hp), max(mtcars$hp),len=100))
newdata$vs = predict(model, newdata, type="response")
plot(vs ~ hp, data=mtcars)
lines(vs ~ hp, newdata, lwd=2)
plogis(predict(model, newdata = data.frame(hp=150)))

Code using cdplot:

mtcarscdplot <- cdplot(as.factor(vs) ~ hp, data = mtcars)
mtcarscdplot$`1`(150)

The probabilities for 150 should at least be close, but from the glm it is 0.1294217 while from the cdplot it is 0.3168849 - a massive difference. Could anyone shed light on this, please? Which probability is the correct (and reportable) one and where is the other one going wrong?

gung - Reinstate Monica · Accepted Answer · 2022-03-30T00:41:21.377

It helps to think about what lies behind these / how they are generated.

The conditional density plot (cdplot) is an exploratory data analysis approach for the relationship between a binary and a continuous variable. It is not a model. Basically, two kernel density plots are made, and you get the first as a proportion of the sum of both densities. I illustrate the process here: Any necessary EDA before logistic? I make them occasionally, mostly because it's quick and easy. There are other EDA methods as well.

The logistic regression model is a statistical model. However, your model only fits a straight line (in the log odds transformed space). It's quite possible that a straight line is not appropriate. We can try adding a squared term, but: 1) with N=32 data, 14 v-shaped and 18 non v-shaped engines, we just don't have much information to fit a flexible line; and 2) this still isn't as flexible as the conditional density plot. Notice the probability is now lower!

m2 = glm(vs~poly(hp, 2), mtcars, family=binomial)
lines(50:335, predict(m2, data.frame(hp=50:335), type="response"),
      col="red")
predict(m2, data.frame(hp=150), type="response")  # 0.1041799

Assuming the straight-line fit is correct, you can get a 95% confidence interval for that predicted probability. It definitely includes .317:

y.hat.150 = predict(model, data.frame(hp=150), type="link", se.fit=TRUE)
with(y.hat.150, plogis(c(fit-(1.96*se.fit), fit+(1.96*se.fit)))) 
# 0.01580684 0.57913339

If you're married to the CD-plot, and you're sure the kernel and default bandwidth estimation are correct, you could try to resample to get a sense of the uncertainty there. While admittedly, I should put more into polishing this, it's... kind of a mess:

set.seed(7454)
mat = matrix(NA, nrow=512, ncol=100)
for(i in 1:100){
  dat = mtcars[sample(1:32, size=32, replace=T), c("hp", "vs")]
  dy  = with(dat, density(hp[vs==1], from=50, to=335))
  dn  = with(dat, density(hp[vs==0], from=50, to=335))
  sm  = dy$y + dn$y
  mat[,i] = dy$y/sm
}
windows()
  plot(1,1, xlim=c(50,335), ylim=c(0,1), ylab="CD estimated proportion", 
       xlab="hp")
  for(i in 1:100){
    lines(seq(50,335,length.out=512), mat[,i], 
          col=rgb(.25, .25, .25, alpha=.25), lwd=2)
  }

Oddly, there's a big dip in the maximum at 150hp, but we can get a crude approximation of a 95% CI for this (it'd be better to do more iterations to get better precision at the limits):

dy$x[180]  # [1] 149.8337
round(sort(mat[180,])[c(3,98)], 3)  # [1] 0.000 0.289

Thank you very much, @gung, this was hugely helpful. In my real data I need to explain (and potentially predict) a binary response variable. When you say the cdplot is not a model - does this mean that passing an x into the function created by cdplot (mtcarscdplot$1(150) in my question) can not be justifiably used for this? — Reader 123, Mar 30 '22 at 09:53
It would never have even occurred to me to use cdplot as a classifier, @Reader123. If the only thing you care about is prediction, & you want to try it, go ahead. By a using Gaussian kernel, it's sort of a non-parametric version of discriminant analysis. I would nest it in a cross-validation scheme to assess it's out of sample predictive performance relative to other options. But really, my default would be to use logistic regression with a spline function & an elastic net penalty, w/ the hyperparameters determined by CV performance. — gung - Reinstate Monica, Mar 30 '22 at 11:16
Maybe I'm just weird. :-D Thank you, @gung, I will work on the logistic regression then. Thanks so much - you have already answered so many of my questions on this board, just want to say that your generosity with your time and knowledge is much appreciated. — Reader 123, Mar 30 '22 at 11:22
You're welcome, @Reader123, although I think you're giving me too much credit. I don't recall having answered a question from you before. — gung - Reinstate Monica, Mar 30 '22 at 16:32

Binomial GLM probability and cdplot probability for same data are wildly different - why?

1 Answers1