0

I have fitted a logistic regression on my data, for which I have below example,

    library(dplyr)
data(mtcars)

Model = glm(vs ~ wt + disp, data = mtcars, family = "binomial")

Pred = predict(Model) Pred_Prob = exp(Pred) / (1+exp(Pred)) Error = (mtcars[, 'vs'] - Pred_Prob) / sqrt(Pred_Prob * (1 - Pred_Prob)) plot(Pred_Prob, Error)

This gives below plot

enter image description here

I wonder if this is a natural plot expected from fitting a logistic regression? Why am I seeing 2 converging lines? If not, what else can be done to improve the model fitting for above dataset?

Bogaso
  • 871
  • 8
  • 14
  • predict.glm already returns probabilities. What are you trying to show on this plot? – Tim Mar 27 '22 at 15:21
  • predict.glm(model) gives values -5.193 etc. So I don't think it predicts probability directly. Looks like it provides log odd. I am trying to understand, for a good logistic fit, how this plot should look like? Is above model good fit? Is there any clustering in residuals? – Bogaso Mar 27 '22 at 15:26
  • 3
    The lower line is for those cars with V-shaped cylinder layouts ($vt=0$) and the upper line for those cars with straight cylinders ($vt=1$). This is what you would expect to see on a residuals plot when predicting a binary variable – Henry Mar 27 '22 at 15:37
  • @Henry Yes it was a mistake. I have corrected the code and modified the plot. Also could you please further elaborate your statement? Why I see downward trend in this error plot? Should not this plot look like random fluctuation for ideal fit? – Bogaso Mar 27 '22 at 15:39
  • 1
    Since Pred_Prob must be between $0$ and $1$, the residuals mtcars[, 'vs'] - Pred_Prob must be positive when $vt=1$ and negative when $vt=0$: this leads to the two lines. Since you have fitted the model to the data, you would expect the residuals on the lower line tend to be closer to $0$ when the predicted probabilities are closer to $0$ (the left side of the lower line) and the residuals on the upper line tend to be closer to $0$ when the predicted probabilities are closer to $1$ (the right side of the upper line) making both lines downward sloping. – Henry Mar 27 '22 at 16:46
  • If you had not divided by sqrt(Pred_Prob*(1-Pred_Prob)) then both lines would be straight with slopes of $-1$ – Henry Mar 27 '22 at 16:47
  • Okay. That makes sense. So in that case it can be concluded that the fitted model is good fit to the data, right? – Bogaso Mar 27 '22 at 17:12
  • I have one more important question. If I need to check the Residual e.g independences, qq plot etc, should I need to do this separately for vs = 1 and vs = 0? i.e. separate tests for positive and negative errors? – Bogaso Mar 27 '22 at 17:21

1 Answers1

2

It may help you to read my answer to Interpretation of plot (glm.model). In brief, I would not (do not) use those plots for logistic regression. The first question I would ask is: what are you trying to find out? Is a plot helpful for discovering that? What plot might do so? Etc. For example, you mention looking at a qq-plot. For a logistic regression, the outcome is binary; thus, the distribution is the Bernoulli or binomial. The data are not normal, nor need they be. We know this in advance. There could be a case where there is value in looking at the qq-plot, but I'm not sure what it would be in general.

With respect to assessing a fitted logistic regression model, one thing you might want to check is whether a linear function of X was appropriate (assuming you used that). In such a case, rather than looking at a plot of the residuals vs X, as I might with a linear model, I might compare the fitted model to something non-parametric, such as a LOWESS fit. To illustrate this idea, consider the example below (adapted from ROC curve crossing the diagonal):

## data
Cond.1 = c(2.9, 3.0, 3.1, 3.1, 3.1, 3.3, 3.3, 3.4, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.7,
           3.8, 3.8, 3.8, 3.8, 3.9, 4.0, 4.0, 4.1, 4.1, 4.2, 4.4, 4.5, 4.5, 4.5, 4.6,
           4.6, 4.6, 4.7, 4.8, 4.9, 4.9, 5.5, 5.5, 5.7)
Cond.2 = c(2.3, 2.4, 2.6, 3.1, 3.7, 3.7, 3.8, 4.0, 4.2, 4.8, 4.9, 5.5, 5.5, 5.5, 5.7,
           5.8, 5.9, 5.9, 6.0, 6.0, 6.1, 6.1, 6.3, 6.5, 6.7, 6.8, 6.9, 7.1, 7.1, 7.1,
           7.2, 7.2, 7.4, 7.5, 7.6, 7.6, 10, 10.1, 12.5)
dat    = stack(list(cond1=Cond.1, cond2=Cond.2))
ord    = order(dat$values)
dat    = dat[ord,]  # now the data are sorted

logistic regression model & LOWESS fit

lr.model1 = glm(ind~values, dat, family="binomial") # straight line fit lr.preds1 = predict(lr.model1, data.frame(values=seq(2.3,12.5,by=.1)), type="response") LOWESS = loess((as.numeric(ind)-1)~values, dat, degree=0) # non-parametric fit LOWESS.pd = predict(LOWESS, data.frame(values=seq(2.3,12.5,by=.1)))

here I plot the data, the model, & overlay a LOWESS fit

windows() with(dat, plot(values, ifelse(ind=="cond2",1,0), ylab="predicted probability of condition2")) lines(seq(2.3,12.5,by=.1), lr.preds1, lwd=2, col="red") lines(seq(2.3,12.5,by=.1), LOWESS.pd, lwd=2, col="blue") legend("bottomright", legend=c("model", "LOWESS"), lwd=2, col=c("red", "blue"))

enter image description here