Logistic regression via robust glm (glmrob) not appropriate if have only one observation in one of two categories of an independent variable?

Question

I received abnormal results when using glmrob (R function from robustbase) when assessing the association of a binomial independent variable X0 with a binomial dependent variable Y using logistic regression. The log odds ratio of X0 jumped from 16 (exp of 2.74) to a whopping 200 (exp of 5.3) after adjusting for X1, X2, and X3 covariables.

X1 is a binomial categorical variable where it has one observation for one of the two categories (see row #2) glmrob considers this observation as an outlier and downweighs it. Specifically, among Y, 1/32 (3%) observations from group 0 had been exposed by X0, while 9/27 (33%) of group 1 had been exposed.

Contingency table

	X:0	X:1
Y:0	31	1
Y:1	19	9

Is it appropriate to use glmrob for such skewed proportional data? The adjusted odds ratio seems super high especially compared to the unadjusted. Is there a more appropriate parameter for such a data? What do you suggest? Would be great to have references to justify the change.

Below are reproducible codes

Best regards, Ali

library(dplyr)
data <- data.frame(
            Y=c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1),
          X0=c(0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0),
          X1=c('B','B','C','B','A','B','B','B','B','B','A','B','B','B','B','B','A','B','A','B','A','A','C','B','B','B','B','B','C','B','B','B','B','A','B','B','B','B','B','B','B','B','B','A','B','B','A','B','C','B','B','B','B','A','B','B','A','B','A'),
          X2=c(1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,0,1,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1),
X3 = c(18.9,5.4,18.4,18.6,15.3,18.2,17.3,17,12.6,6.9,17.6,15.4,16.1,8.4,11.6,16.2,10.7,14.9,19.1,21,16.5,13.2,19.3,16.7,18.3,18.5,13.8,17.7,17.7,16.8,16.9,16.2,14.6,15.8,13.4,16.2,15,21.2,20.3,20.7,13.8,17.5,18.8,10.2,8.2,9.8,14.9,16.6,19.6,16.3,15.9,16.2,15,15,15.7,19.9,18.1,18.2,19)
)
> data <- data[order(data$Y, decreasing = FALSE), ] 
> row.names(data) <- NULL
> data
Y X0 X1 X2   X3
1  0  0  A  0 17.6
2  0  1  B  1 15.4
3  0  0  B  1 16.1
4  0  0  B  1  8.4
5  0  0  B  1 11.6
6  0  0  B  0 16.2
7  0  0  A  1 10.7
8  0  0  B  0 14.9
9  0  0  B  0 14.6
10 0  0  A  1 15.8
11 0  0  B  1 13.4
12 0  0  B  0 16.2
13 0  0  B  1 15.0
14 0  0  B  1 21.2
15 0  0  B  1 20.3
16 0  0  B  0 20.7
17 0  0  B  0 13.8
18 0  0  B  0 17.5
19 0  0  B  1 18.8
20 0  0  A  0 10.2
21 0  0  B  0  8.2
22 0  0  B  0  9.8
23 0  0  A  1 14.9
24 0  0  B  1 16.6
25 0  0  C  0 19.6
26 0  0  B  1 16.3
27 0  0  B  1 15.9
28 0  0  B  1 16.2
29 0  0  B  1 15.0
30 0  0  A  1 15.0
31 0  0  B  1 15.7
32 0  0  B  0 19.9
33 1  0  B  1 18.9
34 1  1  B  0  5.4
35 1  1  C  1 18.4
36 1  1  B  1 18.6
37 1  0  A  1 15.3
38 1  0  B  1 18.2
39 1  1  B  0 17.3
40 1  0  B  1 17.0
41 1  0  B  1 12.6
42 1  0  B  1  6.9
43 1  0  A  1 19.1
44 1  0  B  1 21.0
45 1  0  A  1 16.5
46 1  0  A  1 13.2
47 1  0  C  0 19.3
48 1  1  B  0 16.7
49 1  0  B  1 18.3
50 1  1  B  1 18.5
51 1  0  B  0 13.8
52 1  0  B  1 17.7
53 1  0  C  0 17.7
54 1  1  B  1 16.8
55 1  0  B  1 16.9
56 1  0  B  1 16.2
57 1  1  A  1 18.1
58 1  1  B  1 18.2
59 1  0  A  1 19.0
> robustbase::glmrob(Y ~ X0 , data=data, family = binomial(link = "logit")) %>% summary()
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept)  -0.5436     0.2963  -1.834   0.0666 .
X0            2.7408     1.1396   2.405   0.0162 *

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r  w.x: 
 58 weights are ~= 1. The remaining one are
     2 
0.4483
> robustbase::glmrob(Y ~ X0 + X1 + X2 , data=data, family = binomial(link = "logit")) %>% summary()
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.45216    2.06462  -1.672   0.0945 .
X0           5.29736    2.54417   2.082   0.0373 *
X1B         -0.58513    0.77066  -0.759   0.4477

X1C          2.30901    1.77568   1.300   0.1935

X2           2.07431    1.12265   1.848   0.0646 .
X3           0.09954    0.10882   0.915   0.3603

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r  w.x: 
 54 weights are ~= 1. The remaining 5 ones are
     2     25     41     42     51 
0.1180 0.8981 0.9436 0.7106 0.3551
#Using regular glm doesn’t cause this issue
> glm(Y ~ X0 + X1 + X2 , data=data, family = binomial(link = "logit")) %>% summary()
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept)  -1.5188     0.9734  -1.560  0.11869

X0            3.2219     1.2200   2.641  0.00827 **
X1B          -0.4763     0.7288  -0.654  0.51343

X1C           2.2179     1.5590   1.423  0.15484

X2            1.5971     0.8384   1.905  0.05679 .

Because in my study I have been using glmrob for all my regression analysis for the sake of being consistent, if I want to use standard glm instead for one model, I must provide a justification for the one-time change.

Gordon Smyth · Answer 1 · 2022-11-20T22:02:47.363

0

In my opinion, the glmrob method is not appropriate for binary data. glmrob depends on quasi-likelihood and observation weights, and neither of which are at all reasonable for binary data, for the reasons explained here: What are weights in a binary glm and how to calculate them?

Stepping back a bit, the idea of identify outliers when there are actually only two possible values for the response variable is intuitively suspect.

edited Nov 20 '22 at 22:02

answered Nov 20 '22 at 00:48

Gordon Smyth

12,807

May you share a research paper reference that explains why using quasi-likelihood or observation weights is problematic when using a dichotomized predictor variable in a logistic regression model? – am313 Nov 23 '22 at 23:39
@am313 The only reference I know is Section 9.10 of my book with Peter Dunn (Dunn & Smyth 2018, Generalized Linear Models With Examples in R, Springer). I've also explained the reasons in the thread I link to above. It just doesn't justify a research paper. – Gordon Smyth Nov 24 '22 at 01:05
@am313 Also, the problem has nothing to do with predictor variables, whether dichotomized or not. When I say "binary data", I mean that the response variable is binary, as it is for your data. The form of the predictor variables is irrelevant. – Gordon Smyth Nov 24 '22 at 01:52

score 0 · Answer 2 · answered Mar 05 '24 at 05:11

What you're describing is quasi-complete separation in your data. A strategy for dealing with this is Firth's (bias-reduced) logistic regression which uses a penalized likelihood estimation method.

This can be done in R using the brglm and brglm2 packages. In your glm model, the specific bias-reduction technique can be specified using the 'method' and 'type' options (default is mean). Example:

    mod1 = glm(Y ~ X1 + X2 + X3, data=data, family=binomial(link = "logit"), 
    method="brglmFit", type="AS_mean")

References for changing methods:

Firth, D. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002 Aug 30;21(16):2409-19. doi: 10.1002/sim.1047. PMID: 12210625.

Logistic regression via robust glm (glmrob) not appropriate if have only one observation in one of two categories of an independent variable?

2 Answers2