1

I received abnormal results when using glmrob (R function from robustbase) when assessing the association of a binomial independent variable X0 with a binomial dependent variable Y using logistic regression. The log odds ratio of X0 jumped from 16 (exp of 2.74) to a whopping 200 (exp of 5.3) after adjusting for X1, X2, and X3 covariables.

X1 is a binomial categorical variable where it has one observation for one of the two categories (see row #2) glmrob considers this observation as an outlier and downweighs it. Specifically, among Y, 1/32 (3%) observations from group 0 had been exposed by X0, while 9/27 (33%) of group 1 had been exposed.

Contingency table

X:0 X:1
Y:0 31 1
Y:1 19 9

Is it appropriate to use glmrob for such skewed proportional data? The adjusted odds ratio seems super high especially compared to the unadjusted. Is there a more appropriate parameter for such a data? What do you suggest? Would be great to have references to justify the change.

Below are reproducible codes

Best regards, Ali

library(dplyr)

data <- data.frame( Y=c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1), X0=c(0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0), X1=c('B','B','C','B','A','B','B','B','B','B','A','B','B','B','B','B','A','B','A','B','A','A','C','B','B','B','B','B','C','B','B','B','B','A','B','B','B','B','B','B','B','B','B','A','B','B','A','B','C','B','B','B','B','A','B','B','A','B','A'), X2=c(1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,0,1,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1), X3 = c(18.9,5.4,18.4,18.6,15.3,18.2,17.3,17,12.6,6.9,17.6,15.4,16.1,8.4,11.6,16.2,10.7,14.9,19.1,21,16.5,13.2,19.3,16.7,18.3,18.5,13.8,17.7,17.7,16.8,16.9,16.2,14.6,15.8,13.4,16.2,15,21.2,20.3,20.7,13.8,17.5,18.8,10.2,8.2,9.8,14.9,16.6,19.6,16.3,15.9,16.2,15,15,15.7,19.9,18.1,18.2,19) )

> data <- data[order(data$Y, decreasing = FALSE), ] > row.names(data) <- NULL

> data

Y X0 X1 X2 X3 1 0 0 A 0 17.6 2 0 1 B 1 15.4 3 0 0 B 1 16.1 4 0 0 B 1 8.4 5 0 0 B 1 11.6 6 0 0 B 0 16.2 7 0 0 A 1 10.7 8 0 0 B 0 14.9 9 0 0 B 0 14.6 10 0 0 A 1 15.8 11 0 0 B 1 13.4 12 0 0 B 0 16.2 13 0 0 B 1 15.0 14 0 0 B 1 21.2 15 0 0 B 1 20.3 16 0 0 B 0 20.7 17 0 0 B 0 13.8 18 0 0 B 0 17.5 19 0 0 B 1 18.8 20 0 0 A 0 10.2 21 0 0 B 0 8.2 22 0 0 B 0 9.8 23 0 0 A 1 14.9 24 0 0 B 1 16.6 25 0 0 C 0 19.6 26 0 0 B 1 16.3 27 0 0 B 1 15.9 28 0 0 B 1 16.2 29 0 0 B 1 15.0 30 0 0 A 1 15.0 31 0 0 B 1 15.7 32 0 0 B 0 19.9 33 1 0 B 1 18.9 34 1 1 B 0 5.4 35 1 1 C 1 18.4 36 1 1 B 1 18.6 37 1 0 A 1 15.3 38 1 0 B 1 18.2 39 1 1 B 0 17.3 40 1 0 B 1 17.0 41 1 0 B 1 12.6 42 1 0 B 1 6.9 43 1 0 A 1 19.1 44 1 0 B 1 21.0 45 1 0 A 1 16.5 46 1 0 A 1 13.2 47 1 0 C 0 19.3 48 1 1 B 0 16.7 49 1 0 B 1 18.3 50 1 1 B 1 18.5 51 1 0 B 0 13.8 52 1 0 B 1 17.7 53 1 0 C 0 17.7 54 1 1 B 1 16.8 55 1 0 B 1 16.9 56 1 0 B 1 16.2 57 1 1 A 1 18.1 58 1 1 B 1 18.2 59 1 0 A 1 19.0

> robustbase::glmrob(Y ~ X0 , data=data, family = binomial(link = "logit")) %>% summary()

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5436 0.2963 -1.834 0.0666 . X0 2.7408 1.1396 2.405 0.0162 *


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Robustness weights w.r w.x: 58 weights are ~= 1. The remaining one are 2 0.4483

> robustbase::glmrob(Y ~ X0 + X1 + X2 , data=data, family = binomial(link = "logit")) %>% summary()

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45216 2.06462 -1.672 0.0945 . X0 5.29736 2.54417 2.082 0.0373 * X1B -0.58513 0.77066 -0.759 0.4477
X1C 2.30901 1.77568 1.300 0.1935
X2 2.07431 1.12265 1.848 0.0646 . X3 0.09954 0.10882 0.915 0.3603


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Robustness weights w.r w.x: 54 weights are ~= 1. The remaining 5 ones are 2 25 41 42 51 0.1180 0.8981 0.9436 0.7106 0.3551

#Using regular glm doesn’t cause this issue > glm(Y ~ X0 + X1 + X2 , data=data, family = binomial(link = "logit")) %>% summary()

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5188 0.9734 -1.560 0.11869
X0 3.2219 1.2200 2.641 0.00827 ** X1B -0.4763 0.7288 -0.654 0.51343
X1C 2.2179 1.5590 1.423 0.15484
X2 1.5971 0.8384 1.905 0.05679 .

Because in my study I have been using glmrob for all my regression analysis for the sake of being consistent, if I want to use standard glm instead for one model, I must provide a justification for the one-time change.

am313
  • 86

2 Answers2

0

In my opinion, the glmrob method is not appropriate for binary data. glmrob depends on quasi-likelihood and observation weights, and neither of which are at all reasonable for binary data, for the reasons explained here: What are weights in a binary glm and how to calculate them?

Stepping back a bit, the idea of identify outliers when there are actually only two possible values for the response variable is intuitively suspect.

Gordon Smyth
  • 12,807
  • May you share a research paper reference that explains why using quasi-likelihood or observation weights is problematic when using a dichotomized predictor variable in a logistic regression model? – am313 Nov 23 '22 at 23:39
  • @am313 The only reference I know is Section 9.10 of my book with Peter Dunn (Dunn & Smyth 2018, Generalized Linear Models With Examples in R, Springer). I've also explained the reasons in the thread I link to above. It just doesn't justify a research paper. – Gordon Smyth Nov 24 '22 at 01:05
  • @am313 Also, the problem has nothing to do with predictor variables, whether dichotomized or not. When I say "binary data", I mean that the response variable is binary, as it is for your data. The form of the predictor variables is irrelevant. – Gordon Smyth Nov 24 '22 at 01:52
0

What you're describing is quasi-complete separation in your data. A strategy for dealing with this is Firth's (bias-reduced) logistic regression which uses a penalized likelihood estimation method.

This can be done in R using the brglm and brglm2 packages. In your glm model, the specific bias-reduction technique can be specified using the 'method' and 'type' options (default is mean). Example:

    mod1 = glm(Y ~ X1 + X2 + X3, data=data, family=binomial(link = "logit"), 
    method="brglmFit", type="AS_mean")

References for changing methods:

  1. Firth, D. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38.
  2. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002 Aug 30;21(16):2409-19. doi: 10.1002/sim.1047. PMID: 12210625.
Jack
  • 334