Logistic Regression Simulation Apparent Bias

Question

Hi Other Stats Humans!

I am simulating some data and I am observing some unexpected results. After simulating the data, I found that when the true odds ratio is less than one, the estimated odds ratio is almost always greater than the truth. On the other end, when the true odds ratio is greater than one, the estimated odds ratio is almost always less than the true odds ratio.

I found this Bias of maximum likelihood estimators for logistic regression, which suggests the estimates are indeed biased, but even after bumping the sample size to 100k, the relative magnitude of the observed bias didn't appear to decrease.

I was wondering if this is an example of the bias I might expect, or if I am making a mistake somewhere?

Below is the setup for the simulation:

Let $Y_i \sim Bernoulli(P[Y = 1|X_i])$ with $X_i = \{x_{1,i}, x_{2,i}, \dots, x_{100,i}\}$ and $\mathbf{X}_i =\begin{bmatrix} x_{1,i} & x_{2,i} & \cdots & x_{100,i} \end{bmatrix}$. For this purposes of this simulation I assume each $\mathbf{X}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$. Assume the $E[Y_i] = Pr[Y = 1|X_i]$ is linked to the linear predictor, $\eta_i = \beta_0 + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + \dots + \beta_{100}x_{100,i}$, through the logit function (e.g. logistic regression).

$$ logit(E[Y_i]) = ln(\frac{Pr[Y = 1|X_i]}{1 -Pr[Y = 1|X_i]}) = \beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + \dots + \beta_{100}x_{100,i} $$

Doing a little bit of algebra, the conditional probability can be derived.

$$ Pr[Y = 1 | x_i] = \frac{e^{\beta_0 + \beta_{1} + \beta_{2}x_{1,i} + \dots + \beta_{100}x_{100,i}}}{(1 + e^{\beta_0 + \beta_{1} + \beta_{2}x_{1,i} + \dots + \beta_{100}x_{100,i}})} $$

Let $a \in \mathbb{R}$ and $j \in \{1, 2, \dots, 100\}$. The marginal effect for a unit increase in any $x_{j,i} \in X_i$ is $logit(E[Y | x_{j,i} = a+1]) - logit(E[Y | x_{j,i} = a])$. A bit more math, and we can see that the log odds ratio for a unit change in $a$ is a constant that no longer depends on $x_j$.

$$ ln(\frac{\frac{Pr[Y = 1|x_{j,i} = a+1]}{1 -Pr[Y = 1|x_{j,i} = a+1]}}{\frac{Pr[Y = 1|x_{j,i} = a]}{1 -Pr[Y = 1|x_{j,i} = a]}}) = \beta_j $$

Simulation Results

library("mvtnorm")

set.seed(765)

# Generate the odds ratios.  We assume that there are 25 true effects with odds ratios between 0.5 and 1.5.
Beta <- runif(100, log(0.5), log(1.5))
Beta[sample(1:100, 75)] <- log(1) # Create sparsity

# Generate random normal variates.
X <- rmvnorm(n = 1000, mean = rep(0, 100), sigma = diag(100))
colnames(X) <- paste("x_", 1:100)

# The linear predictor
Beta_lp <- as.matrix(c(0, Beta)) # Zero for the log odds ratio when all of the other beta's are zero.
X_lp <- cbind(intercept = rep(1, dim(X)[1]), X) # Column of 1's for the intercept
Eta = X_lp%*%Beta_lp

# The probability of Y is calculated and then converted to 0, 1 data by sampling from a Bernoulli distribution.
Y_prob = exp(Eta)/(1 + exp(Eta))
Y = rbinom(1000, 1, Y_prob)

# Fit all of the models to get estimates and p-values.
univFit <- function(x, y) glm(y ~ x, family = binomial)
fitRes <- apply(X, 2, univFit, y = Y)

# Recover the estimates
univCoef <- function(x) summary(x)$coefficients["x", "Estimate"]
est <- unlist(lapply(fitRes, univCoef))

Beta_compare <- cbind(exp(Beta), exp(est), exp(Beta) - exp(est))
Beta_compare <- Beta_compare[order(Beta_compare[, 1]), ]
Beta_compare <- Beta_compare[which(Beta_compare[, 1] != 1), ]
colnames(Beta_compare) <- c("True Odds Ratio", "Estimated Odds Ratio", "Difference")
Beta_compare

       True Odds Ratio Estimated Odds Ratio   Difference
x_ 51        0.5385276            0.5668356 -0.028307937
x_ 16        0.5397476            0.6918808 -0.152133154
x_ 67        0.5407665            0.6016156 -0.060849130
x_ 100       0.6126987            0.6949004 -0.082201669
x_ 53        0.6238660            0.7861676 -0.162301651
x_ 15        0.7193909            0.7926078 -0.073216839
x_ 66        0.7485787            0.7794177 -0.030839078
x_ 59        0.7719291            0.7695683  0.002360802
x_ 52        0.7720026            0.8378629 -0.065860309
x_ 6         0.7792257            0.8727507 -0.093525064
x_ 8         0.7896498            0.8111636 -0.021513880
x_ 9         0.8386983            0.9057300 -0.067031684
x_ 89        0.8404852            0.8733146 -0.032829333
x_ 49        0.8758568            0.9353507 -0.059493814
x_ 60        0.9113235            0.9736910 -0.062367499
x_ 20        0.9859473            1.0232799 -0.037332522
x_ 90        1.0402532            1.0516565 -0.011403221
x_ 95        1.0871565            0.9522206  0.134935908
x_ 4         1.1357620            1.0015317  0.134230252
x_ 35        1.1447927            1.0454190  0.099373638
x_ 3         1.1569112            1.1197491  0.037162148
x_ 45        1.3042183            1.1206980  0.183520318
x_ 48        1.3545098            1.2355708  0.118938949
x_ 40        1.3632327            1.2339785  0.129254174
x_ 41        1.4433416            1.2152798  0.228061783

Logistic Regression Simulation Apparent Bias

0 Answers0