Observations from logistic regression - coefficients, R

Question

Let's say that I generate observations from logistic regression:

n <- 1000 #n. of observations
p <- 5 #n. of covariates
u <- dnorm(runif(n*p, min = 0, max = 1))
x <- matrix(u, nrow = n, byrow = TRUE)
b <- c(2, 1, 2, 3, 6, -3)
prob <- 1/(1 + exp(-b%*%t(cbind(1,x))))
y <- rbinom(n,1,prob)

But now if I fit a model:

df = data.frame(y, x)
mod <- glm(y ~ . , data = df, family  = "binomial")
coefficients(mod)

Coefficients I get are:

(Intercept)          X1          X2          X3          X4          X5 
 -7.0004768  16.5132458  -0.6912072  11.7006775   1.5552072   7.0992499

It aswell might be a very dumb question showing lack of knowledge, but why are my coefficients returned by glm different than these from b? A shoot in the dark is that while generating y I rely upon probability given by a binomial distribution, so there is a room for uncertanity. Am I right? In the sense that drawing from binom(4, 0.5) I could get e.x both (1,1,0,0) and (0,1,0,0).

Running the code a couple of times revelas that the sampling variability of the coefficients is rather large. Additionally, nearly all the y values are 1, meaning there isn't much to learn here ( there are at times only 3 zero observations). Increasing n by a couple orders of magnitude rectifies the problem. I show how to simulate some responses here. — Demetri Pananos, Apr 25 '21 at 12:55
@Demetri Pananos so am I at least partly right? That it is due to the fact that we draw y from Binomial distribution so y is not deterministic and may vary? — amal, Apr 25 '21 at 13:05
yes, you're right. Its just that the model data are not the best because the prevalence of the outcome is so high. — Demetri Pananos, Apr 25 '21 at 13:54

Observations from logistic regression - coefficients, R

0 Answers0