Simulated data for logistic regression

Question

I used the code below to create the random variable x1 and binary variable y, and fit the regression with y and x1. My questions are:

Why regression coefficient estimates are not close to 2 and 10 (the values I setup)?
Why the warnning "did not converge"?

Thank you very much in advance for any thoughts!

n=1000
x1=rnorm(n)
xBeta=2+10*x1
p=1/(1+exp(-xBeta))
y=(p > 0.5)+0
dat = data.frame(y=y,x1=x1)
model <- glm(y ~ x1,family=binomial(link='logit'),data=dat)
summary(model)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = y ~ x1, family = binomial(link = "logit"), data = dat)
Coefficients:
                   Estimate      Std. Error z value Pr(>|z|)
(Intercept)  1209.197686004 11464.422053391 0.10547  0.91600
x1           5992.507564124 56805.078196891 0.10549  0.91599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3698653260741e+03  on 999  degrees of freedom

Residual deviance: 6.2933474676058e-05  on 998  degrees of freedom
AIC: 4.0000629334747
Number of Fisher Scoring iterations: 25

I have posted R code to simulate logistic regression data in several threads, including https://stats.stackexchange.com/a/69873/919, https://stats.stackexchange.com/a/525877/919, https://stats.stackexchange.com/a/260977/919, https://stats.stackexchange.com/a/449216/919, etc. So have many others: see this site search. — whuber, Sep 29 '22 at 16:25

Noah · Accepted Answer · 2022-09-29T16:18:10.007

The way you simulated your data, everyone with a value of p greater than .5 gets a 1, and everyone with a value of p less than .5 gets a 0. To get a p greater than .5 is to have xBeta greater than 0, and therefore to have x1 greater than -.2. So, everyone in your population who has x1 greater than -.2 gets a 1, and everyone in your population who has x1 less than -.2 gets a 0.

This does not correspond to a logistic regression model generating process; this corresponds to a sharp threshold, where there is no randomness given a unit's value of x1. Logistic regression is for whenever you have a "non-deterministic" process where each individual has some nonzero, non-one probability of getting the outcome, and the goal is to model that probability. In your case, you set each unit's probability of getting the outcome to be either 0 (if x1 < -.2) or 1 (if x1 > -.2). The model you fit is trying to mimic a steep threshold by giving you an extreme odds ratio, which is why the coefficient value is so high. The coefficient refers to the steepness of the bend in the logistic S-curve, and in this case, the true curve is a straight vertical line because you programmed a strict threshold.

If you want p to refer to the probability of each unit getting the outcome, you need to flip a coin for each unit where heads is weighted with the value of p. To do that in R, you can use rbinom(n, 1, p). This returns a vector of 0s and 1s, where the probability of 1 for each unit is equal to p (the 1 in the function call refers to the fact that were flipping one coin). So, as @utobi said, you need to replace

y=(p > 0.5)+0

with

y <- rbinom(n, 1, p)

to correctly simulate the logistic process.

+1, very nice. What the code did was generate observations with perfect separation, which may be a concept worth looking at. — Stephan Kolassa, Sep 29 '22 at 15:40
@StephanKolassa, great pointer! I didn't think about the separation problem. — utobi, Sep 29 '22 at 19:47

utobi · Answer 2 · 2022-09-29T16:35:37.790

7

The answer to both questions is: because you are generating the response in the wrong way, because you are assuming the response to be fixed, conditionally on $x_i$.

Indeed, the binary logistic regression model can be written as $$ Y_i \sim \text{Bernoulli}(p_i), $$ $$ p_i = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0+\beta_1 x_i}},\,\,i=1,\ldots,n. $$ According to your assumptions, $\beta_0=2,\beta_1=10$, $x_i\sim \text{N}(0,1)$ iid and $n=1000$.

The correct way to generate the data from the model is thus

set.seed(1)
n=1000
x1=rnorm(n)
xBeta=2+10*x1
p=1/(1+exp(-xBeta))
y=rbinom(n,1,p)
dat = data.frame(y=y,x1=x1)
model <- glm(y ~ x1,family=binomial(link='logit'),data=dat)
summary(model)

edited Sep 29 '22 at 16:35

answered Sep 29 '22 at 14:32

utobi

11,726

3

Can you give some more explanation about why the OP's way was wrong? – Stephan Kolassa Sep 29 '22 at 14:35
1

@StephanKolassa Thank you for the comments. Here is my understanding of utobi's answer. The logistic regression uses MLE, p^y * (1-P)^(1-y), which assumes the data y is random binary. However, the way I did, y=(p > 0.5)+0 is deterministic without any randomness. Thus, binary is the wrong distribution to assume in MLE. More importantly, the MLE can predict anything for P >0.5 to extremely close to 1 (e.g., xBeta extremely large) without any cost or mistake. This is essentially why the coefficients are so big. – Watchung Sep 29 '22 at 15:04
2

@StephanKolassa, it seems that the OP got the point. Anyway, I updated my post with a few details, for the sake of completeness. – utobi Sep 29 '22 at 16:37

Simulated data for logistic regression

2 Answers2