Impute binary outcome variable for GLM using Stan in R

Question

My outcome variable is a series of Bernoulli trials where some values are missing

y $\in$ {0, 1, NA}

How do you impute NA values for an outcome variable in rstan in the context of a GLM, assuming they are missing at random?

Reproducible example, this runs with y. How do I modify the code to impute NA values in y_miss?

library(rstan)
set.seed(123)

#generate data
y <- rbinom(20, 1, 0.5)
N <- length(y)
x <- rnorm(20)

#outcome variable with missing data
y_miss <- c(y[1:17], NA, NA, NA) 

#data as a list
data <- list(y=y, x=x, N=N) 

#stan code
model_code <- '
data {
int N;                         //number of observations (#20)
int<lower=0, upper=1> y[N];    //Bernoulli distributed outcome variable
vector[N] x;                   //continuous explanatory variable
}
parameters {
real a;                        //intercept
real beta;                     //slope
}
model {
vector[N] p;                   //vector to hold values of linear model
a ~ normal(0,5);               //prior for intercept
beta ~ normal(0,5);            //prior for slope
for (i in 1:N) {
p[i] = a + beta*x[i];}         //linear model
y ~ bernoulli_logit(p);        //likelihood with link function
}' 

#run model
m <- stan(model_code=model_code, data=data, iter=4000)

score 8 · Accepted Answer · answered Jun 13 '18 at 17:14

Each value of y_miss can either be 0 or 1, so you need to marginalize over them with a statement such as

for (i in 1:n_miss) {
  real eta = a + beta * x_miss[i];
  target += log_mix(inv_logit(eta), bernoulli_logit_lpmf(1 | eta), 
                                    bernoulli_logit_lpmf(0 | eta));
}

where x_miss is a vector of observations on the predictor of size n_miss when the outcome is missing. This is an important conceptual point, rather than merely a code exercise: You need to obtain a posterior distribution that is differentiable with respect to the unknowns a and beta in order to draw from their posterior distribution with sufficient efficiency in order to make reliable conclusions with a finite number of draws. Then, in generated quantities you can draw realizations of the missing values with

generated quantities {
  int y_imp[n_miss];
  for (i in 1:n_miss) {
    real eta = a + beta * x_miss[i];
    y_imp[i] = bernoulli_logit_rng(eta);
  }
}

When I try this with simulated data, I get biased inferences of beta, and the bias grows with the amount of data. If you have any ideas, I have detailed the issue here: https://stats.stackexchange.com/questions/563129/marginalizing-out-discrete-response-variables-in-stan — user_15, Feb 07 '22 at 09:12
The accepted answer to this question is wrong because it provides very biased estimates of beta, as I detail here: https://stats.stackexchange.com/questions/563129/marginalizing-out-discrete-response-variables-in-stan — user_15, Feb 10 '22 at 11:28

shiftedPriors · Answer 2 · 2022-03-18T22:32:50.257

I cannot comment, but I agree that the accepted answer here seems problematic. Per this discussion on the Stan forums, it is clear that the provided code amounts to incrementing the log probability by:

$$\log(p_i^2 + (1-p_i)^2)$$

For each missing observation $y_i$, where $p_i = \mathrm{invlogit}(\eta_i)$,

In other words, letting $M(y_i)$ be an indicator variable indicating that $y_i$ is missing, this implies that

$$P(M(y_i) \mid p_i) = p_i^2 + (1-p_i)^2$$

for each missing $y_i$.

But this makes the presence of any missing values $y_i$ strongly informative about the corresponding $p_i$ and thus about $a$ and $\beta$. Specifically, it will cause the model to prefer values of $a$ and $\beta$ that make $\eta_i$ of large absolute value (and therefore $p_i$ close to 0 or 1) for all missing observations $y_i$, as a plot of the quadratic $P(M(y_i) \mid p_i) = p_i^2 + (1-p_i)^2$ makes clear:

user1454550 · Answer 3 · 2022-04-20T17:15:56.493

0

As another answer mentioned, you can do this in the generated quantities block with Stan. However, I believe that answer is incorrect in marginalizing over the unknown values. The unknown values shouldn't affect target i.e. the model log probability. I think this model should unbiased estimates of model parameters while inferring distributions for missing y values.

N_mis <- 3
data <- list(y=y, x=x, N=N, N_mis=N_mis)
model_code <- '
data {
  int N;
  int N_mis;                           //number of observations (#20)
  array[N] int<lower=0, upper=1> y;    //observed variable with NAs at the end
  vector[N] x;                         //continuous explanatory variable
}
transformed data {
  int N_obs = N - N_mis;
}
parameters {
  real a;                        //intercept
  real beta;                     //slope
}
model {
  a ~ normal(0,5);               //prior for intercept
  beta ~ normal(0,5);            //prior for slope
y[1:N_obs] ~ bernoulli_logit(a + betax[1:N_obs]);
}
generated quantities {
  array[N_mis] int<lower=0, upper=1> y_mis = bernoulli_logit_rng(a + betax[N_obs+1:N]);
}
'

edited Apr 20 '22 at 17:15

answered Mar 22 '22 at 15:44

user1454550

26

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review – Adrian Keister Mar 22 '22 at 16:06
What part of the question did I not answer? I've updated my answer with a code example though. – user1454550 Mar 22 '22 at 17:11
1

The code example makes your answer considerably better. Just fyi: my comment above is an auto-generated comment, and it occurred, obviously, before your edit. So I would say it no longer applies. Cheers. – Adrian Keister Mar 22 '22 at 17:47
1

This is incorrect. The Stan manual shows how to impute missing continuous parameters, not discrete parameters, because HMC cannot sample from them (unlike BUGS/JAGS etc). – user_15 Apr 20 '22 at 12:25
@user_15 ah, very true. My apologies, yes this code would only work for continuous parameters not discrete. – user1454550 Apr 20 '22 at 14:06
@user_15, I think I may have fixed the answer, could you take a look? I moved the discrete parameters from the parameters block to generated quantities, I think that would work. – user1454550 Apr 20 '22 at 16:43
1

@user1454550 This works if the assumption is that the data is MCAR, yes. But if one wants to marginalize over the missing data, then it needs to be done in the model. – user_15 Apr 20 '22 at 18:03

Impute binary outcome variable for GLM using Stan in R

3 Answers3

Linked