2

My outcome variable is a series of Bernoulli trials where some values are missing

y $\in$ {0, 1, NA}

How do you impute NA values for an outcome variable in rstan in the context of a GLM, assuming they are missing at random?

Reproducible example, this runs with y. How do I modify the code to impute NA values in y_miss?

library(rstan)
set.seed(123)

#generate data
y <- rbinom(20, 1, 0.5)
N <- length(y)
x <- rnorm(20)

#outcome variable with missing data
y_miss <- c(y[1:17], NA, NA, NA) 

#data as a list
data <- list(y=y, x=x, N=N) 

#stan code
model_code <- '
data {
int N;                         //number of observations (#20)
int<lower=0, upper=1> y[N];    //Bernoulli distributed outcome variable
vector[N] x;                   //continuous explanatory variable
}
parameters {
real a;                        //intercept
real beta;                     //slope
}
model {
vector[N] p;                   //vector to hold values of linear model
a ~ normal(0,5);               //prior for intercept
beta ~ normal(0,5);            //prior for slope
for (i in 1:N) {
p[i] = a + beta*x[i];}         //linear model
y ~ bernoulli_logit(p);        //likelihood with link function
}' 

#run model
m <- stan(model_code=model_code, data=data, iter=4000)    

3 Answers3

8

Each value of y_miss can either be 0 or 1, so you need to marginalize over them with a statement such as

for (i in 1:n_miss) {
  real eta = a + beta * x_miss[i];
  target += log_mix(inv_logit(eta), bernoulli_logit_lpmf(1 | eta), 
                                    bernoulli_logit_lpmf(0 | eta));
}

where x_miss is a vector of observations on the predictor of size n_miss when the outcome is missing. This is an important conceptual point, rather than merely a code exercise: You need to obtain a posterior distribution that is differentiable with respect to the unknowns a and beta in order to draw from their posterior distribution with sufficient efficiency in order to make reliable conclusions with a finite number of draws. Then, in generated quantities you can draw realizations of the missing values with

generated quantities {
  int y_imp[n_miss];
  for (i in 1:n_miss) {
    real eta = a + beta * x_miss[i];
    y_imp[i] = bernoulli_logit_rng(eta);
  }
}
Ben Goodrich
  • 2,008
  • 1
    When I try this with simulated data, I get biased inferences of beta, and the bias grows with the amount of data. If you have any ideas, I have detailed the issue here: https://stats.stackexchange.com/questions/563129/marginalizing-out-discrete-response-variables-in-stan – user_15 Feb 07 '22 at 09:12
  • 1
    The accepted answer to this question is wrong because it provides very biased estimates of beta, as I detail here: https://stats.stackexchange.com/questions/563129/marginalizing-out-discrete-response-variables-in-stan – user_15 Feb 10 '22 at 11:28
0

I cannot comment, but I agree that the accepted answer here seems problematic. Per this discussion on the Stan forums, it is clear that the provided code amounts to incrementing the log probability by:

$$\log(p_i^2 + (1-p_i)^2)$$

For each missing observation $y_i$, where $p_i = \mathrm{invlogit}(\eta_i)$,

In other words, letting $M(y_i)$ be an indicator variable indicating that $y_i$ is missing, this implies that

$$P(M(y_i) \mid p_i) = p_i^2 + (1-p_i)^2$$

for each missing $y_i$.

But this makes the presence of any missing values $y_i$ strongly informative about the corresponding $p_i$ and thus about $a$ and $\beta$. Specifically, it will cause the model to prefer values of $a$ and $\beta$ that make $\eta_i$ of large absolute value (and therefore $p_i$ close to 0 or 1) for all missing observations $y_i$, as a plot of the quadratic $P(M(y_i) \mid p_i) = p_i^2 + (1-p_i)^2$ makes clear:

Plot of P(M(y_i) | p_i) as a function of p_i

0

As another answer mentioned, you can do this in the generated quantities block with Stan. However, I believe that answer is incorrect in marginalizing over the unknown values. The unknown values shouldn't affect target i.e. the model log probability. I think this model should unbiased estimates of model parameters while inferring distributions for missing y values.

N_mis <- 3
data <- list(y=y, x=x, N=N, N_mis=N_mis)

model_code <- ' data { int N; int N_mis; //number of observations (#20) array[N] int<lower=0, upper=1> y; //observed variable with NAs at the end vector[N] x; //continuous explanatory variable } transformed data { int N_obs = N - N_mis; } parameters { real a; //intercept real beta; //slope } model { a ~ normal(0,5); //prior for intercept beta ~ normal(0,5); //prior for slope

y[1:N_obs] ~ bernoulli_logit(a + betax[1:N_obs]); } generated quantities { array[N_mis] int<lower=0, upper=1> y_mis = bernoulli_logit_rng(a + betax[N_obs+1:N]); } '