I have a large data set with binary outcomes $\vec{y} = (y_1,\dots,y_n)$ with $y_i\in\{0,1\}$ and covariates $\mathbf{X} = (\vec{x}_1, \dots ,\vec{x}_n)^\top$. One of the goals of the model I am fitting is to forecast the risk weighted sum
$$r = \sum_{i=1}^{n_{\text{future}}} z_i \widetilde{y}_i$$
where the $\widetilde{y}_i\in\{0,1\}$ for $i = 1,\dots, n_{\text{future}}$ are unknown future outcomes and I know the current covariates, $\widetilde{\vec{x}}_i$s, and the $z_i$s. I want to provide 90 pct. prediction interval for $r$
I have fitted at logistic regression so far. A naive approach would be to compute the predicted probabilities, $\widehat{\widetilde{p}}_i$s, simulate $\widetilde{y}_i$s, compute $r$, and find the 5% and 95% quanetile. As shown below this fail to account for the uncertainty in the parameter estimates an provide far from the right coverage (however the example is with a small sample).
I have three questions below highlighted in bold and a simulation example at the end to clarify the setup and backup my claims.
Good argument that the parameter uncertainty is not important
What are good arguments that the parameter uncertainty is not important? I have a large data set (more than 1 million observations and 40k cases with 50-100 parameters) and the standard errors of the estimates are quite small. Are there any metric I can use to quantity this?
Bootstrap / simulation
Another idea would be to use the large sample approximation that
$$\widehat{\vec{\beta}} - \vec{\beta} \sim N(\vec{0}, (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1})$$
where $\mathbf{X}$ is the design matrix and $\mathbf{W}$ are working weights at convergence. Thus, one could draw $\widetilde{\vec{\beta}}$ from $N(\widehat{\vec{\beta}}, (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1})$, compute the predicted probabilities, $\widehat{\widetilde{p}}_i$s, using $\widetilde{\vec{\beta}}$, simulate $\widetilde{y}_i$s, compute $r$, and find the 5% and 95% quanetile. I gather the results will be very close to Bayesian credible interval if I had used an uninformative prior.
The above is nice since I do not have to refit the model. However, I am not sure what the pit falls might be or if this is an acceptable method. It does seem to have reasonably close coverage in the example. Further, it is along the lines of the suggestion in chapter 4 of
Wood, Simon N.. Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science). CRC Press. Kindle Edition.
which is neat since I also have a GAM fit and want to do a similar thing. I gather an alternative would be a parametric or non-parametric bootstrap. However, I would prefer to avoid this due to the computation time (especially for GAM).
Example
Here is simulation example which may make it more clear
#####
# define function to simulate
sim_coverage <- function(
formula_fit = y ~ X1 + X2 + X3, spec = 6, n_trials = 1e3){
#####
# setup cluster to decrease computation time
require(parallel)
cl <- makeCluster(spec)
on.exit(stopCluster(cl))
clusterEvalQ(cl, {
#####
# parameter for simulation
n_test <- 10000
n_train <- 1000
n_sim <- 1000
n <- n_test + n_train
is_test <- 1:n_test
invisible()
})
clusterSetRNGStream(cl)
#####
# simulate
out <- parSapply(cl, 1:n_trials, function(...){
# simulate variables
X <- matrix(runif(3 * n, -1, 1), ncol = 3)
beta. <- runif(3, -1, 1)
y <- 1/(1 + exp(-(X %*% beta. - 0.5))) > runif(n)
z <- exp(rnorm(n_test))
test_dat <- data.frame(y = y[ is_test], X[ is_test, ])
train_dat <- data.frame(y = y[-is_test], X[-is_test, ])
# fit model
fit <- glm(formula_fit, binomial(), train_dat)
# simulate outcomes for weighted z given model estimates
y_hat <- predict(fit, test_dat, type = "response")
sim_1 <- replicate(n_sim, {
y_sim <- y_hat > runif(n_test)
sum(y_sim * z)
})
# draw beta and then simulate outcome
beta_hat <- fit$coefficients
cov_chol <- chol(vcov(fit))
mm <- model.matrix(terms(fit), test_dat)
sim_2 <- replicate(
n_sim, {
beta_sim <- drop(beta_hat + t(cov_chol) %*% rnorm(length(beta_hat)))
y_sim <- 1/(1 + exp(-(mm %*% beta_sim))) > runif(n_test)
sum(y_sim * z)
})
# compute if covered. It should be 5-95%
q1 <- quantile(sim_1, c(.05, .95))
q2 <- quantile(sim_2, c(.05, .95))
actual <- sum(test_dat$y * z)
c(
`conditional on estimates` = q1[1] <= actual && q1[2] >= actual,
`simulated parameters` = q2[1] <= actual && q2[2] >= actual)
})
# return coverage
rowSums(out) / ncol(out)
}
# use function with correctly specified model
library(parallel)
set.seed(seed <- 39160006)
sim_coverage(formula_fit = y ~ X1 + X2 + X3)
#R> conditional on estimates simulated parameters
#R> 0.552 0.910
# use function with misspecified model
set.seed(seed)
sim_coverage(formula_fit = y ~ X1)
#R> conditional on estimates simulated parameters
#R> 0.558 0.885
My questions are vaguely related to this answer. However, the OP's questions is about a single response from an exponential family and not a variable which is a function of many observations from an exponential family (the function being non-linear w.r.t. the linear predictors).
n_test <- 10000unknown binary responses in the simulation. The number of unknown binary variables and corresponding known covariates $\widetilde{\vec{x}}_i$s and weights $z_i$s is fixed and not random. – Benjamin Christoffersen Mar 30 '18 at 10:10