4

I have a large data set with binary outcomes $\vec{y} = (y_1,\dots,y_n)$ with $y_i\in\{0,1\}$ and covariates $\mathbf{X} = (\vec{x}_1, \dots ,\vec{x}_n)^\top$. One of the goals of the model I am fitting is to forecast the risk weighted sum

$$r = \sum_{i=1}^{n_{\text{future}}} z_i \widetilde{y}_i$$

where the $\widetilde{y}_i\in\{0,1\}$ for $i = 1,\dots, n_{\text{future}}$ are unknown future outcomes and I know the current covariates, $\widetilde{\vec{x}}_i$s, and the $z_i$s. I want to provide 90 pct. prediction interval for $r$

I have fitted at logistic regression so far. A naive approach would be to compute the predicted probabilities, $\widehat{\widetilde{p}}_i$s, simulate $\widetilde{y}_i$s, compute $r$, and find the 5% and 95% quanetile. As shown below this fail to account for the uncertainty in the parameter estimates an provide far from the right coverage (however the example is with a small sample).

I have three questions below highlighted in bold and a simulation example at the end to clarify the setup and backup my claims.

Good argument that the parameter uncertainty is not important

What are good arguments that the parameter uncertainty is not important? I have a large data set (more than 1 million observations and 40k cases with 50-100 parameters) and the standard errors of the estimates are quite small. Are there any metric I can use to quantity this?

Bootstrap / simulation

Another idea would be to use the large sample approximation that

$$\widehat{\vec{\beta}} - \vec{\beta} \sim N(\vec{0}, (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1})$$

where $\mathbf{X}$ is the design matrix and $\mathbf{W}$ are working weights at convergence. Thus, one could draw $\widetilde{\vec{\beta}}$ from $N(\widehat{\vec{\beta}}, (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1})$, compute the predicted probabilities, $\widehat{\widetilde{p}}_i$s, using $\widetilde{\vec{\beta}}$, simulate $\widetilde{y}_i$s, compute $r$, and find the 5% and 95% quanetile. I gather the results will be very close to Bayesian credible interval if I had used an uninformative prior.

The above is nice since I do not have to refit the model. However, I am not sure what the pit falls might be or if this is an acceptable method. It does seem to have reasonably close coverage in the example. Further, it is along the lines of the suggestion in chapter 4 of

Wood, Simon N.. Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science). CRC Press. Kindle Edition.

which is neat since I also have a GAM fit and want to do a similar thing. I gather an alternative would be a parametric or non-parametric bootstrap. However, I would prefer to avoid this due to the computation time (especially for GAM).

Example

Here is simulation example which may make it more clear

#####
# define function to simulate
sim_coverage <- function(
  formula_fit = y ~ X1 + X2 + X3, spec = 6, n_trials = 1e3){
  #####
  # setup cluster to decrease computation time
  require(parallel)
  cl <- makeCluster(spec)
  on.exit(stopCluster(cl))
  clusterEvalQ(cl, {
    #####
    # parameter for simulation
    n_test  <- 10000
    n_train <- 1000
    n_sim <- 1000

    n <- n_test + n_train
    is_test <- 1:n_test

    invisible()
  })
  clusterSetRNGStream(cl)

  #####
  # simulate
  out <- parSapply(cl, 1:n_trials, function(...){
    # simulate variables
    X <- matrix(runif(3 * n, -1, 1), ncol = 3)
    beta. <- runif(3, -1, 1)
    y <- 1/(1 + exp(-(X %*% beta. - 0.5))) > runif(n) 
    z <- exp(rnorm(n_test))

    test_dat  <- data.frame(y = y[ is_test], X[ is_test, ])
    train_dat <- data.frame(y = y[-is_test], X[-is_test, ])

    # fit model
    fit <- glm(formula_fit, binomial(), train_dat)

    # simulate outcomes for weighted z given model estimates
    y_hat <- predict(fit, test_dat, type = "response")
    sim_1 <- replicate(n_sim, {
      y_sim <- y_hat > runif(n_test)
      sum(y_sim * z)
    })

    # draw beta and then simulate outcome
    beta_hat <- fit$coefficients
    cov_chol <- chol(vcov(fit))
    mm <- model.matrix(terms(fit), test_dat)
    sim_2 <- replicate(
      n_sim, {
        beta_sim <- drop(beta_hat + t(cov_chol) %*% rnorm(length(beta_hat))) 
        y_sim <- 1/(1 + exp(-(mm %*% beta_sim))) > runif(n_test)
        sum(y_sim * z)
      })

    # compute if covered. It should be 5-95%
    q1 <- quantile(sim_1, c(.05, .95))
    q2 <- quantile(sim_2, c(.05, .95))

    actual <- sum(test_dat$y * z)
    c(
      `conditional on estimates` = q1[1] <= actual && q1[2] >= actual, 
      `simulated parameters`     = q2[1] <= actual && q2[2] >= actual)
  })

  # return coverage
  rowSums(out) / ncol(out)
}

# use function with correctly specified model
library(parallel)
set.seed(seed <- 39160006)
sim_coverage(formula_fit = y ~ X1 + X2 + X3)
#R> conditional on estimates     simulated parameters 
#R>                    0.552                    0.910 

# use function with misspecified model
set.seed(seed)
sim_coverage(formula_fit = y ~ X1)
#R> conditional on estimates     simulated parameters 
#R>                0.558                    0.885 

My questions are vaguely related to this answer. However, the OP's questions is about a single response from an exponential family and not a variable which is a function of many observations from an exponential family (the function being non-linear w.r.t. the linear predictors).

  • I sympathize a lot with the Bayesian approach, especially unless the number of samples at each factor level combination is large. If you've got a large dataset, you can check whether on separate test data (i.e. not used in the model development) those predicted to have 0 to 5%, >5 to 10% etc. probability really do (or at least within the prediction interval). – Björn Mar 25 '18 at 07:14
  • Thanks @Björn. Your answer is that you think that my second approach is valid. I dont get how your latter part answer my questions. It seems like you talk about multiple (binary) responses whereas I talk about a confidence intervals for a single variable which is a function of binary variables. I get that I can use e.g., cross validation to check that the method work. This is not what I am asking for though. – Benjamin Christoffersen Mar 27 '18 at 09:45
  • You were asking about evaluation metrics. that would be an obvious one for prediction. – Björn Mar 27 '18 at 10:29
  • True -- good point! I was looking more for something I could compute directly from the fit e.g., by looking $\mathbf{X}^\top\mathbf{W}\mathbf{X}$. – Benjamin Christoffersen Mar 27 '18 at 10:52
  • Which is upper limit of the risk weighted summation? If I read your code correctly, it’s $n_{test}=1000$ in your example, and it’s a fixed number, not a random variable. – DeltaIV Mar 29 '18 at 16:24
  • 1
    @DeltaIV The upper bound for $r$ is the sum of the $z_i$s. There are n_test <- 10000 unknown binary responses in the simulation. The number of unknown binary variables and corresponding known covariates $\widetilde{\vec{x}}_i$s and weights $z_i$s is fixed and not random. – Benjamin Christoffersen Mar 30 '18 at 10:10
  • @BenjaminChristoffersen thanks. What about the weights $z_i$, instead? From your code it seems to me they are log-normal random variables, iid. Is that correct? Do you need the prediction interval to incorporate the uncertainty in the weights? – DeltaIV Mar 30 '18 at 12:26
  • 1
    @DeltaIV No those a known in advance. I simulate in the code only to get weights that are different. – Benjamin Christoffersen Mar 30 '18 at 13:35
  • Excellent! Then I think it's fairly easy to answer your question. I'm a bit busy these days, so I'll write a draft answer now. Let me know if I got the gist of the issue right: if so, then I'll refine it in the next days. – DeltaIV Mar 30 '18 at 14:04
  • by the way, when you generate the future outcomes $\widetilde{y}i\in{0,1}$ for $i = 1,\dots, n{\text{future}}$, the $\widetilde{\vec{x}}_i$ are fixed, right? They don't vary between an experiment and the other. If they do, the question is ill-posed - we need to know the distribution of the $\widetilde{\vec{x}}_i$ to have a chance of answering. – DeltaIV Mar 30 '18 at 14:34
  • Yes, the $\widetilde{\vec{x}}_i$s are fixed when I generate the $\widetilde{y}_i$s. The $\widetilde{\vec{x}}_i$s are known and we assume the probabilities that the $\widetilde{y}_i$s equals 1 depend deterministically on the $\widetilde{\vec{x}}_i$s. – Benjamin Christoffersen Mar 30 '18 at 15:20

0 Answers0