Why is the Bayesian credible interval in this polynomial regression biased whereas the confidence interval is correct?

Question

Consider the plot below in which I simulated data as follows. We look at a binary outcome $y_{obs}$ for which the true probability to be 1 is indicated by the black line. The functional relationship between a covariate $x$ and $p(y_{obs}=1 | x)$ is 3rd order polynomial with logistic link (so it is non-linear in a double-way).

The green line is the GLM logistic regression fit where $x$ is introduced as 3rd order polynomial. The dashed green lines are the 95% confidence intervals around the prediction $p(y_{obs}=1 | x, \hat{\beta})$, where $\hat{\beta}$ the fitted regression coefficients. I used R glm and predict.glm for this.

Similarly, the pruple line is the mean of the posterior with 95% credible interval for $p(y_{obs}=1 | x, \beta)$ of a Bayesian logistic regression model using a uniform prior. I used the package MCMCpack with function MCMClogit for this (setting B0=0 gives the uniform uninformative prior).

The red dots denote observations in the data set for which $y_{obs}=1$, the black dots are observations with $y_{obs}=0$. Note that as common in classification / discrete analysis $y$ but not $p(y_{obs}=1 | x)$ is observed.

Several things can be seen:

I simulated on purpose that $x$ is sparse on the left hand. I want that the confidence and credible interval get wide here due to the lack of information (observations).
Both predictions are biased upward on the left. This bias is caused by the four red points denoting $y_{obs}=1$ observations, which wrongfully suggests that the true functional form would go up here. The algorithm has insufficient information to conclude the true functional form is downward bent.
The confidence interval gets wider as expected, whereas the credible interval does not. In fact the confidence interval encloses the complete parameter space, as it should due to lack of information.

It seems the credible interval is wrong / too optimistic here for a part of $x$. It is really undesirable behavior for the credible interval to get narrow when the information gets sparse or is fully absent. Usually this is not how a credible interval reacts. Can somebody explain:

What are reasons for this?
What steps can I take to come to a better credible interval? (that is, one that encloses at least the true functional form, or better gets as wide as the confidence interval)

Code to obtain prediction intervals in the graphic are printed here:

fit <- glm(y_obs ~ x + I(x^2) + I(x^3), data=data, family=binomial)
x_pred <- seq(0, 1, by=0.01)
pred <- predict(fit, newdata = data.frame(x=x_pred), se.fit = T)
plot(plogis(pred$fit), type='l')
matlines(plogis(pred$fit + pred$se.fit %o% c(-1.96,1.96)), type='l', col='black', lty=2)


library(MCMCpack)
mcmcfit <- MCMClogit(y_obs ~ x + I(x^2) + I(x^3), data=data, family=binomial)
gibbs_samps <- as.mcmc(mcmcfit)
x_pred_dm <- model.matrix(~ x + I(x^2) + I(x^3), data=data.frame('x'=x_pred))
gibbs_preds <- apply(gibbs_samps, 1, `%*%`, t(x_pred_dm))
gibbs_pis <- plogis(apply(gibbs_preds, 1, quantile, c(0.025, 0.975)))
matlines(t(gibbs_pis), col='red', lty=2)

Data access: https://pastebin.com/1H2iXiew thanks @DeltaIV and @AdamO

If somebody could explain to me how to share a table with the data, I can do so. — tomka, Sep 29 '17 at 17:21
You can use dput on the dataframe containing the data, and then includedput output as code in your post. — DeltaIV, Sep 29 '17 at 17:35
Precisely. Not sure why you have row names (haven't used dput in a while), but now everyone who wants to test your data in R can easily copy & paste the dput structure. In general you can also look into the reprex package (though if you problem becomes too much related to programming, then it risks becoming off-topic here). — DeltaIV, Sep 29 '17 at 17:41
@tomka have you used a non-informative prior for your Bayesian model? It makes sense uncertainty intervals in the tails will be larger for Bayesian models since a frequentist approach has singular CIs when Pr(Y=1) = 0 or 1. — AdamO, Sep 29 '17 at 17:44
@AdamO Yes the prior I used is the uniform. It's default in MCMClogit. — tomka, Sep 29 '17 at 17:47
@AdamO That would make sense but here the confidence intervals are much wider than the credible interval. I would expect them to be about the same due to the uninformative prior. — tomka, Sep 29 '17 at 17:49
@tomka oh I see. I'm not colorblind but it's very difficult for me to see the green/blue difference! — AdamO, Sep 29 '17 at 18:14
It seems to me that it's the confidence intervals that are not behaving correctly. Shouldn't 5% of them, by definition, not contain the true value of the parameter? — Flounderer, Sep 29 '17 at 18:29
@Flounderer what you are saying has nothing to do with definition of CI's... — Tim, Sep 29 '17 at 20:04
@Tim I understand the definition of a level $\alpha$ confidence interval for a parameter to be "An interval constructed from an experiment in such a way that, in a large number of independent and identical experiments, a proportion $1-\alpha$ of such intervals contain the true value of the parameter." What definition are you using? — Flounderer, Sep 29 '17 at 20:15
@Flounderer the fact that all CIs seem to cover the true value here does not stand in contrast with the definition of CIs. Also CIs of a nominal coverage of at least $\alpha$, so possibly more. — tomka, Sep 29 '17 at 20:25
@Flounderer Check e.g. https://stats.stackexchange.com/questions/26450/why-does-a-95-confidence-interval-ci-not-imply-a-95-chance-of-containing-the or https://stats.stackexchange.com/questions/6652/what-precisely-is-a-confidence-interval — Tim, Sep 29 '17 at 20:26

AdamO · Answer 1 · 2017-09-29T20:51:48.480

For a frequentist model, the variance of the prediction magnifies in proportion to the square of the distance from the centroid of $X$. Your method of calculating prediction intervals for a Bayesian GLM uses empirical quantiles based on the fitted probability curve, but does not account for $X$'s leverage.

A binomial frequentist GLM is no different from a GLM with identity link except that the variance is proportional to the mean.

Note that any polynomial representation of logit probabilities leads to risk predictions that converge to 0 as $X\rightarrow -\infty$ and 1 as $X\rightarrow \infty$ or vice versa, depending on the sign of the highest polynomial order term.

For frequentist prediction, the squared deviation (leverage) proportional increase in variance of predictions dominates this tendency. This is why the rate of convergence to prediction intervals approximately equal to [0, 1] is faster than the third order polynomial logit convergence to probabilities of 0 or 1 singularly.

This is not so for Bayesian posterior fitted quantiles. There is no explicit use of squared deviation, so we rely simply on the proportion of dominating 0 or 1 tendencies to construct long term prediction intervals.

This is made apparent by extrapolating very far out into the extremes of $X$.

Using the code I supplied above we get:

> x_pred_dom <- model.matrix(~ x + I(x^2) + I(x^3), data=data.frame('x'=c(1000)))
> gibbs_preds <- plogis(apply(gibbs_samps[1000:10000, ], 1, `%*%`, t(x_pred_dom))) # a bunch of 0/1s basically past machine precision
> prop.table(table(gibbs_preds))
gibbs_preds
         0          1 
0.97733585 0.02266415 
>

So 97.75% of the time, the third polynomial term was negative. This is verifiable from the Gibbs samples:

> prop.table(table(gibbs_samps[, 4]< 0))

 FALSE   TRUE 
0.0225 0.9775

Hence the predicted probability converges to 0 as $X$ goes to infinity. If we inspect the SEs of the Bayesian model, we find the estimate of the third polynomial term is -185.25 with se 108.81 meaning it is 1.70 SDs from 0, so using normal probability laws, it should fall below 0 95.5% of the time (not a terribly different prediction based on 10,000 iterations). Just another way of understanding this phenomenon.

On the other hand, the frequentist fit blows up to 0,1 as expected:

freq <- predict(fit, newdata = data.frame(x=1000), se.fit=T)
plogis(freq$fit + c(-1.96, 1.96) %o% freq$se.fit)

gives:

> plogis(freq$fit + c(-1.96, 1.96) %o% freq$se.fit)
     [,1]
[1,]    0
[2,]    1

Nevertheless: isn't the Bayesian model over-confident in areas of the data $x$ that it has seen no examples from? I know Bayesian posteriors or predictive distributions often have very different behavior (i.e. more like the conf. interval). I suspect there is some impact of the prior. If you manipulate B0 in MCMClogit you specify the precision of a normal prior and can observe quite an impact on the credible interval. — tomka, Sep 29 '17 at 20:28
@tomka I don't know how to answer that exactly, as it seems tangential to the question at hand. The most important thing is pointing out that these methods of calculating PIs are not really comparable, especially as they relate to extrapolation. Of course, with Bayesian inference, if you use an informative prior, you gain efficiency when the prior is right, and lose when the prior is wrong. — AdamO, Sep 29 '17 at 20:43
Just to let you know that I am still thinking about your answer. I still feel that it is strange that the posterior does not react to the sparsity by widening. I believe that for other priors a better behavior in the sparse region can be achieved. I cannot pin this down exactly at the moment; I will perhaps enhance the question with an example where the credible interval works in the way I would expect, even in the case of extrapolation (I am thinking about normal linear Bayesian regression, in particular). When I do I will let you know. — tomka, Oct 05 '17 at 16:29

Why is the Bayesian credible interval in this polynomial regression biased whereas the confidence interval is correct?

1 Answers1