Consider the plot below in which I simulated data as follows. We look at a binary outcome $y_{obs}$ for which the true probability to be 1 is indicated by the black line. The functional relationship between a covariate $x$ and $p(y_{obs}=1 | x)$ is 3rd order polynomial with logistic link (so it is non-linear in a double-way).
The green line is the GLM logistic regression fit where $x$ is introduced as 3rd order polynomial. The dashed green lines are the 95% confidence intervals around the prediction $p(y_{obs}=1 | x, \hat{\beta})$, where $\hat{\beta}$ the fitted regression coefficients. I used R glm and predict.glm for this.
Similarly, the pruple line is the mean of the posterior with 95% credible interval for $p(y_{obs}=1 | x, \beta)$ of a Bayesian logistic regression model using a uniform prior. I used the package MCMCpack with function MCMClogit for this (setting B0=0 gives the uniform uninformative prior).
The red dots denote observations in the data set for which $y_{obs}=1$, the black dots are observations with $y_{obs}=0$. Note that as common in classification / discrete analysis $y$ but not $p(y_{obs}=1 | x)$ is observed.
Several things can be seen:
- I simulated on purpose that $x$ is sparse on the left hand. I want that the confidence and credible interval get wide here due to the lack of information (observations).
- Both predictions are biased upward on the left. This bias is caused by the four red points denoting $y_{obs}=1$ observations, which wrongfully suggests that the true functional form would go up here. The algorithm has insufficient information to conclude the true functional form is downward bent.
- The confidence interval gets wider as expected, whereas the credible interval does not. In fact the confidence interval encloses the complete parameter space, as it should due to lack of information.
It seems the credible interval is wrong / too optimistic here for a part of $x$. It is really undesirable behavior for the credible interval to get narrow when the information gets sparse or is fully absent. Usually this is not how a credible interval reacts. Can somebody explain:
- What are reasons for this?
- What steps can I take to come to a better credible interval? (that is, one that encloses at least the true functional form, or better gets as wide as the confidence interval)
Code to obtain prediction intervals in the graphic are printed here:
fit <- glm(y_obs ~ x + I(x^2) + I(x^3), data=data, family=binomial)
x_pred <- seq(0, 1, by=0.01)
pred <- predict(fit, newdata = data.frame(x=x_pred), se.fit = T)
plot(plogis(pred$fit), type='l')
matlines(plogis(pred$fit + pred$se.fit %o% c(-1.96,1.96)), type='l', col='black', lty=2)
library(MCMCpack)
mcmcfit <- MCMClogit(y_obs ~ x + I(x^2) + I(x^3), data=data, family=binomial)
gibbs_samps <- as.mcmc(mcmcfit)
x_pred_dm <- model.matrix(~ x + I(x^2) + I(x^3), data=data.frame('x'=x_pred))
gibbs_preds <- apply(gibbs_samps, 1, `%*%`, t(x_pred_dm))
gibbs_pis <- plogis(apply(gibbs_preds, 1, quantile, c(0.025, 0.975)))
matlines(t(gibbs_pis), col='red', lty=2)
Data access: https://pastebin.com/1H2iXiew thanks @DeltaIV and @AdamO

dputon the dataframe containing the data, and then includedputoutput as code in your post. – DeltaIV Sep 29 '17 at 17:35dputin a while), but now everyone who wants to test your data in R can easily copy & paste thedputstructure. In general you can also look into thereprexpackage (though if you problem becomes too much related to programming, then it risks becoming off-topic here). – DeltaIV Sep 29 '17 at 17:41MCMClogit. – tomka Sep 29 '17 at 17:47