I'm using R to run some logistic regression. My variables were continuous, but I used cut to bucket the data. Some particular buckets for these variables always result in dependent variable being equal to 1. As expcted, the coefficient estimate for this bucket is very high, but the p-value is also high. There are about ~90 observations in either these buckets, and around 800 total observations, so I don't think it's a problem of sample size. Also, this variable should not be related to other variables, which would naturally reduce their p-values.
Are there any other plausible explanations for the high p-value?
Example:
myData <- read.csv("application.csv", header = TRUE)
myData$FICO <- cut(myData$FICO, c(0, 660, 680, 700, 720, 740, 780, Inf), right = FALSE)
myData$CLTV <- cut(myData$CLTV, c(0, 70, 80, 90, 95, 100, 125, Inf), right = FALSE)
fit <- glm(Denied ~ CLTV + FICO, data = myData, family=binomial())
Results are something like this:
Deviance Residuals:
Min 1Q Median 3Q Max
-1.53831 -0.77944 -0.62487 0.00027 2.09771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.33630 0.23250 -5.747 9.06e-09 ***
CLTV(70,80] -0.54961 0.34864 -1.576 0.114930
CLTV(80,90] -0.51413 0.31230 -1.646 0.099715 .
CLTV(90,95] -0.74648 0.37221 -2.006 0.044904 *
CLTV(95,100] 0.38370 0.37709 1.018 0.308906
CLTV(100,125] -0.01554 0.25187 -0.062 0.950792
CLTV(125,Inf] 18.49557 443.55550 0.042 0.966739
FICO[0,660) 19.64884 3956.18034 0.005 0.996037
FICO[660,680) 1.77008 0.47653 3.715 0.000204 ***
FICO[680,700) 0.98575 0.30859 3.194 0.001402 **
FICO[700,720) 1.31767 0.27166 4.850 1.23e-06 ***
FICO[720,740) 0.62720 0.29819 2.103 0.035434 *
FICO[740,780) 0.31605 0.23369 1.352 0.176236
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1037.43 on 810 degrees of freedom
Residual deviance: 803.88 on 798 degrees of freedom
AIC: 829.88
Number of Fisher Scoring iterations: 16
FICO in the range [0, 660) and CLTV in the range (125, Inf] indeed always results in Denial = 1, so their coefficients are very large, but why are they also "insignificant"?
brglmpackage) -- although I agree that they should think carefully about the cutting first ... – Ben Bolker Jun 09 '14 at 17:50My post is a simplified example, so it may not be clear based on my example, but basically there are several "rules" that are supposed to be followed for determining Denial. There are also some ranges (of FICO or CLTV for example) where Denial becomes discretionary. Bottom line is, my client wants to see cut variables because that's most intuitive for them based on the guidelines, even though there is also some discretion.
– ch-pub Jun 09 '14 at 17:53brglmimplements Firth's penalization, which offsets an $O(n-1)$ term in the bias of MLEs: Firth (1993), "Bias reduction of maximum likelihood estimates", Biometrika, 80, pp 27–38. I'm curious as to why you say "infinite parameter estimates do not present problems in prediction" because that's where, intuitively, they can be most problematic: your model says the probability's 100% for a given value of the predictor on which separation occurred whatever the values of any other predictors. – Scortchi - Reinstate Monica Jun 10 '14 at 11:18