I have a dataset with more than 20 predictors and a single binary response variable. With only $n=181$ observations (64 deaths, 117 survivors), I decided to apply penalized logistic regression to modeling, with all predictors involved (so that I avoid problems associated with model selection). Nevertheless, I have to produce a ''simpler'' model too (i.e. one that is simple enough to be suitable for a nomogram-style hand calculation in clinical setting). For that end, I intend to use rms's fastbw.
To exemplify my questions, I'll use the support dataset from Hmisc:
library( rms )
getHdata( support )
fit <- lrm( hospdead ~ rcs( age ) + sex + rcs( meanbp ) + rcs( crea ) + rcs( ph ) + rcs( sod ), data = support, x = TRUE, y = TRUE )
fit
First, I apply penalization:
p <- pentrace( fit, seq( 0, 10, by = 0.01 ) )
plot( p )
fitPen <- update( fit, penalty = p$penalty )
fitPen
I hope I'm correct up to this point.
Next, I validate the model and calculate its calibration curve. If I understand it correctly, I shouldn't validate/calibrate the simpler model, rather, I have to run the necessary functions on the original model, but with bw=T. That is:
validate( fitPen, B = 1000, bw = TRUE )
plot( calibrate( fitPen, B = 1000, bw = TRUE ) )
Question #1: Am I correct in this? I.e. is it true that to get the simpler model's validation/calibration I have to run these not on the simpler model, but on the original one (with bw=T)? And the results will be those pertaining to the simpler model, despite the fact that I haven't run validation/calibration on the simpler model itself?
Next, I try to come up with the simpler model explicitly. Interestingly, (Harrell, 1998) uses a method which is based on calculating the logits for the observations, then modeling them with OLS, then narrowing this model with fastbw. Although it is surely my statistical shortcoming, I simply can't understand why this is necessary.
Question #2: Why can't we directly use fastbw on the logistic regression model? Such as:
fastbw( fitPen )
fitApprox <- lrm( as.formula( paste( "hospdead ~", paste( fastbw( fitPen )$names.kept, collapse = "+" ) ) ), data = support, x = TRUE, y = TRUE )
And finally, I am not completely sure on where should I apply penalizing in the whole process.
Question #3: Should I penalize the original model, then run fastbw (see above), and then re-penalize the obtained model? I.e.
p <- pentrace( fitApprox, seq( 0, 10, by = 0.01 ) )
plot( p )
fitApproxPen <- update( fitApprox, penalty = p$penalty )
fitApproxPen
Or I don't have to re-penalize the narrowed model? Or I don't have to penalize the original model and it is sufficient to penalize the simpler one? (I suspect that the very first option is the correct, but I'm not entirely sure.)
bw=T) also don't take penalization into account? (Iffastbwdoesn't.) If so, does that mean the the results obtained with validate/calibrate will be misleading in this sense? – Tamas Ferenci Jun 01 '15 at 08:34validateandcalibratedo take penalization into account, by making the assumption that the optimum penalty is a constant -- the penalty found from runningpentraceon the original sample. Besides considering data reduction (masked to $Y$), consider whether the whole exercise is going to yield estimates that have sufficient precision in light of my initial comments. – Frank Harrell Jun 01 '15 at 12:25validate/calibrateextract the value of the penalty from the model that was passed to them? I didn't realize that (although it'd have be easy by checking the source code, I'm sorry). Thank you, everything is clear now! – Tamas Ferenci Jun 02 '15 at 16:53