In the credit risk industry (and finance industry as a whole, at least here in the UK), there is a very common and accepted 'proper' way to build scorecards.
The general framework seems to be:
- Binning your predictors, merging neighboring bins with similar Weight of Evidence (WOE) values, generally aiming for a monotonic relationship between the target and predictor
- Filter methods variable selection (calculating the Information Value [IV] of each predictor, removing those with a low IV)
- WOE-transform these predictors (target encoding), fitting a logistic regression model to the transformed data
I was wondering if anyone knows why this practice is followed? It seems like a dated and unusually specific approach, and I never see a blanket approach of binning/WOE encoding/IV filtering used in any other industry. I think an equivalent approach for regression tasks would be to bin and mean-encode all predictors before using them in linear regression, but I have never seen/heard of that used anywhere, including in credit risk.
I expand on my confusion below:
- Why binning? It seems to me that binning numeric predictors discards information and adds so much arbitrariness and manual work to the process, and while it can help deal with outliers and missing values without having to think as much, so can creating flags to capture missing data, median imputation and winsorization, all of which can be easily automated and still leaves the option of adding spline terms, interaction terms, etc.
- Why is information value used? If it's necessary to apply filter methods for variable selection (e.g. you have thousands of predictors), why not evaluate the performance (e.g. the out-of-sample AUC) of many 1-variable logistic regression models and filter based on that, which would have the benefit of aligning with the target metric
- For a bin, the WOE is just the log-odds of that bin (plus a constant), so I suppose it would be a form of target encoding that is suited for logistic regression? This does make some sense to me once you've already binned your data (if target encoding actually improves performance), but I'm still left wondering why we binned the predictors in the first place? Most sources say to 'establish a monotonic relationship' between the predictors and the target, but logistic regression without nonlinear transformations and interaction terms can do that anyway, no? I feel like this just complicates model interpretation over fitting logistic regression models on the raw predictors, and using any form of target encoding always means you have to be that much more careful of target leakage
There must be a reason that this approach seems to be used in all the big banks and financial institutions but is rarely used elsewhere when preprocessing data for logistic regression. Are there any obvious strengths or reasons that I'm missing, or are there historical reasons for how this approach became so universal?