I have a logistic regression model used in credit scoring and I am studying the metrics used in the credit scoring field to evaluate the quality of the model. My ultimate objective is to understand if the same metrics could work in other ML models, for example a random forest. Based on what I have read, I think this is the standard pipeline for credit scoring :
Select the predictive variables for the model.
If a variable is numerical, split it into bins.
For each variable $X$, compute the WoE (weight of evidence), which is $\log\dfrac{P(X=x_j|Y=1)}{P(X=x_j|Y=0)}$
Use the WoE variables to build the model.
Once the model is built:
Evaluate the predictive power of a variable using the Information Value (IV), which is
$\sum(P(X=x_j|Y=1) - P(X=x_j|Y=0))\log\dfrac{P(X=x_j|Y=1)}{P(X=x_j|Y=0)}$
Use a rule of thumb such as:
- IV < 0.02: variable non predictive
- 0.02 < IV < 0.3: predictive but need to review
- IV > 0.3: predictive variable
Now I have some questions with all of this:
First question: why should we use the information value to select if a variable is important for our model? Why not use inference from the logistic regression model?
Second question: I have read in many sites that the use of WoE and IV helps buiding a linear relationship with the output of the logistic regression model. I do not know what this means. The logistic regression is built based on the equation:
$y=\dfrac{1}{1+e^{-x^t\beta}}$
So I guess that by using the WoE instead of the original variables, the logarithm of the WoE "cancels out" with the exponential term in the formula of the model. But I do not know how to interpret the output, or why it is benefitial.
Third (and last) question: I have read that, because how it is defined, it is not recommended to use the IV aside from logistic regression, so for example it would not work to properly select important variables in a random forest model. Are there any alternatives for the IV in that context? Maybe just use the variable importance provided by the random forest?