4

I am working on the problem of loan application acceptance/rejection. I have historical data of about 500K applications and about 70K loans that got funded out of these applications for various loan products and their performance histories. I want to build a predictive model based on this to evaluate and accept/reject future loan applications, and if accepted, which loan products to offer to the borrower.

There are various loan products that could be offered to the borrower depending on the borrower's credit ratings and other borrower metrics. The loan products come with fixed loan amounts, interest rates, loan terms and origination fees. For example, the loan products could look like this:

term_in_months,amount,interest_rate,origination_fee
48,45000,13,750
48,45000,19,750
60,45000,18,900
36,25000,23,275
48,25000,28,500
24,10000,35,100

When a loan application comes in, we need to see if it qualifies for any of our loan products. If you look at the example above, the first and the second products have the same terms, amounts and origination fees, but the first one has a lower interest rate, so higher borrower ratings would be required for the first product.

My first question is how to translate these four (outcome) variables into a single variable so that the different loan products can be ordered. This would be a way of measuring the quality of the loan product.

Also I am reading Siddiqi's Credit Risk Scorecards, which seems to be written for managers and not developers/modelers. Can someone suggest better references or how to approach this problem from a practitioner's perspective? How does a company like, say Lending Club, solve this problem?

PS: I have a decent background in statistics, machine learning and R and several years of programming experience, but have difficulty following theoretical Math. For example, I could easily follow Hastie et al's "Introduction to Statistical Learning" and Kuhn's "Applied Predictive Modeling", but not Hastie et al's "Elements of Statistical Learning" or Bishop's "Pattern Recognition and Machine Learning".

arun
  • 390
  • Quick question - how did the bank set the initial product rates? – 114 Feb 13 '17 at 17:38
  • There is a risk management team that decides on rate, term, etc., These keep changing depending on market conditions. Unfortunately I don't have visibility into how these are decided. – arun Feb 13 '17 at 19:54

3 Answers3

3

As Luminita said, you need to split it into 2 problems

a) prediction of default

2 common approaches are logistic regression ( predict defaults occurring within eg 1 year after issuance) survival models (cox or discrete survival model [=apply logistic regression to predict probability of surviving a given month after issuance]) It is commonly accepted that there is a term structure to defaulting ( eg unlikely to default after 2/3 of loan paid off) (there are lots of blogs analysing lending club investments eg http://peerlendingserver.com/uncategorized/1310/ - keywords IRR /default/vintage)

b) optimisation of loan offer here you take the model of defaulting ( eg probability of defaulting this month|not defaulted in previous month is constant) and you calculate what the IRR on the loan for the bank is (taking into account defaulting and Loss Given Default (LGD) ). So then you might specify that on average you want to achieve a x% return on each loan and then work out for given interest rate what default probability will achieve this.

a book giving you some insights might be 'modelling structured finance cashflows with microsoft excel'. I am sure there are [much] better books

seanv507
  • 6,743
  • Thx, your reference looks great. Can you elaborate a bit on reject inference as well, as to how banks typically implement it? – arun Sep 30 '16 at 17:38
  • I don't know too much about that I'm afraid. – seanv507 Sep 30 '16 at 18:12
  • Thx. I was split between your answer and Luminita's about whom to give the bounty to. Since the other answer came in first and had valid points, I gave the bounty to that answer. In any case, I don't think the problem itself is that straight forward since it involves lot more elements like reject inference, available capital, minimum required capital reserves, prime vs. near-prime vs. sub-prime lending, peer-to-peer vs. non peer-to-peer lending, to name a few. – arun Oct 02 '16 at 20:30
2

I see two research aims here:

1) Building a prediction model to help you evaluate whether future loan requests should be accepted/rejected;

2) Building an optimization algorithm to identify which loan product should you offer to the accepted requests.

1) For prediction, I would build multiple prediction models and asses their performance. Then, select the best model or combine models. In my experience, non-parametric models perform best, but it does not hurt to try also a logistic regression. In any case, try a bootstrap aggregating algorithm (such as random forests) and a boosting algorithm (xgboost is very trendy nowadays). Usually, the choice of model selection lies between these two types of algorithms depending on whether you need to reduce variance or bias.

2) For the optimization algorithm you focus on the data related to the past granted loans and how they performed. To take into account all four dimensions you indicate, I see two options: First, ask the bank to give you the total compounded values of the loans (compounded principal + interest + fees). There are different methods of computing the monthly installments (the most commonly used ones are declining balance and flat rates) and this results in different compounded loan values. As there are multiple loan products, there is no way you can compare them, unless you have this compounded value.

Second, if bank does not help, you can break down the dataset in multiple subsets, one for each loan product, and do different analyses for each loan product by computing yourself the Future Value of the installments, computed at the time of loan reimbursement (i.e. in 48 months in the case of first two observations, 60 months in the case of the third observation and so on). This will combine amount, interest_rate and term_in_months. And then you just add the origination_fee (ideally, you would also compute its future value, but you need to use a rate of return which the bank should give you, not the interest_rate). This is not very correct from a financial standpoint, but in order to really be precise you need the bank’s collaboration because you should consider the inflation and risk of the bank to actually compute the value of a loan for the bank. The second approach is less desirable because the computations do not depend on the method for the installment computation. But as you apply this formula for identical product designs (remember you broke down the dataset by different loan products), you should be able to compare the resulting values of similar loan products.

Here it is an example for the first observation in case you need to go with the second option (I assume the interest rate is 13%? – seems quite high):

The Future Value of the loan = (45000/48)(1+0.13)^48 +(45000/48)(1+0.13)^47+…+(45000/48)*(1+0.13)^1 + 750

Then, your aim is to maximize this value. You can either go with a simple OLS regression with the outcome variable being the above calculated loan Future Value (and then use this model to predict the loan amount that the new borrower should receive), or you can try something more fancy such as a genetic algorithm…

Hope this helps. Good luck!

NRLP
  • 278
0

Hastie's Introduction to Statistical Learning has a good example to begin the loan allocation problem. Logistic regression could be a good starting point.

The loan products you mentioned are outputs of the loan applicant selection process and could be treated independently. Calculated probability of default could be used to evaluate the default risk and a range of parameters including the above-mentioned parameters could be iterated as inputs to obtain parametric sensitivities. An offer to a product with highest return or least risk could be made depending on the risk appetite.