How to decide better threshold values for variables with skewed distributions

Question

In the context of optimizing loan advance decisions for customers with gold loan history, I aim to establish threshold values (x and y) to categorize customers into four groups based on loan count and amount. The goal is to prioritize customers with a positive history to give them higher advance. Groups are divided in a manner where customers with 1. loan count>x and loan amount>y 2. loan amount >x and loan count<y 3. loan amounty 4. loan amount<x and loan count<y. This is the order of importance as well. Given the positively skewed distribution of loan amount and count, what statistical methods or techniques can you recommend for determining these threshold values?

I don't know anything about banking, but what I would do is NOT make thresholds or categories at all, but a formula, where the advance you could get would be a function of loan counts, loan amounts, and other things as well. — Peter Flom, Nov 22 '23 at 12:01
Do you have training data, i.e., data where you know both your input variables before obtaining a loan, and the later loan "outcome"? — Christian Hennig, Nov 22 '23 at 13:09
@PeterFlom Yes of course. That's the next step. Every customer gets a score based on various criteria which is calculated from a formula. Right now I'm working on a subscore which boost the main score. Before calculating this sub score I need to give a weight for each group. So to do that I need a reasonable way of grouping them as in I need to find the optimal threshold values or cutoff values. — Thimali Fernando, Nov 23 '23 at 02:46

score 4 · Answer 1 · answered Nov 22 '23 at 12:47

4

Thresholding of input variables is never appropriate. That’s because thresholding loses information, is arbitrary, and most importantly, gives the wrong answer. The latter problem is due to the fact that you can demonstrate that the threshold of one variable must be a function of the actual values of the other predictors. This is explained in detail here. If thresholding (binning) is necessary (it’s usually not) it must be done on the output of a risk or prediction model.

The best way to proceed is to specify a flexible model that uses all the information in the data. One part of that process is not to assume linearity of effects of predictors, as covered here.

answered Nov 22 '23 at 12:47

Frank Harrell

91,879
6
178
397

Frank, while I certainly don't disagree with you, I do feel that it's worth acknowledging that categorisation can be useful. As a clinician, occasionally I have to explain certain types of risk to patients and I find that they usually understand things far better when results are framed in terms of categories (eg high v low blood pressure.). You might retort that this is lazy on my part, and you could be right. I do need to try to get better at explaining things properly. It's one of the reasons why I to spend an hour or so every day on here, and I always check your latest responses first :) – camhsdoc Jan 09 '24 at 17:53
1

You are mixing two ideas: simplification of inputs and simplification of outputs. It is never appropriate to simplify blood pressure because it matters greatly how high a high blood pressure is, and the risk due to elevated blood pressure can be offset by excellent physical condition, lack of smoking, and low LDL cholesterol. To handle such tradeoffs properly the patient needs to have all the risk factor levels defined, i.e., we use complete conditioning and not partial conditioning which just records whether something exceeds a threshold. – Frank Harrell Jan 10 '24 at 07:05

score 0 · Answer 2 · answered Nov 22 '23 at 12:01

0

Leaving behind the disadvantages of categorizing continuous variables, a common way to find thresholds is based on the quantiles of the distribution. By using quantiles, you look at the mass of your distribution at some point, rather than the values themselves, so it makes more sense to me when dealing with skewed data than just defining cut offs.

answered Nov 22 '23 at 12:01

Mathemagician777

658

Quantiles are "demographic" quantities, will change as the sampling of individuals changes, and have nothing to do with biology or physics. Quantiles are useful if people are competing against each other but not useful for characterizing individual risk. – Frank Harrell Jan 10 '24 at 07:06

How to decide better threshold values for variables with skewed distributions

2 Answers2