1

I am working on a classification problem where my outcome variable is either "Approved" or "Denied". The % of approvals in my dataset is roughly 60% and the denials make up roughly 30%. I have tried multiple models (random forest, decision tree, neural network, and gradient boosting machine). The highest specificity that I can achieve is with the random forest of 0.69. I also tried to balance the data within the "train" function of the caret package by down sampling, over sampling, SMOTE, and ROSE. I performed the sampling only within the training dataset using cross validation (10 folds). I am pretty new to machine learning, so any advice is appreciated. Unfortunately, I cannot provide the datasets or any code that I have written, so I am just looking for general suggestions relating to unbalanced datasets.

Dave
  • 62,186
  • ... what are the remaining 10%, after you've considered the 60% approvals and 30% denials? And... this is hardly an unbalanced data set, not that unbalanced data is really a problem anyway: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – jbowman Jul 23 '19 at 15:26
  • Sorry, it's really 68.731% approvals and 31.269% denials. – user254529 Jul 23 '19 at 15:32
  • I actually tried running the same models without balancing the dataset and the specificity was worse than balancing. – user254529 Jul 23 '19 at 15:33
  • How many observations & features do you have in your RF, roughly? – jbowman Jul 23 '19 at 15:34
  • 23,348 rows for my training set. 24 features. 7 of those are continuous and the rest are categorical. The largest number of levels for my categorical variables is 28. – user254529 Jul 23 '19 at 15:38
  • An alternative is to use weights on your observations. – user2974951 Jul 24 '19 at 13:36
  • Yes, I read about this but I couldn't find any application of it utilizing the caret package in R. I'm not sure exactly what I would set the weights too as well. Can you give an example? I can post the R code that I have for training the model if that would help too. – user254529 Jul 24 '19 at 13:40

1 Answers1

1

The easiest way to do this is to change the classification threshold. As you develop your machine learning knowledge, you will see that you destroy much potentially useful information by having any threshold at all, but for now, just changing the threshold sounds like a reasonable first idea.

When you run, for instance, a neural network, you get predictions on a continuum. The standard continuum in this situation is on the interval $[0,1]$. Many software functions turn these into hard classifications by setting a threshold of $0.5$: above the threshold gets classified as one category, and below the threshold gets classified as the other category.

However, you do not have to use $0.5$ as the threshold. If you want more specificity, so harder to be classified as positive, you might want to up the threshold. Perhaps set the threshold to $0.6$ or $0.8$ to get the specificity you desire.

This will improve your specificity at the expense of sensitivity. The tradeoff can be visualized in receiver-operator characteristic (ROC) curves, such as those implemented by pROC::roc in R software. This function even gives a printout of what sensitivity and specificity are achieved at each threshold. Below, I give a quick demonstration of this, and I discuss in more detail here.

library(pROC)
N <- 25
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
r <- pROC::roc(y, p)
d <- data.frame(
  threshold = r$thresholds,
  sensitivity = r$sensitivities,
  specificity = r$specificities
)
d

################################################################################

OUTPUT

################################################################################

threshold sensitivity specificity

1 -Inf 1.00000000 0.00000000 2 0.03986668 1.00000000 0.07142857 3 0.05315755 1.00000000 0.14285714 4 0.07842079 1.00000000 0.21428571 5 0.12086679 1.00000000 0.28571429 6 0.14478513 1.00000000 0.35714286 7 0.16003195 1.00000000 0.42857143 8 0.21402721 1.00000000 0.50000000 9 0.26453714 1.00000000 0.57142857 10 0.31080317 1.00000000 0.64285714 11 0.35289509 0.90909091 0.64285714 12 0.37692100 0.90909091 0.71428571 13 0.43799047 0.81818182 0.71428571 14 0.49503947 0.81818182 0.78571429 15 0.54152179 0.81818182 0.85714286 16 0.58273907 0.81818182 0.92857143 17 0.60398583 0.72727273 0.92857143 18 0.63121729 0.63636364 0.92857143 19 0.66352988 0.63636364 1.00000000 20 0.73563750 0.54545455 1.00000000 21 0.84121309 0.45454545 1.00000000 22 0.89278788 0.36363636 1.00000000 23 0.92770504 0.27272727 1.00000000 24 0.96569430 0.18181818 1.00000000 25 0.98375395 0.09090909 1.00000000 26 Inf 0.00000000 1.00000000

(There are critics of ROC curves and even sensitivity and specificity in general, among them being Frank Harrell, whose criticisms of these are worth reading.)

Most "classifiers" actually make predictions on a continuum that are then binned according to some threshold to make categorical predictions. If caret does not allow you to access those continuous predictions, the package is less user-friendly than it first seems. (My guess is that you can get them, however.)

Dave
  • 62,186