How to improve specificity with unbalanced data? (R caret package)

Question

I am working on a classification problem where my outcome variable is either "Approved" or "Denied". The % of approvals in my dataset is roughly 60% and the denials make up roughly 30%. I have tried multiple models (random forest, decision tree, neural network, and gradient boosting machine). The highest specificity that I can achieve is with the random forest of 0.69. I also tried to balance the data within the "train" function of the caret package by down sampling, over sampling, SMOTE, and ROSE. I performed the sampling only within the training dataset using cross validation (10 folds). I am pretty new to machine learning, so any advice is appreciated. Unfortunately, I cannot provide the datasets or any code that I have written, so I am just looking for general suggestions relating to unbalanced datasets.

... what are the remaining 10%, after you've considered the 60% approvals and 30% denials? And... this is hardly an unbalanced data set, not that unbalanced data is really a problem anyway: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning — jbowman, Jul 23 '19 at 15:26
I actually tried running the same models without balancing the dataset and the specificity was worse than balancing. — user254529, Jul 23 '19 at 15:33
How many observations & features do you have in your RF, roughly? — jbowman, Jul 23 '19 at 15:34
23,348 rows for my training set. 24 features. 7 of those are continuous and the rest are categorical. The largest number of levels for my categorical variables is 28. — user254529, Jul 23 '19 at 15:38
Yes, I read about this but I couldn't find any application of it utilizing the caret package in R. I'm not sure exactly what I would set the weights too as well. Can you give an example? I can post the R code that I have for training the model if that would help too. — user254529, Jul 24 '19 at 13:40

score 1 · Answer 1 · answered May 03 '23 at 00:21

The easiest way to do this is to change the classification threshold. As you develop your machine learning knowledge, you will see that you destroy much potentially useful information by having any threshold at all, but for now, just changing the threshold sounds like a reasonable first idea.

When you run, for instance, a neural network, you get predictions on a continuum. The standard continuum in this situation is on the interval $[0,1]$. Many software functions turn these into hard classifications by setting a threshold of $0.5$: above the threshold gets classified as one category, and below the threshold gets classified as the other category.

However, you do not have to use $0.5$ as the threshold. If you want more specificity, so harder to be classified as positive, you might want to up the threshold. Perhaps set the threshold to $0.6$ or $0.8$ to get the specificity you desire.

This will improve your specificity at the expense of sensitivity. The tradeoff can be visualized in receiver-operator characteristic (ROC) curves, such as those implemented by pROC::roc in R software. This function even gives a printout of what sensitivity and specificity are achieved at each threshold. Below, I give a quick demonstration of this, and I discuss in more detail here.

library(pROC)
N <- 25
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
r <- pROC::roc(y, p)
d <- data.frame(
  threshold = r$thresholds,
  sensitivity = r$sensitivities,
  specificity = r$specificities
)
d
################################################################################

OUTPUT

################################################################################
threshold sensitivity specificity

1        -Inf  1.00000000  0.00000000
2  0.03986668  1.00000000  0.07142857
3  0.05315755  1.00000000  0.14285714
4  0.07842079  1.00000000  0.21428571
5  0.12086679  1.00000000  0.28571429
6  0.14478513  1.00000000  0.35714286
7  0.16003195  1.00000000  0.42857143
8  0.21402721  1.00000000  0.50000000
9  0.26453714  1.00000000  0.57142857
10 0.31080317  1.00000000  0.64285714
11 0.35289509  0.90909091  0.64285714
12 0.37692100  0.90909091  0.71428571
13 0.43799047  0.81818182  0.71428571
14 0.49503947  0.81818182  0.78571429
15 0.54152179  0.81818182  0.85714286
16 0.58273907  0.81818182  0.92857143
17 0.60398583  0.72727273  0.92857143
18 0.63121729  0.63636364  0.92857143
19 0.66352988  0.63636364  1.00000000
20 0.73563750  0.54545455  1.00000000
21 0.84121309  0.45454545  1.00000000
22 0.89278788  0.36363636  1.00000000
23 0.92770504  0.27272727  1.00000000
24 0.96569430  0.18181818  1.00000000
25 0.98375395  0.09090909  1.00000000
26        Inf  0.00000000  1.00000000

(There are critics of ROC curves and even sensitivity and specificity in general, among them being Frank Harrell, whose criticisms of these are worth reading.)

Most "classifiers" actually make predictions on a continuum that are then binned according to some threshold to make categorical predictions. If caret does not allow you to access those continuous predictions, the package is less user-friendly than it first seems. (My guess is that you can get them, however.)

How to improve specificity with unbalanced data? (R caret package)

1 Answers1

OUTPUT