I have the following question: Are there any Standard Methods for Converting a Continuous Response Variable into a Categorical Variable?
To give my question some context, I give the following example (using the R programming language). Suppose you have the following data (I also posted the code on how to generate this data to my example more reproducible):
#generate data
a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)
b = rnorm(1000, 10, 5)
c = rnorm(1000, 5, 10)
cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE,
prob=c(0.5,0.5))
old_response_variable <- rnorm(1000, 250, 100)
#put data into a frame
d = data.frame(a,b,c, cat_var, old_response_variable)
d<span class="math-container">$cat_var = as.factor(d$</span>cat_var)
#view data
a b c cat_var old_response_variable
1 -2.153779 15.135098 7.903363 B 233.7632
2 10.529895 5.055633 4.959639 B 372.3922
3 20.600232 10.333690 12.749611 B 349.6630
4 41.885899 17.280700 26.760988 B 164.3122
5 17.174567 11.878346 -3.306771 A 272.9595
6 21.524126 12.449084 6.911237 A 179.7316
In this above dataset, the variables a, b, c, cat_var are the predictor variables (covariates) and "old_response_variable" is the response variable (continuous). I am interested converting the "old_response_variable" into a (binary) categorical predictor variable - and then train a statistical model (e.g. decision tree) on this data for the purpose of supervised classification.
Proposed Strategy:
The plot of the "old_response_variable" looks like this:
plot(density(d$old_response_variable), main = "Distribution
of the Old Response Variable")
The "old response variable" can take values between 0 and 600. Since I am interested in binary classification, I thought I could:
1) Make random splits (i.e. threshold) in the "old response variable" (e.g. if old_response_variable < 250 then new_response_variable = "0" else "1")
2) Train a decision tree model on the data from 1)
3) Record performance metrics (e.g. accuracy, sensitivity, specificity) from the model in 2)
4) Repeat steps 1) - 3) many times : choose the final threshold that has "suitable" values of accuracy, sensitivity, specificity (e.g. a threshold where the decision tree model has high accuracy, high sensitivity but low specificity might be less advantageous than a threshold where the decision tree model has medium accuracy, medium sensitivity and medium specificity).
Here is the R code corresponding to this strategy (on a small scale):
library(ggplot2)
library(caret)
library(rpart)
#generate data
a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)
b = rnorm(1000, 10, 5)
c = rnorm(1000, 5, 10)
cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE,
prob=c(0.5,0.5))
old_response_variable <- rnorm(1000, 250, 100)
#put data into a frame
d = data.frame(a,b,c, cat_var, old_response_variable)
d<span class="math-container">$cat_var = as.factor(d$</span>cat_var)
e <- d
vec1 <- sample(250:300, 50)
z <- 0
df <- expand.grid(vec1)
df<span class="math-container">$Accuracy <- NA
df$sens <- NA
df$spec <- NA
for (i in seq_along(vec1)) {
# d <- e
d<span class="math-container">$new_response_variable =
as.integer(ifelse(d$</span>old_response_variable < vec1[i] ,
0, 1))
d<span class="math-container">$new_response_variable =
as.factor(d$</span>new_response_variable)
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 2,
## repeated ten times
repeats = 1)
TreeFit <- train(new_response_variable ~ .,
data = d[,-5],
method = "rpart",
trControl = fitControl)
pred <- predict(
TreeFit,
d[,-5])
con <- confusionMatrix(
d$new_response_variable,
pred)
#update results into table
#final_table[i,j] = con<span class="math-container">$overall[1]
z <- z + 1
df$</span>Accuracy[z] <- con<span class="math-container">$overall[1]
df$</span>spec[z] <- con<span class="math-container">$byClass[1]
df$</span>sens[z] <- con$byClass[2]
}
#view final results ("var1" is the threshold)
head(df)
Var1 Accuracy sens spec
1 299 0.682 0.8125000 0.6798780
2 289 0.657 0.7358491 0.6525871
3 271 0.573 0.8125000 0.5691057
4 278 0.622 0.6491228 0.6185102
5 253 0.540 0.5352564 0.6093750
6 258 0.549 0.5305623 0.6318681
We can visualize the results of this strategy:
ggplot(df, aes(Var1)) +
geom_line(aes(y = Accuracy, colour = "Accuracy")) +
geom_line(aes(y = sens , colour = "sens ")) +
geom_line(aes(y = spec , colour = "spec ")) +
ggtitle("Results of Threshold Splitting Strategy")
According to the results of the above picture, a splitting threshold of approximately "280" (if old_response_variable < 280 then new_response_variable = "0" else "1") appears to be a suitable choice (balanced accuracy, specificity, sensitivity).
Question: Based on this strategy that I have outlined for splitting thresholds, are there any major statistical flaws? The main flaw that I can think of, is that the "suitably" of the threshold is decided by how well the model (the decision tree) performs - it is very possible that this threshold may not be a naturally occurring threshold or a ideal threshold, but rather a threshold that well matches the (multiple) models we trained (someone could ask the question "why wasn't a KNN or an SVM model used to evaluate potential splitting thresholds?") . In essence, we might have projected our biases on to this data - but to some extent, this is inevitable in statistical modelling.
But in general, have I outlined a "reasonable" strategy for converting a continuous response variable into a categorical response variable?
In general, can someone please comment on the approach I have used?
Thanks!
Note : I know that converting a continuous variable into a categorical variables will inevitably result in a loss of information - but what if the client/your boss is specifically requesting this problem be solved as a classification problem? (e.g. similar problems in the industry are treated as classification, and the goal is to redefine new thresholds/classses from a continuous variable).


