Converting a Continuous Response Variable into a Categorical Variable

Question

I have the following question: Are there any Standard Methods for Converting a Continuous Response Variable into a Categorical Variable?

To give my question some context, I give the following example (using the R programming language). Suppose you have the following data (I also posted the code on how to generate this data to my example more reproducible):

    #generate data
a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)
b = rnorm(1000, 10, 5)
c = rnorm(1000, 5, 10)
cat_var &lt;- sample( LETTERS[1:2], 1000, replace=TRUE, 
              prob=c(0.5,0.5))
old_response_variable &lt;- rnorm(1000, 250, 100)

#put data into a frame
d = data.frame(a,b,c, cat_var, old_response_variable)
d<span class="math-container">$cat_var = as.factor(d$</span>cat_var)

#view data

          a         b         c cat_var old_response_variable
1 -2.153779 15.135098  7.903363       B              233.7632
2 10.529895  5.055633  4.959639       B              372.3922
3 20.600232 10.333690 12.749611       B              349.6630
4 41.885899 17.280700 26.760988       B              164.3122
5 17.174567 11.878346 -3.306771       A              272.9595
6 21.524126 12.449084  6.911237       A              179.7316

In this above dataset, the variables a, b, c, cat_var are the predictor variables (covariates) and "old_response_variable" is the response variable (continuous). I am interested converting the "old_response_variable" into a (binary) categorical predictor variable - and then train a statistical model (e.g. decision tree) on this data for the purpose of supervised classification.

Proposed Strategy:

The plot of the "old_response_variable" looks like this:

    plot(density(d$old_response_variable), main = "Distribution 
           of the Old Response Variable")

The "old response variable" can take values between 0 and 600. Since I am interested in binary classification, I thought I could:

1) Make random splits (i.e. threshold) in the "old response variable" (e.g. if old_response_variable < 250 then new_response_variable = "0" else "1")

2) Train a decision tree model on the data from 1)

3) Record performance metrics (e.g. accuracy, sensitivity, specificity) from the model in 2)

4) Repeat steps 1) - 3) many times : choose the final threshold that has "suitable" values of accuracy, sensitivity, specificity (e.g. a threshold where the decision tree model has high accuracy, high sensitivity but low specificity might be less advantageous than a threshold where the decision tree model has medium accuracy, medium sensitivity and medium specificity).

Here is the R code corresponding to this strategy (on a small scale):

    library(ggplot2)
    library(caret)
    library(rpart)
#generate data

a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)
b = rnorm(1000, 10, 5)
c = rnorm(1000, 5, 10)
cat_var &lt;- sample( LETTERS[1:2], 1000, replace=TRUE, 
                prob=c(0.5,0.5))
old_response_variable &lt;- rnorm(1000, 250, 100)

#put data into a frame
d = data.frame(a,b,c, cat_var, old_response_variable)
d<span class="math-container">$cat_var = as.factor(d$</span>cat_var)

e &lt;- d
vec1 &lt;- sample(250:300, 50)
z &lt;- 0
df &lt;- expand.grid(vec1)
df<span class="math-container">$Accuracy &lt;- NA

df$sens <- NA
    df$spec <- NA
for (i in seq_along(vec1)) {

    # d &lt;- e
    d<span class="math-container">$new_response_variable = 
  as.integer(ifelse(d$</span>old_response_variable &lt; vec1[i] , 
                 0, 1))

    d<span class="math-container">$new_response_variable = 
        as.factor(d$</span>new_response_variable)

    fitControl &lt;- trainControl(## 10-fold CV
        method = &quot;repeatedcv&quot;,
        number = 2,
        ## repeated ten times
        repeats = 1)

    TreeFit &lt;- train(new_response_variable ~ ., 
                     data = d[,-5],
                     method = &quot;rpart&quot;,
                     trControl = fitControl)

    pred &lt;- predict(
        TreeFit,
        d[,-5])

    con &lt;- confusionMatrix(
        d$new_response_variable,
        pred)

    #update results into table
    #final_table[i,j] = con<span class="math-container">$overall[1]
z &lt;- z + 1
df$</span>Accuracy[z] &lt;- con<span class="math-container">$overall[1]
df$</span>spec[z] &lt;- con<span class="math-container">$byClass[1]
df$</span>sens[z] &lt;- con$byClass[2]

}

#view final results (&quot;var1&quot; is the threshold)

head(df)

 Var1 Accuracy      sens      spec
1  299    0.682 0.8125000 0.6798780
2  289    0.657 0.7358491 0.6525871
3  271    0.573 0.8125000 0.5691057
4  278    0.622 0.6491228 0.6185102
5  253    0.540 0.5352564 0.6093750
6  258    0.549 0.5305623 0.6318681

We can visualize the results of this strategy:

    ggplot(df, aes(Var1)) + 
        geom_line(aes(y = Accuracy, colour = "Accuracy")) +  
          geom_line(aes(y =  sens  , colour = "sens  ")) +  
          geom_line(aes(y =  spec  , colour = "spec  "))  + 
            ggtitle("Results of Threshold Splitting Strategy")

According to the results of the above picture, a splitting threshold of approximately "280" (if old_response_variable < 280 then new_response_variable = "0" else "1") appears to be a suitable choice (balanced accuracy, specificity, sensitivity).

Question: Based on this strategy that I have outlined for splitting thresholds, are there any major statistical flaws? The main flaw that I can think of, is that the "suitably" of the threshold is decided by how well the model (the decision tree) performs - it is very possible that this threshold may not be a naturally occurring threshold or a ideal threshold, but rather a threshold that well matches the (multiple) models we trained (someone could ask the question "why wasn't a KNN or an SVM model used to evaluate potential splitting thresholds?") . In essence, we might have projected our biases on to this data - but to some extent, this is inevitable in statistical modelling.

But in general, have I outlined a "reasonable" strategy for converting a continuous response variable into a categorical response variable?

In general, can someone please comment on the approach I have used?

Thanks!

Note : I know that converting a continuous variable into a categorical variables will inevitably result in a loss of information - but what if the client/your boss is specifically requesting this problem be solved as a classification problem? (e.g. similar problems in the industry are treated as classification, and the goal is to redefine new thresholds/classses from a continuous variable).

The main flaw is that discretizing a continuous variable throws away a huge amount of information, usually for no good reason. I would recommend you spent much more time on thinking about whether you want to do this, not on how. — Stephan Kolassa, Oct 25 '21 at 05:25
@ Stephan Kolassa: Thank you for your reply! I completely agree with your logic "discretizing a continuous variable throws away a huge amount of information, usually for no good reason. " However, sometimes you receive directions from your boss, that they want to convert this problem into a classification problem (e.g. the client might have requested this). — stats_noob, Oct 25 '21 at 05:29
If you do what the boss wants you are not being true to yourself IMHO. — Frank Harrell, Apr 09 '22 at 21:50
You don't have to knuckle under to every bad idea someone has. — Dave, Apr 11 '22 at 18:20
It is the responsibility of statisticians and data scientists to do work that they can stand behind. Engaging in bad practices simply because you were told to does not have a great history and will not serve you well in your career. — alan ocallaghan, Apr 11 '22 at 19:02

EdM · Answer 1 · 2022-04-11T18:06:34.753

It's seldom that a question simultaneously raises two of the biggest bugaboos on this site: binning continuous data and using accuracy as a measure of model quality.

I infer that there is some yes/no ultimate decision that your boss or client needs to make based on the data. What's important is to separate out the statistical modeling from that ultimate yes/no choice. An important part of your job is to work with the boss/client to understand the nature of that choice. Then you can make a continuous model that best informs the ultimate yes/no choice.

Here are a couple of answers from @StephanKolassa to other questions about the importance of distinguishing the statistical modeling from making the decision, with links to further information.

Example when using accuracy as an outcome measure will lead to a wrong conclusion.

Predictive distribution: What can we say about the prediction?

You need to understand the costs and benefits associated with making a correct or incorrect ultimate decision. In particular, accuracy as a measure of model quality only makes sense when the costs of false-negatives and false-positives are equal. Unless you are sure that is the case, the ultimate decision needs to take into account the relative costs and benefits as informed by your modeling.

The value of your continuous outcome presumably has some relationship to those costs and benefits. Insofar as you provide a good model of that continuous outcome, you are providing the information needed to make that choice.

If there is a particular region of outcome values that is of particular interest, say near the dividing line of which cost-based decision to make, you might consider a method that focuses on that region of interest. My answer here goes into that somewhat. But a premature dichotomization during the modeling phase does neither you, your boss, nor the client any good in the long run.

Converting a Continuous Response Variable into a Categorical Variable

1 Answers1