Changing the regression problem to a classification problem

Question

I'd like to do the classification on a crime data around the country. However, what I have for the label is the crime coefficient which is from 0 to 1. I'd like to make up some interval like 0~0.3 as low crime rate, 0.3~0.5 medium etc. And then use it in the xgboost or random forest model to do the classification. Is there anything I need to pay attention to before the classification? For example, do I have to check the label whether it is normally distributed? What if the data is skewed? What would you do if you are working on this project?

This would be the worst possible way to analyze the data, throwing away a tremendous amount of information and introducing a huge amount of arbitrariness into the result. You need to just be careful in choosing a distribution for crime rate, and possibly use Bayesian borrowing of information in the small area estimation sense. — Frank Harrell, Apr 21 '19 at 20:02
Note that random forests and gradient boosting work perfectly well for regression. — EdM, Apr 21 '19 at 20:04

score 1 · Answer 1 · answered Apr 21 '19 at 21:53

In this case my suggestion would be to predict the actual numeric crime rate and then apply the threshold to arrive at low/medium crime rate.

In order to come up with different thresholds, you can create intervals on predicted crime rate by following some great suggestions to cluster 1D here. If you prefer simplicity rather than finding natural separations, you can hard code some quartiles (e.g. 1 quartile is low, 2/3 is medium 4 is high).

Changing the regression problem to a classification problem

1 Answers1