0

I'd like to do the classification on a crime data around the country. However, what I have for the label is the crime coefficient which is from 0 to 1. I'd like to make up some interval like 0~0.3 as low crime rate, 0.3~0.5 medium etc. And then use it in the xgboost or random forest model to do the classification. Is there anything I need to pay attention to before the classification? For example, do I have to check the label whether it is normally distributed? What if the data is skewed? What would you do if you are working on this project?

Jan Kukacka
  • 11,421

1 Answers1

1

In this case my suggestion would be to predict the actual numeric crime rate and then apply the threshold to arrive at low/medium crime rate.

In order to come up with different thresholds, you can create intervals on predicted crime rate by following some great suggestions to cluster 1D here. If you prefer simplicity rather than finding natural separations, you can hard code some quartiles (e.g. 1 quartile is low, 2/3 is medium 4 is high).

behold
  • 473