0

I'm trying to work on this heart disease dataset by doing binary classification using RPART trees on data that has a hard unbalance, only 8% of the instances are positives.

When it comes to pruning I was trying to apply the 1-SE rule to select the right amount of splits. This is how my printcp looks (some rows are cut).

           CP nsplit rel error  xerror      xstd
1  0.09056369      0   1.00000 1.00000 0.0022309
2  0.02295328      2   0.81887 0.81887 0.0021026
3  0.00858510      4   0.77297 0.77297 0.0020630
4  0.00390896      5   0.76438 0.76438 0.0020552
...
31 0.00027765     60   0.69529 0.70599 0.0019993
32 0.00027582     61   0.69501 0.70577 0.0019991
33 0.00027034     67   0.69287 0.70582 0.0019991
34 0.00026851     68   0.69260 0.70558 0.0019989
35 0.00026303     72   0.69153 0.70515 0.0019985
36 0.00025816     73   0.69127 0.70503 0.0019983
37 0.00025207     76   0.69049 0.70467 0.0019980
38 0.00024842     78   0.68999 0.70432 0.0019976
39 0.00024477     81   0.68924 0.70439 0.0019977
40 0.00024111     83   0.68875 0.70404 0.0019974
41 0.00022650     87   0.68779 0.70358 0.0019969
42 0.00020945     88   0.68756 0.70331 0.0019966
43 0.00020702     91   0.68693 0.70327 0.0019966
44 0.00020458     94   0.68631 0.70327 0.0019966
45 0.00019727     96   0.68590 0.70276 0.0019961
46 0.00018997     97   0.68570 0.70239 0.0019957

The lowest xerror is 46 splits and applying the 1-SE I'd take the cp that splits in 40 nodes.

I'm clearly overfitting the data I have (300k+ rows) but I think it's fine since I'm not necessarily trying to predict with this tree, I'm more interested in understanding which variables are important predictors (please comment on this assumption).

Since the data is severely unbalanced I'm trying to focus on achieving a good enough F1 score which would signify that my tree is able to point out the positives rather than just guessing all negatives.

The way I'm doing that is by setting some class weights when initializing the tree, which at the moment are set 92:8, which is the inverted distribution of the data I have, is this the best way I'm supposed to use the weights?

Going back to my question about pruning, since I don't care about the accuracy but I'd prefer deciding on F1 or precision, is the xerror still what I should be looking for? How is the xerror calculated?

cZeph
  • 1
  • Welcome to Cross Validated! If you don’t care about the accuracy and just want to catch all of the positive cases, why not call everything a positive case? You’ll never miss another positive case that way! – Dave Jun 25 '22 at 02:32
  • @Dave How is this a constructive comment? Without weights the tree is in fact just doing that by guessing all negative and achieving 92% accuracy. – cZeph Jun 25 '22 at 07:43
  • @Dave: You could have been more constructive, by shortly linking posts such as https://stats.stackexchange.com/questions/359909/is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting. cZeph: This comes up so often at this site so people sometimes gives very short answers! Another helpful post for you is at https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem, and many others! – kjetil b halvorsen Jun 25 '22 at 13:17
  • @kjetilbhalvorsen Wait a second though. Does my question come across as just asking if accuracy is a good metric in my problem? Cause I was trying to give it more dept honestly. I damn well know that accuracy is not what I'm supposed to use. I was asking if pruning on the xerror is the right choice when I have given class weights and if I'm actually correctly using class weights since I know of only one approach which is the one I used. – cZeph Jun 25 '22 at 16:22
  • OK; but I must leave that Q for others. – kjetil b halvorsen Jun 25 '22 at 16:53

0 Answers0