I'm trying to work on this heart disease dataset by doing binary classification using RPART trees on data that has a hard unbalance, only 8% of the instances are positives.
When it comes to pruning I was trying to apply the 1-SE rule to select the right amount of splits. This is how my printcp looks (some rows are cut).
CP nsplit rel error xerror xstd
1 0.09056369 0 1.00000 1.00000 0.0022309
2 0.02295328 2 0.81887 0.81887 0.0021026
3 0.00858510 4 0.77297 0.77297 0.0020630
4 0.00390896 5 0.76438 0.76438 0.0020552
...
31 0.00027765 60 0.69529 0.70599 0.0019993
32 0.00027582 61 0.69501 0.70577 0.0019991
33 0.00027034 67 0.69287 0.70582 0.0019991
34 0.00026851 68 0.69260 0.70558 0.0019989
35 0.00026303 72 0.69153 0.70515 0.0019985
36 0.00025816 73 0.69127 0.70503 0.0019983
37 0.00025207 76 0.69049 0.70467 0.0019980
38 0.00024842 78 0.68999 0.70432 0.0019976
39 0.00024477 81 0.68924 0.70439 0.0019977
40 0.00024111 83 0.68875 0.70404 0.0019974
41 0.00022650 87 0.68779 0.70358 0.0019969
42 0.00020945 88 0.68756 0.70331 0.0019966
43 0.00020702 91 0.68693 0.70327 0.0019966
44 0.00020458 94 0.68631 0.70327 0.0019966
45 0.00019727 96 0.68590 0.70276 0.0019961
46 0.00018997 97 0.68570 0.70239 0.0019957
The lowest xerror is 46 splits and applying the 1-SE I'd take the cp that splits in 40 nodes.
I'm clearly overfitting the data I have (300k+ rows) but I think it's fine since I'm not necessarily trying to predict with this tree, I'm more interested in understanding which variables are important predictors (please comment on this assumption).
Since the data is severely unbalanced I'm trying to focus on achieving a good enough F1 score which would signify that my tree is able to point out the positives rather than just guessing all negatives.
The way I'm doing that is by setting some class weights when initializing the tree, which at the moment are set 92:8, which is the inverted distribution of the data I have, is this the best way I'm supposed to use the weights?
Going back to my question about pruning, since I don't care about the accuracy but I'd prefer deciding on F1 or precision, is the xerror still what I should be looking for? How is the xerror calculated?