I have an XGBoost regression model to predict a numeric target y. y is quite right-skewed when I plot histogram. For example, the top 5% of data accounts for 50% of the total sum. I do not want to filter them out, because they are important for my analysis, and besides, there is not a fixed cut-off point.
When I train the model using XGBRegressor that minimizes the squared error, I get an R2 of 0.76. I am curious on how the model fits to different target values. So I use score method of sklearn to calculate R2 to see the predictive power of my model on different subsets. I split the test data based on target value percentiles (y< i^th percentile vs y>=i^th percentile). Note that R2_high (low) refers to the subset of the test data where target values are higher (lower) than the percentile value. See below results:
| Percentile | R2_high | R2_low | R2_overall |
|---|---|---|---|
| 0.10 | 0.758 | -35.9 | 0.76 |
| 0.20 | 0.75 | -15.4 | 0.76 |
| 0.50 | 0.73 | -2.54 | 0.76 |
| 0.90 | 0.63 | 0.375 | 0.76 |
| 0.95 | 0.56 | 0.53 | 0.76 |
- Overall, I am satisfied with an
R2of 0.76. However, I would expectR2value not to deviate so much as long as there are enough number of data points (the test set has ~100 000 rows). However, it deviates extremely. Is this normal? - It is reasonable that the model may try to fit to the large values more due to the objective of minimizing squared error. However, this is not exactly the case; for the top 5% the
R2is only 0.56. - For some subsets the
R2is terrible with negative values. - Does training separate models make sense? For example training 10 models such as dataset 1 include the top 10%
y, dataset 2 has second top 10% and so on?
I would appreciate any help or suggestion greatly.