0

I have a dataset that I have splitted for training - testing purpose (~2400 - ~600) .

After training an XGBoost regression model on the training set, I would like to statistically test the fact that the model is indeed better on the test set than a naïve model that would just always output the average of the training set output factor.

Residuals (of both predictions), residuals difference (between both predictions) or squarred residuals difference(between both predictions) are all not normal (tested with ShapiroWilk) or not even pseudo-normal. So I think I should use a non parametric test, maybe something like Mann–Whitney U test, but if I get it right, this particular test can't solve my problem.

Except if I am wrong, this problem would be almost isomorphic of prooving that a sort of R² of the model is not null.

y_pred = trained.predict(X_test)
baseground = [y_train.mean()] * len(y_pred)
mse = mean_squared_error(y_test, y_pred)
baseground_mse = mean_squared_error(y_test, baseground)

Model mse: 0.3606197869571293 Baseground mse: 0.42712517870196076

residuals = [y_p - y_t for y_p, y_t in zip(y_pred, y_test)]
squarred_residuals = [(y_p - y_t)**2 for y_p, y_t in zip(y_pred, y_test)]
residuals_naive =  [(y_b - y_t)  for y_b, y_t in zip(baseground, y_test)]
squarred_residuals_naive =  [(y_b - y_t)**2  for y_b, y_t in zip(baseground, y_test)] 
residuals_differences = [(r - rn)  for r, rn in zip(residuals, residuals_naive)]
squarred_residuals_differences = [sr - srn  for sr, srn in zip(squarred_residuals, squarred_residuals_naive)]

residuals:  ShapiroResult(statistic=0.9887793660163879, pvalue=4.4769913074560463e-05)
residuals_naive:  ShapiroResult(statistic=0.8029045462608337, pvalue=5.698079444623409e-28)
residuals_differences:  ShapiroResult(statistic=0.9659961462020874, pvalue=1.8076425772894922e-11)
squarred_residuals_differences:  ShapiroResult(statistic=0.8816245794296265, pvalue=2.258187421269406e-22)

r2 = r2_score(y_test, y_pred)

R² =  0.1557046857947879

This Crossvalidated thread may be related

T.L;D.R How to statistically test that for two paired non normal distributions d1, d2 that mean(d1) > mean(d2)

Richard Hardy
  • 67,272
  • Welcome to Cross Validated! Squared residuals cannot be normal, since values below zero are always possible with normal distributions. What do you want to learn by testing if the mean of one set of squared residuals exceeds the mean of another set of squared residuals? // The Benavoli (JMLR 2017) “time for a change” article might be of interest. They address a somewhat different problem, but maybe that should be the problem you address. – Dave Apr 26 '23 at 13:43
  • Thank you ! Yes I agree that the squared residuals can't be normal, but the distribution of the difference between the squared residuals of the model with the residuals of the baseground naive prediction could have been normal (I think so at least). My ultimate goal is to "prove" in a rigorous way that my model has a better MSE on the test set than the baseline model. By proving, I mean with a certain significance level. – Guilhem Nespoulous Apr 26 '23 at 13:51
  • Showing that one model has a statistically significantly better MSE (or any other performance score of interest) is exactly the topic addressed in that Benavoli article. They make a case for specific methods in Bayesian statistics, but they also go through more classical methods that you might find appealing. // Also, most normality testing is worthless. – Dave Apr 26 '23 at 13:54
  • Okay, thank you very much I'll check that Benavoli article. – Guilhem Nespoulous Apr 26 '23 at 14:00

1 Answers1

1

This sounds like a case for the Diebold-Mariano test. The test does not assume the prediction errors or losses to be normal. It simply compares the means of the two sets of losses using a $t$-test, possibly using HAC standard errors. In R, the test is implemented in the dm.test function in the forecast package.

Richard Hardy
  • 67,272
  • Does Diebold-Mariano consider multiple losses or multiple residuals? I see a subtle but real difference between the two. (I just checked the function documentation: "Forecast errors" makes it sound like the individual residuals.) – Dave Apr 26 '23 at 15:56
  • @Dave, I not entirely sure what you are asking about. The null hypothesis of the DM test is that the expected value of the loss differential is zero. The loss differential is the difference between the loss from one forecast and the loss from another forecast. – Richard Hardy Apr 26 '23 at 16:07
  • I meant if the test considers residuals (or individual contributors to the overall loss function) as opposed to something like the results from cross validation (the subject of that Benavoli paper). Both the function documentation and your comment here make it pretty clear that the test considers the former, which sounds like what the OP wants. – Dave Apr 26 '23 at 16:08
  • @Dave, the test consider out-of-sample forecast errors, not in-sample residuals. It seems to me the OP is interested in out-of-sample forecast errors, too. – Richard Hardy Apr 26 '23 at 16:37
  • Yes, but those errors come from one test set, not multiple sets (each of which provides one loss score for the entire set) like in the Benavoli paper. – Dave Apr 26 '23 at 16:38
  • 1
    Yes, in the DM test the errors come from one test set. – Richard Hardy Apr 26 '23 at 16:45