2

I am trying to compare the errors from two statistical models in order to give evidence to one being "better" in terms of lower prediction error than the other.

To formalize this, I thought that a test of stochastic dominance between two collections of random variables (the OOS errors) would be a good idea. Ideally the null hypothesis would be :

$$\mathbb{H}_0: F(x) \ge G(x) \forall x \in \mathbb{R} $$

I have found resources pointing me to the Kruskal-Wallis test, but unfortunately cannot seem to find a paper explicitly stating and proving one of these (or similar) null hypotheses. Many sources I check simply state that the null is that the medians of the two distributions differ, but this is not what I want to check. Any help is appreciated.

  • 1
    Since you have two samples, I believe the Mann-Whitney test would work for you: https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test. Look at the first sentence under "Assumptions and formal statement of hypotheses". – jbowman Jan 12 '18 at 19:42
  • 1
    It seems to be a paired sample, right? – Michael M Jan 12 '18 at 20:02

2 Answers2

2

I don't see that you need a special proof for this. It is foundational that these nonparametric tests are testing for stochastic dominance. If you need a reference to cite, I would just use a basic nonparametric statistics textbook. I tend to use Hollander & Wolfe (2013) for that sort of thing.

Regarding your situation, here are a couple additional points:

  1. The Kruskal-Wallis test generalizes the Mann-Whitney U-test to more than two groups, but you only seem to have two, so it's not really necessary.
  2. The Krukal-Wallis, and the Mann-Whitney, are for independent groups, but you presumably have paired data. That is, you have the prediction error for a given observation from each of two models. Those two errors are meaningfully paired. You need to take that into account, so something like the the Wilcoxon signed rank test is presumably appropriate.
  3. These tests are for stochastic dominance, but that isn't quite the same as being tests of medians. It is common to call them tests of medians, but that is only true under very restrictive circumstances (cf, What exactly does a non-parametric test accomplish & What do you do with the results?).
  4. Lastly, it isn't clear what you mean by "robust test for stochastic dominance" in the title. Nonparametric tests are already robust in the sense typically meant in statistics.
1
  1. You could rank your observations and report the proportion of ranked values in each of the group and say something like, Proportion of higher ranked values in A > Proportion of higher ranked values in B. Visualising the mean rank helps. Here is a resource that helped me: Kruskal–Wallis test: compare more then two groups by Dr. Yury Zablotski.

Blockquote As you can see, the mean (in red) and the median (in blue) of group salaries are showing similar trend, but neither of them catches the high-salaries-bump of “Advanced” education group. The mean-rank (in black) does.

  1. If you are specifically looking to answer a question like this: Is the performance of A better than performance of B, or does A have higher proportion of greater scores (values) than B, then the Vargha Delaney A effect size measure might be one way to go. Here are a couple of resources:

1. A nice blog post talking about when and how to use it

Instead, we could use the Vargha-Delaney A measure, which tells us how often, on average, one technique outperforms the other. When applied to two populations (like the results of our two techniques), the A measure is a value between 0 and 1: when the A measure is exactly 0.5, then the two techniques achieve equal performance; when A is less than 0.5, the first technique is worse; and when A is more than 0.5, the second technique is worse. The closer to 0.5, the smaller the difference between the techniques; the farther from 0.5, the larger the difference.

  1. Vargha and Delaney's paper

Note: It only compares two groups