Statistical test for differences where output is -1, 0, 1?

Question

He's the situation, say we have n data points. There are two treatments A and B. If treatment A was better than treatment B we give a score of 1, in the opposite case we give a -1, and if there is no perceived difference we give a 0. How to then test if one treatment is "better" than the other?

I've looked into Mann-Whitney test, but it ignores the cases where we have 0s, i.e. more 0s don't affect the result. To me, they should as more 0s would indicate more "sameness". See here

So I want a test where it will say whether one treatment is more effective than the other, while utilizing cases where there are 0 values. The only thing I can think of off the top of my head is to somehow use proportions of 1s and -1s, but I feel like I'd be voiding assumptions doing that (maybe I'm wrong though).

Any thoughts?

Is your null hypothesis (a) the mean comparison score is $0$ or (b) the median comparison score is $0$? They will give different results: (a) is saying the population proportion of $+1$s is equal to the population proportion of $-1$s, while (b) is making the less precise statement that the population proportion of $+1$s and the population proportion of $-1$s are each less than $\frac12$ but need not be equal to each other — Henry, Jun 16 '23 at 19:22
The cases with 0 values give you no relevant information. One way to intuit this is to suppose that underlying each treatment is a continuous score. Underlying the data is a random sample of independent score differences. However, you have discretized them into three bins where, under the null hypothesis, the $\pm 1$ bins have equal chances (say $p$) and therefore the $0$ bin has a chance of $1-p.$ The count in the $0$ bin reveals only how coarse it is, but it directly tells you nothing about how A compares to B. So, your problem is a standard Binomial one. — whuber, Jun 16 '23 at 20:17
@whuber, the 0s are important to me, practically speaking. If I have (5,-1s), (10000,0s), and (20,1s) I don't care that I get 4x as many 1s than -1s...because the vast majority of the time it makes no difference. I get that many tests don't use those 0s, but I need one that does. — DanE, Jun 16 '23 at 21:11
Whether they are important or not, they convey no information concerning the test you ask for. Are you sure you have correctly described the hypothesis you want to test? — whuber, Jun 16 '23 at 21:13
@Henry, well I guess that's sort of the issue...I don't think I can just calculate a mean and determine whether it's = 0 or not....because I think the distribution of the data wouldn't allow me to do this (please let me know if I'm wrong). I can't really use the median score otherwise I run into the issue where all of the 0s are irrelevant. — DanE, Jun 16 '23 at 21:14
@whuber....I suppose what I want to test is if Treatment A is superior to Treatment B...and if the vast majority of the time they are the same then this would not be the case...maybe I did not correct articulate the problem at hand. — DanE, Jun 16 '23 at 21:15
This whole discussion, just as in the post linked in the question, boils down to the concern that neglecting the uninformative information, as suggested in @whuber's first comment, feels "unfair against the null". But it is perfectly legal from a frequentist resting point of view: a simple binomial test of the probability of "-" being equal to prob "+" would have correct size. Such a test would be extremely sensitive when you have a broad "0", like the 10000 0s , but only 20 1s and 5 -1s you mentioned as example. — Ute, Jun 17 '23 at 07:02
This is similar to someone selling a steel armored umbrella, promoting it as making the user 7 times less likely to be killed by a meteor. The claim may very well be true (e.g. over a 50 year period, 1 chance in a billion versus 1 chance in 7 billion), but one would have to be very gullible to carry this heavy protection wherever one goes. What's needed is a measure of how meaningful the claim actually is (e.g. as compared with a similar claim about wearing seat-belts, where the difference is significant). — Ray Butterworth, Jun 17 '23 at 13:09
You are confusing the magnitude of the effect with whether or not there is an effect (practical vs. statistical significance.) If you want to test the former, you'll need to define a "magnitude" above which you consider the effect to be practically significant, then test for that. What you are asking for is a test of whether or not there is an effect, for which, as has been pointed out above, the zeroes contain no information. — jbowman, Jun 17 '23 at 14:24

Ute · Answer 1 · 2023-06-17T18:09:47.910

Test using normal approximation

Under some assumptions, it should be possible to derive an approximate test that is a bit "softer" in that it allows a less restrictive null hypothesis. This requires to specify a model, and to formulate that null hypothesis within the model. I'll try this below.

Model

To quote @whuber's comment:

One way to intuit this is to suppose that underlying each treatment is a continuous score. Underlying the data is a random sample of independent score differences. However, you have discretized them into three bins where, under the null hypothesis, the $\pm 1$ bins have equal chances (say $p$) and therefore the $0$ bin has a chance of $1−p$.

Now, let $p_A$ be the probability that treatment A truly works better in an experiment, and $p_{0A}$ be the probability, that A works better, but is given a $0$ score. Denote the corresponding probabilities for treatment B as $p_B$ and $p_{0B}$. If experiments are independent, then your data have a multinomial distribution with probabilities $p_{+1}= p_A-p_{0A}$, $p_{-1}= p_B-p_{0B}$ and $p_0=p_{0A}+p_{0B}$ for the respective outcomes.

This feeling of doing the "wrong" thing may be due to that we don't want to test the hypothesis $p_A-p_B = 0$ but rather "$p_A-p_B$ is small".

Nullhypothesis

We are in the first round interested in testing the hypothesis $p_A=p_B=1/2$, or, $p_A-p_B = 0$.

Testing $p_A=p_B$

Indeed, to maximize power of this test, you should condition on the number of zeros, and thus ignore them in your analysis, since they carry no information about $p_A-p_B$. Assume that $p_0$ is quite large, but so is $n$, and thus the cases where there are only zeros are negligibly rare, then you virtually never run into the problem that you are left without data.

It still does "feel wrong" to only consider the "-1"s and "+1"s; the example you gave in the comment makes sense:

If I have (5,-1s), (10000,0s), and (20,1s) I don't care that I get 4x as many 1s than -1s...because the vast majority of the time it makes no difference.

If treatments A and B are equally effective, then you would perhaps have 5005 cases where B was better, and 5020 cases where A was better, that is a difference of $p_A-p_B\approx 0.0015$ - indeed almost negligible. It is not so satisfactory to get a $p$-value close to zero for that (R: binom.test(5, 25)).

"Small difference hypothesis" $p_A - p_B <\delta_0$

Since only $p_{-1}$, $p_{0}$ and $p_{+1}$ are estimable from the data, we need extra assumptions about the probabilities of being given a zero score in order to retrieve $\delta := p_A-p_B$.

The simplest option is $p_{0A} = p_{0B}=p_0/2$. Then $\delta = p_{+1}-p_{-1}$.
Another option is $p_{0A} = p_{A}\cdot p_{0}$, $p_{0B} = p_{B}\cdot p_{0}$. Then $\delta = (p_{+1}-p_{-1})/p_0$.

Special underlying models (before discretization) may require more complicated assumptions.

Approximate test for $H_0: \delta = p_A - p_B <\delta_0$, given $p_{0A}=p_{0B}$

For large $n$, we may use a normal approximation to test hypotheses about the probabilities - this should be OK when expected numbers for "+1"s and "-1"s are around five. In the following, we assume $p_{0A} = p_{0B}$, which seems reasonable for cases where the original distributions of the treatment effects are supposed to be "almost equal".

Let $a$ denote the number of "+1" and $b$ denote the number of "-1" outcomes. A suitable test statistic is the estimator for $\delta$ given by $d = (a-b)/n$. Since the covariance between $a$ and $b$ is $-np_{-1}p_{+1}$, the variance of $a-b$ is $n(p_{+1}(1-p_{+1})+p_{-1}(1-p_{-1})+2p_{-1}p{+1})=n(p_{+1}+p_{-1})-(p_{+1}-p_{-1})^2$.

Using a normal approximation for the number of outcomes, we obtain $$ \text{pval} = 1 - \Phi\left(\frac{a - b - n\delta_0}{\sqrt{a+b-(a-b)^2/n}}\right). $$

For sufficiently large $p_0$, we can treat the outcomes for "-1" and "+1" as (almost) independent.

Disclaimer: The test is approximate and assumption based.

Remarks

The normal approximation would also yield confidence intervals, which is maybe more appropriate than a mere $p$-value in the present context. You get an approximate confidence interval for the difference in proportions with R function prop.test. Again, assuming that you can treat the "+1" and the "-1" as independent, for the example (5,-1s), (10000,0s), and (20,1s) you would call prop.test(c(5, 20), c(10025, 10025)) .
This could be a case for Bayesian analysis, giving soft bounds :-)

Thanks, I’m going to look into this more when I return to work. — DanE, Jun 19 '23 at 00:16