Testing for Differenc vs Equivalence (e.g., Drugs vs. Disc Brakes)

Question

Scenario Medicine

Let's say I'm developing a new medicine. I have two groups M0 (placebo) and M1 (real drug) where 1 means "healing observed" and 0 means "no healing observed". I run my experiment, get results which might look like this:

M0 = [ 0, 0, 0, 1, 0, 1, 0, 0, 1, ... ]
M1 = [ 1, 0, 0, 1, 1, 1, 1, 0, 1, ... ]

I would now perform a Kruskal-Wallis test KW(M0, M1) resulting in a p-value, say PM = 0.04. The interpretation of PM is sort-of

If we assumed M0 and M1 were based on the same distribution, only in 4 out of 100 times we'd find two observations looking as distinct as M0 and M1 are.

Practically,

a low PM is "good", and
a low PM can be used to guide a "bet" on the medicine, e.g., whether a company should invest, or a patient should be treated if the cost and benefits of being right or wrong about that bet are known,
companies can pre-compute a PM cutoff value below which they can declare "certain enough" and commit to production.

Scenario Disc Brake

Let's instead say I'm developing a cheaper replacement disc brake for a car. I again form two groups, my replacement disc DR, the original one DO, and run an experiment where I measure if my brakes break (0) or persist (1) in certain environmental conditions:

DR = [ 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, ...]
DO = [ 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...]

If I now perform a Kruskal-Wallis KW(DR, DO) I will get another p-value, say PD = 0.82.

Here things become unclear to me, both how to interpret this value practically, and whether there is a better test available. In particular:

PD seems to indicate that only in 82 out of 100 times (under the assumption my replacement disc is identical to the original) I'd see such an outcome. Does that mean I should not bet on my replacement disc if I wanted 5% confidence level?
In other words, would I need PD >= 0.95 to semantically reach the same level of confidence (relative to the medicine case) that my disc types are actually identical?
From practical experiments it seems to be much harder (larger N needed) to reach high such high p-values for "replacement disc"-sort of experiments. Intuitively I sort-of get that testing for absence of differences is harder than showing differences exist, but I am wondering if there are better tests I could do in "disc brake" type of studies than in "medicine" studies?
Is there something else I'm missing here?

[Edit - Background]: I'm writing a paper and interested in a mixture of both scenarios. I have a system that on theoretical ground for some parameters should behave differently and for other parameters should behave identically to a reference. What tests would ideally be needed to reasonably convince myself and my readers that they are the the same when used in some configurations, but are different for other configurations?

A p-value cannot be used to show the null hypothesis to be true. In some sense, you know the null to be at least a little bit false. Perhaps the difference in means (call it $\delta$) is some tiny number like $0.00000000000000000000000000001$. This means that the means are unequal, so $H_0: \delta=0$ is false. — Dave, Jan 15 '21 at 11:30
Ok thanks. But if I have conducted an experiment like DR / DO, what is the best analytical approach I should take to get to actionable results? — left4bread, Jan 15 '21 at 12:09
That’s going to depend heavily on your question(s). So: what is your question(s)? — Dave, Jan 15 '21 at 12:19
Most basic one: Do I have "sufficient reason" to bet that DR and DO are, for the purpose of breaking down under stress, the same? (Assuming I know the cost of betting right or wrong, I just want a likelihood that based on my experiment trail and test I might be right or wrong). — left4bread, Jan 15 '21 at 12:24
That will depend on what “same” means to you. If you mean literally exactly equal, then there is no chance that they’re the same. You go into the experiment knowing they are different. If you mean that they are within some tolerance where $0$ and $0.0001$ are practically equivalent, then you can work with that. A classical method is called TOST, and I have a post on that method on here: https://stats.stackexchange.com/questions/496336/tost-and-its-two-null-hypotheses/500121#500121. — Dave, Jan 15 '21 at 12:25
If I a) build a chocolate brake that melts every time used and b) build a titanium one that seems to work every time there surely must exist some difference between some metric M_chocolate and M_titanium that a reasonable person would use to convince himself that using chocolate is not a good idea, but titanium is. What is that metric? - Edit: saw you updated your post, will look into TOST. — left4bread, Jan 15 '21 at 12:30
That is an engineering question, not a statistics question. Perhaps you care about stopping distance. Perhaps you care about reliability in repeated use. Perhaps you care about how the brake holds up if you leave the car parked in the sun on a summer day. Perhaps you care how delicious the brake is in case you lose a bet at the bar and have to lick your car. — Dave, Jan 15 '21 at 12:33

Testing for Differenc vs Equivalence (e.g., Drugs vs. Disc Brakes)

0 Answers0