Scenario Medicine
Let's say I'm developing a new medicine. I have two groups M0 (placebo) and M1 (real drug) where 1 means "healing observed" and 0 means "no healing observed". I run my experiment, get results which might look like this:
M0 = [ 0, 0, 0, 1, 0, 1, 0, 0, 1, ... ]
M1 = [ 1, 0, 0, 1, 1, 1, 1, 0, 1, ... ]
I would now perform a Kruskal-Wallis test KW(M0, M1) resulting in a p-value, say PM = 0.04. The interpretation of PM is sort-of
If we assumed
M0andM1were based on the same distribution, only in 4 out of 100 times we'd find two observations looking as distinct asM0andM1are.
Practically,
- a low
PMis "good", and - a low
PMcan be used to guide a "bet" on the medicine, e.g., whether a company should invest, or a patient should be treated if the cost and benefits of being right or wrong about that bet are known, - companies can pre-compute a
PMcutoff value below which they can declare "certain enough" and commit to production.
Scenario Disc Brake
Let's instead say I'm developing a cheaper replacement disc brake for a car. I again form two groups, my replacement disc DR, the original one DO, and run an experiment where I measure if my brakes break (0) or persist (1) in certain environmental conditions:
DR = [ 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, ...]
DO = [ 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...]
If I now perform a Kruskal-Wallis KW(DR, DO) I will get another p-value, say PD = 0.82.
Here things become unclear to me, both how to interpret this value practically, and whether there is a better test available. In particular:
PDseems to indicate that only in 82 out of 100 times (under the assumption my replacement disc is identical to the original) I'd see such an outcome. Does that mean I should not bet on my replacement disc if I wanted 5% confidence level?In other words, would I need
PD >= 0.95to semantically reach the same level of confidence (relative to the medicine case) that my disc types are actually identical?From practical experiments it seems to be much harder (larger
Nneeded) to reach high such high p-values for "replacement disc"-sort of experiments. Intuitively I sort-of get that testing for absence of differences is harder than showing differences exist, but I am wondering if there are better tests I could do in "disc brake" type of studies than in "medicine" studies?Is there something else I'm missing here?
[Edit - Background]: I'm writing a paper and interested in a mixture of both scenarios. I have a system that on theoretical ground for some parameters should behave differently and for other parameters should behave identically to a reference. What tests would ideally be needed to reasonably convince myself and my readers that they are the the same when used in some configurations, but are different for other configurations?
DR/DO, what is the best analytical approach I should take to get to actionable results? – left4bread Jan 15 '21 at 12:09DRandDOare, for the purpose of breaking down under stress, the same? (Assuming I know the cost of betting right or wrong, I just want a likelihood that based on my experiment trail and test I might be right or wrong). – left4bread Jan 15 '21 at 12:24M_chocolateandM_titaniumthat a reasonable person would use to convince himself that using chocolate is not a good idea, but titanium is. What is that metric? - Edit: saw you updated your post, will look into TOST. – left4bread Jan 15 '21 at 12:30