5

Within scientific literature, there is a tentative proposal to change the significance level from p=0.05 to p=0.005

http://www.nature.com/articles/s41562-017-0189-z

I understand there is a lot of nuance to this proposal and don't necessarily want to get too much into the pros and cons!

In order to test the performance of this proposal in the real world, I have collated primary endpoint p values for a large number of scientific studies, and have assigned liekert-score type ordinal data to describe the value of the study, where 1= low importance study and 5= highly important study (based on a complex calculation taking journal impact factor, number of citations, H-score and a few other factors into account).

So I have two columns of data as follows:

Column A: 1-5 (Ordinal) - Where 1 = Low Importance Study; 5 = High Importance Study

Column B: 0-1 (Categorical, Dichotomous - Does the study primary endpoint meet p= <0.005 YES or NO - Represented as 1 or 0)

I can see visually from the data that the primary endpoint of most low value studies does not meet the significance threshold of p=0.005 (20%), but most high value studies do meet it (89%). Using the breakdown from studies I have analysed so far, the breakdown according to likert scale is as follows:

1: 20% meet p <0.005

2: 63% meet p <0.005

3: 85% meet p <0.005

4: 93% meet p <0.005

5: 89% meet p <0.005

If I group studies scoring 1-2 as "Not valuable", and studies scoring 3-4-5 as "Moderately/Very Valuable", I get:

1-2: 45% meet p <0.005

3-4-5: 89% meet p <0.005

I am wondering how I can describe this better in statistical terms, and what test would be appropriate here to describe the association with study value and the binary metric of meeting p = <0.005. In laymans terms, I would like to describe the efficiency of this new threshold at identifying and excluding low quality papers, as well as its performance in identifying but preserving high quality papers.

Is Spearman's rho appropriate here? Or would I be better off trying to describe this using receiver operating curves and with the language of sensitivity and specificity etc?

For interest, my data is here https://ufile.io/k3abnh1s

Alexis
  • 29,850
  • 1
    How would your analysis account for publication bias and p-hacking? It isn't as if these p values are all reported faithfully. – Demetri Pananos Jan 09 '22 at 06:40
  • 4
  • While I agree that often 5% is much too low (but sometimes not), the very notion of "a" significance level for "science" seems strange to me, is likely risible to your average Bayesian, and surely would be a bad joke to the 5-sigma physicists. 2. Aren't some of the items in your valuableness score likely in part determined by p-value?
  • – Glen_b Jan 09 '22 at 06:57
  • 1
    ... i.e. measuring the extent to which low p-values are associated with low p-values. 3. Note that a test does not describe association. If you have a hypothesis, name it. If you don't, don't think in terms of tests (e.g. maybe you're after estimation or perhaps even just diagnostics) . – Glen_b Jan 09 '22 at 07:08
  • @Glen_b Can you provide an example of results of a (frequentist) hypothesis test which are not evidence for or against association in a general sense? All the positivist nulls seem to be in accord with "no association," the negativist nulls to be in accord with "association at least yay big," and one-sided nulls to be in accord with "no positive association" or "no negative association" to me, but perhaps I haven't thought that through too well. That said, I am in accord with your first comment. :) – Alexis Jan 09 '22 at 07:26
  • 1
    As an example of @Glen_b's "(but sometimes not)", when comparing the performance of novel machine learning algorithms against state-of-the-art baselines (e.g. Random Forest, Support Vector Machines, etc.) p < 0.05 is likely to stifle research as you wouldn't expect to be significantly better than the state-of-the-art with very stringent significance levels. The significance level should be set according to the needs of the analysis and we shouldn't use defaults values without thinking about what is appropriate. – Dikran Marsupial Jan 09 '22 at 12:02
  • 1
    @Alexis I think we may be talking at cross purposes. I don't think I said that "results of a hypothesis test are not evidence for or against association in a general sense". There was a point where I was talking about the distinction between (i) describing association - which doesn't in any way require a hypothesis (but in which descriptive statistics would play a role - such as plots or measures of association, for example) and (ii) testing it, in which a hypothesis is definitely required. (If I have misunderstood your intent there, I apologize.) ... – Glen_b Jan 09 '22 at 12:18
  • 1
    ... There's a common and seemingly increasing tendency to throw hypothesis tests (and most especially, point nulls) at every kind of statistical task or question, no matter whether the issue is phrased as one of estimation (how much ... questions), description, or whatever else, even though those tasks needn't have any hint of hypothesis about them. – Glen_b Jan 09 '22 at 12:25
  • 1
    I think that all you really need to know is that (1) lowering the threshold will result in worse selection bias just as it does in stepwise variable selection, (2) gaming (p-hacking) will continue and may get worse, (3) the cost of lower power is being ignored. – Frank Harrell Jan 09 '22 at 14:51
  • @Alexis 1. I'd like to turn your question around. "[C]an you provide an example of results of a (frequentist) hypothesis test which [ARE] evidence for or against association in a general sense?" – Michael Lew Jan 09 '22 at 19:52
  • @Alexis 2. The result of a hypothesis test is a decision and that decision depends on the evidence in, at best, a stepwise all-or-none manner. The result of s significance test—the p-value—depends on the evidence in a continuous manner. – Michael Lew Jan 09 '22 at 19:53
  • @Alexis 3. P-values may well be used as if they are decision surrogates but they can be much more useful to scientific inferences when their inherent information is seen more fully. – Michael Lew Jan 09 '22 at 19:56
  • @Alexis 4. You may know all of this already, but the way you write about it encourages erroneous interpretation of significance tests and encourages the use of statistical inferences in place of scientific inferences. – Michael Lew Jan 09 '22 at 19:57
  • @MichaelLew (1) Yes. Every frequentist hypothesis test. (Caveat: I would not use "evidence for or aganst" I would say "evidence for, or absence of evidence for", or, for negativist nulls, "evidence against, or absence of evidence against.) (2) Sure, you have kinda described the purpose of a hypothesis test: to make a decision about the parameter of some distribution of interest with respect to a statement about that parameter (dependent on a bunch of assumptions). – Alexis Jan 09 '22 at 20:36
  • @MichaelLew (3) I try not to use $p$ values as surrogates for CIs, thanks, and find them merely instrumental for that decision-making purpose of hypothesis tests. If I want to look a "extremity" of evidence in a non-instrumental sense, I look at the data and the distribution assumed under the null explicitly. (4) We disagree. Have a nice day. (Also: Glen_b clarified his point for me quite nicely in his followup comment) – Alexis Jan 09 '22 at 20:36