15

Lots of emphasis is placed on relying on and reporting effect sizes rather than p-values in applied research (e.g. quotes further below).

But is it not the case that an effect size just like a p-value is a random variable and as such can vary from sample to sample when the same experiment is repeated? In other words, I'm asking what statistical features (e.g., effect size is less variable from sample to sample than p-value) make effect sizes better evidence-measuring indices than p-values?

I should, however, mention an important fact that separates a p-value from an effect size. That is, an effect size is something to be estimated because it has a population parameter but a p-value is nothing to be estimated because it doesn't have any population parameter.

To me, effect size is simply a metric that in certain areas of research (e.g., human research) helps transforming empirical findings that come from various researcher-developed measurement tools into a common metric (fair to say using this metric human research can better fit the quant research club).

Maybe if we take a simple proportion as an effect size, the following (in R) is what shows the supremacy of effect sizes over p-values? (p-value changes but effect size doesn't)

binom.test(55, 100, .5)  ## p-value = 0.3682  ## proportion of success 55% 

binom.test(550, 1000, .5) ## p-value = 0.001731 ## proportion of success 55%

Note that most effect sizes are linearly related to a test statistic. Thus, it is an easy step to do null-hypothesis testing using effect sizes.

For example, t statistic resulted from a pre-post design can easily be converted to a corresponding Cohen's d effect size. As such, distribution of Cohen's d is simply the scale-location version of a t distribution.

The quotes:

Because p-values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single p-value, and 100 studies with the same single effect size could each have 100 different values for p-value.

or

p-value is a random variable that varies from sample to sample. . . . Consequently, it is not appropriate to compare the p-values from two distinct experiments, or from tests on two variables measured in the same experiment, and declare that one is more significant than the other?

Citations:

Thompson, B. (2006). Foundations of behavioral statistics: An insight-based approach. New York, NY: Guilford Press.

Good, P. I., & Hardin, J. W. (2003). Common errors in statistics (and how to avoid them). New York: Wiley.

Nakx
  • 504
rnorouzian
  • 3,986
  • 12
    I don't draw the same conclusions from the quotations (that effect sizes are "superior" or should be reported instead of p-values). I am aware some people have overreacted by making statements like that (such as the BASP ban on p-values). It isn't a one-or-the-other situation: it's a case of pointing out that p-values and effect sizes give different kinds of useful information. Ordinarily one should not be examined without considering it in the context of the other. – whuber Aug 17 '17 at 19:28
  • 2
    I personally think reporting an estimate along with a confidence interval is enough. It gives the effect size (practical significance) and hypothesis testing (statistical significance) at the same time. – Jirapat Samranvedhya Aug 17 '17 at 20:23
  • 1
    Whether p values or effect sizes are 'superior' depends on your perspective. The former follows from the Fisherian NHST tradition, while the latter from Neyman-Pearson tradition. In some fields (biological sciences, humanities), effect sizes tend to be very small, making p values attractive. Conversely, as others note, p-values can be 'forced' smaller through changes in design, like increased N. – HEITZ Aug 17 '17 at 20:51
  • 3
    Is a screwdriver superior to a hammer? – kjetil b halvorsen Oct 27 '17 at 11:23
  • 2
    Is a nut superior to a bolt? – Sextus Empiricus Feb 02 '19 at 08:23

5 Answers5

29

The advice to provide effect sizes rather than P-values is based on a false dichotomy and is silly. Why not present both?

Scientific conclusions should be based on a rational assessment of available evidence and theory. P-values and observed effect sizes alone or together are not enough.

Neither of the quoted passages that you supply is helpful. Of course P-values vary from experiment to experiment, the strength of evidence in the data varies from experiment to experiment. The P-value is just a numerical extraction of that evidence by way of the statistical model. Given the nature of the P-value, it is very rarely relevant to analytical purposes to compare one P-value with another, so perhaps that is what the quote author is trying to convey.

If you find yourself wanting to compare P-values then you probably should have performed a significance test on a different arrangement of the data in order to sensibly answer the question of interest. See these questions: p-values for p-values? and If one group's mean differs from zero but the other does not, can we conclude that the groups are different?

So, the answer to your question is complex. I do not find dichotomous responses to data based on either P-values or effect sizes to be useful, so are effect sizes superior to P-values? Yes, no, sometimes, maybe, and it depends on your purpose.

Michael Lew
  • 15,102
  • 2
    I think it would preferable to present the effect size and its confidence interval, provided the analyst is correctly able to state what a meaningful effect size is for the study at hand. The confidence interval, unlike the p-value, gives the reader a sense of both the precision of the estimate as well as its extremity. – AdamO Aug 18 '17 at 22:15
  • 1
    @AdamO Yes, I largely agree, but the P-value has two things to offer and should not be omitted. It is an index of the strength of evidence against the null, something that can only be gotten from a confidence interval by a very experienced eye, and an exact P-value does not directly invite the dichotomy of inside/outside that the confidence interval does. Of course, a likelihood function offers advantages over both. – Michael Lew Aug 19 '17 at 00:13
19

In the context of applied research, effect sizes are necessary for readers to interpret the practical significance (as opposed to statistical significance) of the findings. In general, p-values are far more sensitive to sample size than effect sizes are. If an experiment measures an effect size accurately (i.e. it is sufficiently close to the population parameter it is estimating) but yields a non-significant p-value then, all things being equal, increasing the sample size will result in the same effect size but a lower p-value. This can be demonstrated with power analyses or simulations.

In light of this, it is possible to achieve highly significant p-values for effect sizes that have no practical significance. In contrast, study designs with low power can produce non-significant p-values for effect sizes of great practical importance.

It is difficult to discuss the concepts of statistical significance vis-a-vis effect size without a specific real-world application. As an example, consider an experiment that evaluates the effect of a new studying method on students' grade point average (GPA). I would argue that an effect size of 0.01 grade points has little practical significance (i.e. 2.50 compared to 2.51). Assuming a sample size of 2,000 students in both treatment and control groups, and a population standard deviation of 0.5 grade points:

set.seed(12345)
control.data <- rnorm(n=2000, mean = 2.5, sd = 0.5)
set.seed(12345)
treatment.data <- rnorm(n=2000, mean = 2.51, sd = 0.5)
t.test(x = control.data, y = treatment.data, alternative = "two.sided", var.equal = TRUE) 

treatment sample mean = 2.51

control sample mean = 2.50

effect size = 2.51 - 2.50 = 0.01

p = 0.53

Increasing the sample size to 20,000 students and holding everything else constant yields a significant p-value:

set.seed(12345)
control.data <- rnorm(n=20000, mean = 2.5, sd = 0.5)
set.seed(12345)
treatment.data <- rnorm(n=20000, mean = 2.51, sd = 0.5)
t.test(x = control.data, y = treatment.data, alternative = "two.sided", var.equal = TRUE)  

treatment sample mean = 2.51

control sample mean = 2.50

effect size = 2.51 - 2.50 = 0.01

p = 0.044

Obviously it's no trivial thing to increase the sample size by an order of magnitude! However, I think we can all agree that the practical improvement offered by this study method is negligible. If we relied solely on the p-value then we might believe otherwise in the n=20,000 case.

Personally I advocate for reporting both p-values and effect sizes. And bonus points for t- or F-statistics, degrees of freedom and model diagnostics!

Michael Lew
  • 15,102
Darren James
  • 1,231
  • 8
    @Darrent James There is no practical importance in a difference between p=0.065 and p=0.043 beyond the unfortunate assumption that p=0.05 is a bright line that should be respected. Neither P-value represents compelling evidence for or against anything by itself. – Michael Lew Aug 17 '17 at 22:07
  • @Michael Lew Yes, I agree! – Darren James Aug 17 '17 at 22:13
  • Another example taught by my professor was a difference in height of 1 mm in African giraffes. If you measure enough of them, you are bound to get a small p eventually, concluding difference in height between populations. What I get from this is that analysis should be reproducible (i.e. a report with data and code) and from there on, it doesn't really matter what statistic you report. It's all there for you to judge. – Roman Luštrik Aug 18 '17 at 05:52
  • 1
    James, given your code and explanations, you seem to have completely misunderstood the OP's point. Your R code also is wrong! Because you have NOt set the var.equal = TRUE while your sds are equal. With such background, I'm not sure why you even posted a response like this. OP is asking a question that doesn't have an easy answer at least at the present time! – user138773 Aug 18 '17 at 15:32
  • 1
    I've added var.equal = TRUE to the code. But it's unnecessary in this case. The same p-values are obtained with both var.equal = TRUE and the default var.equal = FALSE. – Darren James Aug 18 '17 at 19:26
  • @user138773 If you take issue with Darren's post then you should specify why you believe he has missed OP's point. The pedantic tone and borderline irrelevant pointing out of the var.equal issue doesn't really add anything to this conversation. – klumbard Aug 18 '17 at 19:37
  • @klumbard, in my question I had asked about the simultaneous comparison of the "obtained p-value" from a sample and "obtained effect size" from that same sample and thinking which of these two might more reliably tell us about the population. What I noticed in Daren's answer is that Daren is comparing the p-value from a sample BUT the effect-size that is for the population. – rnorouzian Aug 18 '17 at 20:41
  • I just added the sample effect size in my response. – Darren James Aug 18 '17 at 21:15
  • "Obviously it's no trivial thing to increase the sample size by an order of magnitude!" Ironically enough, in my field is sometimes is. "I want a lower p-value" mostly means running something over the weekend. – Fomite Aug 18 '17 at 21:29
  • Dear Daren, I never asked about what you provided. I mainly wanted to know for a sample result (NOT with 20,000) what "statistical features" i.e., distributional properties would make effect size a better fit. The effect size you are calculating is basically a population effect size and not a sample effect size. Still stated differently, I asked I want this through simulation and shared code with you but you keep misunderstanding the question and conflating the concepts. – rnorouzian Aug 18 '17 at 21:29
  • @Fomite: are you sure you are correcting for various types of pseudoreplication? – David Ernst Sep 14 '17 at 23:56
  • @DavidErnst Yep. – Fomite Sep 14 '17 at 23:58
7

I currently work in the data science field, and before then I worked in education research. While at each "career" I've collaborated with people who did not come from a formal background in statistics, and where emphasis of statistical (and practical) significance is heavily placed on the p-value. I've learned include and emphasize effect sizes in my analyses because there is a difference between statistical significance and practical significance.

Generally, the people I worked with cared about one thing "does our program/feature make and impact, yes or no?". To a question like this, you can do something as simple as a t-test and report to them "yes, your program/feature makes a difference". But how large or small is this "difference"?

First, before I begin delving into this topic, I'd like to summarize what we refer to when speaking of effect sizes

Effect size is simply a way of quantifying the size of the difference between two groups. [...] It is particularly valuable for quantifying the effectiveness of a particular intervention, relative to some comparison. It allows us to move beyond the simplistic, 'Does it work or not?' to the far more sophisticated, 'How well does it work in a range of contexts?' Moreover, by placing the emphasis on the most important aspect of an intervention - the size of the effect - rather than its statistical significance (which conflates effect size and sample size), it promotes a more scientific approach to the accumulation of knowledge. For these reasons, effect size is an important tool in reporting and interpreting effectiveness.

It's the Effect Size, Stupid: What effect size is and why it is important

Next, what is a p-value, and what information does it provide us? Well, a p-value, in as few words as possible, is a probability that the observed difference from the null distribution is by pure chance. We therefore reject (or fail to accept) the null hypothesis when this p-value is smaller than a threshold ($\alpha$).

Why Isn't the P Value Enough?

Statistical significance is the probability that the observed difference between two groups is due to chance. If the P value is larger than the alpha level chosen (eg, .05), any observed difference is assumed to be explained by sampling variability. With a sufficiently large sample, a statistical test will almost always demonstrate a significant difference, unless there is no effect whatsoever, that is, when the effect size is exactly zero; yet very small differences, even if significant, are often meaningless. Thus, reporting only the significant P value for an analysis is not adequate for readers to fully understand the results.

And to corroborate @DarrenJames's comments regarding large sample sizes

For example, if a sample size is 10 000, a significant P value is likely to be found even when the difference in outcomes between groups is negligible and may not justify an expensive or time-consuming intervention over another. The level of significance by itself does not predict effect size. Unlike significance tests, effect size is independent of sample size. Statistical significance, on the other hand, depends upon both sample size and effect size. For this reason, P values are considered to be confounded because of their dependence on sample size. Sometimes a statistically significant result means only that a huge sample size was used. [There is a mistaken view that this behaviour represents a bias against the null hypothesis. Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples? ]

Using Effect Size—or Why the P Value Is Not Enough

Report Both P-value and Effect Sizes

Now to answer the question, are effect sizes superior to p-values? I would argue, that these each serve as importance components in statistical analysis that cannot be compared in such terms, and should be reported together. The p-value is a statistic to indicate statistical significance (difference from the null distribution), where the effect size puts into words how much of a difference there is.

As an example, say your supervisor, Bob, who is not very stats-friendly is interested in seeing if there was a significant relationship between wt (weight) and mpg (miles per gallon). You start the analysis with hypotheses

$$ H_0: \beta_{mpg} = 0 \text{ vs } H_A: \beta_{mpg} \neq 0 $$

being tested at $\alpha = 0.05$

> data("mtcars")
> 
> fit = lm(formula = mpg ~ wt, data = mtcars)
> 
> summary(fit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

From the summary output we can see that we have a t-statistic with a very small p-value. We can comfortably reject the null hypothesis and report that $\beta_{mpg} \neq 0$. However, your boss asks, well, how different is it? You can tell Bob, "well, it looks like there is a negative linear relationship between mpg and wt. Also, can be summarized that for every increased unit in wt there is a decrease of 5.3445 in mpg"

Thus, you were able to conclude that results were statistically significant, and communicate the significance in practical terms.

I hope this was useful in answering your question.

Michael Lew
  • 15,102
Jon
  • 2,330
  • Jon, thanks, there are LOTS of grey areas that I was hoping to hear more about but I didn't. In lots of situations effect sizes and p-values don't agree. Many trust effect sizes in such situations which I wanted to know why. I was hoping to hear more about simulations that could show important points. Regarding the matter you brought up i.e., that effect size might be tiny but not exactly zero; methods of equivalence testing have been in place for several years now. I like the Bayesian equivalence testing even more. Anyways, I probably did n't ask my question clearly enough. -- Thanks – rnorouzian Aug 18 '17 at 18:58
  • BTW, a colleague commented that Daren's R code is wrong, it seems s/he is right. He has not put var.equal = TRUE. – rnorouzian Aug 18 '17 at 18:59
  • In lots of situations effect sizes and p-values don't agree.* -- can you provide more information on this? An example? Regarding the matter you brought up i.e., that effect size might be tiny but not exactly zero -- this situation can result in a large sample size. Thus if the effect size is nearly zero, then the variable of interest may not impact the outcome significantly, or the relationship may be incorrectly specified (e.g. linear vs nonlinear).
  • – Jon Aug 18 '17 at 19:06
  • Just try this tool. Also see this document. It seems I will need to ask another question at a later time using some code for clarity. -- Thank you. – rnorouzian Aug 18 '17 at 19:27
  • @rnorouzian, okay, I ran your code. What's your point? – Jon Aug 18 '17 at 19:57
  • Jon, my point was that OTHER THAN what effect sizes are what they do ... etc. I was wondering if you run a simulation large enough times, it seems that there is Less variability in the obtained effect sizes than in p-values. In other words, both these guys as random variables could jump around from sample to sample but this variability is less in amount foe effect sizes. No if a researchers wants gets in a situation where effect size is considerable (e.g.6 in standardized mean difference unit) in magnitude but p-value does n't denote the rejection of the null, an effect size is more dependable – rnorouzian Aug 18 '17 at 20:27
  • In other words, in my question I had asked about the simultaneous comparison of the "obtained p-value" from a sample and "obtained effect size" from that same sample and thinking which of these two might more reliably tell us about the population. What I noticed in Daren's answer is that Daren is comparing the p-value from a sample BUT the effect-size that is for the population. – rnorouzian Aug 18 '17 at 20:43
  • If your main inquiry was to inquire about this simulation, then I think you worded your question poorly. The two former answers directly comment on the "usefulness" of effect sizes. They do not touch on the variability of p-values and effect sizes in simulation, primarily because that is not what you asked for. It seems you didn't address this point in your question at all I was wondering if you run a simulation large enough times, it seems that there is Less variability in the obtained effect sizes than in p-values. You may want to open up a new question for this and explain what you mean. – Jon Aug 18 '17 at 21:02
  • You're right, but what do you think now? – rnorouzian Aug 18 '17 at 21:17
  • Great explanation, thx Jon! – Etienne Juneau Oct 20 '18 at 20:02