Is there a general definition of the effect size?

Question

The effect-size tag has no wiki. The wikipedia page about the effect size does not provide a precise general definition. And I have never seen a general definition of the effect size. However when reading some discussions such as this one I am under the impression that people have in mind a general notion of effect size, in the context of statistical tests. I have already seen that the standardized mean $\theta=\mu/\sigma$ is termed as the effect size for a normal model ${\cal N}(\mu,\sigma^2)$ as well as the standardized mean difference $\theta=(\mu_1-\mu_2)/\sigma$ for a "two Gaussian means" model. But how about a general definition ? The interesting property shared by the two examples above is that, as far as I can see, the power depends on the parameters only through $\theta$ and is an increasing function of $|\theta|$ when we consider the usual tests for $H_0:\{\mu=0\}$ in the first case and $H_0:\{\mu_1=\mu_2\}$ in the second case.

Is this property the underlying idea behind the notion of effect size ? That would mean that the effect size is defined up to a monotone one-to-one transformation ? Or is there a more precise general definition ?

+1, great question. One way to think about effect size is that p-values simultaneously measure magnitude & N, so ES is p decoupled from N (this is, of course, only quite loose, though). — gung - Reinstate Monica, May 20 '13 at 21:44
Effect size is only easy to pin down in some specific cases. With a two-sample test of means, the notion of effect size is straightforward. But add in a third sample and it becomes less clear (if you do ANOVA, you can write it in terms of variance, though). For some tests, it just boils down to nothing more clear than "whatever this test statistic measures". — Glen_b, May 20 '13 at 22:33
@Glen_b For any Gaussian linear model the power of a $F$-test is an increasing function of the noncentrality parameter (see the second part of my answer here http://stats.stackexchange.com/a/59428/8402). It is someting like $(\sum \alpha_i^2)/\sigma^2$ for ANOVA. — Stéphane Laurent, May 21 '13 at 05:26
Yes, I was trying to give a nontechnical sense of what the noncentrality parameter represents in terms of the original problem - relating it to a variance is an oversimplification but at least with of ANOVA we're used to multiple measures of variation (such as between means as well as around means). It's not much good saying "the effect size is monotonic in the noncentrality parameter" if the person reading a tag wiki won't follow. Your question would need to guide us where you want the comments pitched, if you don't want fairly basic responses (which is normally where I'd aim) — Glen_b, May 21 '13 at 05:32
So let me rephrase: "Effect size is only easy to pin down in a non-technical way in some specific cases" — Glen_b, May 21 '13 at 05:35
@Glen_b I have nothing against basic answers! Any comment is welcomed. Thanks. — Stéphane Laurent, May 21 '13 at 07:44

score 5 · Answer 1 · answered May 21 '13 at 00:16

I don't think there can be a general and precise answer. There can be general answers that are loose, and specific answers that are precise.

Most generally (and most loosely) an effect size is a statistical measure of how big some relationship or difference is.

In regression type problems, one type of effect size is a measure of how much of the dependent variable's variance is accounted for by the model. But, this is only precisely answerable (AFAIK) in OLS regression - by $R^2$. There are "pseudo-$R^2$" measures for other regression. There are also effect size measures for individual independent variables - these are the parameter estimates (and transformations of them).

In a t-test, a good effect size is the standardized difference of the means (this also works in ANOVA, and may work in regression if we pick particular values of the independent vairables)

and so on.

There are whole books on the subject; I used to have one, I believe that Ellis is an updated version of it (the title sounds familiar)

Hello Peter. Why do you say that the standardized difference $\theta$ is a good choice for the $t$-test ? Is it because of the property I pointed out: the power depends on the parameters $\mu_1$, $\mu_2$, $\sigma$ only through $\theta$ and is an increasing function of $|\theta|$. — Stéphane Laurent, May 21 '13 at 05:29
Hi @StéphaneLaurent , yes, that is a more formal way of putting it. Or, you could say that it gets bigger as the difference gets bigger, but is not affected by scaling. — Peter Flom, May 21 '13 at 10:05

score 2 · Accepted Answer · answered Jul 23 '23 at 15:51

Jacob Cohen popularized the use of effect sizes for social scientists in his classic, Statistical Power Analysis for the Behavioral Sciences (2nd Ed. 1988). (You can find the full text on the Web -- search by the full title and look among the first hits.) In this book Cohen defined and examined many different effect sizes, beginning with the eponymous Cohen's d (based on the t test for means) and going on from there to correlation coefficients, proportion, contingency tables, ANOVA, and multiple regression.

How did Cohen define and describe "effect sizes"? He didn't provide a general definition, but he left clues. Here are some relevant quotations. (Page numbers are from the referenced edition and all emphases are in the original.)

The power of a statistical test depends upon three parameters: the significance criterion, the reliability of the sample results, and the "effect size," that is, the degree to which the phenomenon exists. [p. 4]

The reliability (or precision) of a sample value is the closeness with which it can be expected to approximate the relevant population value. ... For example, one conventional means for assessing the reliability of a statistic is the standard error (SE) of the statistic. [p. 6]

Without intending any necessary implication of causality, it is convenient to use the phrase "effect size" to mean "the degree to which the phenomenon is present in the population," or "the degree to which the null hypothesis is false." [pp 9 - 10]

Whatever the manner of representation of a phenomenon in a particular research in the present treatment, the null hypothesis always means that the effect size is zero. [p. 10]

... when the null hypothesis is false, it is false to some specific degree, i.e., the effect size (ES) is some specific nonzero value in the population. [p. 10]

Thus, whether measured in one unit or another, whether expressed as a difference between two population parameters or the departure of a population parameter from a constant or in any other suitable way, the ES can itself be treated as a parameter which takes the value zero when the null hypothesis is true and some other specific nonzero value when the null hypothesis is false, and in this way the ES serves as an index of degree of departure from the null hypothesis. [p. 10]

Cohen attributes the relative unfamiliarity of behavioral scientists with the ES to "the difference in null hypothesis testing between the procedures of Fisher (1949) and those of Neyman and Pearson (1928, 1933)." [p. 10] In the Fisherian formula, according to Cohen, "no basis for statistical power analysis exists" [p. 11], because a sufficiently quantitative alternative hypothesis is not formulated. However,

By contrast, the Neyman-Pearson formulation posits an exact alternative for the ES, i.e., the exact size of the effect the experiment is designed to detect. [p. 11]

The concept is clear, but so far the discussion has not established how one quantifies the ES. Cohen turns next to this issue.

In any given statistical test, [the ES] must be indexed or measured in some defined unit appropriate to the data, test, and statistical model employed. [p. 11]

He then argues that there are "formidable mathematical-statistical problems in the way" of defining a "universal ES index" and that even if one existed, "the result would express ES in terms so unfamiliar to the researcher in behavioral science as to be self-defeating," [p. 11]. Whether these assertions are correct or not, they show that part of Cohen's motivation is for the ES to be interpretable to the researcher.

He then argues that "some generalization is obviously necessary" in order to prepare a set of tables (as he did, and included in his book) for power analysis. These tables are (of course!) based on the standardized values appropriate to the statistical tests they address [pp 11 - 12].

I have concluded from a study of these statements (and of the rest of the book) that generally

An effect size is defined in the context of a (planned) null hypothesis test to be carried out in the Neyman-Pearson paradigm, where

a null hypothesis $H_0$ (one or more distributions) and an alternative hypothesis $H_A$ (typically a set of many distributions) are specified,
a test statistic $t$ is proposed, and
the power of the test (its chance of rejecting $H_0$) depends on a single parameter (property of the distributions in $H_A$) $\Delta.$

Wherever possible, to be readily interpretable, the effect size $\Delta$ should be directly proportional to some population property that the test statistic $t$ can be considered to estimate.

(2) precludes freely applying monotonic transformations of $t.$ Quite generally, null hypothesis tests that fit this framework usually suggest a unique effect size: it "looks like" a standardized version of the test statistic with the influence of the sample size removed. For instance, Cohen's d arises in a Student t test setting where the statistic is $t = (\bar x - \bar y) / SE(\bar x - \bar y).$ The formulas for the standard error involve estimates of (or assumptions about) the variances of the populations (or subpopulations) sampled by $x$ and $y.$ The corresponding formula for Cohen's d (there are many, depending on the setting and the assumptions) can always be found by ignoring the sample sizes.

What makes this program difficult, and a little unclear, is that in complicated settings (such as comparing means from populations assumed to have different variances), the SE of the test statistic is not a simple function of a single sample size. Cohen made choices determined by his need to tabulate values of the power: in the 1960's, when he was doing this work, computing devices were not widely available to researchers and people relied on printed tables. In some situations Cohen realized it was possible to publish relatively short approximate tables if the definition of the ES was suitably made. For instance, if one is using a Welch t-test for the unequal variances setting, standardizing the difference in means by a suitably pooled standard deviation gives a value of $\Delta$ that is closely, but only approximately, related to the power of the test. That is, $\Delta$ itself cannot exactly determine the power, which really depends on both sample sizes and on both variances (look at the formulas...)

Consequently, there is room for creativity in proposing suitable effect sizes in complex testing settings. This is why Cohen asserted (above) that a general formula can't be given. We have to settle for understanding the underlying principles: how effect sizes are intimately related to the test you are planning to use, influenced by the desire to derive a simple (even if approximate) formula (or table) for power in terms of effect size, and with the intent that the effect size be scientifically meaningful and interpretable.

Is there a general definition of the effect size?

2 Answers2

Linked

Related