Jacob Cohen popularized the use of effect sizes for social scientists in his classic, Statistical Power Analysis for the Behavioral Sciences (2nd Ed. 1988). (You can find the full text on the Web -- search by the full title and look among the first hits.) In this book Cohen defined and examined many different effect sizes, beginning with the eponymous Cohen's d (based on the t test for means) and going on from there to correlation coefficients, proportion, contingency tables, ANOVA, and multiple regression.
How did Cohen define and describe "effect sizes"? He didn't provide a general definition, but he left clues. Here are some relevant quotations. (Page numbers are from the referenced edition and all emphases are in the original.)
The power of a statistical test depends upon three parameters: the significance criterion, the reliability of the sample results, and the "effect size," that is, the degree to which the phenomenon exists. [p. 4]
The reliability (or precision) of a sample value is the closeness with which it can be expected to approximate the relevant population value. ... For example, one conventional means for assessing the reliability of a statistic is the standard error (SE) of the statistic. [p. 6]
Without intending any necessary implication of causality, it is convenient to use the phrase "effect size" to mean "the degree to which the phenomenon is present in the population," or "the degree to which the null hypothesis is false." [pp 9 - 10]
Whatever the manner of representation of a phenomenon in a particular research in the present treatment, the null hypothesis always means that the effect size is zero. [p. 10]
... when the null hypothesis is false, it is false to some specific degree, i.e., the effect size (ES) is some specific nonzero value in the population. [p. 10]
Thus, whether measured in one unit or another, whether expressed as a
difference between two population parameters or the departure of a population
parameter from a constant or in any other suitable way, the ES can
itself be treated as a parameter which takes the value zero when the null
hypothesis is true and some other specific nonzero value when the null hypothesis is false, and in this way the ES serves as an index of degree of departure from the null hypothesis. [p. 10]
Cohen attributes the relative unfamiliarity of behavioral scientists with the ES to "the difference in null hypothesis testing between the procedures of Fisher
(1949) and those of Neyman and Pearson (1928, 1933)." [p. 10] In the Fisherian formula, according to Cohen, "no basis for statistical power analysis exists" [p. 11], because a sufficiently quantitative alternative hypothesis is not formulated. However,
By contrast, the Neyman-Pearson formulation posits an exact alternative for the ES, i.e., the exact size of the effect the experiment is designed to detect. [p. 11]
The concept is clear, but so far the discussion has not established how one quantifies the ES. Cohen turns next to this issue.
In any given statistical test, [the ES] must be indexed or measured in some defined unit appropriate to the data, test, and statistical model employed. [p. 11]
He then argues that there are "formidable mathematical-statistical problems in the way" of defining a "universal ES index" and that even if one existed, "the result would express ES in terms so unfamiliar to the researcher in behavioral science as to be self-defeating," [p. 11]. Whether these assertions are correct or not, they show that part of Cohen's motivation is for the ES to be interpretable to the researcher.
He then argues that "some generalization is obviously necessary" in order to prepare a set of tables (as he did, and included in his book) for power analysis. These tables are (of course!) based on the standardized values appropriate to the statistical tests they address [pp 11 - 12].
I have concluded from a study of these statements (and of the rest of the book) that generally
- An effect size is defined in the context of a (planned) null hypothesis test to be carried out in the Neyman-Pearson paradigm, where
- a null hypothesis $H_0$ (one or more distributions) and an alternative hypothesis $H_A$ (typically a set of many distributions) are specified,
- a test statistic $t$ is proposed, and
- the power of the test (its chance of rejecting $H_0$) depends on a single parameter (property of the distributions in $H_A$) $\Delta.$
- Wherever possible, to be readily interpretable, the effect size $\Delta$ should be directly proportional to some population property that the test statistic $t$ can be considered to estimate.
(2) precludes freely applying monotonic transformations of $t.$ Quite generally, null hypothesis tests that fit this framework usually suggest a unique effect size: it "looks like" a standardized version of the test statistic with the influence of the sample size removed. For instance, Cohen's d arises in a Student t test setting where the statistic is $t = (\bar x - \bar y) / SE(\bar x - \bar y).$ The formulas for the standard error involve estimates of (or assumptions about) the variances of the populations (or subpopulations) sampled by $x$ and $y.$ The corresponding formula for Cohen's d (there are many, depending on the setting and the assumptions) can always be found by ignoring the sample sizes.
What makes this program difficult, and a little unclear, is that in complicated settings (such as comparing means from populations assumed to have different variances), the SE of the test statistic is not a simple function of a single sample size. Cohen made choices determined by his need to tabulate values of the power: in the 1960's, when he was doing this work, computing devices were not widely available to researchers and people relied on printed tables. In some situations Cohen realized it was possible to publish relatively short approximate tables if the definition of the ES was suitably made. For instance, if one is using a Welch t-test for the unequal variances setting, standardizing the difference in means by a suitably pooled standard deviation gives a value of $\Delta$ that is closely, but only approximately, related to the power of the test. That is, $\Delta$ itself cannot exactly determine the power, which really depends on both sample sizes and on both variances (look at the formulas...)
Consequently, there is room for creativity in proposing suitable effect sizes in complex testing settings. This is why Cohen asserted (above) that a general formula can't be given. We have to settle for understanding the underlying principles: how effect sizes are intimately related to the test you are planning to use, influenced by the desire to derive a simple (even if approximate) formula (or table) for power in terms of effect size, and with the intent that the effect size be scientifically meaningful and interpretable.