What is the post-hoc power in my experiment? How to calculate this?

Question

The following articles are reprinte of #3375492 of math.stackexchange.com. It was recommended to ask this community at math.stackexchange.com.

My motivations
I often see the claims that post-hoc power is nonsense. This kind of editorials are mass-produced and is published on many established journals. I can easily to access to the definitions that are not chunked-down to formulas or codes.

However, it is unclear what the post-hoc power they criticize is. Certainly they writes definition is written in words. However, it is not chunked down into formulas or calculation codes.　Therefore, what is they want to criticize are not identified / at least not shared with me. (Both code 1 and code 2 below seem to meet their common definitions. The results are different, but different ways.)

The strange thing is that even though many people have been criticized so much but "what is post-hoc power?" is not seems to clear. Wouldn't it be strange to be able to understand these opinions like “it doesn't make sense because it is unique if other variables are set” or “circular theory” for objects whose calculation method is not shown? This looks like a barren on-air battle under the unclear premise.

Give calculation procedure before criticizing them!! （This is likely to apply to all statutory ethics editorials that have been mass-produced recently.）

The verbal explanation is written on the mass-produced editorial. They are not what I want. - Please show me formulas or codes instead of words. - Please chunk down words into the formula.

Require explanations in formulas and codes instead of words.

I know that there is no "correct" post-hoc analysis, as it is often screamed in mass-produced editorial. “Correct post-hoc analysis” I said is synonymous with “post-hoc analysis that many people criticize.”

My Question

What is the post-hoc power in the following experiment?

Experiment:
We randomly divide 20 animals into two groups, Group A and Group B. After that, for Group A, Foods A are fed, and for Group B, Foods B are fed. After a certain period, bodyweight was measured, and the data were as follows.

Group_A :40.2, 40.4, 40.6, 40.8, 41.0, 41.2, 41.4, 41.6, 41.8
Group_B :30.1, 30.3, 30.5, 30.7, 30.9, 31.1, 31.3, 31.5, 31.7, 31.9, 32.1

I would like to conduct a two-sided test with a significance level of 0.05 to see if there is a significant difference between the two groups.

I think it is one of the following ones. Both codes are written in "R". R source codes can be downloaded from the following link.

The difference between Method 1 and Method 2 is using the predetermined value (in the code of method1, we use α=0.05) or using the calculated p-value when calculating power.

Method 1
Code01

#Load data
Group_A = c(40.2, 40.4, 40.6, 40.8, 41.0, 41.2, 41.4, 41.6, 41.8)
Group_B = c(30.1, 30.3, 30.5, 30.7, 30.9, 31.1, 31.3, 31.5, 31.7, 31.9, 32.1)

# Welch Two Sample t-test
t.test(Group_A,Group_B)

library(effsize)
library(pwr)

cd = cohen.d(Group_A, Group_B)
cd

pwr.t2n.test(n1 = 9, n2= 11, d = cd$estimate, sig.level = 0.05, power = NULL,
         alternative = c("two.sided"))

Method 2
Code02

# Load data
Group_A = c(40.2, 40.4, 40.6, 40.8, 41.0, 41.2, 41.4, 41.6, 41.8)
Group_B = c(30.1, 30.3, 30.5, 30.7, 30.9, 31.1, 31.3, 31.5, 31.7, 31.9, 32.1)

# Welch Two Sample t-test
twel=t.test(Group_A,Group_B)
twel

pwel=twel$p.value

library(effsize)
library(pwr)

cd = cohen.d(Group_A, Group_B)
cd

pwr.t2n.test(n1 = 9, n2= 11, d = cd$estimate, sig.level = pwel, power = NULL, 
  alternative = c("two.sided"))

Which is the “correct” post-hoc power calculation code?

Notes:
If your "R" environment does not have packages named "effsize" and "pwr", you need to install them previously.　If the following command is executed on R while connected to the Internet, installation should start automatically.

install.packages("effsize")
install.packages("pwr")

【Post-Hoc Notes】 (Added after 2019/10/06 00:56(JST))

(1)Relationship between effect size and power　(Based on Method 01)
Fig. PHN01 shows the relationship between effect size and power when using code01 above, p = 0.05, 0.025, 0.01. Where n1 = 9, n2 = 11.

Fig. PHN01 :Relationship between effect size and power

These are calculated using the R same manner of followiing code.

Code PHN 01

library(pwr)
pv=0.025
pwr.t2n.test(n1 = 9, n2= 11, d = 4, sig.level = pv, power = NULL, 
         alternative = c("two.sided"))

(2)Relationship between effect size and power　(Based on Method 02)
Fig. PHN02 shows the relationship between effect size and power when using code02, where n1 = 9, n2 = 11.

Fig. PHN02 :Relationship between effect size and power

Code PHN 02 library(effsize) library(pwr)

offc=1.6

offc=0.1+offc
Group_A = c(30.2+offc, 30.4+offc, 30.6+offc, 30.8+offc, 31.0+offc, 31.2+offc, 31.4+offc, 31.6+offc, 31.8+offc)
Group_B = c(30.1, 30.3, 30.5, 30.7, 30.9, 31.1, 31.3, 31.5, 31.7, 31.9, 32.1)
print(mean(Group_A)-mean(Group_B))
twel=t.test(Group_A,Group_B)
pwel=twel$p.value
cd = cohen.d(Group_A, Group_B)

pwr.t2n.test(n1 = 9, n2= 11, d = cd$estimate, sig.level = pwel, power = NULL, 
         alternative = c("two.sided"))

(3)Comment on Welch’s correction
There was a comment that “it is better to remove the Welch correction”. Certainly in the R is not comprising the functionality to calculate the power it self under the Welch correction for n1≠n2 cases.

Please forget the following code.

Code PHN 03

library(effsize)
offc=1.6

offc=0.1+offc
Group_A = c(30.2+offc, 30.4+offc, 30.6+offc, 30.8+offc, 31.0+offc, 31.2+offc, 31.4+offc, 31.6+offc, 31.8+offc)
Group_B = c(30.1, 30.3, 30.5, 30.7, 30.9, 31.1, 31.3, 31.5, 31.7, 31.9, 32.1)
print(mean(Group_A)-mean(Group_B))

#Option1 Var.equal
twel=t.test(Group_A,Group_B, var.equal=True)
pwel=twel$p.value

#Option2 Hedges.correction, Optoon3 var.equal=FALSE
cohen.d(Group_A, Group_B, hedges.correction=FALSE, var.equal=FALSE) 

sqrt((9+11)/(9*11))
cd$estimate/twel$statistic

(4)The "correct" post-hoc power calculation method　for when welch's correction is not required

This part has been split into the following thread:
The calculation method of post-hoc power in t-test without welch's correction

https://gpsych.bmj.com/content/32/4/e100069

Only the case where the Welch correction was not necessary was written, but I found a paper in which the "correct" post-hoc power calculation method was written in mathematical formulas. Here, “correct” means “criticized by mass-produced editorials”.

Post-hoc power seems to be calculated by the following formula.

Here, the α is given in advance, it can be considered that it is essentially the same as the method of Code 1. However, my setting is different from the Welch test.

　(PHN04-01)

Here,
　(PHN04-02)
(PHN04-03)
And, use the following d for ,
(PHN04-04)

However, I could not read out the distribution of the following statistics. (Maybe non-central t distribution, but how is the non-central parameter value?)

(PHN04-05)

What is this ${Z}_{\alpha /2}$? . Zα is the upper α point of which distribution? Is the upper α/2 point t-distribution?

And

How can it be extended to Welch's case?

【P.S.】 I'm not very good at English, so I'm sorry if I have some impolite or unclear expressions. I welcome any corrections and English review. (You can edit my question and description to improve them)

Is the only difference between the two scripts: sig.level = pwel and sig.level = 0.05? — Jeremy Miles, Oct 04 '19 at 16:23
I can't install the packages easily, but I think the first should give you the post hoc power, which is a transformation of the p-value, and the second should give you 0.5. — Jeremy Miles, Oct 04 '19 at 18:29
@Jeremy Miles I use the Open Source Edition R studio. It works well in this environment. It will be added to the main body as a comment. If you are using the Open Source Edition of R studio, you can also install via the GUI. See below for details.https://danmaclean.github.io/ggplotbook/ — Blue Various, Oct 04 '19 at 18:40
Second should give you 0.5. If you set d = 0, you will get power of 0.05 (or whatever you set sig.level to be). — Jeremy Miles, Oct 04 '19 at 19:34
[I know how to install a package, thanks.I can't easily install packages for complex reasons to do with security policies where I work. Not because of the edition of R that I use.] — Jeremy Miles, Oct 04 '19 at 19:36
You've added some code and some charts to the question since I answered. The code doesn't draw the charts though. Do you have an additional question? — Jeremy Miles, Oct 06 '19 at 00:51
@Jeremy Miles I am preparing a post-hoc note to ask questions/confirmation of my understanding about your answers. I want you to wait a little.
Due to my lack of English, I can't have a detailed discussion "on air". Preferably code and formulas (and figures) be numbered. — Blue Various, Oct 06 '19 at 08:42
@Jeremy Miles >I know how to install a package ... Roger that. I forget if the installation trouble is not a universal problem. — Blue Various, Oct 06 '19 at 13:49

score 6 · Accepted Answer · answered Oct 07 '19 at 16:29

Let's examine the well accepted statistical definitions of "power," "power analysis," and "post-hoc," using this site's tag information as a guide.

Power

is a property of a hypothesis testing method: the probability of rejecting the null hypothesis given that it is false, i.e. the probability of not making a type II error. The power of a test depends on sample size, effect size, and the significance () level of the test.

Let's ignore for now the post-hoc issue. From that definition you can see that either of your approaches to power could be considered "correct": Method 1 is based on a significance () level of 0.05, while Method 2 is based on the significance () level that you happened to find, about 0.17.

For what is useful, however, consider power analysis:

An inquiry into the quality of a statistical test by calculating the power - the probability of rejecting the null hypothesis given that it is false - under certain circumstances. Power analysis is often used when planning a study to determine the sample size required to achieve a nominal level of power (e.g. 80%) for a given effect size.

In the design phase of a study, where the importance of power analysis is unquestioned, you attempt to estimate the number of cases needed to detect a "statistically significant" effect. This typically means basing the calculations on a significance () level of 0.05. It would be hard to come up with any rationale for choosing instead a level of 0.17. So for power analysis in the a priori design-phase of a study your Method 1 would be the only one to make sense.

Now consider post-hoc:

"Post-hoc" refers to analyses that are decided upon after the data has been collected, as opposed to "a priori".

We need to distinguish 2 types of post-hoc analysis related to power calculations. One is to treat the just-completed study as a pilot study to inform the design of a more detailed study. You use the observed difference between the groups and the observed variance of the difference as estimates of the true population values. Based on those estimates, you determine the sample size needed in a subsequent study to provide adequate power (say, 80%) to detect a statistically significant difference (say, < 0.05). That's quite appropriate. That is "post-hoc" in the sense of being based on already obtained data, but it is used to inform the design of the next study.

In most cases, however, that is not how the phrase "post-hoc power analysis" is used or the way you are using the phrase. You (and many others) seek to plug into a formula to determine some type of "power" of the study and analysis you have already done.

This type of "post-hoc power analysis" is fundamentally flawed, as noted for example by Hoenig and Heisey in The Abuse of Power. They describe two variants of such analysis. One is the "observed power," "that is, assuming the observed treatment effects and variability are equal to the true parameter values, the probability of rejecting the null hypothesis." (Note that this null hypothesis is typically tested at < 0.05, your Method 1, and is based on the sample size at hand. This seems to be what you have in mind.) Yet this "observed power" calculation adds nothing:

Observed power can never fulfill the goals of its advocates because the observed significance level of a test ("p value") also determines the observed power; for any test the observed power is a 1:1 function of the p value.

That's the point that Jeremy Miles makes with his example calculations based on your two Methods. In this type of post-hoc analysis, neither Method adds any useful information. That's why you find both of us effectively saying that is no "correct" post-hoc power calculation code. Yes, you can plug numbers correctly into a formula, but to call the analysis "correct" from a statistical perspective would be an abuse of terminology.

There is a second (ab)use of power calculations post-hoc, which does not seem to be what you have in mind but which should be addressed for completeness: "finding the hypothetical true difference that would have resulted in a particular power, say .9." Hoenig and Heisey show that this approach can lead to nonsensical conclusions, based on what they call:

the “power approach paradox” (PAP): higher observed power does not imply stronger evidence for a null hypothesis that is not rejected.

So the statistical advice (which is what one should expect from this site) is to refrain from post-hoc power tests in the sense that you wish to use them.

Thankyou for your commment. I'm sorry, but as I said in the mainbody, verbal explanations are not welcomed. It can be found by googled editorial. Please show me formulas or codes instead of words. Please chunk down words into the formula. — Blue Various, Oct 11 '19 at 03:30
I know that there is no "correct" post-hoc analysis, as it is often screamed in mass-produced editorial. “Correct post-hoc analysis” I said is synonymous with “post-hoc analysis that many people criticize.” I added it to the main body with emphasis because it was bad. I want to hear about why there is one-to-one correspondence between p-value and power even though there is no formula or code. Although it seems to be one-to-one correspondence, it seems to depend on the effect size and sample size ... — Blue Various, Oct 11 '19 at 03:31
In post hoc power, you know the effect size and sample size. They don't change. Hence there is a 1:1 correspondence. — Jeremy Miles, Oct 11 '19 at 15:12
@BlueVarious p-values map to effect sizes at fixed sample size and test type. If sample size doesn't matter in the original test (e.g., Z-test) neither does the 1:1 relationship between p-value and "post-hoc power"; Hoenig and Heisey show a graph for 1-sided Z-tests. For tests where sample size matters (t-tests and F-tests) Russ Lenth has corresponding tables and formulas here. How to handle unequal variances or sample sizes is inherent in the tests themselves and has nothing extra to do with power calculations. — EdM, Oct 18 '19 at 14:50

score 5 · Answer 2 · answered Oct 05 '19 at 05:16

5

Here's the thing. Post hoc power tells you the probability that you would have detected a significant result, based on the result that you have. That is, if the estimate that you just found is the population parameter, what is the probability that another study, which is exactly the same as the study you did, will obtain a statistically significant result.

If your p-value is 0.05, your post hoc power is 0.5.

In your first analysis, you ask "What is the power to detect an effect, if I use an alpha that is equal to the p-value that I found, and the effect size that I found?" The answer is:

 power = 0.4985284

i.e. within precision limits of 0.50.

The second analysis says "What's the probability I would get a significant effect, given the effect I found". You had a very low p-value, so you have lots and lots of power. Hence power is 1.00.

Let's try it again with different data:

#Load data
Group_A = c(40.2, 40.4, 40.6, 40.8, 41.0, 41.2, 41.4, 41.6, 41.8)
Group_B = c(40.2, 40.4, 40.6, 40.8, 41.0, 41.2, 41.4, 41.6, 41.8, 31.9, 32.1)

The t-test is not statistically significant:

 p-value = 0.1741

Hence, the first power estimate tells me that my power is less than 50%.

> pwr.t2n.test(n1 = 9, n2= 11, d = cd$estimate, sig.level = 0.05, power = NULL,
+              alternative = c("two.sided"))

     t test power calculation 

             n1 = 9
             n2 = 11
              d = 0.5923485
      sig.level = 0.05
          power = 0.2389704

The second analysis tells me that my power, if I use the same alpha as I found, is (approximately) 50%.

> pwr.t2n.test(n1 = 9, n2= 11, d = cd$estimate, sig.level = pwel, power = NULL, 
+              alternative = c("two.sided"))

     t test power calculation 

             n1 = 9
             n2 = 11
              d = 0.5923485
      sig.level = 0.1740843
          power = 0.4740473
    alternative = two.sided

You get a little closer if you don't use the Welch correction (use equal.variances = TRUE in the t-test).

Post hoc power is nonsense because it doesn't tell you anything you didn't already know.

The first analysis you did is a transformation of p - the lower p, the higher power. This is what is conventionally referred to as post hoc power. The second analysis you did gives a result of 50%, whatever your data look like.

answered Oct 05 '19 at 05:16

Jeremy Miles

17,812

Thank you for your answer. I have some questions for your answer. First one is: What "your first analysis, " you means and How to caluculate "power = 0.4985284" of First box of your answer? Is what you say the same as what is written in my “post-hoc note (2)”? – Blue Various Oct 06 '19 at 14:18
Second one: The value of the third Box in your answer seems to be calculated with a code that replaces the measured data in my code 02 with the one written in the second Box in your answer. Is that correct? When calculated on my PC, it was “sig.level = 0.1740843, power = 0.4740473”. – Blue Various Oct 06 '19 at 14:33
Third one：>”Hence, the first power estimate tells me that my power is less than 50%.” ■ Is this a comment on Method 2? In the Method 1, 0.5 not seem appeared. (See Fig. PHN01) Even if it is a comment for Method 2, isn't it “over than 50%” instead of “less than 50%”? (See Fig. PHN02) – Blue Various Oct 06 '19 at 14:42
Fourth one: "> Post hoc power is nonsense because it doesn't tell you anything you didn't already know."■ Before we talk about it, I want to clarify what post-hoc power is. What/Which is the “right” post-hoc power calculation method? (While editorials about statistics are mass-produced, they seem too careless to clarify what they are criticizing.) – Blue Various Oct 06 '19 at 15:39
5th one: What do "anything you didn't already know" means? ■ Certainly, if “n1, n2, d, α” is determined, the power will be calculated uniquely. However, there are many things that are meaningful even if they are uniquely determined once the value of the variable is determined. Do you want to say, “When the P value is calculated from observed data, the concept of type 2 error is meaningless”? It is also difficult to understand the value that cannot be calculated without using a complex distribution called non-central-t-distribution is obvious. – Blue Various Oct 06 '19 at 15:47
As for the effect size correction, I'd like to sort out my head a little more. – Blue Various Oct 06 '19 at 16:21
3

@BlueVarious please see this page about post-hoc power calculations: "You’ve got the data, did the analysis, and did not achieve 'significance.' So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis." Just because you can calculate a value doesn't mean that it is meaningful. Use instead what you found to help design an adequately powered study in the future. – EdM Oct 06 '19 at 18:11
@EdM Thank you for your comment. But, I cannot find any mathematical equation or codes to define “what is post-hoc analysis” on your link. Give definitions and calculation procedure before criticizing them. I know that there are so many claims that "post-hoc power is meaningless". This kind of editorials are mass-produced and is published on many established journals. I can easily to access to the definitions that are not chunked-down to formulas or codes. – Blue Various Oct 07 '19 at 03:06
The strange thing is that even though many people have been criticized so much bur "what is post-hoc power?" is not seems to clear. Wouldn't it be strange to be able to understand these opinions like “it doesn't make sense because it is unique if other variables are set” or “circular theory” for objects whose calculation method is not shown? This looks like a barren on-air battle under the unclear premise. – Blue Various Oct 07 '19 at 03:07
Hi @BlueVarious there's a lot going on in this question and comments and answers. If you asked one question at a time, that would be easier to answer. You could start with the graphs you added to the question. – Jeremy Miles Oct 07 '19 at 04:42
1

To answer your earlier question, post hoc power analysis is the probability of obtaining a significant result, with the effect you have in your data and the sample size you have in your data. If you have a significant effect, your post hoc power will be over 50%. If you don't, it will be less than 50%. If p = 0.05 power is 50%. – Jeremy Miles Oct 07 '19 at 04:44
You asked earlier: 'Do you want to say, “When the P value is calculated from observed data, the concept of type 2 error is meaningless”?' No, I don't. – Jeremy Miles Oct 07 '19 at 04:45
＠Jeremy Miles 　First clarify with Yes or No. Is post-hoc power calculated with calculation code 1?　（I know the definitions written in words of post-hoc power . But I can't chunk it down to the calculation method. I think both of my Code 1 and Code 2 match your definition.） – Blue Various Oct 07 '19 at 14:11
Yes. [I can't write a one word comment so these are more words.] – Jeremy Miles Oct 07 '19 at 15:31
Let us continue this discussion in chat. – Jeremy Miles Oct 07 '19 at 21:41

What is the post-hoc power in my experiment? How to calculate this?

2 Answers2

Linked