2

I have observations collected over a time period of 12 months. For each month, I recorded the total number of observations (n_per_month), the number of "success" events (n_success) and then computed the success rate as n_success/n_per_month (success_rate). Here it is an example of the data:

data_df = data.frame(month = 5:11, 
     n_per_month = c(17, 21, 15, 8, 14, 14, 1), 
     n_success = c(13, 13, 6, 6, 10, 14, 1), 
     success_rate = c(0.7647059, 0.6190476, 0.4, 0.75, 0.7142857, 
                      1, 1))

Then I fitted a logistic regression using glm R function:

model_obs <- glm(success_rate ~ month, data = data_df, 
                 family = binomial(), weights=n_per_month)

And add the corresponding fitted values in the original dataframe:

data_df$fitted = as.numeric(predict.glm(model_obs, 
         type = "response"))

Following a past posted question: Simulation of logistic regression power analysis - designed experiments, I modified that R code to perform post-hoc power analysis. That is, I wanted to know how reliable is the success rate I computed from the observed data, given the sample size of each month. For this example code, I repeat the analysis 10 times (repetitions) and record the results of the power analysis in significant matrix. In this post-hoc power analysis I assume that the rates estimated from the observed data are true.

repetitions = 10

significant = matrix(nrow=repetitions, ncol=3)

for(i in 1:repetitions){ final_rep = NULL
for(j in 1:nrow(data_df)){
resp_val = rbinom(n=data_df$n_per_month[j], size=1, prob=data_df$fitted[j]) tmp_df = data.frame(month=rep(data_df$month[j], length(resp_val)), events=resp_val) final_rep = rbind(final_rep, tmp_df) }

n_success_df = aggregate(events ~ month, data = final_rep, sum) n_month_df = aggregate(events ~ month, data = final_rep, length) rep_df = data.frame(month=n_success_df$month, n_per_month=n_month_df$events, n_success=n_success_df$events) rep_df$success_rate = rep_df$n_success/rep_df$n_per_month

model <- glm(success_rate ~ month, data = rep_df, family = binomial(), weights=n_per_month)

significant[i, 1] = (summary(model)$coefficients[2, 4]<.05) significant[i, 2] = sum(significant[i, 1]) modelDev = model$null.deviance-model$deviance significant[i, 3] = (1-pchisq(modelDev, 1))<.05

rm(temp_df, count_df, rep_df, model, modelDev) }

sum(significant[, 2]==1)/repetitions

Would you agree with this procedure?

2 Answers2

8

Would you agree with this procedure?

No, unless your intent is to inform design of a future study. And in that case, you should be doing your simulation in a different way.

"Post-hoc power analysis," in the way the term is typically used, is (a) inappropriate, as @AdamO explains in another answer, and (b) not the right way to evaluate "how reliable is the success rate I computed from the observed data, given the sample size of each month."

For the reliability of probability estimates month by month, outside of the modeling, this page shows the formula for the variance of the probability estimate from binomial sampling. Associated types of confidence intervals are described here. Those would be appropriate estimates of "how reliable is the success rate" month by month in a saturated model. For your last 2 months all observations are successes. Note that with only 1 observation in the last month, there is at least a 5% chance of a success when the true value of $p$ is as small as 0.05, so the confidence interval is very wide. For 14 successes out of 14, the 95% confidence limit for $p$ could be as low as about 0.8. See this Wikipedia page.

In terms of estimating the reliability of estimates from your model,* "given the sample size of each month," simulations like those you propose (with the number of observations per month fixed at those in your data sample) might make some sense. What you would want is an estimate of the variances of the coefficient estimates among simulations, or variances of month-by-month predictions from the models. That's different from your "post-hoc power calculation," which is based solely on whether coefficients from a model of a simulated data set happened to pass the arbitrary p < 0.05 criterion.

It's OK to use these data to inform design of a future study, but that's really an a priori power analysis for that future study. If you were to do so, you probably should also include an estimate of the sources of variability of the total number of observations per month.


*A warning on your model: you are modeling as if the log-odds of success is linearly related to the numeric month. Make sure that's really what you want. I find it hard to think of a scenario where that would be realistic.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • For the first reason, in my script I actually included 'weights' which takes as argument exactly the number of observations for each month. So this is not an argument, if I understood correctly your point. Concerning the additional warning, I assumed that my data follow a logistic regression, so yes, this is what I want (again if I understood what you wrote). For the second reason, yes, I could estimate the variance, but I am also interested in knowing which sample size would give me accurate results. What do you think? – CafféSospeso Feb 23 '23 at 19:31
  • 2
    +1 although I think the part (a) of the second point is the meat of the answer, I may add another "answer" to take this point solely. – AdamO Feb 23 '23 at 19:34
  • @CafféSospeso sorry, I missed the "weights" argument that was hidden to the right of my screen. I'll edit later when I get a chance. – EdM Feb 23 '23 at 19:40
  • Ok, Thanks. No problem. I just wanted to be sure to have got your point. And yes, I would be curious to know whether with this post-hoc analysis (or by simulations) I could achieve the general objective of my question. – CafféSospeso Feb 23 '23 at 19:44
  • Thanks a lot @EdM, this answer clearly to my initial question. And to take full consideration of your *warning (no specific assumption on the relationship between odds and month), I am actually going to estimate confidence intervals of month-by-month predictions (as you said), using two approaches: i) the simulations I proposed; and ii) the Wilson score interval (as it is robust to low sample size). Does it make sense? – CafféSospeso Feb 24 '23 at 16:41
  • 1
    @CafféSospeso if you go month by month without a model for changes over the months, the CIs from simulations and the Wilson intervals should be close. You could consider a model intermediate between a month-by-month evaluation and your strictly linear modeling of months, for example a regression spline for months. If you want to evaluate the quality of a model of changes over months, then simulations would be a good way to go. – EdM Feb 24 '23 at 17:38
  • Really nice @EdM! Thanks for your tips! I learned and clarified a lot. – CafféSospeso Feb 24 '23 at 17:40
  • It appears that perhaps doing simulations will not be a good idea, given the issues raising for sample proportions equal to zero or one, since they will underestimate confidence intervals. This is discussed in https://stats.stackexchange.com/questions/401883/how-to-simulate-sample-proportions-near-or-at-zero-in-a-monte-carlo-simulation – CafféSospeso Feb 26 '23 at 20:54
  • @CafféSospeso if you are using this to design a future study, then you might design the future study in a way that avoids observed proportions of 0 or 1. If the proportions are truly 0 or 1 that's one thing. On the other hand, if its a problem of sampling too few cases when actual proportions of interest are close to 0 or 1, then you could use simulations to estimate the necessary sample sizes if simple binomial theory isn't adequate. – EdM Feb 28 '23 at 14:59
6

Building on the part (a) of the Second point in @EdM's answer.

As the recently defamed RA Fisher said nearly 100 years ago to the First Indian Statistical Conference,

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

He continues:

To utilise this kind of experience he [the statistician, who may in fact be a she or other still] must be induced to use his imagination, and to foresee in advance the difficulties and uncertainties with which, if they are not foreseen, his investigations are beset.

So the point of this is that any analysis requires forethought quantitatively. Typically, this is in the form of a "power analysis". That is, bearing no data at all upon which to base your opinion, what is the expected effect, rate of recruitment, missingness, distribution of blocking or strata variables, etc. etc. (ad infinitum or ad nauseum)... Often, the final result amounts to drawing a line in the sand and saying, "We expect the OR to be 1.5, and with 1:1 randomization, we'll have 90% power with N=XX". I have used lines such as these verbatim on protocols or grant applications.

And yet, in the preponderance of literature, there is rarely a mention of the apriori power calculation. This is disappointing for several reasons. Because of this, you often find the readership at a loss of how to interpret a non-significant result - or a significant one, although blindly many "trust" significant results and ask fewer questions of them.

In either case, I strongly argue that the statistician must revisit their assumptions upon reporting the final analysis. "We thought the death rate would be X but it was Y, we thought the distribution of age would be this but it was that," etc. etc. This is in the spirit of working in the service of science. Some assumptions are sensitive and require further analysis to make sure the result isn't spurious due to completely unforeseen phenomena, like effect modification, informative missingness, etc.

Despite all the insight above, a post-hoc power calculation does not fit into this line of inquiry. It doesn't shed any light at all on these issues because you replicate many issues in your analysis that may be flukes without identifying precisely what they were. I find reviewers or readers may ask for "post-hoc power calculations" because they are subconsciously aware of these "post mortem" issues, but they don't know how to ask meaningful questions of a statistician.

Many are the papers supporting this idea.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C48&q=do+not+perform+post-hoc+power+calculations&btnG=

AdamO
  • 62,637
  • Thanks @AdamO for this thoughtful answer. I fully agree on your point. That being said, it is important to consider that not always is possible to follow such 'correct' approach. For instance, in my case, I am using data collected over more the 20 years of wild animal observations. Meaning that although one can do apriori power analysis, this was not done over these years, plus it is very difficult to control for sample size when collective observations in the wild. I am now more interested how to assess my original question, aposteriori, that is once the data have been already collected. – CafféSospeso Feb 23 '23 at 20:14
  • @CafféSospeso that actually doesn't have any bearing on this issue, the $p$-value does. In fact, if the question boils down to "what value is it to continue collecting another 5, 10, or 20 years of Additional data, you can use well established sequential methods for this purpose. – AdamO Feb 23 '23 at 20:22
  • Yes, I can use "established sequential methods" for quantifying the required sample size for the next 5, 10 or 20 years. Still, I won't trush the data already collected. In fact, I am interested in exploring and understanding the limits and the information contained in-there anyway. What I mean is that, your criticism on post-hoc power analysis is completely fair and should be considered, however, I cannot go back in time and ask all collaborators to perform power analysis. I am looking rather for an approach to address my initial question. Perhaps computing Wilson score, as suggested by @EdM. – CafféSospeso Feb 23 '23 at 20:45
  • @CafféSospeso I think for what you are intending to do, it could or should just be called a "power and sample size calculation", since you are not speaking in particular about the study at hand. You can borrow as many or as few elements as you like from the original analysis, but the point is you are speaking of a different study, which is a worthwhile consideration for other purposes. Like I was trying to say, such as a case like yours, you can point out efficiency gains from sampling more sites, or more timepoints as is possible... – AdamO Feb 23 '23 at 20:51
  • Fisher was "recently defamed"? I would be interested to learn more. PS. Checked his Wikipedia page and the section about his beliefs about race and smoking reads as measured as ever. – dipetkov Mar 11 '23 at 18:27