Conducting power analysis to minimize number of tests: generalized linear model (GLZ) with binary outcome, zero inflated binomial predictors R

Question

My problem: I have a fixed, imbalanced, case-control cohort where I am using two (2) low occurrence predictors (rare variants) to try to find an association between the predictors and disease. I will be doing this on a gene-by-gene basis, but want to figure out the minimum amount of the predictor counts I should see to have a well-powered test and reduce the total amount of tests, as a large portion of the genes will have very low predictor counts. I am using R for this analysis.

The data is:

Sample ID: ~2963 cases and ~5015 controls
Binary outcome: disease, no disease
Per sample number of rare mutations of type 1 from whole exome sequencing data: technically binomial - any value 0 to infinity ; practically a binomial from 0-2 - one homozygous or one heterozygous variant - and heavily 0 weighted with frequency <=1%)
Per sample number of rare mutations of type 2 from whole exome sequencing data: with the same characteristics and distribution as mutations of type 1
Covariates: 4 principal components that are related to ancestry and sex

My predictors are the type 1 and type 2 mutations and my response is disease (hypothesis is that type 1 or type 2 mutations cause disease under a dominant model - whereby a single mutation (heterozygote) leads to disease)
I am using a generalized linear model to see if the effect of type 1 mutations + type 2 mutations > 0 (H0 - beta_1 + beta_2 = 0 ; H1 - beta_1 + beta_2 > 0) at a significance level of 0.01 lets say. HOWEVER, we can start with a test statistic based on a single parameter (H0 - beta_1 = 0 ; H1 - beta_1 > 0).

Here is a example of data I expect to see:

df <- data.frame(id = 1:8001, disease = 0, type1 = 0, type2 = 0, PC1 = seq(-0.009,0.035, by= 0.0000055), sex =0)
df$disease[sample(nrow(df), 3000)] <- 1
df$type1[sample(nrow(df), 30)] <- 1
df$type2[sample(nrow(df), 60)] <- 1
df$sex[sample(nrow(df), 3511)] <- 1

I set up the glz like this:

test.model.null <- glm(disease ~ PC1 + sex, family = binomial(link = "logit"), data = df)
test.model.full <- glm(disease ~ type1 + type2 + PC1 + sex, family = binomial(link = "logit"), data = df)

Question:

How can I set up simulations to test for how effect size affects power in these models - translating this to the minimum amount of type1 mutations and/or type2 mutations for a well powered test?

I know I can randomize my covariates and then create the model again over and over, but I do not know how to evaluate false positives (for type 1 error I would do simulations with the null model, but how do I know if I have a false positive?) and false negatives (simulations with my full model?).

My simulations development for based on a simplified model with only a single parameter:

Here is my attempt at simulations with an outcome (0 for controls - Y0, 1 for cases - Y1) and just a single predictor ($predictor), where power is tested at a specific predictor occurrence rate (indicated in the prob variable in the binomial() command), based on this link I found in another thread in CrossValidated:

alpha <- 0.05 # Standard significance level 
sims <- 500 # Number of simulations to conduct for each N
significant.experiments <- rep(NA, sims) # Empty object to count significant experiments
Loop to conduct experiments "sims" times
for (i in 1:sims){
Y0 <- data.frame(outcome = rep(0, 5015), predictor = rbinom(5015, size=2, prob=2/5015)) #  5015 controls get the predictor (0,1, or 2) pulled from binomial distribution at a particular rate
tau <- 1 # Hypothesize treatment effect - gets disease
Y1 <- data.frame(outcome =rep(tau, 2963), predictor = rbinom(2963, size=2, prob=11/2963)) # 2963 cases get the predictor (0,1, or 2) pulled from binomial distribution at a particular rate

Y.sim <- rbind(Y0, Y1)
fit.sim <- glm(Y.sim$outcome ~ Y.sim$predictor, family = binomial(link = "logit")) # Do analysis (Simple regression) 
  p.value <- summary(fit.sim)$coefficients[2,4] # Extract p-values 
  significant.experiments[i] <- (p.value <= alpha) # Determine significance according to p <= 0.05
}
power <- mean(significant.experiments) # store average success rate (power)
power

Next, I adjusted my code to test for power across a fixed probability of the predictor in cases (prob.sim.Y0), compared to a set of probabilities of that same predictor in controls (prob.sim.Y1):

# Set the probability of having an allele for cases and controls based on an hypothetical # of alleles
prob.sim.Y0 = 1/5015 # we will fix the probability of an allele in cases
prob.sim.Y1 = seq(from = 2/2963, to = 7/2963, by = 1/2963) #and we will vary the probability in cases
#Set up power object and other parameters
powers <- rep(NA, length(prob.sim.Y1)) # Empty object to collect simulation estimates 
alpha <- 0.05 # Standard significance level 
sims <- 500 # Number of simulations to conduct for each N
for (j in 1:length(prob.sim.Y1)){ #loop over the number of predictor frequencies
significant.experiments &lt;- rep(NA, sims) # Empty object to count significant experiments 


Loop to conduct experiments "sims" times
for (i in 1:sims){
#Ensure there is at least 1 predictor becuase if there are 0 predictors in 
#both cases and controls, the glm does not return a beta coefficent
Y.sim &lt;- data.frame(outcome = 0, predictor = 0)
while(sum(Y.sim$predictor) == 0) {
  Y0 &lt;- data.frame(outcome = rep(0, 5015), predictor = rbinom(5015, size=2, prob= prob.sim.Y0)) # 5015 controls get the predictor (0, 1, or 2) pulled from binomial distribution at a particular rate

  tau &lt;- 1 # Outcome: Disease status
  Y1 &lt;- data.frame(outcome = rep(tau, 2963), predictor = rbinom(2963, size=2, prob= prob.sim.Y1[j])) # 2963 cases get the predictor (0,1, or 2) pulled from binomial distribution at a particular rate                                 
  Y.sim &lt;- rbind(Y0, Y1) #creating the joint object
}

fit.sim &lt;- glm(Y.sim<span class="math-container">$outcome ~ Y.sim$</span>predictor, family = binomial(link = &quot;logit&quot;)) # Do analysis (Simple regression) 
p.value &lt;- summary(fit.sim)$coefficients[2,4] # Extract p-values 
significant.experiments[i] &lt;- (p.value &lt;= alpha) # Determine significance according to p &lt;= 0.05, storing TRUE or FALSE according to whether or not the value was &lt;= 0.05

}
  powers[j] <- mean(significant.experiments) # store average success rate (power) at each frequency tested by taking the mean() of a T,F vector - which calculates the frequency of TRUE 
}
powers
plot(prob.sim.Y1, powers, ylim=c(0,1))

If you already have the data, there is no need to evaluate power. After you build your model, you either will have had sufficient power to find a "statistically significant" result or you won't. Power analysis is for design; you might, for example, use the current study as a basis for power analysis to design a future study. See this page among many others on the topic of "post hoc power. — EdM, Apr 19 '23 at 19:13
@EdM I will be performing this analysis multiple times, each time corresponds to a different transcript with a different # of type 1 and type 2 mutations. I want to find out what is the minimum amount of type 1 and type 2 mutations I should observe in a given transcripts to have a well powered test. In this way, I can 1. See if any of my tests would be well-powered, or this analysis is a waste of time ; 2. Be able to prune the amount of tests I would be conducting, if I have 8000 transcripts but only 200 are worth testing, the multiple testing correction adjustment is much lower. — Sky Scraper, Apr 19 '23 at 20:21
@EdM This is a valid use of post-hoc power analysis in my field. For example, see this publication with other 5000 citations: https://www.nature.com/articles/s41586-020-2308-7 "For this dataset, we computed a median of 17.9 expected pLoF variants per gene (Fig. 2c) and found that 72.1% of genes have more than 10 pLoF variants (powered to be classified into the most constrained genes)" — Sky Scraper, Apr 19 '23 at 20:24
The covariates will be the same across all tested transcripts, only the number of type 1 and type 2 mutations changes. — Sky Scraper, Apr 19 '23 at 20:29
So I want to be able to see what effect sizes I would need, and correspond that back into # of type 1 mutations and # of type 2 mutations. @EdM — Sky Scraper, Apr 19 '23 at 20:37
In cases and controls that is.... For example, I tried playing around with a odds ratio calculator (https://www.campbellcollaboration.org/escalc/html/EffectSizeCalculator-OR2.php) - inputting different "proportion with event" values and to reach a odds ratio or risk ratio of 9.5, which I determined to be the odds ratio minimum I need to obtain 0.7 power using an online power calculator (https://csg.sph.umich.edu/abecasis/gas_power_calculator/), and I determined that if I have 1 mutation in controls for my given sample size, I would need 6 in cases. If I had 2 mutations in controls, — Sky Scraper, Apr 19 '23 at 20:45
I would need 11 mutations in patients, etc... But I do not think the power calculator I used is applicable to my data and model (I simplified my model and made some assumptions to work with it) so I want to be able to make the power calculations on my own.
@EdM — Sky Scraper, Apr 19 '23 at 20:46
The fact that you want to do this to minimize the number of transcripts to test is important and does make prior power analysis worthwhile. To help readers get that point, please edit the question to include that reason for being interested in power analysis early on. I'll work on an answer to this question, but in the meantime look at this answer, which might point the way to a simple solution for pre-screening the transcripts to examine without looking at the outcome values. — EdM, Apr 21 '23 at 15:26
@EdM I have added clarifications and added an attempt at solving my problem. Thank you and sorry for the confusion. I probably also over explained/added too much information. — Sky Scraper, Apr 21 '23 at 17:14

EdM · Accepted Answer · 2023-04-23T16:38:42.237

If your goal is to minimize the number of genes to test so that you aren't hampered by too large a correction for multiple comparisons, then you have to be careful not to use knowledge about the association between mutations and case/control status. Otherwise, you've already started using the outcomes to choose the model, invalidating the assumptions underlying standard hypothesis tests.

That, however, is what you seem to intend when you say in comments: "if I have 1 mutation in controls for my given sample size, I would need 6 in cases. If I had 2 mutations in controls, I would need 11 mutations in patients, etc.." That would be identifying genes to test based on the mutation distribution between cases and controls. Your simulations also seem to be intended to support that type of decision making. I don't think it would be helpful to figure out the problems with your code in that case (and coding-specific questions are off-topic on this site, anyway).

You certainly can, however, use information about the predictors and their associations with each other in the entire data set to restrict tests to genes that can provide adequate power to find an association with outcome of at least a desired level. In your situation you have a fixed number of samples broken down into cases and controls, so it makes sense to consider the minimum magnitude of a log-odds ratio $\pm \beta_j^{\alpha}$ for predictor $j$ that can be detected at a given level of significance $\alpha$ and power $\gamma$. This answer provides an approximate formula:

$$ \pm \beta_j^{\alpha} = \frac{z_{1-\alpha/2}+z_\gamma}{\sigma_{x_j}\sqrt{np(1-p)(1-\rho_j^2)}}, $$

where the z's are the corresponding quantiles of the standard normal distribution, $\sigma_{x_j}$ is the standard deviation of predictor $j$ values, $n$ is the total sample size, $p$ is the proportion of cases in the sample, and $\rho_j$ is the multiple correlation of predictor $j$ with the other predictors. With a binary yes/no predictor for mutation status, $\sigma_{x_j} = \sqrt{f_j(1-f_j)}$, where $f_j$ is the fraction of total samples with the mutation. The following R function implements that:

beta_minMag <- function(alpha,power,fracMarker,N,fracCase,markerCorr) {
    (qnorm(1-alpha/2)+qnorm(power))/
    (sqrt(fracMarker*(1-fracMarker))*
    sqrt(N*fracCase*(1-fracCase)*(1-markerCorr^2)))
    }

with fracMarker the fraction of samples with the mutation, fracCase the fraction of samples that are cases, and markerCorr the multiple correlation of mutation status with all the other predictors in the model.

Once you've chosen the significance level and power, with a fixed sample size and known case fraction, all that's left to specify is fracMarker and markerCorr, which you can get gene by gene (again, without looking at outcomes). A few examples follow, based on 3000 cases, 5000 controls, alpha of 0.01, and 90% desired power.

For the minimum detectable log-odds at 1% mutation prevalence and no correlation with other predictors:

beta_minMag(alpha=0.01, power=0.9, fracMarker=0.01,
    N=8000, fracCase=3000/8000, markerCorr=0)
# [1] 0.8953118

As above, with 0.8 multiple correlation with other predictors:

beta_minMag(alpha=0.05, power=0.9, fracMarker=0.01,
    N=8000, fracCase=3000/8000, markerCorr=0.8)
# [1] 1.253945

Same high correlation, but 10% mutation prevalence in the entire sample:

beta_minMag(alpha=0.05, power=0.9, fracMarker=0.1,
    N=8000, fracCase=3000/8000, markerCorr=0.8)
# [1] 0.4158866

No correlation, still 10% mutation prevalence:

beta_minMag(alpha=0.05, power=0.9, fracMarker=0.1,
    N=8000, fracCase=3000/8000, markerCorr=0)
# [1] 0.249532

No correlation, but 0.1% mutation prevalence:

beta_minMag(alpha=0.05, power=0.9, fracMarker=0.001,
    N=8000, fracCase=3000/8000, markerCorr=0)
# [1] 2.368453

It's certainly a good idea to consider simulation to check this approximation. But recognize that you can't decide what genes to exclude from testing based on outcome-associated considerations like "if I have 1 mutation in controls for my given sample size, I would need 6 in cases. If I had 2 mutations in controls, I would need 11 mutations in patients, etc.."

In response to comments

For this approach to work, you must use only the predictor values to decide on genes to omit from hypothesis testing. If you use the above approximate formula, all you need is the prevalence of mutations in the gene and the multiple correlation of the gene's mutation status with the other predictors in the model. There's no need to evaluate the "statistical significance" of those relationships among the predictors to get the estimate of the "minimum detectable" log-odds value for that gene's association with the disease-status outcome; this is just doing the best you can with the data that you have.

What you don't want to do is to look at the associations of the predictors with the outcome when you choose which genes to include or omit in your hypothesis testing. The same principles apply if you simulate data instead of using the formula: choices of genes to omit must be based only on the predictor values, not on the associations with outcome.

In terms of how to use things like these "minimum detectable" log-odds estimates, you make a tradeoff between running too many tests and potentially missing some very high associations with outcome. For example, you can choose a combination of significance level and power and simply not evaluate genes that can only then be found "significant" if they have a very large log-odds value.

You might then decide to exclude genes from further analysis that have a "minimum detectable" log-odds value of some particular value, say 1, or greater. The risk is that you might thus exclude a low-prevalence gene that has a very high association with outcome.

If you suspect that there are such genes, you could instead test all genes (ignoring these power issues) and use false-discovery rates for multiple-comparison correction. That might find low-prevalence high-association mutations, but at the risk of missing some higher-prevalence mutations with useful outcome associations.

Making those tradeoffs requires applying your knowledge of the subject matter.

This situation is very closely related to what's frequently in microarray or RNAseq gene-expression studies, when genes with low variance in expression among samples are removed from the study to start. In your situation, genes having low mutation prevalence have low variance among cases, and genes whose mutations have high correlations with other predictors can only be detected as outcome-related if they have very high associations with disease status.

I fixed the code bugs, but I totally get your reasoning as to why my particular simulation approach should probably be avoided.
I wish I realized earlier, but I think a good way to model my mutation predictor: a binomial with 2 draws - to represent that the sample either is a heterozygote (1) or homozygote (2). how can I update the standard normal to reflect this? If it is super easy to figure out, lmk and I will dig into it. thanks sorry. I am currently reading your answer to completion! — Sky Scraper, Apr 21 '23 at 21:41
the link you provided will assist me in finding markerCorr, the multiple correlation of predictor with the other predictors. https://stats.stackexchange.com/questions/162294/multiple-logistic-regression-power-analysis/396681#396681 — Sky Scraper, Apr 21 '23 at 21:45
When you say "fracMarker and markerCorr, which you can get gene by gene without looking at outcomes," that means extracting the values/coefficients without looking at the test statistic values (like |z| and p-val)? — Sky Scraper, Apr 21 '23 at 21:47
I finished reading and the final question I have is: what do I then do with the minimum detectable log-odds? Do I then apply my model to my real data and go gene-by-gene discarding all models where the beta coefficient returned by the model is lower than the minimum detectable log-odds calculated, factoring in the fraction of samples with the mutation type? — Sky Scraper, Apr 21 '23 at 22:12
Sorry, I did not see your update sooner.
I am a little puzzled, why would you ever exclude genes from further analysis that "have a "minimum detectable" log-odds value of some particular value or greater"? Would you not want to exclude genes from further analysis that have a log-odds value less than the "minimum detectable" log-odds value less at that frequency? — Sky Scraper, Apr 25 '23 at 20:51
@SkyScraper remember that you are doing this without looking at the outcomes. The "minimum detectable" value for a gene mutation is the smallest association with outcome that you are likely to identify based solely on the characteristics of the predictors. If you are able to detect a fairly small association with outcome, fine, that gene might provide useful information. If you can only detect a very large association of some gene with outcome on that basis, however, then you might decide to omit that gene to cut down on the number of gene tests and the multiple-comparison problem. — EdM, Apr 25 '23 at 21:02
The problem is how to choose a reasonable cutoff. Is 4 a reasonable "minimum detectable" log-odds value, 8, 2 or 1.2? I do not have literature to base my decision on, as the most popular tool for rare variant analysis (SKAT package), does not report effect size. As I am conducting rare-variant analysis, I believe effect sizes from GWAS studies are not useful, as I expect that rare variants must have large effect sizes to be associated with disease. I guess my data already has an upper bounds (only looking at variants with MAF < 1%) and now I am looking for a reasonable lower bounds. — Sky Scraper, Apr 25 '23 at 21:17
In writing my above comment, I come to reason that I believe I am looking to identify a "highest reasonable effect size" in my study.... — Sky Scraper, Apr 25 '23 at 21:18
This might be a good resource for me to methodically go through: "Searching for missing heritability: Designing rare variant association studies" Zuk et. al. 2013 — Sky Scraper, Apr 25 '23 at 21:36
@SkyScraper you can only tell that a gene has a large effect if you look at the outcomes. If you mostly care about large effects, then power estimates probably aren't of much help and (absent information based on knowledge of the subject matter) you need to take the risk of a large number of comparisons and do multiple-comparison corrections accordingly. The false-discovery rate is often used for genome-wide studies; you accept that some fraction of "positive" results will be invalid. — EdM, Apr 25 '23 at 21:37
I am familiar with this approach but am looking for an alternative as I previously used this approach in a related genetic analysis and was unable to find anything because most genes had very few alleles. I will look at the literature to see what I could find in the space. Further, could I not use simulations of my predictor (based on a binomial) to then determine a reasonable effect size to set my minimum detectable log-odds value?
Thank you so much for your help in understanding all of this. It has been super appreciated. — Sky Scraper, Apr 25 '23 at 21:51
@SkyScraper the choices are controlling false-discovery rate or family-wise error rate. For both, setting a minimum detectable log-odds value (whether done by the formula or by simulation) can help cut down on the number of hypotheses but at the risk of missing some rare very strong associations. You might also consider a LASSO binomial regression with all the genes, a principled way to select predictors while penalizing their regression coefficients to minimize overfitting. — EdM, Apr 26 '23 at 13:55

Conducting power analysis to minimize number of tests: generalized linear model (GLZ) with binary outcome, zero inflated binomial predictors R

Loop to conduct experiments "sims" times

Loop to conduct experiments "sims" times

1 Answers1