2

I am performing analysis on data of ~8,000 patients who have COVID. This is my scenario. I have information of each patient being classified as either severe or non-severe in how COVID-19 affected them. I also have a set of all the mutations that each patient had.

Thus, I generated frequencies as such: Frequency = P(Having Mutation X and a Severe Clinical Response) divided by P(Mutation X). I did this for all mutations that have at least 30 patients to satisfy proper conditions for the statistical test.

From my original dataset of 8,000 patients, I know that about 1,000 had a severe clinical response. Thus, my null hypothesis is that for any given mutation the frequency, as calculated above, should be equal to 1/8 if it has no correlation to a severe response to COVID. What is the best way to perform a statistical test that can compare the severity frequency of each mutation to the population proportion (1/8) and also account for sample size (i.e. if one mutation has a frequency of 0.9, but only 30 patients have it, it's likely not as significant as a mutation which 5,000 patients have even if the frequency is lower, at 0.5).

I am thinking of doing a chi-squared test for homogeneity but am not sure about how to go about it because I don't think it compares to the proportion of the entire dataset and I'm not sure if it accounts for sample size either. I've done feature selection so not looking to do that again but I'm trying to do statistical tests on my original data set. Thanks.

1 Answers1

3

A properly constructed contingency table lets a chi-square test takes the overall sample size into account. For your situation, it's a cross-tabulation based on severe/not in one dimension and mutation present/absent in the other. In your first example, with 8000 total observations, 1000 with severe disease, 30 with the mutation and 27 of those having severe disease, a contingency table looks like this:

mat1 <- matrix(c(27,3,(1000-27), (7000-3)), byrow=TRUE, nrow=2,
          dimnames=list(mutation=c("present","absent"),
                        severity=c("severe","not")))
mat1
#           severity
# mutation  severe  not
#   present     27    3
#   absent     973 6997
chisq.test(mat1)$p.value ## warning omitted
# [1] 2.62529e-36 

If you do similar tests based on 20% of those with a mutation having severe disease, however, you will not find a "significant" result if there are only 30 with the mutation. If there are 300 with the mutation and the same fraction with severe disease, then you will get a highly significant association.

There are nevertheless a few problems with your approach. First, if you have many mutations then you must deal with the multiple-comparisons problem.

Second, from your last paragraph it seems that you are undertaking a regression model or similar analysis for severity. In that case you should be incorporating the model's predictors into a binary (e.g., logistic) regression model instead of looking at the raw contingency tables from the original data set. Omitting any outcome-associate predictor from a binary regression model tends to bias the magnitudes of estimated association downward for the included predictor. But that's just what you are doing if you ignore the other predictors and just look at mutation and disease status as you propose.

Third, this one-mutation-at-a-time approach means that you will miss potentially important associations that depend on combinations of mutations. You might consider a principled approach like LASSO that can evaluate multiple genes together.

EdM
  • 92,183
  • 10
  • 92
  • 267