I am performing analysis on data of ~8,000 patients who have COVID. This is my scenario. I have information of each patient being classified as either severe or non-severe in how COVID-19 affected them. I also have a set of all the mutations that each patient had.
Thus, I generated frequencies as such: Frequency = P(Having Mutation X and a Severe Clinical Response) divided by P(Mutation X). I did this for all mutations that have at least 30 patients to satisfy proper conditions for the statistical test.
From my original dataset of 8,000 patients, I know that about 1,000 had a severe clinical response. Thus, my null hypothesis is that for any given mutation the frequency, as calculated above, should be equal to 1/8 if it has no correlation to a severe response to COVID. What is the best way to perform a statistical test that can compare the severity frequency of each mutation to the population proportion (1/8) and also account for sample size (i.e. if one mutation has a frequency of 0.9, but only 30 patients have it, it's likely not as significant as a mutation which 5,000 patients have even if the frequency is lower, at 0.5).
I am thinking of doing a chi-squared test for homogeneity but am not sure about how to go about it because I don't think it compares to the proportion of the entire dataset and I'm not sure if it accounts for sample size either. I've done feature selection so not looking to do that again but I'm trying to do statistical tests on my original data set. Thanks.