1

I'm studying the association between a rare disease and smoking. Because the disease is rare, my contingency table is highly unbalanced with way more Non-Diseased than Diseased individuals, independently of their smoker status.

            NonDiseased | Diseased
___________ ____________ __________
Smoker     |    4312    |    16
___________ ____________ __________
Non-Smoker |   21329    |    20
___________ ____________ __________

Is there a way to correct the p-value of a chi-square test done on this table to reflect the fact that there are very few Diseased individuals?

je2018
  • 51

3 Answers3

2

In addition to the other excellent answers:

To get more precise inference, you can model via logistic regression. That can always be done with a $2\times 2$ contingency table, and then use likelihood methods. I will show profile likelihood used to get more precise confidence intervals for the log-odds ratio, a confidence interval is also more informative than a p-value only. With R code:

Smoker   <-  rep(c("Yes", "No", "Yes", "No"),  
                 c(4312, 21329,  16,  20))
Diseased <-  rep(c("No",  "No",  "Yes",  "Yes"), 
                 c(4312, 21329,  16,  20))

mydata <- data.frame(Smoker =as.factor(Smoker), Diseased =as.factor(Diseased)) rm(Smoker, Diseased)

mod0 <- glm(Diseased ~ Smoker, data=mydata, family=binomial) summary(mod0)

confint(mod0, 2)

Call: glm(formula = Diseased ~ Smoker, family = binomial, data = mydata)

Deviance Residuals: Min 1Q Median 3Q Max
-0.0861 -0.0433 -0.0433 -0.0433 3.7344

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.9721 0.2237 -31.170 < 2e-16 *** SmokerYes 1.3755 0.3358 4.096 4.2e-05 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 544.98  on 25676  degrees of freedom

Residual deviance: 530.05 on 25675 degrees of freedom AIC: 534.05

Number of Fisher Scoring iterations: 9

Waiting for profiling to be done ...
2.5 %    97.5 % 

0.7032746 2.0316117

Exponentiating the confidence limits then gives a confidence interval for the odds ratio:

exp(c(0.7032746,  2.0316117 ))
[1] 2.020358 7.626368

which is a direct measure of effect size. Note that this confidence interval is based on likelihood profiling, which usually is more precise than the Wald confidence intervals, which we can get by

confint.default(mod0, 2)
              2.5 %   97.5 %
SmokerYes 0.7173564 2.033688

which in this case is not very different, probably because the sample size is so large.

If you instead want a more direct measure, like the difference in proportions, use the (similar) methods in Finding a confidence interval for difference of proportions

There is also higher-order likelihood inference, in R with packages hoa, cond. We can try this to see if it makes much of a difference, but the conclusion is that, not. Again, probably because of the large sample size. But below some results:

library(cond)

mod0.cond <- cond( mod0, offset=SmokerYes)

summary(mod0.cond)

Formula: Diseased ~ Smoker Family: binomial Offset: SmokerYes

      Estimate   Std. Error 

uncond. 1.376 0.3358 cond. 1.375 0.3359

Confidence intervals

level = 95 % lower two-sided upper Wald pivot 0.7174 2.034 Wald pivot (cond. MLE) 0.7172 2.034 Likelihood root 0.7033 2.032 Modified likelihood root 0.7071 2.032 Modified likelihood root (cont. corr.) 0.6500 2.085

Diagnostics:

  INF        NP 

0.0139565 0.0003413

Approximation based on 20 points

The modified likelihood root interval is supposed to be the most accurate, but the different intervals given here are very similar.

1

The bottom line is that the proportions of diseased subjects among smokers and non-smokers are 0.0037 and 0.0011, respectively, and they are highly significantly different.

Because counts 16 and 20 are relatively small some statisticians might use the Yates continuity correction, which is conservative (making the chi-sq statistic smaller, hence the P-value larger). With or without this 'correction' your P-value is very small.

Computations in R below:

TBL
      [,1] [,2]
smok  4312   16
nons 21329   20

chisq.test(TBL)

    Pearson's Chi-squared test 
    with Yates' continuity correction

data: TBL X-squared = 17.658, df = 1, p-value = 2.644e-05

chisq.test(TBL, cor=F)

    Pearson's Chi-squared test

data: TBL X-squared = 19.58, df = 1, p-value = 9.649e-06

The expected counts (all larger than 5) in this chi-squared test are sufficiently large for a good approximation of the null distribution to $\mathsf{Chisq}(\nu = 1).$

chisq.test(TBL, cor=F)$exp
          [,1]      [,2]
smok  4321.932  6.067999
nons 21319.068 29.932001
BruceET
  • 56,185
1

Imbalance alone is not an issue for a chi-squared test, although a small absolute number of counts can be - applying a chi-squared test to a 100:1 imbalanced dataset will work fine if you have a million samples, but not if you have a hundred. With sufficient sample size, a chi squared test could be appropriately applied to data with any level of imbalance. As long as there are enough counts in the rare group, it doesn't really matter what proportion of the whole they are.