Kruskal-Wallis test on data with heterogeneous variance and small sample sizes per group

Question

I've been trying to figure out how to test (in R) if there are significant differences between the group means of my data because it seems to violate the assumptions of tests that do this (ANOVA, Kruskal-Wallis). The obvious problem I have is that my data have large differences in group variability and relatively small sample sizes by group (11 to 36 observations).

> dput(df_count)
structure(list(Year = c(2018, 2019, 2020, 2017, 2010, 2011, 2013, 
2016, 2021, 2009, 2012), n = c(36L, 34L, 25L, 24L, 12L, 12L, 
12L, 12L, 12L, 11L, 11L)), row.names = c(NA, -11L), class = c("tbl_df", 
"tbl", "data.frame"))

I've read I can't run a Kruskal-Wallis test if I don't have homogeneity of variance or small sample sizes like these. One possible work around I got from a colleague would be to convert my density values to presence absence (anything about 0 gets turned into a 1) and do a chi-square test on the proportion of 0's to 1's, but we weren't sure if that was a good solution. I'd also like to do a pairwise comparison, but again wasn't sure what problem would arise given the properties of this dataset.

My data:

> dput(df)
structure(list(Year = c(2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2009, 2009, 
2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2010, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2021, 2021, 
2021, 2021, 2021, 2021, 2021, 2021, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), den = c(0.010339884588578, 
0.00455728179952938, 0.00343679685937641, 0, 0, 0, 0.0026099691969192, 
0.00261493942751616, 0, 0.00758841788700932, 0.00261098669259391, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00258166076347894, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00511179097342, 0, 0, 0, 0, 0, 
0, 0, 0.0344957485650341, 0, 0.00260181030586538, 0, 0, 0, 0.0090545203588682, 
0, 0.00264281685601483, 0, 0.0378316032295272, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0.0654230184885466, 0, 0, 0, 0, 0, 0, 0, 0, 0.00183150183150183, 
0, 0.00535045478865704, 0.00260213374967473, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0.00273945867001107, 0, 0, 0, 0, 0.0093124763796496, 
0, 0.0104115475163763, 0.176865881398321, 0, 0.0168260157921274, 
0.00986181407592425, 0.0031893056137876, 0.0254639674849916, 
0, 0, 0, 0, 0, 0, 0.0211521212726769, 0, 0, 0, 0, 0.00516893200484304, 
0, 0, 0.0104424761130064, 0, 0, 0, 0.0421221519735819, 0, 0.00252790953920362, 
0.0206951796482847, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0.00777067045723764, 0.0104825822946628, 0, 0, 0, 0, 0, 0.0129942159892243, 
0, 0, 0, 0.0103953727913794, 0, 0, 0.00259452618256605, 0, 0, 
0, 0, 0.00260317460317461, 0, 0, 0, 0.00260752047062565, 0, 0.00259452618256605, 
0.0026038446052897, 0.0226762329646423, 0, 0.00252708407882677, 
0, 0.00548065139740094, 0, 0, 0, 0.0494424158277803, 0, 0.00748566757311657
)), row.names = c(NA, -201L), class = c("tbl_df", "tbl", "data.frame"
))

Plot of mean density +-standard error bars

Kruskal Wallis is not a test of group means. The usual thing would be Welch(-Satterthwaite) ANOVA (but the small sample size means that your significance level would be sensitive to non-normality). However, I think a big issue is likely to be the assumption of independence you need no matter which of the more usual tests you consider. — Glen_b, May 11 '22 at 23:37
Each value is also associated with the location in which it was collected (“den” = no. of shrimp/area surveyed at each site and season/year sampled without replacement - nothing thrown back). — Nate, May 12 '22 at 02:18

dipetkov · Accepted Answer · 2022-05-12T00:19:21.353

3

Kruskal-Wallis is a non-parametric rank-based test. Under its null hypothesis observations in one group are not larger than observations in any other group. This means that Kruskal-Wallis compares medians, not means.

The issue with using Kruskal-Wallis on your data is that it contains 77% zeros and so there are a lot of ties. The p-value has to be corrected for all those ties.

Note: Variance is not an appropriate summary for your data because it consists mostly of zeros and the distribution of the densities is very skewed. Tests that are sensitive to non-normaliity are not appropriate and symmetric confidence intervals as shown in your plot don't make much sense either.

kruskal.test(den ~ Year, data = data)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  den by Year
#> Kruskal-Wallis chi-squared = 29.435, df = 10, p-value = 0.001059
# p-value adjusted for ties

However, why treat time (in years) as a categorical variable? You have observations from 11 consecutive years. A better plot of your data shows that the proportion of distribution of non-zero data points increases between 2016 and 2018 and then declines again.

This suggests to treat time as continuous and to model its effect with a smooth nonlinear function. Here is an analysis using proportional odds regression with restricted cubic splines as implemented in the rms package.

Note: Proportional odds regression generalizes the Kruskal-Wallis test [1].

library("rms")
anova(orm(den ~ rcs(Year, 4), data = data))
#>                 Wald Statistics          Response: den 
#> 
#>  Factor     Chi-Square d.f. P

#>  Year       10.04      3    0.0182
#>   Nonlinear  9.26      2    0.0098
#>  TOTAL      10.04      3    0.0182

[1] Biostatistics for Biomedical Research course notes. Available online.

R code to make the small multiples plot above.

data %>%
  mutate(
    den = round(den, 3)
  ) %>%
  ggplot(
    aes(den)
  ) +
  geom_bar(
    width = 0.001
  ) +
  facet_wrap(
    ~Year,
    ncol = 4
  )

edited May 12 '22 at 00:19

answered May 11 '22 at 22:21

dipetkov

9,805

Aha, thank you! Would you mind sharing the code for the plot you made? Also, I'm not familiar with proportional odds regression, how would you interpret the results of that test? Is there a way to do a pairwise comparison as well? – Nate May 11 '22 at 22:47
By pairwise comparisons, do you mean comparing two years? Yes, you can do it but it would be more informative to look at the entire function rcs(Year). Read the course notes to learn more. – dipetkov May 11 '22 at 22:51
1

And if you are interested in pairwise comparisons, it will still be better to fit the regression model. This will give you model-based p-values for the comparisons (also called contrasts). If you compare pairs of years separately, you'll have to make multiplicity adjustment for multiple tests. – dipetkov May 11 '22 at 22:54
You would still need to assume equal population variances under the null (or you would lose exchangeability), but yes, there's no requirement that it be the case under the alternative. As a practical matter you wouldn't want the variances to change too quickly as you move from small effect sizes to large, though. – Glen_b May 11 '22 at 23:34
@Glen_b Thank you for the correction. I've updated my answer to explain the null hypothesis of Kruskal Wallis better. I interpreted the part about the variances to mean that, to apply Kruskal Wallis, it's necessary to first test for equality of variances; it isn't. – dipetkov May 11 '22 at 23:53
No dispute on the testing. Testing of assumptions isn't answering the right question in any circumstance and people should avoid looking at the data to choose what test they run on the same data (even more so if that leads them to change their specific hypothesis, as it often appears to) – Glen_b May 12 '22 at 00:38
@Glen_b, could you expand on this (looking at data to choose what test is to be run) a little more? Or maybe there there was a post in which you already have? I don't quite understand what you mean... – Nate May 12 '22 at 00:43
1

Imagine there are two tests you're choosing between, on the basis of what you find in the sample you want to run your test on. The frequentist properties of the tests you run are no longer what you want them to be. For example, the actual significance level for both tests will tend to shift away from the level you chose. The problem is related to this issue: https://en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data but in a somewhat less direct form. – Glen_b May 12 '22 at 00:47
There are a number of posts on site (and also comments) where I discuss the issue, but you can also find papers that relate to it (including specifically on not testing heteroskedasticity, but instead simply not assuming equal variance under $H_0$ when you don't have cause to and so in those situations choosing a test that doesn't require it). However the issue is quite general. – Glen_b May 12 '22 at 00:50
In the closely related case of t-tests, see https://stats.stackexchange.com/questions/100934/does-testing-for-assumptions-affect-type-i-error/100941#100941 ... there are some links given there which in turn should have a reference or two for the case of heteroskedasticity ... It doesn't actually matter whether you use a test or some other criterion, any informative data-based choice will have similar problems. – Glen_b May 12 '22 at 00:52
To add to the issue of screwing up the properties of the test you want to use, testing its assumptions just doesn't answer the question you need it to in the first place. Diagnostics come closer to looking at the right thing - it's at least addressing effect size (so can get close to 'how much might this assumption's failure affect my results') but it still doesn't actually answer the right question (about what you should assume under $H_0$) and still screws up your test properties. – Glen_b May 12 '22 at 00:58
1

Some of the posts/comments at https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless are also relevant - for example Harvey Motulsky's answer there is short but on point. I have some much longer answers in other threads but I don't wish to labor the point. It's not about heteroskedasticity but on the usefulness of testing the comments largely carry over. – Glen_b May 12 '22 at 01:00
@Glen_b, if I only needed to construct margins of error around those means (95% confidence intervals), would bootstrapping be an appropriate method? – Nate May 14 '22 at 17:08
1

No, not really, since your data is mostly 0s and you have very few observations per group. (You'll have to do block bootstrap if you want to do it correctly). Subtle discussions of how much deviation from normality is not too much don't seem especially relevant in your case. – dipetkov May 14 '22 at 17:14
@dipetkov Sorry, what does the last sentence mean? – Nate May 14 '22 at 17:27
I suspect that you don't distinguish between the following two cases: 1. The data fails a normality test but it's still "close enough" to normal that the inference (p-values, confidence intervals, etc.) will be approximately correct also. 2. There is nothing close to normal about the data. The linked post discusses case #1, your data is case #2. – dipetkov May 14 '22 at 17:31
Honestly, the confidence intervals are my boss’s request, but it’s sounds to be me like they are mostly for symmetric data - which I don’t have and I don’t mind that it’s not normal (I understand why it’s not). I would like to confirm for myself though if this is true. If I can build them using this data I would like to, if I can’t, it’s fine - just want to understand the why, which I think I do. So, thank you! – Nate May 14 '22 at 17:38
And my question should have been about confidence intervals in the first place, but the responses I’ve gotten have still generated useful insights. – Nate May 14 '22 at 17:47
2

Confidence intervals don't have to be symmetric. Many CIs have the form mean +/- 2 SD, so are symmetric but that's not a requirement. I'm not sure CIs make sense for your data which is bounded below by 0 and so most measurements are 0s. But you could consider asking a new question explicitly about CIs. No-one else is reading such a long thread of comments.... – dipetkov May 14 '22 at 18:19
Someone else already asked that question, that’s why I thought bootstrapping might work. I’ll end here, haha. – Nate May 14 '22 at 18:26
3

Last comment from me also: Bootstrapping is a good idea, but you'd run into another issue. To estimate a CI with bootstrapping you need to estimate quantiles. In your case [0, X95] where X95 is the 95th quantile seems a reasonable CI. The issue is that for each year you have few observations, so it's hard to see how you can bootstrap the value such that Pr{X < X95} = .95. Limits are harder to estimate than central quantities like mean and median. Need more observations to estimate what's happening in the tails, so to speak. – dipetkov May 14 '22 at 18:34

Kruskal-Wallis test on data with heterogeneous variance and small sample sizes per group

1 Answers1