0

I am quite new to statistical analysis and have what is probably a newbie question. I have a dataset that contains year and state data for two different groups. I am trying to compare the group means for each state and year combination (in total it is around 40 unique state-year combinations), however am running into a few problems:

  1. The distributions for each group by year and state are not normal.
  2. The sample sizes are erratic, i.e for 2018 + Kansas, group 1 may have a sample size of 1100 and group 2 may have a sample size of 3600. Or, in extreme cases group 1 may have a sample size of 10,000 and group 2 may have a sample size of 3,000.
  3. In general the variances are equivalent (though not always).

Based on this I felt that a Mann-Whitney non-parametric test would be the best way to compare the distributions of each group within each state and year. However, I have been doing some reading and noticed that the statistical power for Mann-Whitney is reduced when sample sizes are overly large.

Is there any guidance on the best statistical test to compare these various combinations? Any resources or feedback would be greatly appreciated!

user
  • 1
  • 2
  • It might depend on what the distribution does look like. Are they counts? Probabilities? How non-normal? Generally, the variances being difference doesn't matter. – Jeremy Miles Aug 22 '23 at 20:14
  • @JeremyMiles: my understanding is that, e.g., ANOVA is non-robust if variances differ and group sizes are different "enough", no? – Stephan Kolassa Aug 22 '23 at 20:17
  • 1
    @StephanKolassa - yes, agree. But it's pretty straightforward to deal with. I wouldn't switch to a different technique because of that. – Jeremy Miles Aug 22 '23 at 21:25
  • The distributions of your data do not matter. What does matter are the sampling distributions of the means. Unless your data are extremely skewed (see https://stats.stackexchange.com/questions/69898 for a discussion and example), the sampling distributions of the means will be close enough to Normal for your work. You have more important things to worry about, such as your implicit assumption that your samples can be analyzed as if they were random. – whuber Aug 22 '23 at 23:04

0 Answers0