1

Say I have a group of 100 persons, and 3 features: age, height, income. I want to make an analysis of how the age is distributed. It turns out that its nicely normal-distributed with mean=$\mu$. I then discover that 80% of our persons are women, so to account for that, I split up the group into men/women. I now look at the age and find, that $\mu_{\text{men}} = 60$ and $\mu_{\text{women}} = 45$, i.e reporting a mean-age of $\mu$ would be "wrong" since there is a big difference whether its a woman or a man we pick.

If we then look at where the persons come from we notice that the persons from asia has an average age greater that persons from europe i.e we need to take that into account - now having 4 groups.

So, we can descend deeper and deeper into the "splits" and create more and more groups - how can we then say something, statistical, about a feature? Say I have a feature f1 and I want to see, if there is a statistical significant difference between two groups (e.g the mean-age), we might find that there is a difference between men and women, but not between men from Europe and women from Asia, thus concluding there is a significant difference in age between men and women is not wrong, but not the complete truth either.

1 Answers1

0

You have stumbled upon the question about multiplicity, or multiple comparisons. Andrew Gelman has written about this extensively on his blog The garden of forking paths (and search that blog). This is a big topic!

One standard way to "solve" it is to say that you should have your research questions very clear, preferable stated in a written document, before starting analysis. That is to say, only do planned comparisons ...

The linked blog post is about a completely different viewpoint, the multiverse.