In our statistics class, we discussed the following problem:
- Suppose a data set is collected that contains the city, height, weight, age, and salary of different people. For each of these people, we know if they have asthma or not.
- Suppose for the entire data set, you take the height variable and break this variable into 5 even ranges (ntiles) that now span the entire data set (e.g., 150–155 cm, 156–161 cm, etc.). Each of these ranges contain 20% of the data
- You now repeat this for the weight, age and salary variables
- Finally, you take every unique group of city, height-range, weight-range, age-range, salary-range and calculate the proportion of people that have asthma
The question being – suppose for each of these proportions, you calculate 95% confidence intervals: In such a situation, is hypothesis testing a valid approach for comparing asthma rates between different groups? As an example – are people in Buffalo, NY between 150–155 cm, 90–92 kg, 20–25 years old and earning between $\mathbf{30,000 - 40,000}$ more likely to have asthma compared to people in Orlando, Florida between 156–161 cm, 90–92 kg, 20–25 years old and earning between $\mathbf{30,000 - 40,000}$?
Based on the nature of this question and the tone in which this question was framed to us, I infer the following (and think that this is what the professor wants us to answer):
- Here, the choice of "5 groups" is arbitrary – why not choose 4 or 6?
- Binning/discretizing a continuous variable loses information
- It is possible that some of these groups might have very small counts that will not allow for hypothesis testing (e.g., small sample sizes)
- Depending on the binning criteria, some arrangements might surpress legitimate trends whereas other arrangements might show false trends
- Overall, logistic regression is more suitable for such problems
But in reality, is it really not suitable to compare proportions across different groups from "ntile" based bins? I understand that there will likely always be some level of arbitrariness, comparisons might not be possible on groups with smaller counts and other statistical approaches might be better – but are hypothesis tests based on such comparisons fundamentally wrong? Does this also mean we never statistically compare proportions calculated from different quantiles?
(UPDATE) Food For Thought (from comment section):
Suppose I collect data on people's weights and whether they have asthma or not - now, a medical doctor comes along and tells me that the industry definition of overweight is " > x kg" - I could now split my data into two groups (e.g. > x kg and < x kg) and see if the the proportion of asthma in overweight people is different from non-overweight.
Now, in a second example, suppose I collect the same data and this time there is no medical doctor to give me advice. I plot my data and observe (e.g. visually, or via some clustering algorithm) that there seems to be two very distinct groups of people : people with a weight > y kg and people with a weight < y kg . I can break the data into these two groups and compare the proportion of asthma in these two POST-HOC groups. In a very general sense, what is the difference between these two examples? Why is one valid and the other inherently not valid? Could I not just use some simulation technique (e.g. cross validation, bootstrapping) to "create new datasets" from the existing data to see if this "y kg" threshold that I "stumbled across" is meaningful or not?
Isn't identifying POST-HOC groups (e.g. statistical hypothesis testing) and eventually using these POST-HOC groupings for some sort of analysis the whole point of Clustering (e.g. K-Means)?
What if I POST-HOC find some interesting pattern I didnt know about prior to collecting the data? How can I khow about this in advance? Should I not investigate and report on this and test the significance of this new found pattern via simulation?
Another idea : What if I said prior to collecting data: "In whatever data I am about to collect, I think people that will be above the median weight in this future dataset are more likely to have asthma than below the median weight in this future dataset." Or, "In whatever data I am about to collect, I think people that will be in the first quartile (based on weight) in this future dataset are likely to have asthma at a different prevalence compared to people in the last quartile." In both of these cases, I have not explicitly defined these ranges - these ranges are only defined in terms of statistical functions. Are these still unsuitable approaches?
As a final concluding statement, my (naïve and likely incorrect) opinion would be : Although there is a significant risk of complicating the analysis using POST-HOC groups (e.g. bias, small sample sizes, arbitrariness) - it is not necessary true that the analysis will necessarily be invalidated when statistical comparisons are performed on POST-HOC groups?