3

In our statistics class, we discussed the following problem:

  • Suppose a data set is collected that contains the city, height, weight, age, and salary of different people. For each of these people, we know if they have asthma or not.
  • Suppose for the entire data set, you take the height variable and break this variable into 5 even ranges (ntiles) that now span the entire data set (e.g., 150–155 cm, 156–161 cm, etc.). Each of these ranges contain 20% of the data
  • You now repeat this for the weight, age and salary variables
  • Finally, you take every unique group of city, height-range, weight-range, age-range, salary-range and calculate the proportion of people that have asthma

The question being – suppose for each of these proportions, you calculate 95% confidence intervals: In such a situation, is hypothesis testing a valid approach for comparing asthma rates between different groups? As an example – are people in Buffalo, NY between 150–155 cm, 90–92 kg, 20–25 years old and earning between $\mathbf{30,000 - 40,000}$ more likely to have asthma compared to people in Orlando, Florida between 156–161 cm, 90–92 kg, 20–25 years old and earning between $\mathbf{30,000 - 40,000}$?

Based on the nature of this question and the tone in which this question was framed to us, I infer the following (and think that this is what the professor wants us to answer):

  • Here, the choice of "5 groups" is arbitrary – why not choose 4 or 6?
  • Binning/discretizing a continuous variable loses information
  • It is possible that some of these groups might have very small counts that will not allow for hypothesis testing (e.g., small sample sizes)
  • Depending on the binning criteria, some arrangements might surpress legitimate trends whereas other arrangements might show false trends
  • Overall, logistic regression is more suitable for such problems

But in reality, is it really not suitable to compare proportions across different groups from "ntile" based bins? I understand that there will likely always be some level of arbitrariness, comparisons might not be possible on groups with smaller counts and other statistical approaches might be better – but are hypothesis tests based on such comparisons fundamentally wrong? Does this also mean we never statistically compare proportions calculated from different quantiles?

(UPDATE) Food For Thought (from comment section):

  • Suppose I collect data on people's weights and whether they have asthma or not - now, a medical doctor comes along and tells me that the industry definition of overweight is " > x kg" - I could now split my data into two groups (e.g. > x kg and < x kg) and see if the the proportion of asthma in overweight people is different from non-overweight.

  • Now, in a second example, suppose I collect the same data and this time there is no medical doctor to give me advice. I plot my data and observe (e.g. visually, or via some clustering algorithm) that there seems to be two very distinct groups of people : people with a weight > y kg and people with a weight < y kg . I can break the data into these two groups and compare the proportion of asthma in these two POST-HOC groups. In a very general sense, what is the difference between these two examples? Why is one valid and the other inherently not valid? Could I not just use some simulation technique (e.g. cross validation, bootstrapping) to "create new datasets" from the existing data to see if this "y kg" threshold that I "stumbled across" is meaningful or not?

  • Isn't identifying POST-HOC groups (e.g. statistical hypothesis testing) and eventually using these POST-HOC groupings for some sort of analysis the whole point of Clustering (e.g. K-Means)?

  • What if I POST-HOC find some interesting pattern I didnt know about prior to collecting the data? How can I khow about this in advance? Should I not investigate and report on this and test the significance of this new found pattern via simulation?

  • Another idea : What if I said prior to collecting data: "In whatever data I am about to collect, I think people that will be above the median weight in this future dataset are more likely to have asthma than below the median weight in this future dataset." Or, "In whatever data I am about to collect, I think people that will be in the first quartile (based on weight) in this future dataset are likely to have asthma at a different prevalence compared to people in the last quartile." In both of these cases, I have not explicitly defined these ranges - these ranges are only defined in terms of statistical functions. Are these still unsuitable approaches?

  • As a final concluding statement, my (naïve and likely incorrect) opinion would be : Although there is a significant risk of complicating the analysis using POST-HOC groups (e.g. bias, small sample sizes, arbitrariness) - it is not necessary true that the analysis will necessarily be invalidated when statistical comparisons are performed on POST-HOC groups?

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 2
    One thing that did not feature in your "inferred" list is the multiple comparisons problem, which I guess would be one of the main discussion points your professor would like to touch on. – B.Liu Jan 08 '23 at 09:15
  • 1
    Binning does not just 'lose information,' but also adds bias. – Alexis Jan 08 '23 at 17:39
  • 3
    You are right to be concerned. I discuss this very issue in an answer (to a totally different question) at https://stats.stackexchange.com/a/17148/919: see the analysis and example of a chi-squared using quantile-based bins. – whuber Jan 08 '23 at 19:40
  • @ B.Liu: Thank you for bringing this point up! – stats_noob Jan 08 '23 at 21:47
  • @ Whuber: You posted a detailed answer which is too advanced (e.g. about manifolds) for me lol! In simpler words - is this approach based on breaking data based on ntile groups, calculating proportions within these ntile groups and analyzing these proportions .... is this approach NEVER acceptable? Even if you have large amounts of data within each of these ntile groups? – stats_noob Jan 08 '23 at 21:49
  • The relevant portion of the answer is completely elementary. – whuber Jan 08 '23 at 23:36
  • @ Whuber: according to this wikipedia article https://en.m.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data - "cross validation" can be used to counter these problems? – stats_noob Jan 09 '23 at 01:24
  • It can, but it can be difficult to implement because you have to cross-validate the entire procedure, including the decisions about how to bin the data. Usually those decisions are not conducted according to any definite algorithms and accordingly they can become difficult to reproduce. – whuber Jan 09 '23 at 18:18
  • @ Whuber: Thank you for your reply! Our prof showed us this paper today here: https://www.popdata.bc.ca/sites/default/files/documents/events/etu/2016%20Sutradhar%20-%20MSM%20for%20adherence%20-%20JMS.pdf - over here, "Age" and "Neighborhood Income" are binned into quintiles. Do you feel as though this "binning" process should not have been done (i.e. arbitrary) and have ideally been treated as a continuous variable? Does this "binning" process take away from the results in this paper? – stats_noob Jan 12 '23 at 03:19

1 Answers1

2

In reality, it is 100% appropriate and routine to compare proportions across different groups.

Hypothesis tests based on contrasting proportions across groups are not, in and of themselves, fundamentally wrong.

For example, the quantitative heart of epidemiology depends on such contrasts and such tests—for example, when contrasting proportions or contrasting incidence rates. That includes contrasting incidence rates between different arms in randomized control trials, to contrasting rates and proportions between geopolitical or jurisdictional boundaries in observational population research. Literally any general epidemiology textbook will illustrate this amply.

There are specific two-sample and omnibus (multi-sample) tests to do precisely this in the two variable case, both when representing such data as proportions, and as counts (e.g., Z test, $\chi^2$ test, Cochran's Q test, Friedman's test, etc.). There are also direct and indirect adjustment techniques for addressing comparisons when there are additional variables.

Finally, the question of whether to compare proportions across different quantiles depends substantively on your theory of aggregation. For example, "California" isn't a randomly selected group of 40 million people across a geographic continuum, "minors" aren't a group of people spanning any arbitrary 18 year age bracket, etc. (If the number of quantiles across some continuum is relatively large, as in the case of direct age adjustment using populations pyramids with 17 or more age brackets—sometimes even age brackets by years—then both loss of power and bias from aggregation are much reduced.)

Alexis
  • 29,850
  • 1
    This doesn't address the question, which concerns post hoc binning based on quantiles of data. As suggested in the question, it indeed is incorrect to treat those as if they were pre-specified or established independently of the data. What often rescues this error is the sheer amount of data. – whuber Jan 08 '23 at 19:39
  • 1
    @whuber Can you say a little more about what you mean by post hoc binning? Apparently I am missing something in the question, which (still) reads to me about comparing proportions across groups, including when there are third variables, and using, e.g., direct adjustment methods (which is where I thought binning was coming into the question, and why I mentioned direct age adjustment with population pyramids). Happy to be enlightened by you as always (I learn so much! :) – Alexis Jan 08 '23 at 20:05
  • 1
    The groups are constructed from the data: "you take the height variable and break this variable into 5 even ranges (ntiles)..." In a comment to the question I offered a link to a post where I explain and illustrate the kind of problems this can create for applying statistical tests. – whuber Jan 08 '23 at 21:05
  • @ Alexis: Thank you so much for your answer! – stats_noob Jan 08 '23 at 21:50
  • @ Whuber: @ Whuber: Excuse my ignorance, but I have difficulty understand this point. Suppose I collect data on people's weights and whether they have asthma or not - now, a medical doctor comes along and tells me that the industry definition of overweight is " > x kg" - I could now split my data into two groups and see if the the proportion of asthma in overweight people is different from non-overweight. – stats_noob Jan 08 '23 at 22:06
  • Now, in a second example, suppose I collect the same data and this time there is no medical doctor to give me advice. I plot my data and observe (e.g. visually, or via some clustering algorithm) that there seems to be two very distinct groups of people : people with a weight > y kg and people with a weight < y kg . I can break the data into these two groups and compare the proportion of asthma in these two POST-HOC groups. In a very general sense, what is the difference between these two examples? Why is one valid and the other not valid? – stats_noob Jan 08 '23 at 22:07
  • Isn't POST-HOC grouping and performing analysis (e.g. statistical hypothesis testing) on these POST-HOC groupings the whole point of clustering (e.g. k-means)? – stats_noob Jan 08 '23 at 22:09
  • Clustering is not a formal hypothesis test: you seem to be changing the subject matter of your question as you go along. Once again: the explanation in the post I referred you to answers these questions explicitly and provides a fully worked and very simple example. – whuber Jan 09 '23 at 00:01