In our Statistics Class, we are learning about something called "P-Hacking" (also called "Data Snooping", "Data Dredging", etc.).
Here is the example we are working with in our class - we have a large dataset (e.g. over 5 million rows of data) that contain medical data on patients (each row is a medical patient). This includes information such as their age, gender, height, geographical location of residence, weight, income, highest obtained education level - and whether or not they have asthma. We are discussing strategies to perform analysis on this dataset - for example, do certain groups of people have asthma at higher rates compared to other groups of people.
Our professor told us we can consider different comparisons - for example:
- Do Men have asthma at higher rates than Women?
- Do people with University Degrees have asthma at higher rates compared to people without University Degrees?
- Do Men with University Degrees have asthma at higher rates compared to Women without University Degrees?
But as we see, the possible comparisons we can do with only the categorical variables are numerous - and we start factoring in the continuous variables (e.g. Men with University Degrees over the Age of 53 vs. Men with University Degrees over the Age of 53), the number of possible comparisons become even more.
Several of us in the class had this idea of "exploring the data" and seeing what interesting relationships we could find, and then trying to incorporate our findings into new hypothesis questions. But to our surprise, our professor heavily criticized this idea and called it a "classic example of P-Hacking". Our professor told us that for research to be taken seriously - all hypotheses (e.g. which comparisons) must be clearly stated prior to doing the research. Supposedly, this reduces the chances of "getting lucky" and finding some coincidental findings that are the product of chance. Our professor listed several examples of P-Hacking:
- Removing and inserting variables into a regression model arbitrarily until desirable p-values are found
- "Fishing" for different hypotheses until one of these hypotheses results in a desirable p-value
While these examples that the professor listed do seem relevant to me - I am not sure if I agree that these practices are "inherently" wrong. For example:
In Machine Learning, there are entire algorithms on "Feature Selection" (e.g. Stepwise Selection, Genetic Algorithm, https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html) that try different combination of features (and even create new features from old features) until a good model is produced. As I understand, these techniques can be used for a regression model. I understand that these methods can be prone to abuse with regards to P-Hacking (and result in overfitted models that are too complex to understand) - but if large datasets are available, couldn't procedures like Cross Validation be used to rigorously test these models to make sure that a "newly discovered insight" is not occurring by chance, but rather this new insight consistently appears across different levels of variability within the data?
Regarding the second point, I also agree that "newly discovered trends, patterns and results" can be prone to abuse and can be misleading - but aren't "preconceived hypotheses" equally subject to such abuse and be have the potential to be equally misleading? Suppose I take some two random subset of medical patients (e.g. Men over 34 with University Degrees earning under 44k vs. Men under 34 with University Degrees earning more than 44k ) - I run some statistical comparison tests and find that "Men over 34 with University Degrees earning less than 44k" have asthma at very high rates. In the spirit of Machine Learning, could I not use some Cross Validation themed approach (such as Bootstrap), take repeated random samples of "Men over 34 with University Degrees earning less than 44k" (provided the sample size of this subgroup is large enough) and plot the distribution/histogram of their asthma rate? Could I also not use a similar approach like this to test the statistical significance between groups? Could such an approach not be used to partly determine if P-Hacking is occurring?
The prof brought up that "if you fish for something long enough, you will eventually find it". By this, the prof meant that testing many different hypotheses will inevitably result in some false hypotheses being accepted and some true hypotheses being rejected. Doesn't the Bonferoni Correction Factor assist in adjusting the statistical significance so that the probability thresholds are more "strict" and result in hypotheses on the "cusp of being accepted" fall on the rejection side?
And finally, if the entire research procedure, trials and errors, all hypotheses and results are formally stated (e.g. even the results that look "less attractive" with insignificant p-values) - couldn't this also be used to argue the case that P-Hacking and other malicious research practices did not occur?
Can someone please comment this - if my logic is correct, wouldn't this make aspects within Machine Learning and Exploratory Data Analysis fundamentally at odds with P-Hacking? Or can large datasets and Cross Validation be used to partly circumvent and mitigate such issues?
