7

In our Statistics Class, we are learning about something called "P-Hacking" (also called "Data Snooping", "Data Dredging", etc.).

Here is the example we are working with in our class - we have a large dataset (e.g. over 5 million rows of data) that contain medical data on patients (each row is a medical patient). This includes information such as their age, gender, height, geographical location of residence, weight, income, highest obtained education level - and whether or not they have asthma. We are discussing strategies to perform analysis on this dataset - for example, do certain groups of people have asthma at higher rates compared to other groups of people.

Our professor told us we can consider different comparisons - for example:

  • Do Men have asthma at higher rates than Women?
  • Do people with University Degrees have asthma at higher rates compared to people without University Degrees?
  • Do Men with University Degrees have asthma at higher rates compared to Women without University Degrees?

But as we see, the possible comparisons we can do with only the categorical variables are numerous - and we start factoring in the continuous variables (e.g. Men with University Degrees over the Age of 53 vs. Men with University Degrees over the Age of 53), the number of possible comparisons become even more.

Several of us in the class had this idea of "exploring the data" and seeing what interesting relationships we could find, and then trying to incorporate our findings into new hypothesis questions. But to our surprise, our professor heavily criticized this idea and called it a "classic example of P-Hacking". Our professor told us that for research to be taken seriously - all hypotheses (e.g. which comparisons) must be clearly stated prior to doing the research. Supposedly, this reduces the chances of "getting lucky" and finding some coincidental findings that are the product of chance. Our professor listed several examples of P-Hacking:

  • Removing and inserting variables into a regression model arbitrarily until desirable p-values are found
  • "Fishing" for different hypotheses until one of these hypotheses results in a desirable p-value

While these examples that the professor listed do seem relevant to me - I am not sure if I agree that these practices are "inherently" wrong. For example:

  • In Machine Learning, there are entire algorithms on "Feature Selection" (e.g. Stepwise Selection, Genetic Algorithm, https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html) that try different combination of features (and even create new features from old features) until a good model is produced. As I understand, these techniques can be used for a regression model. I understand that these methods can be prone to abuse with regards to P-Hacking (and result in overfitted models that are too complex to understand) - but if large datasets are available, couldn't procedures like Cross Validation be used to rigorously test these models to make sure that a "newly discovered insight" is not occurring by chance, but rather this new insight consistently appears across different levels of variability within the data?

  • Regarding the second point, I also agree that "newly discovered trends, patterns and results" can be prone to abuse and can be misleading - but aren't "preconceived hypotheses" equally subject to such abuse and be have the potential to be equally misleading? Suppose I take some two random subset of medical patients (e.g. Men over 34 with University Degrees earning under 44k vs. Men under 34 with University Degrees earning more than 44k ) - I run some statistical comparison tests and find that "Men over 34 with University Degrees earning less than 44k" have asthma at very high rates. In the spirit of Machine Learning, could I not use some Cross Validation themed approach (such as Bootstrap), take repeated random samples of "Men over 34 with University Degrees earning less than 44k" (provided the sample size of this subgroup is large enough) and plot the distribution/histogram of their asthma rate? Could I also not use a similar approach like this to test the statistical significance between groups? Could such an approach not be used to partly determine if P-Hacking is occurring?

  • The prof brought up that "if you fish for something long enough, you will eventually find it". By this, the prof meant that testing many different hypotheses will inevitably result in some false hypotheses being accepted and some true hypotheses being rejected. Doesn't the Bonferoni Correction Factor assist in adjusting the statistical significance so that the probability thresholds are more "strict" and result in hypotheses on the "cusp of being accepted" fall on the rejection side?

  • And finally, if the entire research procedure, trials and errors, all hypotheses and results are formally stated (e.g. even the results that look "less attractive" with insignificant p-values) - couldn't this also be used to argue the case that P-Hacking and other malicious research practices did not occur?

Can someone please comment this - if my logic is correct, wouldn't this make aspects within Machine Learning and Exploratory Data Analysis fundamentally at odds with P-Hacking? Or can large datasets and Cross Validation be used to partly circumvent and mitigate such issues?

stats_noob
  • 1
  • 3
  • 32
  • 105

4 Answers4

7

The TLDR

The short answer to your question is no. You can engage in both null hypothesis significance testing (NHST) as well as exploratory data analysis (EDA) without engaging in morally corrupt research practices. The longer answer to that question can be summarized with "it also depends."

A Practical Example

If we remove the frame of formal scientific research, we can apply both of these types of data analysis on everyday life and see why they are both important (and of consequence, why they are not always at odds with each other). Let's say my computer breaks one day. I hypothesize that this is due to the power button being dysfunctional. My null hypothesis would be that pressing the power button should have no effect on the outcome. So let's say I press the button a thousand times and nothing happens. I have now failed to reject the null hypothesis.

But of course somebody who has broken their computer wouldn't just throw their digital baby into the trash bin. They would try to find some way to figure out the problem, and thus would explore other options. That can be done both with formal hypotheses ("perhaps it is because of the charger") or by simply mashing all the buttons until something works ("idk what is going on with my computer but maybe blunt force trauma will do the trick."). You can see where this would be a solution rather than a problem, because now you are combining exploratory and confirmatory methods. Consider the below chart for the scientific method:

enter image description here

You can see that some parts of this method are confirmatory and others are exploratory. Take for example the section that shows where one would "troubleshoot." We do this all the time with broken computers, and that would also apply with our scenario above. You would also engage in updates to your hypotheses ("maybe it is broken because of X, Y, or Z"). Only by exploring further can we figure this out. Drawing up hypotheses from nowhere is a crackpot's idea of science anyway.

Now if we were to inject a p-hacking scenario into this real-life situation, it may look something like the next scenario. Let's say my roommate also has a broken computer and I have no pre-conceived notion of what is causing the issue. So I mash all of the buttons on his keyboard until it decides to turn on. I don't know what actually fixed it, but because one of the buttons was the power button, I tell him after that I already devised that it was the power button the whole time. Obviously this is lying...I didn't fix the button because I knew the cause already, I simply explored around and conceived of the idea after.

This situation is more where you will find issues with blending NHST and EDA. If you decide to dig around until you get something you want, then naturally you cannot say with certainty that you hypothesized it. But that's not to say that you shouldn't engage in EDA to facilitate NHST. It only means that you should engage in other practices, such as reporting your failed hypotheses, including newly developed hypotheses, and providing justification for why EDA was used (usually to motivate a decision on what the causes of a phenomenon are).

Final Remarks

The reason that I suspect he has made this hard distinction in your class is because failing to state some explicit ideas before data analysis can be very problematic if you are going to be a researcher or statistician, and he is probably attempting to get you to think in a more scientifically rigorous way. This isn't inherently wrong, but I do think that stating this in such a rigid way isn't always helpful. Perhaps some others here may disagree with me, but that is my assumption based off my intuition on the subject.

Edit

In the comments, Karolis mentioned the potential issue with using the power button analogy as being a false example because you always get the answer to your outcome. But as I mentioned there, this is actually a probabilistic outcome disguised as a closed solution. I give a "power surge scenario" in the comments, but there are other factors that can influence this outcome (and are not limited to just these instances):

  • Your battery is dead before you do a certain number of trials. While you are gone, somebody replaces it. You turn it on and suddenly your computer turns on after more trials. You have falsely rejected the null because the button wasn't the issue.
  • Your glasses were not on that day and you can't see the buttons clearly. You think you are pressing the power button, but you actually pressed something else and now the computer is on. So your perception was the computer was turned on by the power button, but it was actually something entirely different.
  • The computer doesn't turn on because you didn't apply enough pressure when pushing down with your finger. You retain the null hypothesis but this in fact isn't true...you simply didn't push down hard enough on the button.

This is actually in line with what the professor said...getting the laptop to power on or not power on may be purely by chance, but having a solid theory and evidence behind your claim will help reduce the risk of chance outcomes and give you a solid starting point for attacking the problem.

To your question in the comments about cross-validation...you have asked a lot here so I will simply state that cross-validation doesn't become a magic bullet for solving the dilemma. Karolis has provided some reasons for why this can be an issue that I agree with.

  • Quite well articulated. +1. – User1865345 Jan 15 '23 at 07:24
  • 1
    @ Shawn Hemelstrand : Thank you so much for your answer! – stats_noob Jan 15 '23 at 07:37
  • This question is related to a previous question I had asked earlier: https://stats.stackexchange.com/questions/601224/statistical-validity-of-comparing-proportions-across-groups-post-hoc – stats_noob Jan 15 '23 at 07:39
  • 2
    Well set out answer. One query. The power button analogy seems reasonable on the surface. However, the power button turned out to be true result. So does p-hacking become an issue if, in fact, it actually verified a (hidden) truth? – Mari153 Jan 15 '23 at 07:40
  • That is an interesting question. Consider if you are a computer tech employee at a company in a similar situation and you are promoted because of your "confirmatory" approach you did not in fact posit. Your colleagues may have a false sense of your troubleshooting skills (EDA) as well as your ability to posit practical questions to problems (NHST) because of unethical practices. The issue is not solving the problem by serendipitous events...it is only an issue when posed as intentionally devised in my opinion (at least that's my first thoughts on it). – Shawn Hemelstrand Jan 15 '23 at 07:46
  • I do not like the computer-button example. Pressing a button and seeing how the computer responds always tells as an answer so there is no room for false positive result. This is like doing exploratory analysis and having an oracle which for each significant results says whether it's real or false. There would never be an issue with p-hacking if we had that. – Karolis Koncevičius Jan 15 '23 at 07:53
  • That's not actually true (or at least based off my thinking). Pressing a button and it turning on/off doesn't always tell you the answer. A situation I can think of is when there is a power surge in your apartment, and then suddenly at the moment your electricity turns on, you press the button and the computer is now able to turn on. You may have potentially answered the question at this point, but it's not guaranteed to be true. – Shawn Hemelstrand Jan 15 '23 at 07:55
  • Thanks everyone! can either of you comment on the cross validation idea in the asthma example? – stats_noob Jan 15 '23 at 08:04
  • 5
    @stats_noob in my impression your questions are too open ended. A good place to start is to clarify the question you have as best as possible so that you can get a clear answer. Now for example this question has multiple parts, and then you link it to another question with multiple parts and edits and follow up questions that change the original question itself. My suggestion is to really distill the question you have into a simple example and a few sentences. One question mark allowed. Then it will be easier for both sides: easier to give an answer and easier to understand the answer given. – Karolis Koncevičius Jan 15 '23 at 08:20
  • I agree. The question is already very convoluted and I think both Karolis and I have provided comprehensive answers already. I've edited my answer for some additional clarifications but I mention that Karolis already gave a good enough answer to the CV question. – Shawn Hemelstrand Jan 15 '23 at 11:10
  • 1
    I agree with @Karolis. And even better place to start would be to first search through Cross Validated for similar threads. Your most open-ended questions tend to have been already asked and answered (because they are indeed interesting & good questions and others have studied them before.) – dipetkov Jan 15 '23 at 14:43
5

The problem with p-hacking in my view is not the hacking itself, but rather the interpretation of the outcome. Your professor is right saying that if you test and test and test, you will at some point find significance even if nothing is going on. This means that significances that you found in this way cannot be reliably interpreted as meaningful.

However, if you get at things from an exploratory angle, this should not be the point. Whatever turns up from an exploratory analysis should not be interpreted as a meaningful final result (some people argue that significances from testing should never be interpreted in this way, but I leave this discussion out here). Rather it should be interpreted as a potential hint at something, possibly be investigated on new independent data. If you look at certain data, test one hypothesis and get $p=0.44$ and another and get $p=0.0002$, the data that you have are more critical against the second null hypothesis than against the first, and there may be something going on. This kind of information is gradual and not dichotomous; If you run lots of tests, you should know that $p=0.04$ isn't particularly convincing, $p=0.012$ or $p=0.009$ may still occasionally happen accidentally, but $p=10^{-8}$ is still quite a strong indication (unless you have so much data that basically everything becomes significant; and also always look at effect sizes!).

Digression: There are formal procedures for multiple testing such as Bonferroni, Benjamini-Hochberg etc. I think it is good to know about these, because they give some information about what kind of p-value to expect in situations where multiple testing goes on. However, I think in EDA their worth is that they provide some orientation, but they shouldn't be literally followed and "trusted". They are based on simplifying assumptions (e.g., Bonferroni can be very conservative), and they can't take into account all the informal, visual snooping around that is done in EDA. Keep in mind that the "p-hacking" problem does not only come from running multiple tests, but also from running tests conditionally on informal decisions ("I run test $T$ because this connection here looks nonlinear, so I try out adding a squared term and test it"), which runs counter to the standard assumptions of the tests. (End digression.)

Ultimately I'd use p-values as exploratory indicators for things that may be worthwhile investigating without interpreting them as final meaningful result.

Another issue is the automatic use of multiple significance tests inside a procedure that ultimately does something else, for example a variable selection/prediction routine. You are right that the success of these procedures can be tested and validated on independent data, and the procedure can be seen as good to the extent that it leads to a model with good prediction power, if that's your ultimate aim. The issue here is the same as before: The significance tests are not interpreted as individually meaningful; i.e., you shouldn't claim that just because a model with variable $X_4$ was selected in such a procedure you have significant evidence that variable $X_4$ is important in a meaningful way. The tests here are not used as tests in the usual sense, but rather as building blocks of a method that actually does something else, and has to be assessed on its own merits.

Now of course one can ask whether it is a good idea to use significance tests in situations in which they cannot be interpreted in the standard way. My answer would be: It depends. You can hear often these days that for variable selection regularisation using ideas such as the Lasso is better than stepwise selection using significance tests, sometimes with the implicit suspicion that this is because the significance tests are used "wrongly". Well, my experience is, sometimes this is true and sometimes it isn't. I have seen data sets on which cross-validated prediction worked better using stepwise regression than the Lasso. If you don't have very many variables, this may not even be that exceptional. Also, when doing exploratory analysis, I like to look at p-values just to address the question as how "surprising" a find should actually be seen given a model in which nothing meaningful is going on behind it. However it is clear that this is just one bit of information, and the detailed visual impression and things such as effect sizes are still very relevant. Also of course I should not forget that "p-hacking" is going on, i.e., that the "model in which nothing meaningful is going on" is already violated to some extent by me snooping around. If I could formalise my "snooping" in advance, i.e., formulate a battery of things to test and maybe even some things to look at conditional on earlier results, I can generate data from a null model and in fact explore what kind of p-values I should expect to find in such a situation using the whole battery ($10^{-8}$ not often, I suspect).

A key to suitable interpretation may be to say goodbye to the idea that significance is something inherently good and meaningful. If you look at the data in order to find significances, you will find significances many of which will likely be meaningless. We should try to address the questions that are meaningful to us using the data, and then a (strong) significance is just something of a marker for further attention.

  • Well said. I'm not as well versed in ML as some others are here, but I know it often deals with big data, and with that p values become less and less indicative of an effect on their own. EDA can at least attempt to elucidate why these effects get flagged. – Shawn Hemelstrand Jan 15 '23 at 11:31
  • Nice answer. I agree that there is nothing wrong with calculating pvalues in EDA as say a measure of goodness of fit of competing models/hypotheses, just dont use them to claim statistical significance/insignificance. Significance claims would need separate data. – Graham Bornholt Jan 15 '23 at 19:07
3

Here is my take at the answer.

The problem with exploratory analysis

Your professor is correct - exploratory analysis, if done for long enough, will always find some trend that will seem interesting. You can easily check this yourself - just fill the data you had in class with random numbers (or shuffle it between participants so that all the real trends are gone) and perform an exploratory analysis. If you will use hypothesis testing with a traditional 0.05 p-value cut-off you should find that around every 20th result appears to be significant. But this cannot be true, since the data now is a random sea of numbers. Hence all these "findings" are false positives.

The logic is quite simple - you cannot use such a procedure (extensive exploratory analysis) to claim you have found something when the same procedure would also find something where there is nothing to find.

The best way to address the issue

Of course in science everyone is doing exploratory analyses all the time and they are wonderful. Calculating one predefined hypothesis only would be a waste of scientific resources. Imagine we collect 1000 biopsies from cancer patients and perform gene expression analysis (which can be an experiment worth hundreds of thousands of dollars). And then we only test if one pre-selected gene is different between cases and controls. If not we throw the dataset away (because not doing so would be performing an exploratory analysis) and collect another sample to test another hypothesis. Nobody does that.

The way to deal with the issue is to have a "confirmatory" dataset which is then used in order to check if the results found via exploratory phase can be replicated. In an ideal case this confirmatory dataset should come from a separate collection of samples - not merely on collection split into two (a.k.a. "tain" and "test" in machine learning). However, splitting into two is still better than having no dataset for confirmation.

So the whole procedure looks like this:

  1. Collect a sample of data for exploratory analysis.
  2. Explore the trends in the data by trying anything you like to try.
  3. Out of the things you found - select a few that are strongest and most interesting.
  4. Collect another sample for confirmatory analysis.
  5. Perform a predefined test on confirmatory dataset based on what you found in step 2.

Another way of dealing with the issue

Another way to deal with the issue, as you have mentioned yourself, is to do a correction for multiple testing. So if in the exploratory phase you performed 100 hypothesis tests you then use something like Bonferroni correction in order to adjust the p-values such that, instead of every 20th being significant, the chance of at least one significant result out of this 100 (in a data with no real trends) would be 5% (one in 20).

The issue here of course is that it is hard to track the number of performed tests in exploratory phase. Also exploration involves a lot of peaking at the data. Say you do some kind of scatter plot and you see that there is a difference between adults and children. Did you perform one test here? No. Surprisingly, you might have performed 10 tests just by one glance. Probably, unknowingly, you just checked for differences between men and women, for differences between different socioeconomic status, etc etc. And you just formally checked one that had the most chance of turning out significant. So in the end it's hard to estimate how to correct for multiple testing.

On top of that many real world experiments nowadays (i.e. that gene expression example) involve multiple testing even in exploratory phases. For example imagine you are looking which genes express differently between cases and controls, between cases and controls stratified by age, between cases and controls stratified by sex, etc. Here in each stage you are checking thousands of genes at once, and you should do some form of multiple testing correction (probably something like FDR), in each exploratory step.

So multiple testing is only valid in quite restricted exploratory analyses when you know how many tests you perform. Otherwise it goes out of hand.

Is cross validation an answer

Just like in machine learning you can't train a model on full data and then check it by doing cross validation here you cannot do exploratory analysis on whole dataset and then cross-validate the hypothesis that were significant. If the whole dataset shows a trend (real or perceived) this already predisposes all the subset of that dataset to follow this "detected" pattern.

So this tells us that we have to split the data into parts before doing exploratory analysis and then perform the exploration only on one part and test on the others. But is there a need for multiple parts as in cross validation, or is one "confirmatory" dataset enough - that is a good question. I think the answer is - the more confirmations, the better. It should boil down to the probability of accepting a false-positive. If something replicates on one confirmatory dataset with a p-value of 0.05 - there is a 5% chance of a false positive. If it replicates on two the chance is 0.25%. But probably we can achieve the same thing by simply adjusting our p-value for confirmation to be below 0.05. Also here we should remember that a real confirmation comes from a different sample altogether, not the same sample split into multiple parts.

Why is this not a problem with predefined hypotheses

Well, no. There is a difference between testing a single hypothesis and fitting a single classifier. Classifier will go through an optimisation stage and will try to fit the data as best as it can. A predefined hypothesis in this case would be more similar to the final model of a classifier. If you have a final model (say a neural network with all weights set in place) - you can test it's performance on a dataset once, without cross validation, as there is nothing to cross-validate. Same idea here. When you have a pre-defined hypothesis it is not adapting to the data anymore. It's just testing how likely that effect is to be obtained by chance.

Hence, predefined hypothesis are not potentially misleading, when calculated correctly. Also note that sample size does not change the probability of a false positive - you will have same chance to get a significant result on random data with 10 samples and with 1,000,000 samples. The main thing sample size gives you is statistical power - that is it increases the chance of detecting TRUE trends.

Why is stating all the procedures and all the results not an answer

Think about the proposed exploratory experiment on random data. We can do 1000 hypotheses and you would find around 50 significant trends. We can then publish these results of our exploration and say that we did a 1000, describe each of them, and we found 50 to be significant and showcase these. But all these are false positives. So merely describing how we p-hacked does not prevent us from p-hacking and there is no way to judge which of these results are true, if any.

  • @ Karolis Koncevičius : Thank you for your answer! – stats_noob Jan 15 '23 at 07:44
  • I think your answer makes sense too. You do have to weigh the cost-benefit of how many hypotheses you define, as well as consider the context of your situation. A situation like ML somewhat glides between both worlds conceptually and isn't totally black and white in my opinion. – Shawn Hemelstrand Jan 15 '23 at 07:50
1

couldn't procedures like Cross Validation be used

Yes, p-hacking, can certainly be reduced by tests like these that represent the significance correctly.

But that's the whole point about p-hacking, it is the case when such techniques as cross validation are not used and when the expression of the statistical significance is corrupted by ignoring the multiple repetitions that are used to observe some result.

P-hacking is a mistake where we do not express significance correctly or out of context. Sure this mistake can be 'solved' by not making the mistake — by expressing the significance correctly or by adding more independent data to test significance of a potential effect.

sidenote: Cross validation can itself be prone to p-hacking as the same validation set is fitted by multiple models (the point of cross validation is to select a test that has the best generalising performance, a best balance between variance and bias). It is another (third) test dataset, which is only used once on a single model, that gives the correct indication of statistical significance. (What is the difference between test set and validation set?)