1

This is related to the post How much do we know about p-hacking "in the wild"?. The post does not clearly delineate the boundary between forking or not forking to me.

Suppose I have a count model with some predetermined covariates. Assume there is a bunch of zeroes in count as well but I do not know whether I should use zero inflated regression. Assume further that there is no extra data to allow validating the model and I only care about inference.

Step.1: Because the model is count model, the naive choice is Poisson regression. Observing Pearson residuals of regression, I determined that there is possibility of overdispersion/underdispersion. Then I decided to use zero inflated regression(hurdle) with log and logit links for count and binomial components respectively.

Step.2: After inspection of Step.1 regression diagnostics, I decided the link function for count may not be appropriate. So I changed the count regression's link function.

Q1: I think I have commited forking in step.1. Is that correct? Should one just jump to the zero inflated model at the first step? This sounds like one should always start with fairly complex model at first step.

Q2: Does Step.2 alone account as forking? Changing link function changes model. Not to say, adding or dropping covariates after discovering confounding/unconfounding is changing model. Should I treat step.2 and adding/dropping covariates as forking?

Q3: Since data is fixed, even if step.1 or step.2 is not forking, p-value is fixed regardless of whether I stare at summary of the result. Thus it looks like that given fixed data, p-value is known for whatever model applied and independent of whether you look at summary or not. So in this view, it seems any model adjustment is forking. If this is the case, then why do we select models for inference after obtaining data in observational/experimental study?

user45765
  • 1,416

0 Answers0