I am comparing several modeling approaches to semi-continuous data (many exact zeros and continuous positive cost outcomes) to assess the effect of the main predictor "disease" on cost.
The models I'm comparing are a Tweedie model and a gamma hurdle model.
The Tweedie model is in the form:
glm(cost ~ disease + age + gender + offset(log(days_in_study)),
data = df, family = tweedie(link.power = 0, var.power = xi.max)
The hurdle model is in the form:
binary component:
glm(cost ~ disease + age + gender + offset(days_in_study),
data = df, family = "binomial")`
continuous component:
glm(cost ~ disease + age + gender + offset(log(days_in_study)),
data = subset(data, cost > 0), family = Gamma(link = "log"))
How do I compare these models to choose between them? The Tweedie model indicates that the variable "disease" is strongly predictive of costs. The hurdle model indicates that it is not. So while I know no model is per se "right," my choice of model determines the entire outcome. Common measures like the AIC don't seem to be well-defined for two-part models, per various CrossValidated posts.