1

I am comparing several modeling approaches to semi-continuous data (many exact zeros and continuous positive cost outcomes) to assess the effect of the main predictor "disease" on cost.

The models I'm comparing are a Tweedie model and a gamma hurdle model.

The Tweedie model is in the form:

glm(cost ~ disease + age + gender + offset(log(days_in_study)), 
    data = df, family = tweedie(link.power = 0, var.power = xi.max)

The hurdle model is in the form:

binary component:

glm(cost ~ disease + age + gender + offset(days_in_study), 
    data = df, family = "binomial")`

continuous component:

glm(cost ~ disease + age + gender + offset(log(days_in_study)), 
    data = subset(data, cost > 0), family = Gamma(link = "log"))

How do I compare these models to choose between them? The Tweedie model indicates that the variable "disease" is strongly predictive of costs. The hurdle model indicates that it is not. So while I know no model is per se "right," my choice of model determines the entire outcome. Common measures like the AIC don't seem to be well-defined for two-part models, per various CrossValidated posts.

0 Answers0