I was wondering from a technical perspective what approach I should follow in this modelling problem I have.
I have a target variable Y which is a continuous random variable defined in the interval [0; infinity). For this reason (and this is also verified by the data itself) I decided to use a tweedie distribution. Moreover, I would like to have a multiplicative model, so I am using a log link-function.
I also know that the variable Y is linearly dependent on the time variable. It is assumed that the more the time, the higher the Y value is.
Given these conditions I followed two different approaches:
- Modeling the variable directly and using
timeas a log offset. Following R syntax the model would look like the followingglm(Y ~ X1 + X2 + ... + offset(log(time)), family = tweedie(link = "log")) - Modeling the ratio of
Yandtimeand usingtimeas training weights. DefiningY_time = Y / timewe haveglm(Y_time ~ X1 + X2 + ..., weights = time, family = tweedie(link = "log"))
Which approach is more theoretically sound?