Hello: I am interested in modelling the purchase rate over time (an integer number of purchases, observed daily, for a number of shoppers) for a collection of 3 different products.
The data look a bit like this:
| Person ID | Date | Product ID | Product Name | Purchase Count |
|---|---|---|---|---|
| A | Mon | X | craft beer 6pack | 0 |
| A | Mon | Y | 24ct case generic lager | 0 |
| A | Mon | Z | 750mL bottle | 1 |
| A | Tue | X | craft beer 6pack | 0 |
| A | Tue | Y | 24ct case generic lager | 2 |
| A | Tue | Z | 750mL bottle | 0 |
Because the outcome of interest is a rate based on a count variable that includes true zeroes, something like Poisson regression appears to be a compelling option. However, I'm struggling to decide how to handle the fact that each product has a difference case count/volume (e.g., a 6-pack of 12oz cans is ~2129 mL of beer so I would expect this item to be purchased less frequently than a 750mL bottle, all else equal).
Some of my colleagues have suggested converting all of the products to a common measurement (e.g., mL) but this results in some non-integer values due to the introduction of a conversion factor. In response, I considered switching to a Gaussian model but this is not ideal because the data are still very granular and bounded below by zeroes (values like {0, 1, 2} become {0, 2129.29, 4258.59}.
One option that seems compelling but I am struggling with is whether I can use the offset term to capture the unit information about each product (e.g., use the mL-per-purchase or its inverse as the offset term).
Has anybody ever used such a technique? If so, how did it work? What exactly did you use as the offset and how did you construct your model?