I think i know what I need to do here, but I want a gut check, and i might need some direction on specific packages and processing to use.
My goal is to discover the latent value of products that people buy, where they are packaged together in somewhat unintuitive ways. I have several levels of packaging that I need to pick through. The first is the "interface" this is the product that people actually use, in the sense of it being the front end. Then there is the "package" which is what people actually pay money for. Then there is the "content set". A package is made of multiple content sets and an interface. I want to determine the value of the interface. An analogy is a cable TV subscription. I pay for "gold+sports". What is the value of "BBC America"? What is the value of the cable subscription itself (the ability to access the content on the platform)? Also, some of my customers have streaming only, some have traditional cable, and some have both.
to extend the analogy my data looks like this
| Customer | Price Paid (total monthly) | Interface | line item cost | Package | Channel |
|---|---|---|---|---|---|
| Bill | 100 | cable | 100 | gold | Local channel 1 (denver) |
| Bill | 100 | cable | 100 | gold | Local channel 2 (denver) |
| Bill | 100 | cable | 100 | gold | BBC America |
| Bill | 100 | cable | 100 | gold | History Channel |
| Bill | 100 | cable | 100 | gold | Comedy Central |
| Suzy | 150 | cable + app | 100 | gold | Local channel 1 |
| Suzy | 150 | cable + app | 100 | gold | Local channel 2 |
| Suzy | 150 | cable + app | 100 | gold | BBC America |
| Suzy | 150 | cable + app | 100 | gold | History Channel |
| Suzy | 150 | cable + app | 100 | gold | Comedy Central |
| Suzy | 150 | cable + app | 50 | sports | NBCSN |
| Suzy | 150 | cable + app | 50 | sports | FS1 |
| Jimmy | 60 | app | 60 | bronze | History Channel |
| Jimmy | 60 | app | 60 | bronze | Comedy Central |
| Clara | 50 | app | 50 | silver | History Channel |
| Clara | 50 | app | 50 | silver | Comedy Central |
| Clara | 50 | app | 50 | silver | BBC America |
| Clara | 50 | app | 50 | silver | App-only National News |
The problem is that I have 1000 different customers, with 100,000 possible "channels".
I think what I need is a GLM or maybe GLME model. It seems like what i'm trying to do is have a hierarchical linear model where I have the top-line value at the customer level, discover the interface value, have the line item value for package, and discover the channel value. Does that sound right? What package/code makes the most sense for that? Either R or Python is fine. I think I'm trying to predict total value by customer (random effect) + package (fixed) + package line item price (random) + channel.
If the right approach is GLME, I'm not sure what the right encoding strategy would be for all of these combinations of interface, package, and channel. Seems like one-hot or dummy would give me a dataframe that's 100,000 columns wide and would be very sparse.