Let's say I have data on 10 stores that sell widgets, each of which received num_orders number of orders in a certain timeframe, and sold a total of total_qty widgets in that timeframe. In other words, for store i, the average quantity per order is total_qty_i/num_orders_i.
My goal is to use this data produce a prediction interval for the total_qty variable that accounts for the fact that there are two layers of uncertainty:
- We don't know how many orders a store will get - there is a distribution of
num_orders - Even once we know how many orders the store gets, there is uncertainty because we don't know what the quantity for each order will be.
What's the best way to do this?
Let's say the data look like this (in reality, of course, I don't know the distributions that gave rise to the data)
num_stores <- 10
df <-
tibble(store_id = 1:num_stores,
num_orders = rpois(num_stores, 50),
total_qty = rpois(num_stores, 100))
My idea is to use a Bayesian approach:
- Write down a generative probability model
- Fit the model to the data and get samples of parameters from the posterior distribution
- Simulate a large number of data points with the samples from the posterior, and then take the 5th and 95th quantiles of that set of simulated datapoints.
Is this a sensible approach?
As an example, this is the probability model I might start with:
total_qty ~ dpois(lambda_qty*orders) # sum of Poissons is Poisson
orders ~ dpois(lambda_orders)
lambda_qty ~ dnorm(30, 2) # prior for avg qty ordered
lambda_orders ~ dexp(1) # prior for avg num orders
I'm pretty new to Bayesian analysis, so would appreciate any tips.
total_qtyas a discrete variable - e.g. one order could be for 2 widgets, or 3 widgets, but not 2.5. But I can see that the way I named it might suggest dollar amounts instead, which would be continuous. In any case, your response is very valuable in validating the general idea I had. – Nayef Oct 27 '20 at 06:09