Accounting for multiple layers of uncertainty in a model

Question

Let's say I have data on 10 stores that sell widgets, each of which received num_orders number of orders in a certain timeframe, and sold a total of total_qty widgets in that timeframe. In other words, for store i, the average quantity per order is total_qty_i/num_orders_i.

My goal is to use this data produce a prediction interval for the total_qty variable that accounts for the fact that there are two layers of uncertainty:

We don't know how many orders a store will get - there is a distribution of num_orders
Even once we know how many orders the store gets, there is uncertainty because we don't know what the quantity for each order will be.

What's the best way to do this?

Let's say the data look like this (in reality, of course, I don't know the distributions that gave rise to the data)

num_stores <- 10
df <- 
    tibble(store_id = 1:num_stores, 
           num_orders = rpois(num_stores, 50), 
           total_qty  = rpois(num_stores, 100))

My idea is to use a Bayesian approach:

Write down a generative probability model
Fit the model to the data and get samples of parameters from the posterior distribution
Simulate a large number of data points with the samples from the posterior, and then take the 5th and 95th quantiles of that set of simulated datapoints.

Is this a sensible approach?

As an example, this is the probability model I might start with:

total_qty ~ dpois(lambda_qty*orders)  # sum of Poissons is Poisson
orders ~ dpois(lambda_orders) 
lambda_qty ~ dnorm(30, 2)  # prior for avg qty ordered 
lambda_orders ~ dexp(1)    # prior for avg num orders

I'm pretty new to Bayesian analysis, so would appreciate any tips.

score 1 · Accepted Answer · answered Oct 27 '20 at 04:30

First things first, "simulating a large number of data points with the samples from the posterior and taking the 5th and 95th quantiles of that simulated dataset" is a totally sensible approach, and the it's known as sampling from the predictive posterior distribution.

Now, the Poisson distribution is not the best choice for total_qty, since total_qty is a continuous random variable, not a discrete one. I think we can improve this model as follows. Let's start with these priors:

lambda_orders ~ dexp(1) # prior for avg num orders
lambda_qty ~ dnorm(30, 10)  # prior for avg qty ordered 
sigma_qty ~ dexp(1) # prior for st. dev. qty ordered

[I increased the prior variance of lambda_qty, because it is the prior distribution, not the predictive distribution, so it's usually better to keep wider distributions. But if you have prior knowledge that lambda_qty is between 26 and 34 with high probability, you should keep that prior just the way you wrote it.]

Then, we can keep a Poisson distribution for orders, since it represents a count. [If there's overdispersion one would prefer a negative binomial, but let's ignore it for now.]

orders ~ dpois(lambda_orders)

Finally, we can assume the value for each order follows a normal distribution, with mean lambda_qty and standard deviation sigma_qty. Adding a number orders of normal RVs with that distribution gives us a new variable with mean orders*lambda_qty and standard deviation orders*sigma_qty. Thus, we write:

total_qty ~ dnorm(orders*lambda_qty, orders*sigma_qty)

That concludes our Bayesian model! Hope it was helpful

Thank you, this is extremely helpful! Just one point of clarification: I was actually thinking of total_qty as a discrete variable - e.g. one order could be for 2 widgets, or 3 widgets, but not 2.5. But I can see that the way I named it might suggest dollar amounts instead, which would be continuous. In any case, your response is very valuable in validating the general idea I had. — Nayef, Oct 27 '20 at 06:09

Accounting for multiple layers of uncertainty in a model

1 Answers1

Linked