5

I'm sending a number of DNA test kits to customers each day. The customers swab their cheeks to gather DNA and send back the kit for processing.

I have data about each test kit that has been sent for the past two years. The data includes the date it was sent, the state that it was sent to, and the date that it was received back (if it has been received).

We process each kit the day it is returned. I'm trying to forecast the number of kits that will be returned each day, looking forward 60 days, so I can staff accordingly for processing. Each state will process its own tests so I need to estimate the number of kits to be processed in each state for each of the next 60 days (3000 estimates updated daily).

Many customers return their kits within one week, but some take a few weeks, sometimes up to two months. Some never return their kits at all. (We can treat kits sent more than two months ago as if they will never be returned if it simplifies the problem.)

We will update the forecasts each day. Forecasts far out are expected to be less reliable; most of the kits that will be processed in 60 days haven't been sent out or even ordered yet. Forecasts for dates within the next few days should be more reliable since we know how many kits are outstanding.

A confidence interval estimate, e.g. 30-50 kits at 80% confidence in NY on June 1, 2023, would be great, but a point estimate would be helpful too.

The data looks like this:

Sent,Returned,State
2023-01-02,2023-01-10,CA
2023-01-02,2023-01-15,NY
2023-01-04,NA,CA

What's a good way of modelling this problem?

I've been looking into using ARIMA on the number of kits returned each day, but not sure how to make the predictions factor in the number of sent, but not yet returned.

I also starting looking at Cox survival modeling, but I'm not clear on how to predict the number that would be returned each day.

Also thought about using multiple linear regression with 60 columns for number of kits sent 1 day ago, 2 days ago, etc, but not sure how to deal with the sparseness at we try to predict further and further out.

Sycorax
  • 90,934
cwarden
  • 768

1 Answers1

5

I think that you can solve this pretty easily using some simple methods which are commonly used in or time-to-event analysis.

For each kit that we have not received, we know that a duration $d$ has elapsed, and we want to know the probability that the total duration will be between $d$ and $T=d + k$ for some $k>0$. In other words, we've waited $d$ days to receive it so far, and we want to know the probability that we will have to wait $d+1, d+2, d + 3.17, \dots, d + 59.9, d+60$ days.

We can do this if we can estimate the probability distribution of $t$ the total time elapsed between sending the kit out and when the kit is returned. For simplicity, we can assume this is an absolutely continuous random variable.

From probability theory we know that $$p=\mathbb P(t < T | t > d) = \frac{F(T)-F(d)}{1 - F(d)} $$ for $F$ the CDF. So we just need to estimate a suitable distribution $F,$ whence we can compute the probability of receiving each kit on each day between $k=1$ and $k=60$ given the $d$ that we know for each kit.

So you need to estimate $F$. If you wish to use a parametric distribution, my recommendation for $F$ is to choose a distribution that has support only for positive values (because you're measuring duration). Some suggestions:

  • Exponential
  • Gamma
  • Weibull

Of course, there are lots of other parametric options. You could also use a non-parametric estimate of $F$, such as the ECDF; see .

You'll have to decide what to do about the that is present in your data.

  • The simplest option is to estimate $F$ using only the returned kits. This might work well enough if the number of non-returned kits is very small.
  • However, the simplest option will bias the estimates to be smaller than if you had complete data; estimating the distribution in a way that accounts for the effect for non-returned kits seems more prudent.

You can find a worked example in A. Clifford Cohen (1965) "Maximum Likelihood Estimation in the Weibull Distribution Based On Complete and On Censored Samples", Technometrics, 7:4, 579-588, DOI: 10.1080/00401706.1965.10490300

I would not expect the data about states to change any aspect about this -- my assumption is that all states would have essentially the same distribution $F$. If you find that is not the case, then the simplest thing is to build 50 models, one for each state. While this is simple, these models might wildly differ, especially for the states where data are scarce. A model might ameliorate that.

After we have $F$, then estimating the number of kits we receive each day is trivial. If we assume that all kits are independent, then the expected number of kits received on a given day is the sum of the probabilities of receiving the kits (each kit $i$ is a Bernoulli trial with probability $p_i$). The easiest way to get an interval estimate is to do a Monte Carlo simulation, but there are probably better ways.

Sycorax
  • 90,934
  • If it's been 5 days since a kit was sent, and I want to estimate the probability that the kit will be returned 3 days from now, is it (F(5+3) - F(5)) / (1 - F(5)), or is it (F(5+3) - F(5+3-1)) / (1 - F(5+3-1)) because in order for it to be returned on day 8, it can't have been returned on or before day 7? – cwarden May 16 '23 at 20:43
  • The formula means what it says, but perhaps you are considering discrete $F$ instead of absolutely continuous $F$? – Sycorax May 16 '23 at 21:26
  • I'd say discrete since we're only tracking returns at the granularity of a day. But I'm unclear on the implications of this assumption. If assuming a continuous distribution is simpler, that may be acceptable. – cwarden May 16 '23 at 21:50
  • The answer https://stats.stackexchange.com/questions/416193/conditional-survival-probability-up-to-time-t-given-t-s explains the implications – Sycorax May 16 '23 at 23:41
  • The linked question doesn't explain how to calculate the probability if we assume a discrete distribution. Assuming a continuous distribution, I'm still not clear on how to estimate the probability that a kit will be returned between day 7 and day 8 if it's currently day 3, as opposed to calculating the probability that it will be returned by day 8 if it's day 3. – cwarden May 17 '23 at 11:38
  • 1
    Both of these points can be addressed by consulting the definition of a CDF. – Sycorax May 17 '23 at 11:41