Predicting time to finish

Question

Out of curiosity, I want to understand how to model this problem. I've been hearing people suggest the use of linear regression but I am not sure how to encode this problem (included my attempt below) in R as I am a complete beginner in this area.

I have a task that can be done any number of times (each individual instance is a task instance). Everytime the task completes 1%, I recorded the time elapsed since the task's start time. Therefore, for each task, I will have 100 points (100 1% increments) at which I recorded the time elapsed.

Given that I have this data for many instances, is it possible to predict the finish time for this task when a new task instance is given?

      TaskID Percent TimeElapsed
   1:      1       0   0.2035333
   2:      1       1   0.2062833
   3:      1       2   0.2137167
   4:      1       3   0.2180833
   5:      1       4   0.2490833
  ---                           
3127:     31      96   4.9391667
3128:     31      97   4.9970500
3129:     31      98   5.5644500
3130:     31      99   5.6532667
3131:     31     100   5.8359833

A quick look at the task behavior (below) tells me there is a bit of a variance in how the task behaves so its hinting that the output should not just be a time prediction but rather a time prediction with some confidence?

In addition, I'm thinking just using the information about the current progress of the task might not be sufficient - the task may have slowed down in some its previous progress points so the finsh time would be affected. Therefore, this information should somehow be encoded into the model?

enter image description here

I am particularly interested in how to do this using R. I included my initial attempt at using linear regression here but the result does not look good to me. Any suggestions on how to improve this or use some other methods?

I have given the output of dput (on a data table: install.packages("data.table")) on pastebin. If you want a data.frame instead, please see this paste instead.

EDIT: Attempt at using linear regression

The thick black line is the median at every point. The thick red line is the regression line fit to the median line.

enter image description here

"is it possible to predict the finish time for this task when a new task instance is given?" -- Are you interested in (a) how long it will take to finish the task, given that a part of the task already has been completed (and thus having an idea of how fast this instance will be) (b) how long it will take, given that the task just has started (and just knowing 'typical' behaviour from past tasks) or (c) predicting the time vs. percentage curves (d: while having already completed part of the task) — dobiwan, Sep 06 '14 at 10:35
@dobiwan: I am particularly interested in predicting the time vs. percentage curves while having already completed part of the task. — Legend, Sep 07 '14 at 05:16
As long as your samples effectively traverse the space then your fit (of whatever function) has the ability to be a decent predictor. You might get mileage using Eureqa (http://www.nutonian.com/products/eureqa/) where your t_to_finish = f(percent_done, elapsed_time), and you have a sigmoid element in your basis like tanh. It is likely to give you a pretty solid analytic fit. It would be linear in the sense that there are a few definitions, and one is about the nature of the coefficients, not that it is a geometric line. Do you have sample data? — EngrStudent, Sep 10 '14 at 18:07

score 3 · Accepted Answer · answered Sep 07 '14 at 16:09

To estimate the time elapsed at, say, P=70% done, given that you know (a) the time elapsed for 1,...,50% and (b) some previous curves, you can do the following:

Train a regression model based on the previous curves. The target value (y) should be the time elapsed at 70% (which you know for the previous curves) and the predictors (x) should be the times elapsed for 1,...,50% (which you know for the previous curves and also for the new, in-progress curve).
Use this model to predict the time elapsed at 70% for your in-progress task. If the regression method also gives an estimate of the uncertainty, you can calculate confidence bounds.

Repeat this procedure to predict the elapsed time for P=51,...,100% to get a forecast curve (if the regression method allows multiple outputs, you can use one model for all values of P).

I've used your problem as an excuse to waste this rainy Sunday afternoon to play around with Gaussian Processes, which I've heard are good at uncertainty estimation (I can confirm that now, but specifying the model seems to be kind of an art).

The plots show the training curves in gray and the curve to estimate in black. The already completed part is marked with an 'x', the red area gives a confidence estimate for the remaining part of the task. I've taken the liberty to remove the outlier task (green in your plots) from the dataset - that task would pull up all estimates at about 50%.

forecasts

The plots show two things: As we get further away from the known datapoints, the uncertainty increases; and the more datapoints we have, the better we can predict the future.

My R is a bit rusty (unfortunately) and I've found a convenient toolbox for Matlab, so that's what I used for the analysis. Here's my code: http://pastebin.com/RYNZqh3k. Note that I'm definitely no expert for Gaussian Processes (this is my first real exposure to them), so this probably is not the optimal code and certainly not the best model configuration. You can also try other regression methods that give uncertainty estimates (also indirectly, e.g. via bootstrap), but keep in mind that your predictors will be very correlated (if you were fast for the first 10% of the task, you likely also were fast for the first 8% and 9% of the task).

score 2 · Answer 2 · answered Sep 10 '14 at 14:53

Censoring does not appear to be an issue in your setting. So, you may have a bit more flexibility in the choice of methods. You may want to look at an Accelerated Failure Time model (AFT). For instance, one with a Weibull distribution where the dependent variable is TimeElapsed and the independent variable is your TaskID categorical var. You should be ok in terms of sample size according to the 15-20 events per covariate rule of thumb. You could also look at quantile regression which makes fewer dist. assumptions on Y and works on the original scale. Perhaps you could elaborate on what the TaskID variable represents.

seanv507 · Answer 3 · 2014-09-10T18:18:38.963

Legend, It would help if you explained what the task was.

As you said:

the task may have slowed down in some its previous progress points so the finish time would be affected

Do you for instance know the "progress points".
I would be tempted to look at predicting time difference between progress points ( or fixed %ages eg to go from 10% to 20% completed). This doesn't have the problem of "knock on effects". Below is a boxplot of time difference at each 1% stage ( eg Time elapsed 21% - Time elapsed 20%) - I think it shows more clearly what is going on box plot of 1% elapsed time difference Now the mean prediction at any time is easy - just sum up the means of the remaining time periods. I don't know right now a good way of doing confidence intervals ( apart from a bootstrap with eg mean and variance/covariance at each percentage). A simple way would be to assume each % diff is independent normal, then add up remaining means and variances to get

$\text{remaining time}\sim N(\sum \mu_i, \sum sigma_i^2)$

and calculate confidence interval for remaining time based on that.

score -1 · Answer 4 · edited Apr 13 '17 at 12:44

-1

The curve is clearly nonlinear. It does generally look sigmoid-shaped so maybe try to fit a generalised logistic curve to it. You could probably use GLM in R (disclaimer: I haven't used R in a while so there could be other packages). http://en.wikipedia.org/wiki/Generalised_logistic_function (let A = 0 in your case)

To find the confidence intervals in R, you could follow a similar path to this answer. https://stats.stackexchange.com/a/72058/34860

Let me know how this goes or if you have any problems with the approach.

edited Apr 13 '17 at 12:44

Community

1

answered Sep 06 '14 at 05:25

Black

179

glm in R is NOT the answer, see https://stats.stackexchange.com/questions/47802/whats-the-most-pain-free-way-to-fit-logistic-growth-curves-in-r/68103#68103 – kjetil b halvorsen Nov 20 '20 at 16:03

Predicting time to finish

4 Answers4