4

The context

I was asked not to give the full context here, but I hope, the following will suffice:

  • There are human caused events usually occurring about weekly.
  • The event as seen by us has no data, just the timestamp. Imagine a user entering some data and finally clicking the "send" button. The click gets reported to us, the data do not.
  • Currently, we're having about 1000 independent data sets.
  • We want to send reminders for expected events.

The data sets are mostly independent as each data set corresponds with an action by a different user (There are some exceptions like one user involved in two or maybe three data sets, but let's ignore them).

We need a prediction of future events, so that we can send a reminder to the user a few hours before they'll produce the next event.

This reminder is to be seen as a service to the users. Some mispredictions are fine, but too many of them would turn the service into a nuisance. In case, no prediction can be done, there'll be no reminder - that's fine.

Real examples

Some real data sets, the format is yymmdd-day_of_the_week-hhmm.

Data set A

  • 180109-Tue-2130
  • 180116-Tue-2259
  • 180124-Wed-1140
  • 180130-Tue-2316
  • 180207-Wed-0105
  • 180213-Tue-2223
  • 180221-Wed-0028
  • 180227-Tue-2116
  • 180307-Wed-0156
  • 180313-Tue-0037

I'd ignore the third and last events as outliers as all others fall in the range Tue-2116 to Wed-0105. Predicting the next event to come on 2018-03-13 at about 23:10 plus minus two hours with a probability of about 80% might be the best guess.

I was told that the last event actually belongs to the others, and that no more event should be expected on 2018-03-14 and this turned out to be correct.

Data set B

  • 171023-Mon-0857
  • 171023-Mon-1618
  • 171109-Thu-2301
  • 171122-Wed-1438
  • 180107-Sun-1452
  • 180131-Wed-1512
  • 180205-Mon-2242
  • 180209-Fri-0040

This looks pretty hopeless. Refusing to generate any prediction sounds best.

Data set C

  • 180204-Sun-2311
  • 180211-Sun-2335
  • 180219-Mon-0110
  • 180226-Mon-0006
  • 180304-Sun-2318
  • 180311-Sun-2208

Predicting the next event on 2018-03-18 at about 23:30 plus minus two hours sounds like a sure bet.

Data set D

  • 180212-Mon-1410
  • 180219-Mon-1205
  • 180226-Mon-1449
  • 180226-Mon-1449
  • 180305-Mon-1834
  • 180312-Mon-2329

This one is fairly regular, too, except for the repeated timestamp on 2018-02-26. Such repetitions are corrections (the use forgot something and added it the next time) and can freely be ignored.

I'd predict the next event on 2018-03-19 at 18:00 plus minus six hours. It might even slide into the next day, but this doesn't make our reminder completely wrong.

I was told, there are data sets containing two events per week, but haven't found such an example yet.

Expected output as computed on 2018-03-13 at 01:00

  • For data set A (my original thought):
    Next event on 2018-03-13 at about 23:10 plus minus two hours with a probability of about 80%.
    => Produce a corresponding reminder.

  • For data set A (improved):
    Next event on 2018-03-20 at about 23:10 plus minus two hours with a probability of about 80%.
    => No reminder for now as it's too distant future (the computation will be repeated at least daily).

  • For data set B:
    Unpredictable.
    => No reminder.

  • For data set C:
    Next event on 2018-03-18 at about 23:30 plus minus two hours with a probability of about 90%.
    => A reminder.

When we predict a next event in the next few hours with some sufficient probability, we'll send a reminder to the user. Currently, we plan to send the reminder 6 hours before the expected time and require a probability of at least 60%, but this will change according to the feedback we get.

Clarifications to the comments

We get all the events immediately as they get produced. The above examples were produced from a state as of 2018-03-03 01:00 and you can obtain this state from the linked file by simply removing all later events.

Obviously, you can generate our data at any older time t by simply removing all events newer than t.

At any time t, our data contain all events before t (especially the first event). Nonetheless, we believe than only a few (maybe 10) most recent events are relevant for the prediction (it's just a gut feeling).

There may be correlations between the data sets, but they're probably too minor to matter in our limited data. There'd be useful, if we knew the future of some data sets, but we obviously don't.

The events may be influenced by major TV events and by holidays, but IMHO that would be visible with much more data sets only. We most probably can ignore all external events as they fall below the noise level.

The question

How can I produce such predictions?

I'm linking a file containing about 1000 data sets from 2018-03-25 at about 03:30: https://www.dropbox.com/s/ukwqlqxb0dgzq6g/stats-334010.txt?dl=0

Data sets containing fewer than three events were removed as uninteresting as we need no prediction for them.

  • 4
    What is your question? Eg, I don't see a "?" anywhere. – gung - Reinstate Monica Mar 19 '18 at 13:36
  • @gung I've added a question mark. :D And more... – maaartinus Mar 25 '18 at 18:13
  • I see that, thank you. My impression is that this is still too broad by our standards. It would be helpful if some or our forecasting experts chimed in here. – gung - Reinstate Monica Mar 25 '18 at 20:03
  • @gung: thanks for drawing my attention to this. I'll take a look in the next days. – Stephan Kolassa Mar 25 '18 at 20:27
  • 2
    Given that a big database is presented and also the intentions/goals are made very clear ("We need a prediction of future events, so that we can send a reminder to the user a few hours before they'll produce the next event.") I would say to open this question. It might be considered still broad or doing basic development work, but in that case the question should have been deleted long before. I believe that there is at least some value to answer this question and the OP seems to be willing to adjust the question if it would be useful to make it more generic than the current specific setting – Sextus Empiricus Mar 25 '18 at 20:39
  • @MartijnWeterings I've just made the question less generic to counter the "too broad" problem. I don't expect anyone to solve this exact problem for me, I just expect some pointers to what theory is applicable. I guess, I myself could define some error function to be minimized and get some solution (machine learning, maybe), but I'm looking for something more scientific, if possible. – maaartinus Mar 25 '18 at 20:49
  • @maaartinus, I believe I was arguing in favor of your case (+1). However, the biggest problem with the current question is still that it is very much applied and requires a 'custom made' thorough approach rather than a few references to some textbook paragraphs. If you would translate this to stack-overflow language, then I guess it is something like 'help me to create this user interface with such and such specifics'. It is an extremely applied problem (not really a, specific, question) and little theoretic (yet, I like the problem, but thats besides t point about the quality of the problem). – Sextus Empiricus Mar 25 '18 at 21:01
  • 1
    @MartijnWeterings I acknowledge, you we're arguing in favor of it. I guess, I'll be able to customize the approach myself, it's just that I'd like to get an advice about what makes most sense. I'm not looking for an ultimate solution - more data and more requirement will come later (in a few months maybe) and something better may be developed. +++ Your translation to an SO problem sounds very broad to me. I'd translate it like "given a set of points, find the "smallest" set of rectangles containing them all + some explanation how "smallest" is to be understood and what it's good for). – maaartinus Mar 25 '18 at 21:27
  • 1
    IMHO, this has turned into an interesting well-framed question (+1). Thank you for continuing to work on it. – whuber Mar 25 '18 at 21:35
  • I think it's pretty clear now (+1). A few more questions: (1) Are those complete histories for each user? - so that it would make sense to call the first record in each data-set the first event. (2) You give an example of making a prediction at a later date than that of the latest record in each data-set - do you know that no further events have occurred between those dates? (3) You say that you haven't found data-sets containing more than two events per week, but Data-Set B contains two events on the same day. Did you say what you meant? – Scortchi - Reinstate Monica Mar 26 '18 at 10:38
  • @MartijnWeterings: Isn't it just a harder question than most? That reduces the odds of getting an answer, & especially a fully worked-out solution, but doesn't make it a poor fit for our site. I think you'd find relevant theory in an advanced textbook on survival analysis, dealing with panel data for multiple events, frailty, & time-varying coefficients. – Scortchi - Reinstate Monica Mar 26 '18 at 10:53
  • 1
    @Scortchi I find the question on the broader side because - it is mostly about application than theory - it is more about a complete problem/task/work-package than an individual issue. I would not say that it is a harder question than most, except that it might be harder because it is not an abstract problem that can be cast in clear formula. Especially the lack of information about the underlying data (how it is generated) makes it hard to apply any theory. Choosing a strategy for predictions starts with prior information, the meta data, and not with just a dataset. – Sextus Empiricus Mar 26 '18 at 11:13
  • For instance, relating to the lack of information, a simple auxiliary question would be about the data being censored or not (such that the distribution of the length of the data set may have some meaning and can be incorporated into making predictions). – Sextus Empiricus Mar 26 '18 at 11:29
  • Relating to the broadness of the problem and the problem being more a big piece of work rather than a clear single task: e.g. Some approach would be to make predictions using recent events (the question is whether this is allowed or not, but this is not an abstract problem with clear borders). For instance one can learn a pattern occurring among a cluster of users (e.g. the event might be related to some weekly tv-show) and in that case the behavior of others can be used , a la minute, to improve the predictions for others. (e.g. a shift of time, or canceling the weekly event, holidays, etc). – Sextus Empiricus Mar 26 '18 at 11:33
  • @Scortchi (1) Yes, the first event is there. (2) Yes, each event is available immediately when it happens. (3) I meant: I didn't find anything looking like "one event on Monday, one event on Thursday, both repeated" leading to two reminders per week. – maaartinus Mar 26 '18 at 14:03
  • @MartijnWeterings The data is not censored, I just left out sequences containing less than three events as we won't do any prediction for them. The events may be influenced by major TV events and by holidays, but IMHO that would be visible with much more data sets only. We most probably can ignore all external events as they fall below the noise level. If we knew the future of some data sets, the we probably could use it for others, but we know no future, so I believe, there's no use for an inter-data-set analysis. – maaartinus Mar 26 '18 at 14:19
  • 3
    The data have to be censored. There is a timepoint when the datasets were generated. These users can still produce events after that, but haven't yet. That's the definition of censoring. – gung - Reinstate Monica Mar 26 '18 at 14:25
  • @gung OK, I don't know the terms. We always have all the events until the current time (there's no delay and we miss no past events). – maaartinus Mar 26 '18 at 14:35
  • That's what I was trying to get at in my second question, a bit clumsily. So is the current time for the data you've provided 2018-03-03 01:00 ? – Scortchi - Reinstate Monica Mar 26 '18 at 14:51
  • @Scortchi The newest timestamp in the file is 180325-Sun-0321, so it's newer. When you drop all events after 2018-03-03 01:00, then you get exactly what you'd get on 2018-03-03 01:00. As we get everything online, you can choose an arbitrary smaller timestamp, filter the data accordingly and generate a prediction. – maaartinus Mar 26 '18 at 18:02
  • I see. But you don't actually know for these data-sets? Could you include these smaller clarifications in the question as well, please? - I shouldn't expect many readers to follow the comment thread so far down. – Scortchi - Reinstate Monica Mar 27 '18 at 12:13
  • @Scortchi I've added some clarifications to the question. I don't understand you first question. – maaartinus Mar 28 '18 at 00:20
  • 1
    It's an interesting question, but I agree with @MartijnWeterings: one can either (1) do a very simple approach (e.g., check whether a user's events follow a weekly pattern to a certain degree, and if so, issue the reminder; if not, don't), and this would be a very useful benchmark for anything more complicated. And I assume that you have thought of the simple approaches yourself, and we likely won't be able to be very helpful. – Stephan Kolassa Mar 29 '18 at 10:55
  • 1
    (2) Or one could learn much more about the actual problem and what drives the data, then build a more sophisticated prediction model. Which in turn sounds like a fully-fledged statistics/data science consulting project, or possibly a M.Sc. thesis. But too broad for us here. – Stephan Kolassa Mar 29 '18 at 10:55

0 Answers0