1

I have the following preset;

A data frame on the format as follows:


df <- data.frame(prevArrivalTime = c(1676193057, 1676193112, 1676193180, 1676193277, 1676193358, 1676193469, 1676193581, 1676102575, 1676102613),
                 hours = c(1, 2, 3, 1, 2, 3, 1, 2, 3))

dummy <- model.matrix(~factor(hours) - 1, data = df) df <- cbind(df[, 1], dummy) colnames(df) = c("prevArrivalTime", "hour_1", "hour_2", "hour_3") df <- cbind(df, actualTravelTime = c(55, 68, 97, 81, 111, 112, 126, 38, 73))

The prevArrivalTime is the recording for when a bus arrived at the previous stop, converted to a unix timestamp format, which measures the seconds which have passed since 1970-01-01. For instance, 1676193057 was "Sunday, February 12, 2023 9:10:57 AM". Dummy variable hour_1, hour_2 & hour_3 are dummy encoded hour recordings for the corresponding hour. The actualTravelTime is the variable that I'm trying to predict, namely how long it took for the bus to travel.

My data stretches over multiple days and weeks, meaning the variable prevArrivalTime will be strictly increasing as time passed by. I want to estimate $\hat{y}$ = estimatedTravelTime, by using linear regression.

The way I envision how I do this is through the formula;

$\hat{y} = b_0 * prevArrivalTime + b_1 * dummy(x) + \epsilon- prevArrivalTime$

Where I'd subtract the prevArrivalTime, which would give me a measurement of how long it took to travel from the previous stop to the next.

Otherwise, I think I end up with something where prevArrivalTime would be constantly increasing;

  • prevArrivalTime_1 = 1676193057
  • prevArrivalTime_2 = 1676193112
  • prevArrivalTime_500 = 1677829121

While actualTravelTime would be fluctuating;

  • actualTravelTime_1 = 55
  • actualTravelTime_2 = 68
  • actualTravelTime_500 = 58

My original question was how to implement the formula above in R, but as the comments suggested, it wasn't possible to interpret what I wanted to achieve. I hope this restructure explains what I wish to achieve in a better way.

OLGJ
  • 317
  • 3
    Why not subtract first and then model? – user2974951 Mar 02 '23 at 13:01
  • 1
    Subtract how you mean? – OLGJ Mar 02 '23 at 13:04
  • 1
    $y2=y_t-y_{t-1}$ and then model $y2$. – user2974951 Mar 02 '23 at 13:06
  • Let me check if it makes sense – OLGJ Mar 02 '23 at 13:09
  • You can also see the offset argument in lm, see https://stats.stackexchange.com/q/292574/60613 – Firebug Mar 02 '23 at 13:13
  • What does "Is it possible to achieve this without fitting the model first?" mean? You have to fit a model to estimate the coefficients! – Sycorax Mar 02 '23 at 13:33
  • I edited my question with a clarification of what I wanted to achieve. – OLGJ Mar 02 '23 at 14:36
  • 1
    @OLGJ your explanation is helpful in understanding your problem a bit better, but it is still not clear what you mean by 'subtracting' some variable from the model. – Sextus Empiricus Mar 02 '23 at 15:30
  • 1
    As its written right now, the question appears to be an XY Problem. Your research question wants to know something about bus arrivals and you have some data about the buses (X), and you've decided to answer it using a regression that subtracts some variable (Y), but how these two ideas X & Y are related is very unclear. I suggest writing, in plain language, what you know, what you want to know, and where you are stuck. Code & notation can come later -- right now, they seem to be distracting & creating unclarity. – Sycorax Mar 02 '23 at 15:32
  • If I'm not subtracting the prevArrivalTime from my model, I would end up with an estimation something along the lines of: $y_hat = b_1 * 1676193057 + b_0 * dummy(x) + \epsilon$ for day1. For day2 it would be roughly the same but "1676193057 " would be a strictly larger number. For day 15 it would be an even larger number. However the actualTravelTime would (probably) be roughly the same. That is why I wanted to subtract the prevArrivalTime. @SextusEmpiricus – OLGJ Mar 02 '23 at 15:56
  • 1
    @OLGJ I have no idea what these numbers like 1676193057 mean and why you are talking about day 1, day 2, day 15, etc. You haven't explained your problem. So with your last comment I do not understand what you mean by "That is why I wanted to subtract the prevArrivalTime." – Sextus Empiricus Mar 02 '23 at 16:33
  • 1
    I suspect that you are doing something very simple (and you want to predict something like differences in time) but your approach here might be difficult and becomes complicated to understand (because instead of differences in time you are modeling with absolute times). This relates to the XY problem mentioned by sycorax. Your question becomes difficult to understand because you ask for Y without explaining X, but the question about Y (the concept behind it) is difficult to understand (what sort of subtraction you want to do is unclear)). – Sextus Empiricus Mar 02 '23 at 16:37
  • I understand, I'm sorry for the confusion I might be causing. I will try to think it through and see if I can make another edit tomorrow for it to be comprehensible for others @SextusEmpiricus.

    The thing I meant with 1676193057 is that it's a unix timestamp (12th february 9:57) which encodes the actual time of when the bus arrived at the previous stop, and that timestamp is encoding how many seconds have passed since 1970-01-01. As hours, days or weeks progress in my observations there will have passed more seconds.

    – OLGJ Mar 02 '23 at 21:17
  • I have now edited my question and I hope it's more clear @SextusEmpiricus – OLGJ Mar 03 '23 at 07:58
  • I believe that what your are doing is a wrong approach, but at least it is clear now. So I voted to reopen. – Sextus Empiricus Mar 03 '23 at 09:10
  • Ok, if you have the time I'd appreciate to hear why it's the wrong approach @SextusEmpiricus – OLGJ Mar 03 '23 at 11:13
  • @OLGJ If the arrival time is larger or bigger, what would this have for influence on travel time? Why do you include arrival time in the linear model for estimating travel time? If last week the travel time was 60 seconds, why would it matter that the next week the arrival time is 600 thousand seconds bigger and what would it's influence on travel time be? Or is the bus travelling faster/slower every day? – Sextus Empiricus Mar 03 '23 at 11:27
  • @SextusEmpiricus 1) I don't quite understand your question. 2) I include it as that is how I can create an estimation of how long it will take for the buss to travel. Arrival time at B - Arrival time at A = time it took to travel from A to B. 3) Yes the bus will not travel between two stops at a constant speed. There are various scenarios that can impact the time it takes to travel between two stops, for instance if it gets caught at a red light. – OLGJ Mar 03 '23 at 12:51
  • @OLGJ "There are various scenarios that can impact the time" but why does previous arrival time have an impact on travel time? – Sextus Empiricus Mar 03 '23 at 13:02
  • My thought was that depending on when it arrived at the last stop, it might impact when it arrives at the next stop. For instance arriving at 00:15 a Monday at a certain stop coincides with when traffic is minimal, whereas arriving at 08:05 at the same stop means there's lots of traffic and rush hour, which could impact the travel time. @SextusEmpiricus – OLGJ Mar 03 '23 at 13:13
  • Is your variable 'previous arrival time' the time of the day or the time passed since 1970-01-01? And why is it a linear relationship? If, for a particular piece of travel, the bus takes 60 seconds on Monday 07:05 and 61 seconds on Monday 08:05, then you would project this to 70 seconds on Monday 17:05 and a day later, Tuesday 07:05, the bus takes 84 seconds? – Sextus Empiricus Mar 03 '23 at 13:17
  • It is the time passed since 1970-01-01. The observation at this second will be 1 second less than the observation in 1 second, and 2 seconds less than the observation in 2 seconds. Don't understand the second part, would it be better to open a chat instead? @SextusEmpiricus – OLGJ Mar 03 '23 at 13:39

2 Answers2

2

According to your comments it seems like you want to model some variable that has a periodic time dependency. When you model this as a linear function then subtracting the time is gonna make it still a linear function.

The model below

$$y = a + bt + \epsilon -t$$

is just equivalent to linear models

$$y = a + b^\prime t + \epsilon \quad \text{with $b^\prime = b-1$} $$ or

$$y^\prime = a + bt + \epsilon \quad \text{with $y^\prime = y+t$} $$


  • What you probably want to do instead is to include some periodic variables like Fourier terms.

  • Or alternatively you use a time variable modulo the time of the day. E.g. time stamps like t = 1.7 days, t = 2.7 days, t = 3.7 days will all become t=0.7.

    The image below shows an example with a time variable on the x-axis that runs from t=0 to t=7. On the left we have the variable $y$ plotted as a function of $t$. On the right we have the variable $y$ plotted as a function of $t$ modulo $1$ (the colour coding of the points is kept the same).

    example of plots

  • The first part actually addressed my concerns. Thank you for that and your patience. I will expand into different ways to encode the hour of the recordings, but I wanted to create this as a simple first model. Thank you. – OLGJ Mar 03 '23 at 14:32
  • However, If I may ask; how would I practically rewrite my linear model in this case? Even if it doesn't matter. – OLGJ Mar 03 '23 at 15:01
0

Using offset (hp is the lagged variable)

> lm(mpg~1,data=mtcars,offset=hp)

Coefficients: (Intercept)
-126.6

using I()

> lm(I(mpg-hp)~1,data=mtcars)

Coefficients: (Intercept)
-126.6

subtracting first the modelling

> mtcars$mpg_hp=mtcars$mpg-mtcars$hp
> lm(mpg_hp~1,data=mtcars)

Coefficients: (Intercept)
-126.6

user2974951
  • 7,813