Create new data frame column from past values

Question

Say I have a dataframe with the following structure:

dt   dtPr   id    val
99    98     a     10
98    97     a     9
97    96     a     8
99    98     b     20
98    97     b     19
97    96     b     18

What is the most efficient way to create another data frame column that shows "prior value" based on the two dates? Prior value should equal the val where dtPr = dt for a given id. I could do this easily in SQL but I"m unsure about the most efficient approach in R.

Example output:

dt   dtPr   id    val   valPr
99    98     a     10    9
98    97     a     9     8
97    96     a     8     NULL
99    98     b     20    19
98    97     b     19    18
97    96     b     18    NULL

Code to generate sample data frame:

a <- c(99,98,97,99,98,97)
b <- c(98,97,96,98,97,96)
c <- c("a","a","a","b","b","b")
d <- c(10,9,8,20,19,18)
e <- data.frame(dt = a, dtPr = b, id = c, val = d)

IRTFM · Answer 1 · 2014-09-08T20:43:48.900

2

Assuming that the data is sorted in whatever order you need, and recognizing that it should be NA rather than NULL which is used in a dataframe column:

e$valPr  <- with( e , ave(val, id, FUN=function(x) c(tail(x,-1),NA) ) )

The ave function operates on a vector, breaks the vector into segments defined by the second argument, "id" in this case, and returns a vector of the same length as the original. It can be used to create group sums, group averages, or as in this case group-specific vectors constructed by shifting or other operations involving indexing or ordering as long as the result is the same length as the input within each group.

edited Sep 08 '14 at 20:43

answered Sep 08 '14 at 20:32

IRTFM

251,731
20
347
472

the way how `ave` works looks the same idea as `ddply`. vote up from `stats` – B.Mr.W. Sep 08 '14 at 20:53
@B.Mr.W., `ddply` is an external package, while `ave` is base R. So the opposite is true. Also, I doubt that BondedDust was trying to mimic `ddply`, so your comment is essentially pointless – David Arenburg Sep 08 '14 at 20:59
@B.Mr.W. The 'plyr'-package functions had performance issues, which I think is the reason hadley developed the 'dplyr'-package. I didn't find the formula interface that plyr functions provided to be much of an improvement over the equivalent base functions, but I think his 'dplyr' work is going to be very useful to the R world in the future. – IRTFM Sep 08 '14 at 21:14

David Arenburg · Accepted Answer · 2014-09-08T20:56:11.050

2

Using data.table will be probably the most efficient way, as it updates/creates column by reference (without creating copies)

library(data.table)
setDT(e)[, valPr := c(val[-1], NA), by = id]
e
#    dt dtPr id val valPr
# 1: 99   98  a  10     9
# 2: 98   97  a   9     8
# 3: 97   96  a   8    NA  
# 4: 99   98  b  20    19
# 5: 98   97  b  19    18
# 6: 97   96  b  18    NA

Although the question is a bit unclear. Are there any cases in your real data when dt and dtPr are not consecutive by row?

If you really meant a recursive join, i.e. dates are not always in consecutive rows, you could create an index using match and then extract those values

setDT(e)[, Indx := match(dtPr, dt), by = id][, valPr := val[Indx], by = id]
e
#    dt dtPr id val Indx valPr
# 1: 99   98  a  10    2     9
# 2: 98   97  a   9    3     8
# 3: 97   96  a   8   NA    NA
# 4: 99   98  b  20    2    19
# 5: 98   97  b  19    3    18
# 6: 97   96  b  18   NA    NA

edited Sep 08 '14 at 20:56

answered Sep 08 '14 at 20:33

David Arenburg

89,637
17
130
188

Since the input was numeric, wouldn't it have been more appropriate to return a numeric vector? – IRTFM Sep 08 '14 at 20:48
@BondedDust, maybe, but his desired output contained `NULL` so I thought `""` is more similar to it than `NA`, but maybe I'm wrong – David Arenburg Sep 08 '14 at 20:52
1

Think sometimes people need to be protected from their own folly. – IRTFM Sep 08 '14 at 21:09
@JeffreyKramer, please provide some feedback, as your question is a bit vogue – David Arenburg Sep 09 '14 at 08:33
This works really well. Thank you for the data table suggestion. I'm using it more and more lately. I ended up just doing some sorting before applying the table. This was helpful for some other operations that are more complicated than the example. Thanks!! – Jeffrey Kramer Sep 10 '14 at 14:04

score 0 · Answer 3 · answered Sep 08 '14 at 21:04

This works, but function blurf calls the global variable e ...

blurf=function(x){   temp=as.data.frame(t(x))   
    result=as.numeric(e$val[as.character(e$id)==as.character(temp$id) & e$dt==temp$dtPr]) 
    if (identical(result, numeric(0))) result=NA   
    result 
}

e$valPr=apply(e,1,blurf)

Create new data frame column from past values

3 Answers3

Linked