2

Say I have a dataframe with the following structure:

dt   dtPr   id    val
99    98     a     10
98    97     a     9
97    96     a     8
99    98     b     20
98    97     b     19
97    96     b     18

What is the most efficient way to create another data frame column that shows "prior value" based on the two dates? Prior value should equal the val where dtPr = dt for a given id. I could do this easily in SQL but I"m unsure about the most efficient approach in R.

Example output:

dt   dtPr   id    val   valPr
99    98     a     10    9
98    97     a     9     8
97    96     a     8     NULL
99    98     b     20    19
98    97     b     19    18
97    96     b     18    NULL

Code to generate sample data frame:

a <- c(99,98,97,99,98,97)
b <- c(98,97,96,98,97,96)
c <- c("a","a","a","b","b","b")
d <- c(10,9,8,20,19,18)
e <- data.frame(dt = a, dtPr = b, id = c, val = d)
Jeffrey Kramer
  • 1,305
  • 6
  • 25
  • 43

3 Answers3

2

Assuming that the data is sorted in whatever order you need, and recognizing that it should be NA rather than NULL which is used in a dataframe column:

e$valPr  <- with( e , ave(val, id, FUN=function(x) c(tail(x,-1),NA) ) )

The ave function operates on a vector, breaks the vector into segments defined by the second argument, "id" in this case, and returns a vector of the same length as the original. It can be used to create group sums, group averages, or as in this case group-specific vectors constructed by shifting or other operations involving indexing or ordering as long as the result is the same length as the input within each group.

IRTFM
  • 251,731
  • 20
  • 347
  • 472
  • the way how `ave` works looks the same idea as `ddply`. vote up from `stats` – B.Mr.W. Sep 08 '14 at 20:53
  • @B.Mr.W., `ddply` is an external package, while `ave` is base R. So the opposite is true. Also, I doubt that BondedDust was trying to mimic `ddply`, so your comment is essentially pointless – David Arenburg Sep 08 '14 at 20:59
  • @B.Mr.W. The 'plyr'-package functions had performance issues, which I think is the reason hadley developed the 'dplyr'-package. I didn't find the formula interface that plyr functions provided to be much of an improvement over the equivalent base functions, but I think his 'dplyr' work is going to be very useful to the R world in the future. – IRTFM Sep 08 '14 at 21:14
2

Using data.table will be probably the most efficient way, as it updates/creates column by reference (without creating copies)

library(data.table)
setDT(e)[, valPr := c(val[-1], NA), by = id]
e
#    dt dtPr id val valPr
# 1: 99   98  a  10     9
# 2: 98   97  a   9     8
# 3: 97   96  a   8    NA  
# 4: 99   98  b  20    19
# 5: 98   97  b  19    18
# 6: 97   96  b  18    NA

Although the question is a bit unclear. Are there any cases in your real data when dt and dtPr are not consecutive by row?


If you really meant a recursive join, i.e. dates are not always in consecutive rows, you could create an index using match and then extract those values

setDT(e)[, Indx := match(dtPr, dt), by = id][, valPr := val[Indx], by = id]
e
#    dt dtPr id val Indx valPr
# 1: 99   98  a  10    2     9
# 2: 98   97  a   9    3     8
# 3: 97   96  a   8   NA    NA
# 4: 99   98  b  20    2    19
# 5: 98   97  b  19    3    18
# 6: 97   96  b  18   NA    NA
David Arenburg
  • 89,637
  • 17
  • 130
  • 188
  • Since the input was numeric, wouldn't it have been more appropriate to return a numeric vector? – IRTFM Sep 08 '14 at 20:48
  • @BondedDust, maybe, but his desired output contained `NULL` so I thought `""` is more similar to it than `NA`, but maybe I'm wrong – David Arenburg Sep 08 '14 at 20:52
  • 1
    Think sometimes people need to be protected from their own folly. – IRTFM Sep 08 '14 at 21:09
  • @JeffreyKramer, please provide some feedback, as your question is a bit vogue – David Arenburg Sep 09 '14 at 08:33
  • This works really well. Thank you for the data table suggestion. I'm using it more and more lately. I ended up just doing some sorting before applying the table. This was helpful for some other operations that are more complicated than the example. Thanks!! – Jeffrey Kramer Sep 10 '14 at 14:04
0

This works, but function blurf calls the global variable e ...

blurf=function(x){   temp=as.data.frame(t(x))   
    result=as.numeric(e$val[as.character(e$id)==as.character(temp$id) & e$dt==temp$dtPr]) 
    if (identical(result, numeric(0))) result=NA   
    result 
}

e$valPr=apply(e,1,blurf)
emudrak
  • 617
  • 7
  • 20