2

I would like to select the first few rows for each factor in a datatable.

SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )
> SOURCE
     NAME VALUE
 1: NAME1  TRUE
 2: NAME1  TRUE
 3: NAME1  TRUE
 4: NAME1 FALSE
 5: NAME1 FALSE
 6: NAME2  TRUE
 7: NAME2 FALSE
 8: NAME2  TRUE
 9: NAME2  TRUE
10: NAME2  TRUE
11: NAME3  TRUE
12: NAME3 FALSE
13: NAME3 FALSE
14: NAME3  TRUE
15: NAME3  TRUE

For instance here I'd like to select the 3 first rows for each NAME so I would end up with rows : 1-3, 6-9 and 11-13. Any idea how to do that ?

I tried this but it doesn't work :

> SOURCE[1:3, VALUE, by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1  TRUE
Frank
  • 65,012
  • 8
  • 95
  • 173
ChiseledAbs
  • 1,833
  • 4
  • 18
  • 31

2 Answers2

4

We can try with row indexing (.I) as well to subset.

SOURCE[SOURCE[, .I[1:3], by = NAME]$V1]
akrun
  • 789,025
  • 32
  • 460
  • 575
3

This looks like it should do it. Basically the same thing as @hrbrmstr's answer in the comments, but doesn't use the head function:

set.seed(1)
SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )

SOURCE[,.SD[1:3], by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1 FALSE
4: NAME2 FALSE
5: NAME2 FALSE
6: NAME2 FALSE
7: NAME3  TRUE
8: NAME3  TRUE
9: NAME3 FALSE
Mike H.
  • 13,460
  • 2
  • 26
  • 38
  • 1
    For what it's worth, optimization is planned for `.SD[int_vec]` but not for `head(.SD, n)`, looks like https://github.com/Rdatatable/data.table/issues/735 – Frank May 29 '16 at 13:03