Create duplicate rows based on conditions in R

Question

I have a data.table that looks like this

dt <- data.table(ID=c("A","A","B","B"),Amount1=c(100,200,300,400),
                 Amount2=c(1500,1500,2400,2400),Dupl=c(1,0,1,0))

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0

I need to duplicate each row that has a 1 in the Dupl column and replace the Amount1 value with the Amount2 value in that duplicated row. Besides that I need to give that duplicated row the value 2 in Dupl. This means it should look like this:

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0

Any help is much appreciated! Kind regards,

Tim

Are there any other rules we need to know? Could there be four rows for a given ID? Does the "Dupl" column have values other than "1" and "2"? — A5C1D2H2I1M1N2O1R2T1, Mar 10 '15 at 10:50

zx8754 · Answer 1 · 2021-01-05T21:32:59.827

11

Using dplyr

library("data.table")
library("dplyr")

#data
dt <- data.table(ID = c("A", "A", "B", "B"),
                 Amount1 = c(100, 200, 300, 400),
                 Amount2 = c(1500, 1500, 2400, 2400),
                 Dupl = c(1, 0, 1, 0))
#result
rbind(dt,
      dt %>% 
        filter(Dupl == 1) %>% 
        mutate(Dupl = 2,
               Amount1 = Amount2))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A     200    1500    0
# 3:  B     300    2400    1
# 4:  B     400    2400    0
# 5:  A    1500    1500    2
# 6:  B    2400    2400    2

edited Jan 05 '21 at 21:32

answered Mar 10 '15 at 10:57

zx8754

46,390
10
104
180

1

I always try to avoid plyr since I work with large datasets, but this worked as well. Many thanks. – Tim_Utrecht Mar 10 '15 at 11:40

score 11 · Accepted Answer · answered Mar 10 '15 at 10:58

11

You could try

rbind(dt,dt[Dupl==1][,c('Amount1', 'Dupl') := list(Amount2, 2)])

answered Mar 10 '15 at 10:58

akrun

789,025
32
460
575

@DavidArenburg Thanks, I was also thinking in terms of `rleid` using your template code, but found to be a bit difficult. – akrun Mar 10 '15 at 11:04
How do you generalize this? For example my data.table has a column defining groups. I want to duplicate each whole group N times, where N is calculated as a function of some column. – skan Apr 22 '21 at 01:14
@skan can you post as a question so that it becomes more easier to understand – akrun Apr 22 '21 at 16:14

James · Answer 3 · 2015-03-10T11:20:45.250

You can rbind a copy of the sub-setted data with the correct transformations done:

rbind(dt,copy(dt[Dupl==1])[,Amount1:=Amount2][,Dupl:=Dupl+1])
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0
5:  A    1500    1500    2
6:  B    2400    2400    2

Alternatively, you can get the duplicates by sub-setting, and then transform the duplicated rows using an intermediate step. This keeps the duplicated row next to the original as in the example in the question:

x <- dt[rep(seq(dt[,Dupl]),times=dt[,Dupl==1]+1)]
x[duplicated(x),c("Amount1","Dupl"):=list(Amount2,Dupl+1)]
x
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0

You can do the first step in the second solution just by `x – David Arenburg Mar 10 '15 at 11:51 — David Arenburg, Mar 10 '15 at 11:51

score 3 · Answer 4 · answered Mar 10 '15 at 10:48

This seems to do what you are asking for. Can probably be refined a bit...

library(splitstackshape)
expandRows(dt, dt$Dupl+1, count.is.col = FALSE)[
  Dupl != 0, Dupl := cumsum(Dupl), by = ID][
    , Amount1 := ifelse(Dupl > 1, Amount2[-1], Amount1)][]
#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0

score 1 · Answer 5 · edited Nov 09 '18 at 00:11

Using dplyr's left_join to do the duplication work. Perhaps not elegant, but should be easy to understand.

library(data.table)
library(dplyr)

joiner <- data.frame(Dupl = 1, helper_col= 1:2)

dt <- left_join(dt, joiner) %>%
  mutate(Dupl = ifelse(helper_col == 2 & !is.na(helper_col), 2, Dupl)) %>%
  select(-helper_col) %>%
  mutate(Amount1 = ifelse(Dupl == 2, Amount2, Amount1))

> dt
  ID Amount1 Amount2 Dupl
1  A     100    1500    1
2  A    1500    1500    2
3  A     200    1500    0
4  B     300    2400    1
5  B    2400    2400    2
6  B     400    2400    0

score 1 · Answer 6 · answered May 23 '20 at 23:37

Biased here, but I think this dplyr solution is elegant, and it's also pretty scalable, especially as long as Dupl is always <= 2. Essentially, it takes advantage of tidyr::uncount, which says, 'Based on a given column's value (x), repeat each row x times, thereby elongating the df.' Once we've lengthened the df, we can just use dplyr::mutate_at to replace cells if they're the same value as their lag.

library(tidyverse)
dt %>%
    uncount(Dupl + 1) %>%
    mutate_at(vars(Amount1),
              ~case_when(. == lag(.) ~ Amount2, TRUE ~.)) %>%
    mutate_at(vars(Dupl),
              ~case_when(. == lag(.) ~ 2, TRUE ~.))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0

Create duplicate rows based on conditions in R

6 Answers6

Linked