R: Updating a data frame with another data frame

Question

Let's say our initial data frame looks like this:

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))

> df1
  Index A  B  C
1     1 1  1  1
2     2 2  2  2
3     3 3  3  3
4     4 4 NA NA
5     5 5 NA NA
6     6 6 NA NA

Another data frame contains new information for col B and C

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))

> df2
  Index B C
1     4 4 5
2     5 4 5
3     6 4 5

How can you update the missing values in df1 so it looks like this:

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

My attempt:

library(dplyr)

> full_join(df1,df2)
Joining by: c("Index", "B", "C")
  Index  A  B  C
1     1  1  1  1
2     2  2  2  2
3     3  3  3  3
4     4  4 NA NA
5     5  5 NA NA
6     6  6 NA NA
7     4 NA  4  5
8     5 NA  4  5
9     6 NA  4  5

Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.

Any help would be greatly appreciated!

thelatemail · Accepted Answer · 2016-05-19T03:07:05.440

6

merge then aggregate:

aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )

#  Index B C A
#1     1 1 1 1
#2     2 2 2 2
#3     3 3 3 3
#4     4 4 5 4
#5     5 4 5 5
#6     6 4 5 6

Or in dplyr speak:

df1 %>% 
    full_join(df2) %>%
    group_by(Index) %>%
    summarise_each(funs(na.omit))

#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
#  Index     A     B     C
#  (dbl) (int) (dbl) (dbl)
#1     1     1     1     1
#2     2     2     2     2
#3     3     3     3     3
#4     4     4     4     5
#5     5     5     4     5
#6     6     6     4     5

edited May 19 '16 at 03:07

answered May 19 '16 at 02:55

thelatemail

85,757
12
122
177

Thanks, it worked great! I really need to learn more about the aggregate function. Is there also a way to update entries that aren't necessarily NA? – Zyferion May 19 '16 at 06:38

score 5 · Answer 2 · answered May 19 '16 at 02:18

5

We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.

library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
#   Index A B C
#1:     1 1 1 1
#2:     2 2 2 2
#3:     3 3 3 3
#4:     4 4 4 5
#5:     5 5 4 5
#6:     6 6 4 5

answered May 19 '16 at 02:18

akrun

789,025
32
460
575

Hi akrun, I couldn't get this solution to work for me – Zyferion May 19 '16 at 02:37
@Zyferion My solution was based on the example you provided. It works for me with data.table 1.9.6. Also, as your comment was not very informative (i.e. not providing any error message or anything), I can't comment. – akrun May 19 '16 at 02:38
Error in `[.data.table`(setDT(df1), df2, `:=`(c("B", "C"), .(i.B, i.C)), : unused argument (on = "Index") – Zyferion May 19 '16 at 02:45
4

@Zyferion It could be because you are using an older version of `data.table`. Please update to new version – akrun May 19 '16 at 02:45

score 3 · Answer 3 · answered May 20 '16 at 08:04

For those interested, I've extended this problem to:

- handle updating a data frame with another data frame with new columns

- replace any existing entries regardless if they're NA or not.

Heres the solution I found using the aggregate function from @thelatemail :)

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))

df3 = full_join(df1,df2)

# Create a function na.omit.last 
na.omit.last = function(x){
  x <- na.omit(x)
  x <- last(x)
}

# For the columns not in df1 
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))] 
dfA = data.frame(lapply(dfA,as.numeric))

dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)

# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
  df3 = cbind(dfB,dfA)
}  else {
    df3 = dfB
  }

print(df3)

score 2 · Answer 4 · answered May 19 '16 at 01:59

2

Not sure what the general case or conditions would be, but this works for this instance without dplyr

df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3

  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5

answered May 19 '16 at 01:59

TomNash

2,979
2
18
48

thanks this could be useful - i've updated the question and added an index column though – Zyferion May 19 '16 at 02:16

score 0 · Answer 5 · answered May 11 '22 at 16:54

0

As of dplyr >= 1.0.0 you can use rows_update:

library(dplyr)

df1 %>% 
  rows_update(df2, by = "Index")

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

answered May 11 '22 at 16:54

LMc

8,572
3
21
35

R: Updating a data frame with another data frame

5 Answers5

For those interested, I've extended this problem to:

Linked

Related