1

I have two dataframes df1 and df2.

np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})

  key    id      2021
0   A     2  1.764052
1   B    23  0.400157
2   C   234  0.978738
3   D  2345  2.240893

  key    id      2022
0   B    23  1.867558
1   D  2345 -0.977278
2   E    67  0.950088
3   F    45 -0.151357

I want to have unique keys. If key found already just update the key else insert new row. I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?

Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.

Thanks!

Poongodi
  • 55
  • 5

3 Answers3

3

I think you need create index from key and then join in concat:

df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
jezrael
  • 729,927
  • 78
  • 1,141
  • 1,090
3

You can do it using merge function:

df = df1.merge(df2, on='key', how='outer')

df
   key     2021    2022
0   A   1.764052    NaN
1   B   0.400157    1.867558
2   C   0.978738    NaN
3   D   2.240893    -0.977278
4   E   NaN         0.950088
5   F   NaN        -0.151357
Andrea Ierardi
  • 391
  • 1
  • 9
0

Given your description, it looks like you want combine_first. It will merge the two datasets by replacing the duplicates in order.

df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()

Output:

  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
mozway
  • 81,317
  • 8
  • 19
  • 49