Pandas - merge dataframe to keep all values on left and 'insert' values from right if 'no key on left' else 'update' existing 'key' in left

Question

I have two dataframes df1 and df2.

np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})

  key    id      2021
0   A     2  1.764052
1   B    23  0.400157
2   C   234  0.978738
3   D  2345  2.240893

  key    id      2022
0   B    23  1.867558
1   D  2345 -0.977278
2   E    67  0.950088
3   F    45 -0.151357

I want to have unique keys. If key found already just update the key else insert new row. I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?

Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.

Thanks!

Can you change dataframes for see why is not possible use `df1.merge(df2, on='key', how='outer')` ? What columns are duplicated? — jezrael, Apr 28 '22 at 08:15
So you need `df1.merge(df2, on=['key', 'id'], how='outer')` ? — jezrael, Apr 28 '22 at 09:24
Sorry jezrael. I have edited the dataframes after first posting. Similarly named columns were present in both the dataframes and thus the output returned duplicate cols with '_x' and '_y'. — Poongodi, Apr 28 '22 at 09:24
I have manually deleted cols with '_x' and renamed '_y' cols to original [Id values doesn't matter] — Poongodi, Apr 28 '22 at 09:27
Is it possible to accept more than one answer as solution? Both the solutions that are received so far is correct. I want to close the question. — Poongodi, Apr 28 '22 at 09:29
I already reopened,so cannot close. Only one solution should be accepted. — jezrael, Apr 28 '22 at 09:31

jezrael · Answer 1 · 2022-04-28T08:15:06.050

3

I think you need create index from key and then join in concat:

df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

edited Apr 28 '22 at 08:15

answered Apr 28 '22 at 08:12

jezrael

729,927
78
1,141
1,090

you beat me to the first answer everyday! :) – el_oso Apr 28 '22 at 08:14

score 3 · Accepted Answer · answered Apr 28 '22 at 08:16

You can do it using merge function:

df = df1.merge(df2, on='key', how='outer')

df
   key     2021    2022
0   A   1.764052    NaN
1   B   0.400157    1.867558
2   C   0.978738    NaN
3   D   2.240893    -0.977278
4   E   NaN         0.950088
5   F   NaN        -0.151357

score 0 · Answer 3 · answered Apr 28 '22 at 09:40

Given your description, it looks like you want combine_first. It will merge the two datasets by replacing the duplicates in order.

df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()

Output:

  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

Pandas - merge dataframe to keep all values on left and 'insert' values from right if 'no key on left' else 'update' existing 'key' in left

3 Answers3