Only copy one key-column into merged DataFrame

Question

Consider the following DataFrames:

df1 = pd.DataFrame({'a': [0, 1, 2, 3], 'b': list('abcd')})
df2 = pd.DataFrame({'c': list('abcd'), 'd': 'Alex'})

In this instance, df1['b'] and df2['c'] are the key columns. So when merging:

df1.merge(df2, left_on='b', right_on='c')
   a  b  c     d
0  0  a  a  Alex
1  1  b  b  Alex
2  2  c  c  Alex
3  3  d  d  Alex

I end up with both key columns in the resultant DataFrame when I only need one. I've been using:

df1.merge(df2, left_on='b', right_on='c').drop('c', axis='columns')

Is there a way to only keep one key column?

For more information on various facets and functionality of the merge, join, and concat API, please take a look at [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101). — cs95, Dec 15 '18 at 13:13

sacuL · Accepted Answer · 2018-11-08T21:00:07.010

One way is to set b and c as the index of your frames respectively, and use join followed by reset_index:

df1.set_index('b').join(df2.set_index('c')).reset_index()

   b  a     d
0  a  0  Alex
1  b  1  Alex
2  c  2  Alex
3  d  3  Alex

This will be faster than the merge/drop method on large dataframes, mostly because drop is slow. @Bill's method is faster than my suggestion, and @W-B & @PiRsquared easily outspeed the other suggestions:

import timeit

df1 = pd.concat((df1 for _ in range(1000)))
df2 = pd.concat((df2 for _ in range(1000)))

def index_method(df1 = df1, df2 = df2):
    return df1.set_index('b').join(df2.set_index('c')).reset_index()


def merge_method(df1 = df1, df2=df2):
    return df1.merge(df2, left_on='b', right_on='c').drop('c', axis='columns')

def rename_method(df1 = df1, df2 = df2):
    return df1.rename({'b': 'c'}, axis=1).merge(df2)

def index_method2(df1 = df1, df2 = df2):
    return df1.join(df2.set_index('c'), on='b')

def assign_method(df1 = df1, df2 = df2):
    return df1.set_index('b').assign(c=df2.set_index('c').d).reset_index()

def map_method(df1 = df1, df2 = df2):
    return df1.assign(d=df1.b.map(dict(df2.values)))

>>> timeit.timeit(index_method, number=10) / 10
0.7853091600998596
>>> timeit.timeit(merge_method, number=10) / 10
1.1696729859002517
>>> timeit.timeit(rename_method, number=10) / 10
0.4291436871004407
>>> timeit.timeit(index_method2, number=10) / 10
0.5037374985004135
>>> timeit.timeit(assign_method, number=10) / 10
0.0038641377999738325
>>> timeit.timeit(map_method, number=10) / 10
0.006620216699957382

`df1.join(df2.set_index('c'), on='b')` – piRSquared Nov 08 '18 at 20:51 — piRSquared, Nov 08 '18 at 20:51
Would you like testing my speed ? – BENY Nov 08 '18 at 20:55 — BENY, Nov 08 '18 at 20:55
@W-B, I just did, it's ***far*** faster! – sacuL Nov 08 '18 at 20:58 — sacuL, Nov 08 '18 at 20:58

score 7 · Answer 2 · answered Nov 08 '18 at 20:42

7

Another way is to give b and c the same name. At least for the merge operation.

df1.rename({'b': 'c'}, axis=1).merge(df2)
   a  c     d
0  0  a  Alex
1  1  b  Alex
2  2  c  Alex
3  3  d  Alex

answered Nov 08 '18 at 20:42

Bill

8,217
4
52
75

score 5 · Answer 3 · answered Nov 08 '18 at 20:48

5

Or use one set_index and left_index=True and right_on paramater:

df1.set_index('b').merge(df2, left_index=True, right_on='c')

Output:

   a  c     d
0  0  a  Alex
1  1  b  Alex
2  2  c  Alex
3  3  d  Alex

answered Nov 08 '18 at 20:48

Scott Boston

133,446
13
126
161

piRSquared · Answer 4 · 2018-11-08T20:55:41.233

4

`map`

Obnoxious (not recommended) method that I was compelled to put down because I accidentally posted a duplicate answer to someone else.

df1.assign(d=df1.b.map(dict(df2.values)))

   a  b     d
0  0  a  Alex
1  1  b  Alex
2  2  c  Alex
3  3  d  Alex

edited Nov 08 '18 at 20:55

answered Nov 08 '18 at 20:49

piRSquared

265,629
48
427
571

Wait, why not use map in this case of bringing only one column? – ALollz Nov 08 '18 at 21:11
1

Because it isn't generalized. It's very specific to this toy problem. If we truly were bringing over one column, then I'd agree. – piRSquared Nov 08 '18 at 21:12

score 4 · Answer 5 · answered Nov 08 '18 at 20:55

4

After set_index you ca directly assign the value

df1.set_index('b').assign(c=df2.set_index('c').d).reset_index()
Out[233]: 
   b  a     c
0  a  0  Alex
1  b  1  Alex
2  c  2  Alex
3  d  3  Alex

answered Nov 08 '18 at 20:55

BENY

296,997
19
147
204

Only copy one key-column into merged DataFrame

5 Answers5

`map`

Linked

Related