How to keep index when using pandas merge

Question

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced

if you merge on a specific column, it is not clear which indices to use (in case they are both different). — bonobo, Aug 23 '18 at 20:56
It is pretty clear if you do a left or right merge for example. — Marses, Jul 16 '21 at 11:26

score 236 · Accepted Answer · edited Aug 27 '20 at 17:58

236

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

edited Aug 27 '20 at 17:58

Asclepius

49,954
14
144
128

answered Aug 16 '12 at 07:53

Wouter Overmeire

59,036
9
61
42

11

Very clever. a.merge(b, how="left").set_index(a.index) also works, but it seems less robust (since the first part of it loses the index values to a before it resets them.) – DanB Aug 16 '12 at 18:01
19

For this particular case, those are equivalent. But for many merge operations, the resulting frame has not the same number of rows than of the original `a` frame. reset_index moves the index to a regular column and set_index from this column after merge also takes care when rows of a are duplicated/removed due to the merge operation. – Wouter Overmeire Aug 16 '12 at 19:35
I didn't know why my solution would break down, but I figured it would. The reset_index() and then set_index() solution makes a lot of sense. Thanks. – DanB Aug 17 '12 at 04:33
2

@Wouter I'd love to know why a left merge will reindex by default. Where can I learn more? – Matthew Jun 08 '16 at 15:25
@Matthew see http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging – Wouter Overmeire Jun 09 '16 at 19:23
@DanB, I agree. The proposed soln will break when the index is not called `index`. But I do see your point, the `pandas.DataFrame.merge` could in principle rearrange rows (although my testing suggests that this does not happen). – Stephen McAteer Nov 10 '17 at 05:57
@DanB, on second thoughts, the default behavour of `pandas.DataFrame.merge` is `sort=False` so there shouldn't be an issue with your solution. – Stephen McAteer Nov 10 '17 at 06:07
9

Nice! To avoid explicitly specifying the index-name I use `a.reset_index().merge(b, how="left").set_index(a.index.names)`. – Truls Dec 08 '17 at 09:21
1

doesn't look very nice, though I guess that's how it should be done. I don't quite understand why they're dropping it in the first place, but there must be some reasons! – avloss Jun 17 '18 at 17:12
@WouterOvermeire Your elaboration in the comments about what `reset_index` does should be in the answer. I was going to try to incorporate the comment directly into the answer, but it didn't really work smoothly (especially the other details in the comment). So I think it would be better if you handle it. – jpmc26 Nov 08 '18 at 19:20
7

Pandas badly thought API strikes again. – Henry Henrinson Jul 25 '19 at 21:31
1

I still think pandas should keep the index for you (or at least give you the option to choose). There are many other "default" cases where indexes get duplicated; during a pd.concat for example. I don't think the current working is optimal. – Marses Aug 07 '19 at 12:30

Matthew Son · Answer 2 · 2019-11-14T18:03:04.673

16

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).

edited Nov 14 '19 at 18:03

answered Jul 27 '19 at 21:12

Matthew Son

707
6
22

2

This is the best answer. There are many reasons why you would want to preserve your old indexes during a merge (and the accepted answer doesn't preserve indexes, it just resets them). It helps when you're trying to merge more than 2 dataframes, and so on... – Marses Aug 07 '19 at 12:24
upvoted but just be wary of a caveat, when using multi-index, your indices will be stored as a tuple in a single column called a[copy_index] – geekidharsh Nov 06 '19 at 06:19
What I am reading in the docs about `merge_asof` indicates it is not using the index to join, it is using the closes index to join. You also have to have your data sorted a certain way so the closest index joins properly. – bfmcneill Nov 12 '20 at 14:58
This is just a less elegant version of the `reset_index()` solution. @MartienLubberink is incorrect, as `reset_index()` stores the index as a column by default. – Migwell Apr 27 '22 at 04:12

score 10 · Answer 3 · edited Aug 27 '20 at 19:32

10

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

In: a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
In: a['col2']
Out:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn't introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.

edited Aug 27 '20 at 19:32

Asclepius

49,954
14
144
128

answered Sep 11 '17 at 17:33

Zero

66,763
15
141
151

2

This seems superior to the accepted answer as it will probably work better with edge cases like multi indexes. Can anyone comment on this? – BallpointBen Jan 17 '19 at 06:07
1

question, what if you need to assign multiple columns, would this approach work or is it limited to only 1 field? – Yuca Mar 11 '19 at 15:24
2

@Yuca: This possibly won't work with multiple columns, since when you subset multiple columns you end up with a `pd.Dataframe` and not a `pd.Series`. The `.map()` method is only defined for the `pd.Series`. This is to mean that: `a[['to_merge_on_1', 'to_merge_on_2']].map(...)` won't work. – Dataman Feb 13 '20 at 13:08
Brilliant. In my project we are using too many pandas tricks everywhere. This is very refreshing as it is straight forward and low level. Thank you! – F. Santiago Dec 06 '21 at 09:29

score 8 · Answer 4 · edited Aug 27 '20 at 19:34

8

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1

edited Aug 27 '20 at 19:34

Asclepius

49,954
14
144
128

answered Apr 26 '19 at 06:43

Supratik Majumdar

2,255
1
22
31

It seems to work, but when I use it with `on=list_of_cols]`, it contradicts the documentation: `If joining columns on columns, the DataFrame indexes *will be ignored*`. Is one of using indices vs. columns has precedence? – Itamar Katz Jan 22 '20 at 09:56
@Supratik Majumdar doesn't your suggestion assume the indexes of the dataframes already match? The OP has non-matching indexes and is merging/joining on columns. – James Apr 01 '22 at 17:05

lisrael1 · Answer 5 · 2022-04-04T20:03:33.983

2

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis

edited Apr 04 '22 at 20:03

answered May 04 '20 at 07:37

lisrael1

178
7

This didn't work for me – James Apr 01 '22 at 17:12

score 2 · Answer 6 · answered Sep 24 '21 at 07:49

2

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:

c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)

answered Sep 24 '21 at 07:49

Alicia

21
2

score 0 · Answer 7 · answered Jan 24 '20 at 19:11

Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.

How to keep index when using pandas merge

7 Answers7

Linked