188

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced

Zero
  • 66,763
  • 15
  • 141
  • 151
DanB
  • 3,306
  • 4
  • 20
  • 22

7 Answers7

236
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

Asclepius
  • 49,954
  • 14
  • 144
  • 128
Wouter Overmeire
  • 59,036
  • 9
  • 61
  • 42
  • 11
    Very clever. a.merge(b, how="left").set_index(a.index) also works, but it seems less robust (since the first part of it loses the index values to a before it resets them.) – DanB Aug 16 '12 at 18:01
  • 19
    For this particular case, those are equivalent. But for many merge operations, the resulting frame has not the same number of rows than of the original `a` frame. reset_index moves the index to a regular column and set_index from this column after merge also takes care when rows of a are duplicated/removed due to the merge operation. – Wouter Overmeire Aug 16 '12 at 19:35
  • I didn't know why my solution would break down, but I figured it would. The reset_index() and then set_index() solution makes a lot of sense. Thanks. – DanB Aug 17 '12 at 04:33
  • 2
    @Wouter I'd love to know why a left merge will reindex by default. Where can I learn more? – Matthew Jun 08 '16 at 15:25
  • @Matthew see http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging – Wouter Overmeire Jun 09 '16 at 19:23
  • @DanB, I agree. The proposed soln will break when the index is not called `index`. But I do see your point, the `pandas.DataFrame.merge` could in principle rearrange rows (although my testing suggests that this does not happen). – Stephen McAteer Nov 10 '17 at 05:57
  • @DanB, on second thoughts, the default behavour of `pandas.DataFrame.merge` is `sort=False` so there shouldn't be an issue with your solution. – Stephen McAteer Nov 10 '17 at 06:07
  • 9
    Nice! To avoid explicitly specifying the index-name I use `a.reset_index().merge(b, how="left").set_index(a.index.names)`. – Truls Dec 08 '17 at 09:21
  • 1
    doesn't look very nice, though I guess that's how it should be done. I don't quite understand why they're dropping it in the first place, but there must be some reasons! – avloss Jun 17 '18 at 17:12
  • @WouterOvermeire Your elaboration in the comments about what `reset_index` does should be in the answer. I was going to try to incorporate the comment directly into the answer, but it didn't really work smoothly (especially the other details in the comment). So I think it would be better if you handle it. – jpmc26 Nov 08 '18 at 19:20
  • 7
    Pandas badly thought API strikes again. – Henry Henrinson Jul 25 '19 at 21:31
  • 1
    I still think pandas should keep the index for you (or at least give you the option to choose). There are many other "default" cases where indexes get duplicated; during a pd.concat for example. I don't think the current working is optimal. – Marses Aug 07 '19 at 12:30
16

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).

Matthew Son
  • 707
  • 6
  • 22
  • 2
    This is the best answer. There are many reasons why you would want to preserve your old indexes during a merge (and the accepted answer doesn't preserve indexes, it just resets them). It helps when you're trying to merge more than 2 dataframes, and so on... – Marses Aug 07 '19 at 12:24
  • upvoted but just be wary of a caveat, when using multi-index, your indices will be stored as a tuple in a single column called a[copy_index] – geekidharsh Nov 06 '19 at 06:19
  • What I am reading in the docs about `merge_asof` indicates it is not using the index to join, it is using the closes index to join. You also have to have your data sorted a certain way so the closest index joins properly. – bfmcneill Nov 12 '20 at 14:58
  • This is just a less elegant version of the `reset_index()` solution. @MartienLubberink is incorrect, as `reset_index()` stores the index as a column by default. – Migwell Apr 27 '22 at 04:12
10

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

In: a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
In: a['col2']
Out:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn't introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.

Asclepius
  • 49,954
  • 14
  • 144
  • 128
Zero
  • 66,763
  • 15
  • 141
  • 151
  • 2
    This seems superior to the accepted answer as it will probably work better with edge cases like multi indexes. Can anyone comment on this? – BallpointBen Jan 17 '19 at 06:07
  • 1
    question, what if you need to assign multiple columns, would this approach work or is it limited to only 1 field? – Yuca Mar 11 '19 at 15:24
  • 2
    @Yuca: This possibly won't work with multiple columns, since when you subset multiple columns you end up with a `pd.Dataframe` and not a `pd.Series`. The `.map()` method is only defined for the `pd.Series`. This is to mean that: `a[['to_merge_on_1', 'to_merge_on_2']].map(...)` won't work. – Dataman Feb 13 '20 at 13:08
  • Brilliant. In my project we are using too many pandas tricks everywhere. This is very refreshing as it is straight forward and low level. Thank you! – F. Santiago Dec 06 '21 at 09:29
8
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1

Asclepius
  • 49,954
  • 14
  • 144
  • 128
Supratik Majumdar
  • 2,255
  • 1
  • 22
  • 31
  • It seems to work, but when I use it with `on=list_of_cols]`, it contradicts the documentation: `If joining columns on columns, the DataFrame indexes *will be ignored*`. Is one of using indices vs. columns has precedence? – Itamar Katz Jan 22 '20 at 09:56
  • @Supratik Majumdar doesn't your suggestion assume the indexes of the dataframes already match? The OP has non-matching indexes and is merging/joining on columns. – James Apr 01 '22 at 17:05
2

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis

lisrael1
  • 178
  • 7
2

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:

c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
Alicia
  • 21
  • 2
0

Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.