21

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

So basically this is a either/or match condition (pesudo code):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

cs95
  • 330,695
  • 80
  • 606
  • 657
alwaysaskingquestions
  • 1,475
  • 5
  • 20
  • 42

2 Answers2

19

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  3  7     0  1  4
1  2  8     1  5  1
2  1  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

  • df1.loc[0] matches A on df2.loc[0]
  • df1.loc[1] matches B on df2.loc[2]
  • df1.loc[3] matches both A and B on df2.loc[3]

We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
                df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.

dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN
Max Segal
  • 1,764
  • 1
  • 22
  • 47
andrew_reece
  • 18,690
  • 2
  • 27
  • 54
  • I think you need drop_duplicates after concat to make sure your eliminate record that may have successful joins on both A and B. – Scott Boston Aug 24 '17 at 20:19
  • Good catch! I'll update with the edge case in my sample data. – andrew_reece Aug 24 '17 at 20:22
  • is the concat same thing as append in this case? – alwaysaskingquestions Aug 24 '17 at 22:29
  • 1
    Yes, you could also do `df1.merge(on='A').append(df1.merge(on='B'))`. I think `concat` is often faster. – andrew_reece Aug 24 '17 at 23:03
  • @andrew_reece can you explain why you only has one argument "on" for the merge in your code df1.merge(on='A').append(df1.merge(on='B'))? what does that mean? – alwaysaskingquestions Aug 25 '17 at 19:48
  • It's the equivalent of the `if ... then merge` pseudocode in your original post. If you're new to the idea, Pandas docs on [Merge, Join, and Concatenate](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) are comprehensive and readable. – andrew_reece Aug 25 '17 at 19:52
2

I would suggest this alternate way for doing merge like this. This seems easier for me.

table1["id_to_be_merged"] = table1.apply(
    lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)

You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

pratpor
  • 1,716
  • 22
  • 41
  • 2
    __apply__ and __lambdas__ are not for beginners, and thus, lacking an explanation, makes the answer not as helpful as could be. You could also make your code a bit more readable. Does it beat the method above in performance? – MikeMajara Jun 07 '19 at 12:23
  • 1
    I don't think this solves the problem. I think most of the rows for the question asker include both a valid ship number and a valid track number. Also, the question asker isn't looking to match table 1 ship numbers to table 2 track numbers or vice versa. It has to be `(table_1.ShipNumber == table2.ShipNumber) or (table_1.TrackNumber == table2.TrackNumber)` . Matching `table_1.ShipNumber == table_2.TrackNumber` would be a problem. – JDenman6 Sep 22 '20 at 14:11