Python Pandas: How to merge based on an "OR" condition?

Question

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

So basically this is a either/or match condition (pesudo code):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

did you see this post: https://stackoverflow.com/questions/43925603/python-pandas-merge-with-or-logic?rq=1 — Vico, Aug 24 '17 at 19:57
Hi @Vico I edited my post to include your suggested post, but I think that post did not answer my question :( — alwaysaskingquestions, Aug 24 '17 at 20:02
Do you want four columns total in your output? I.e. `A_tab1`, `A_tab2`, `B_tab1`, `B_tab2`? It seems like you'll need all of these if only `A` or `B` needs to match on any given row. — andrew_reece, Aug 24 '17 at 20:05
@andrew_reece that is fine if all columns has to be preserved; the key is to merge the tables correctly — alwaysaskingquestions, Aug 24 '17 at 20:06
Not so. They're slightly different. Have a go and see if it works. — cs95, Aug 24 '17 at 20:08
Anyway, more cannot be said without your data or expected output. — cs95, Aug 24 '17 at 20:11

score 19 · Accepted Answer · edited Oct 12 '21 at 16:39

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  3  7     0  1  4
1  2  8     1  5  1
2  1  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]

We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
                df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.

dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

I think you need drop_duplicates after concat to make sure your eliminate record that may have successful joins on both A and B. — Scott Boston, Aug 24 '17 at 20:19
Good catch! I'll update with the edge case in my sample data. — andrew_reece, Aug 24 '17 at 20:22
Yes, you could also do `df1.merge(on='A').append(df1.merge(on='B'))`. I think `concat` is often faster. — andrew_reece, Aug 24 '17 at 23:03
@andrew_reece can you explain why you only has one argument "on" for the merge in your code df1.merge(on='A').append(df1.merge(on='B'))? what does that mean? — alwaysaskingquestions, Aug 25 '17 at 19:48
It's the equivalent of the `if ... then merge` pseudocode in your original post. If you're new to the idea, Pandas docs on [Merge, Join, and Concatenate](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) are comprehensive and readable. — andrew_reece, Aug 25 '17 at 19:52

score 2 · Answer 2 · answered Jul 19 '18 at 11:51

2

I would suggest this alternate way for doing merge like this. This seems easier for me.

table1["id_to_be_merged"] = table1.apply(
    lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)

You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

answered Jul 19 '18 at 11:51

pratpor

1,716
22
41

2

__apply__ and __lambdas__ are not for beginners, and thus, lacking an explanation, makes the answer not as helpful as could be. You could also make your code a bit more readable. Does it beat the method above in performance? – MikeMajara Jun 07 '19 at 12:23
1

I don't think this solves the problem. I think most of the rows for the question asker include both a valid ship number and a valid track number. Also, the question asker isn't looking to match table 1 ship numbers to table 2 track numbers or vice versa. It has to be `(table_1.ShipNumber == table2.ShipNumber) or (table_1.TrackNumber == table2.TrackNumber)` . Matching `table_1.ShipNumber == table_2.TrackNumber` would be a problem. – JDenman6 Sep 22 '20 at 14:11

Python Pandas: How to merge based on an "OR" condition?

2 Answers2

Linked