How to remove a pandas dataframe from another dataframe

Question

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

Thank you.

Possible duplicate of [set difference for pandas](http://stackoverflow.com/questions/18180763/set-difference-for-pandas) — AKS, May 19 '16 at 04:25
@176coding Please timeit our answers on your real datasets - it's interesting to me which is fastest/ — knagaev, May 19 '16 at 09:45

piRSquared · Answer 1 · 2016-05-19T08:20:19.067

96

Solution

Use pd.concat followed by drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

   a  b
1  3  4

Explanation

pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.

A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.

edited May 19 '16 at 08:20

answered May 19 '16 at 04:27

piRSquared

265,629
48
427
571

thx, it works, and we can use `pd.concat(df1,df2).drop_duplicates(keep=False)` or `df1.append(df2).drop_duplicates(keep=False)` – 176coding May 19 '16 at 06:42
@176coding hopefully this answers your question. If not, let me know what remains unanswered and i'll do my best to address it. – piRSquared May 19 '16 at 06:56
1

@piRSquared Your answer isn't correct - you made symmetric difference, not difference (simple). – knagaev May 19 '16 at 08:09
1

THis does not work. It the df you concatenate has additional records that are NOT in the checked df, then they will be Added to it... – clg4 Sep 01 '20 at 14:38
1

The primary dataframe is `df1`. I'm concatenating 3 dataframes, `df1` once and `df2` twice. Because `df2` is concatenated twice, by definition, everything in it will be duplicated. Therefore, dropping duplicates will leave NOTHING that was in `df2`. – piRSquared Sep 01 '20 at 17:20
2

Further, the only issue that this does not address is if there are existing duplicates in the initial dataframe. This assumes there are no duplicates in the initial dataframe. – piRSquared Sep 01 '20 at 17:24

score 13 · Answer 2 · edited Apr 24 '18 at 09:41

You can use .duplicated, which has the benefit of being fairly expressive:

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 µs per loop

For comparison:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 µs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 µs per loop

In sum, using the np.array comparison is fastest. Don't need the .tolist() there.

Be careful: This only works if the substracted Dataframe only contains data that is included in the first one. But I do like this answer. — Florian Fasmeyer, Apr 28 '21 at 06:16

score 6 · Answer 3 · answered May 19 '16 at 08:32

A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)

pd.DataFrame(list(idx1 - idx2), columns=df1.columns)

   a  b
0  3  4

score 6 · Answer 4 · edited Feb 15 '21 at 18:19

6

To get dataframe with all records which are in DF1 but not in DF2

DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')

edited Feb 15 '21 at 18:19

Joe Ferndz

8,163
2
11
32

answered Sep 01 '20 at 21:06

Pallavi Kalambe

593
7
4

Such an elegant and pythonic solution. Works for me. Thanks – Selim Dec 29 '21 at 21:41

score 2 · Answer 5 · answered May 19 '16 at 09:43

2

My shot with merge df1 and df2 from the question.

Using 'indicator' parameter

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4

answered May 19 '16 at 09:43

knagaev

2,767
15
20

2

An explanation of what is happening would make this a richer answer. You mentioned that the key to this method's success is the 'indicator' parameter, and setting that to true will add location information to each row, which your solution uses in the final step to filter, keeping only rows that appear only in the left data frame (indicator == 'left_only'). – Dannid Oct 30 '17 at 17:58

score 1 · Answer 6 · answered May 19 '16 at 08:43

1

A masking approach

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4

answered May 19 '16 at 08:43

piRSquared

265,629
48
427
571

don't need `.tolist()`. – Stefan May 19 '16 at 19:05

score 0 · Answer 7 · edited Oct 25 '18 at 12:06

0

I think the first tolist() needs to be removed, but keep the second one:

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]

edited Oct 25 '18 at 12:06

Armali

16,557
13
53
152

answered Oct 25 '18 at 11:40

Peter Abdou

1

score 0 · Answer 8 · answered Nov 14 '18 at 21:17

An easiest option is to use indexes.

Append df1 and df2 and reset their indexes.

df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes

indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

score 0 · Answer 9 · edited May 06 '22 at 17:56

Solution if df1 contains duplicates + keeps the index.

A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.

df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]

If your dataframes are big, you may want to store the result of

pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)

in a variable before the df1.apply call.

score 0 · Answer 10 · answered May 10 '22 at 07:06

0

This solution works when your df_to_drop is a subset of main data frame data.

data_clean = data.drop(df_to_drop.index)

answered May 10 '22 at 07:06

Maryam_Bibi

1
1

How to remove a pandas dataframe from another dataframe

10 Answers10

Solution

Explanation

Linked