61

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

Thank you.

LoicTheAztec
  • 207,510
  • 22
  • 296
  • 340
176coding
  • 2,633
  • 4
  • 16
  • 18
  • Possible duplicate of [set difference for pandas](http://stackoverflow.com/questions/18180763/set-difference-for-pandas) – AKS May 19 '16 at 04:25
  • @176coding Please timeit our answers on your real datasets - it's interesting to me which is fastest/ – knagaev May 19 '16 at 09:45

10 Answers10

96

Solution

Use pd.concat followed by drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

   a  b
1  3  4

Explanation

pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.

A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.

piRSquared
  • 265,629
  • 48
  • 427
  • 571
  • thx, it works, and we can use `pd.concat(df1,df2).drop_duplicates(keep=False)` or `df1.append(df2).drop_duplicates(keep=False)` – 176coding May 19 '16 at 06:42
  • @176coding hopefully this answers your question. If not, let me know what remains unanswered and i'll do my best to address it. – piRSquared May 19 '16 at 06:56
  • 1
    @piRSquared Your answer isn't correct - you made symmetric difference, not difference (simple). – knagaev May 19 '16 at 08:09
  • 1
    THis does not work. It the df you concatenate has additional records that are NOT in the checked df, then they will be Added to it... – clg4 Sep 01 '20 at 14:38
  • 1
    The primary dataframe is `df1`. I'm concatenating 3 dataframes, `df1` once and `df2` twice. Because `df2` is concatenated twice, by definition, everything in it will be duplicated. Therefore, dropping duplicates will leave NOTHING that was in `df2`. – piRSquared Sep 01 '20 at 17:20
  • 2
    Further, the only issue that this does not address is if there are existing duplicates in the initial dataframe. This assumes there are no duplicates in the initial dataframe. – piRSquared Sep 01 '20 at 17:24
13

You can use .duplicated, which has the benefit of being fairly expressive:

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 µs per loop

For comparison:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 µs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 µs per loop

In sum, using the np.array comparison is fastest. Don't need the .tolist() there.

Jonathan
  • 7,966
  • 7
  • 51
  • 67
Stefan
  • 38,755
  • 12
  • 69
  • 79
  • Be careful: This only works if the substracted Dataframe only contains data that is included in the first one. But I do like this answer. – Florian Fasmeyer Apr 28 '21 at 06:16
6

A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)

pd.DataFrame(list(idx1 - idx2), columns=df1.columns)

   a  b
0  3  4
piRSquared
  • 265,629
  • 48
  • 427
  • 571
6

To get dataframe with all records which are in DF1 but not in DF2

DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
Joe Ferndz
  • 8,163
  • 2
  • 11
  • 32
Pallavi Kalambe
  • 593
  • 7
  • 4
2

My shot with merge df1 and df2 from the question.

Using 'indicator' parameter

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4
knagaev
  • 2,767
  • 15
  • 20
  • 2
    An explanation of what is happening would make this a richer answer. You mentioned that the key to this method's success is the 'indicator' parameter, and setting that to true will add location information to each row, which your solution uses in the final step to filter, keeping only rows that appear only in the left data frame (indicator == 'left_only'). – Dannid Oct 30 '17 at 17:58
1

A masking approach

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4
piRSquared
  • 265,629
  • 48
  • 427
  • 571
0

I think the first tolist() needs to be removed, but keep the second one:

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
Armali
  • 16,557
  • 13
  • 53
  • 152
0

An easiest option is to use indexes.

  1. Append df1 and df2 and reset their indexes.

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. e.g:
    This will give df2 indexes

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

Akhan
  • 415
  • 1
  • 6
  • 20
0

Solution if df1 contains duplicates + keeps the index.

A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.

df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]

If your dataframes are big, you may want to store the result of

pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)

in a variable before the df1.apply call.

RiveN
  • 2,470
  • 8
  • 7
  • 23
0

This solution works when your df_to_drop is a subset of main data frame data.

data_clean = data.drop(df_to_drop.index)