Pandas drop duplicates on elements made of lists

Question

Say my dataframe is:

df = pandas.DataFrame([[[1,0]],[[0,0]],[[1,0]]])

which yields:

        0
0  [1, 0]
1  [0, 0]
2  [1, 0]

I want to drop duplicates, and only get elements [1,0] and [0,0], if I write:

df.drop_duplicates()

I get the following error: TypeError: unhashable type: 'list'

How can I call drop_duplicates()?

More in general:

df = pandas.DataFrame([[[1,0],"a"],[[0,0],"b"],[[1,0],"c"]], columns=["list", "letter"])

And I want to call df["list"].drop_duplicates(), so drop_duplicates applies to a Series and not a dataframe?

Mazdak · Accepted Answer · 2018-05-18T19:52:02.580

9

You can use numpy.unique() function:

>>> df = pandas.DataFrame([[[1,0]],[[0,0]],[[1,0]]])
>>> pandas.DataFrame(np.unique(df), columns=df.columns)
        0
0  [0, 0]
1  [1, 0]

If you want to preserve the order checkout: numpy.unique with order preserved

edited May 18 '18 at 19:52

answered May 18 '18 at 19:43

Mazdak

1

I like this answer, it is pretty simple – user May 18 '18 at 20:01
@user And [simple is better than complex](https://www.python.org/dev/peps/pep-0020/) ;). – Mazdak May 18 '18 at 20:03
@user and if you think one answer is better than the others, it is better to "accept" it so others can know what solution worked best. – Omid May 19 '18 at 12:53
1

@Omid All answers were great and all upvoted but this is the one I used for its simplicity – user May 20 '18 at 20:14
Seems like this or the tuple answer should be added to the pandas codebase. – wordsforthewise Jan 18 '20 at 17:00

score 8 · Answer 2 · answered May 18 '18 at 19:42

8

Call drop_duplicates on tuplized data:

df[0].apply(tuple, 1).drop_duplicates().apply(list).to_frame()

        0
0  [1, 0]
1  [0, 0]

However, I'd much prefer something that doesn't involve apply...

from collections import OrderedDict
pd.Series(map(
    list, (OrderedDict.fromkeys(map(tuple, df[0].tolist()))))
).to_frame()

Or,

pd.Series(
    list(k) for k in OrderedDict.fromkeys(map(tuple, df[0].tolist()))
).to_frame()

        0
0  [1, 0]
1  [0, 0]

answered May 18 '18 at 19:42

cs95

Why would you prefer something that doesn't involve apply? The code looks much more readable with apply. – wordsforthewise Jan 18 '20 at 17:08
@wordsforthewise the answer to that question is long but it is here: https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code – cs95 Jan 18 '20 at 19:05

score 4 · Answer 3 · answered May 18 '18 at 19:39

Here is one way, by turning your series of lists into separate columns, and only keeping the non-duplicates:

df[~df[0].apply(pandas.Series).duplicated()]

        0
0  [1, 0]
1  [0, 0]

Explanation:

df[0].apply(pandas.Series) returns:

From which you can find duplicates:

>>> df[0].apply(pd.Series).duplicated()
0    False
1    False
2     True

And finally index using that

score 4 · Answer 4 · answered Aug 30 '20 at 19:39

4

I tried the other answers but they didn't solve what I needed (large dataframe with multiple list columns).

I solved it this way:

df = df[~df.astype(str).duplicated()]

answered Aug 30 '20 at 19:39

Andreas

4 Answers4