Drop duplicates with less precision

Question

I have a pandas DataFrame with string-columns and float columns I would like to use drop_duplicates to remove duplicates. Some of the duplicates are not exactly the same, because there are some slight differences in low decimal places. How can I remove duplicates with less precision?

Example:

import pandas as pd
df = pd.DataFrame.from_dict({'text': ['aaa','aaa','aaa','bb'], 'result': [1.000001,1.000000,2,2]})
df
     result text
0  1.000001  aaa
1  1.000000  aaa
2  2.000000  aaa
3  2.000000   bb

I would like to get

df_out = pd.DataFrame.from_dict({'text': ['aaa','aaa','bb'], 'result': [1.000001,2,2]})
df_out
     result text
0  1.000001  aaa
1  2.000000  aaa
2  2.000000   bb

Binning is an overcomplicated solution for this problem, but I'll share a link anyway: https://chrisalbon.com/python/pandas_binning_data.html — Joe Frambach, May 29 '17 at 14:51

score 3 · Answer 1 · answered May 29 '17 at 14:47

3

round them

df.loc[df.round().drop_duplicates().index]

     result text
0  1.000001  aaa
2  2.000000  aaa
3  2.000000   bb

answered May 29 '17 at 14:47

Steven G

14,602
6
47
72

score 2 · Accepted Answer · answered May 29 '17 at 14:50

You can use the function round with a given precision in order to round your df.

DataFrame.round(decimals=0, *args, **kwargs)

Round a DataFrame to a variable number of decimal places.

For example you can apply the round with two decimals by this:

df = df.round(2)

Also you can apply it on specific columns, for example:

df = df.round({'result': 2})

After the rounding you can use the function drop_duplictes

score 0 · Answer 3 · answered May 29 '17 at 15:00

0

Use numpy.trunc to get at the precision you are looking for. Use pandas duplicated to find which ones to keep.

df[~df.assign(result=np.trunc(df.result.values * 100)).duplicated()]

answered May 29 '17 at 15:00

piRSquared

265,629
48
427
571

Drop duplicates with less precision

3 Answers3

Linked