1

I have a pandas DataFrame with string-columns and float columns I would like to use drop_duplicates to remove duplicates. Some of the duplicates are not exactly the same, because there are some slight differences in low decimal places. How can I remove duplicates with less precision?

Example:

import pandas as pd
df = pd.DataFrame.from_dict({'text': ['aaa','aaa','aaa','bb'], 'result': [1.000001,1.000000,2,2]})
df
     result text
0  1.000001  aaa
1  1.000000  aaa
2  2.000000  aaa
3  2.000000   bb

I would like to get

df_out = pd.DataFrame.from_dict({'text': ['aaa','aaa','bb'], 'result': [1.000001,2,2]})
df_out
     result text
0  1.000001  aaa
1  2.000000  aaa
2  2.000000   bb
Make42
  • 10,870
  • 22
  • 68
  • 142
  • Binning is an overcomplicated solution for this problem, but I'll share a link anyway: https://chrisalbon.com/python/pandas_binning_data.html – Joe Frambach May 29 '17 at 14:51

3 Answers3

3

round them

df.loc[df.round().drop_duplicates().index]

     result text
0  1.000001  aaa
2  2.000000  aaa
3  2.000000   bb
Steven G
  • 14,602
  • 6
  • 47
  • 72
2

You can use the function round with a given precision in order to round your df.

DataFrame.round(decimals=0, *args, **kwargs)

Round a DataFrame to a variable number of decimal places.

For example you can apply the round with two decimals by this:

df = df.round(2)

Also you can apply it on specific columns, for example:

df = df.round({'result': 2})

After the rounding you can use the function drop_duplictes

omri_saadon
  • 9,513
  • 6
  • 31
  • 58
0

Use numpy.trunc to get at the precision you are looking for. Use pandas duplicated to find which ones to keep.

df[~df.assign(result=np.trunc(df.result.values * 100)).duplicated()]
piRSquared
  • 265,629
  • 48
  • 427
  • 571