0

Perhaps im being really dumb - I hope I am and that this is a quick fix.

Example of the CSV I'm working with:

32ED87BDB5FDC5E9CBA88547376818D4:24230577
C22B315C040AE6E0EFEE3518D830362B:8012567
2D20D252A479F485CDF5E171D93985BF:3993346
8846F7EAEE8FB117AD06BDD830B7586C:3861493
2D7F1A5A61D3A96FB5159B5EEF17ADC6:3184337
...
..
.

My python code is:

import sys
import pandas as pd

df = pd.read_csv(sys.argv[1],delimiter=":",index_col=False,header=0, usecols=['hash'], chunksize=1000000).drop_duplicates(keep='First')

print(df)

Im using a "test.csv" with only a handful of rows, but the original csv has over 653 million rows, and is about 22Gb in size. That's why I have the whole 'chunksize' bit.

Anyway...my actual issue is that when it "print(df)", I get the following error:

AttributeError: 'TextFileReader' object has no attribute 'drop_duplicates'

Nobody has mentioned this on any of the sites I've reviewed. What's up with that?!

0 Answers0