I have a y.csv file. The file size is 10 MB and it contains data from Jan 2020 to May 2020.
I also have a separate file for each month. e.g. data-2020-01.csv. It contains detailed data. The file size of each month file is around 1 GB.
I'm splitting the y.csv by month and then process the data by loading the relevant month file. This process is taking too long when I go for large number of months. e.g. 24 months.
I would like to process the data faster. I have access to AWS m6i.8xlarge instance which has 32 vCPU and 128 GB memory.
I'm new to multiprocessing. So can someone guide me here?
This is my current code.
import pandas as pd
periods = [(2020, 1), (2020, 2), (2020, 3), (2020, 4), (2020, 5)]
y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0) # Filesize: ~10 MB
def process(_month_df, _index):
idx = _month_df.index[_month_df.index.get_loc(_index, method='nearest')]
for _, value in _month_df.loc[idx:].itertuples():
up_delta = 200
down_delta = 200
up_value = value + up_delta
down_value = value - down_delta
if value > up_value:
y.loc[_index, "result"] = 1
return
if value < down_value:
y.loc[_index, "result"] = 0
return
for x in periods:
filename = "data-" + str(x[0]) + "-" + str(x[1]).zfill(2) # data-2020-01
filtered_y = y[(y.index.month == x[1]) & (y.index.year == x[0])] # Only get the current month records
month_df = pd.read_csv(f'{filename}.csv', index_col=0, parse_dates=True) # Filesize: ~1 GB (data-2020-01.csv)
for index, row in filtered_y.iterrows():
process(month_df, index)