While running some anomaly detection functions (like ABOD or SVM) I got SIGKILL error pretty often so after reading various posts on here I decided to optimise my code. I already downcasted data frame values to int16, which helped a lot but I am interested in other ways of memory usage reduction.
I found this statement with a promising amount of likes in the comments to this post:
Actually calling gc.collect() yourself at the end of a loop can help avoid fragmenting memory, which in turn helps keep performance up. I've seen this make a significant difference (~20% runtime IIRC)
The only thing is I am not really understanding where explicitly does it makes sense for me to use. I have enough functions with loops, e.g.:
def getLogsData(path, slice=58134):
"""
Opens & reads logs .gz files in path's directories and subdirectories and saves the data in a dataframe.
:param path: path to the folder with logs
:param slice: the number of files opened in the folder
:return: df: pandas dataframe with logs data
"""
# initialise logs df
df = pd.DataFrame(columns=['timestamp', 'id', 'billed_duration', 'max_memory_size_used',
'init_duration'])
# initialise rows counter
counter = 0
# list returns absolute paths of all .gz files
file_list = [f for f in iglob(path, recursive=True) if f.endswith('.gz')]
# slice to reduce memory
file_list = list(islice(file_list, slice))
for file in file_list:
file_content = gzip.open(file, 'rb').read().decode("utf-8")
splitted_file_content = file_content.splitlines()
for line in splitted_file_content:
if re.search('REPORT', line):
tokens = line.split()
timestamp = tokens[0]
id = tokens[3]
billed_duration = tokens[9]
max_memory_size_used = tokens[18]
try:
init_duration = tokens[22]
except IndexError:
init_duration = np.nan
df.loc[counter] = [timestamp, id, billed_duration,
max_memory_size_used, init_duration]
counter += 1
return df
BUT: when I tried to add gc.collect() at different places (after first, second loop...) to this function and timing it with time.perf_counter() the results came out very different. Usually, it took about 12 minutes to load the whole df (slice=58134), with gc.collect() I stopped the process after 20 minutes with no results. If I define slice around 500 or so, it is indeed making the runtime 5-10 seconds faster. So I got a bit confused about how to use gc.collect() and where does it make the most sense to use.