0

While running some anomaly detection functions (like ABOD or SVM) I got SIGKILL error pretty often so after reading various posts on here I decided to optimise my code. I already downcasted data frame values to int16, which helped a lot but I am interested in other ways of memory usage reduction.

I found this statement with a promising amount of likes in the comments to this post:

Actually calling gc.collect() yourself at the end of a loop can help avoid fragmenting memory, which in turn helps keep performance up. I've seen this make a significant difference (~20% runtime IIRC)

The only thing is I am not really understanding where explicitly does it makes sense for me to use. I have enough functions with loops, e.g.:

def getLogsData(path, slice=58134):
    """
    Opens & reads logs .gz files in path's directories and subdirectories and saves the data in a dataframe.

    :param path: path to the folder with logs
    :param slice: the number of files opened in the folder
    :return: df: pandas dataframe with logs data
    """
    # initialise logs df
    df = pd.DataFrame(columns=['timestamp', 'id', 'billed_duration', 'max_memory_size_used',
                               'init_duration'])

    # initialise rows counter
    counter = 0

    # list returns absolute paths of all .gz files
    file_list = [f for f in iglob(path, recursive=True) if f.endswith('.gz')]

    # slice to reduce memory
    file_list = list(islice(file_list, slice))

    for file in file_list:
        file_content = gzip.open(file, 'rb').read().decode("utf-8")
        splitted_file_content = file_content.splitlines()
        for line in splitted_file_content:
            if re.search('REPORT', line):
                tokens = line.split()

                timestamp = tokens[0]
                id = tokens[3]
                billed_duration = tokens[9]
                max_memory_size_used = tokens[18]

                try:
                    init_duration = tokens[22]
                except IndexError:
                    init_duration = np.nan

                df.loc[counter] = [timestamp, id, billed_duration,
                                   max_memory_size_used, init_duration]
                counter += 1

    return df

BUT: when I tried to add gc.collect() at different places (after first, second loop...) to this function and timing it with time.perf_counter() the results came out very different. Usually, it took about 12 minutes to load the whole df (slice=58134), with gc.collect() I stopped the process after 20 minutes with no results. If I define slice around 500 or so, it is indeed making the runtime 5-10 seconds faster. So I got a bit confused about how to use gc.collect() and where does it make the most sense to use.

Kami
  • 131
  • 10

0 Answers0