7

I got following warning

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use newframe = frame.copy()

when I tried to append multiple dataframes like

df1 = pd.DataFrame()
for file in files:
  df = pd.read(file)
  df['id'] = file
  df1 = df1.append(df, ignore_index =True)

where

  df['id'] = file

seems to cause the warning. I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.

Thanks,


I tried to create a testing code to duplicate the problem but I don't see Performance Warning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.

import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
    if not os.path.isdir('./data'):
        os.mkdir('./data')
        files = []
        for i in range(num_files):
            file = f'./data/{i}.pkl'
            pd.DataFrame(
                np.random.randint(1, 1_000, (rows, cols))
            ).to_pickle(file)
            files.append(file)
    return files

# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning

dfs = []
for file in files:
    df = pd.read_pickle(file)
    df['id'] = file

    dfs.append(df)

dfs = pd.concat(dfs, ignore_index = True)
Chung-Kan Huang
  • 71
  • 1
  • 1
  • 3
  • 1
    When reassigning, invoke copy on your frame. – ifly6 Jul 07 '21 at 20:54
  • 2
    This should probably be something like `df1 = pd.concat([pd.read(file).assign(id=file) for file in files])` – Henry Ecker Jul 07 '21 at 20:56
  • 1
    A simple python list of dataframes is lighter weight than appended dataframes. As long as you can afford to hold both the dataframes in the list and the final concatentated dataframe in memory at the same time, @HenryEcker has a good solution. – tdelaney Jul 07 '21 at 20:59
  • @ifly6, I tried using list and concat but they do not seem to be the issues of the fragmentation. I am curious what you meant by invoking copy when reassigning. Do you mind providing an example? Thanks. – Chung-Kan Huang Jul 08 '21 at 17:45
  • 3
    `df1 = df1.append(df, ignore_index=True).copy()` – ifly6 Jul 08 '21 at 17:46
  • Thanks @ifly6, I tried but I still got the same warning. PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()` df['id'] = file – Chung-Kan Huang Jul 08 '21 at 17:55
  • @ifly6, I suspect that when I create a new column for each df I might make the data "scatter" more therefore more fragmented. After append or concat with the fragmented dataframe I will start to suffer performance problem if I continue to use the resulted df. However, I might be able to resolve this by making a copy to defragment. – Chung-Kan Huang Jul 08 '21 at 18:01

2 Answers2

3

append is not an efficient method for this operation. concat is more appropriate in this situation.

Replace

df1 = df1.append(df, ignore_index =True)

with

 pd.concat((df1,df),axis=0)

Details about the differences are in this question: Pandas DataFrame concat vs append

Polkaguy6000
  • 1,000
  • 1
  • 7
  • 14
1

This is a problem with recent update. Check this issue from pandas-dev. It seems to be resolved in pandas version 1.3.1 (reference PR).

bruno-uy
  • 1,359
  • 8
  • 17