3

I have about 8000 text files which contain csv data like

CustomerID,Gender,Day,SaleAmount
18,Male,Monday,71.55
24,Female,Monday,219.66
112,Male,Friday,150.44

My code is looping through all the files and then appending it to final.txt-

with open('final.txt', 'wb') as outfile:
    for filename in files:
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

Now the problem is since each file has it's own header i.e.

+------------+--------+-----+------------+
| CustomerID | Gender | Day | SaleAmount |
+------------+--------+-----+------------+

My final content looks like this -

+------------+--------+--------+------------+
| CustomerID | Gender |  Day   | SaleAmount |
+------------+--------+--------+------------+
| 18         | Male   | Monday | 71.55      |
| 24         | Female | Monday | 219.66     |
| 112        | Male   | Friday | 150.44     |
| CustomerID | Gender | Day    | SaleAmount |
| 28         | Male   | Monday | 7.55       |
| 34         | Female | Monday | 19.66      |
| 12         | Female | Friday | 150.44     |
| CustomerID | Gender | Day    | SaleAmount |
| 28         | Male   | Monday | 7.55       |
| 34         | Female | Monday | 19.66      |
| 12         | Female | Friday | 150.44     |
+------------+--------+--------+------------+

Is there a way to merge all the 8000 txt files into one keeping just one header using shutil.copyfileobj?

I've tried using pd.read_csv but copyfileobj is twice as fast. Are there any other faster way to do this?

EDIT - I am reading directly from txt files and not dataframes.

Pirate X
  • 2,839
  • 5
  • 32
  • 55
  • 2
    Possible duplicate of [Concatenating multiple csv files into a single csv with the same header - Python](https://stackoverflow.com/questions/44791212/concatenating-multiple-csv-files-into-a-single-csv-with-the-same-header-python) with `shutil` – Trenton McKinney Aug 26 '19 at 06:20

2 Answers2

1

Use this method

def copy_csv(fname):
    allFiles = glob.glob(fname)
    allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
    with open(fname+'.csv', 'wb') as outfile:
        for i, fname in enumerate(allFiles):
            with open(fname, 'rb') as infile:
                if i != 0:
                    infile.readline()  # Throw away header on all but first file
                # Block copy rest of file from input to output without parsing
                shutil.copyfileobj(infile, outfile)
            infile.close()
    outfile.close()
0
import pandas as pd
from pathlib import Path

files = Path.cwd().glob('**/*.csv')

or

files = Path('c:/path_to_files').glob('**/*.csv')  # ** looks in all subdirectories
df = pd.concat([pd.read_csv(file) for file in files])

df.reset_index(inplace=True)  # if you want

df.to_csv('new.csv', index=False, sep=',')
Trenton McKinney
  • 43,885
  • 25
  • 111
  • 113
  • I have tried looping with `pd.concat` , it becomes time consuming at 8000 files. I am looking for an approach which is faster than `copyfileObj` – Pirate X Aug 26 '19 at 06:17