19

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

This code works fine, but it is slow. It can take up to 2 days to process.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

Thanks.

mattblack
  • 1,151
  • 3
  • 10
  • 16

4 Answers4

28

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

thoomasbro
  • 43
  • 4
ShadowRanger
  • 124,179
  • 11
  • 158
  • 228
  • @ShadowRanger, Could you please also share some method to split a large csv into multiple files and keeping the header in each small files? – vikrant rana Dec 07 '18 at 12:48
  • 2
    @vikrantrana: That's a completely different question, not really suitable to answering in the comments, and not appropriate to answering the OP's question. Assuming one of the [many questions on this topic](https://stackoverflow.com/search?q=python+split+csv) doesn't cover it, feel free to ask your own question on that topic. It's going to need a lot more details to answer though (e.g. are you splitting up by row counts, byte counts, etc.), and the `csv` module will be necessary (because you'll need it to properly separate rows). – ShadowRanger Dec 07 '18 at 14:20
  • Thanks. I will raise a separate question for this and share a link.. there are solutions available but I am looking for optimIzed one.. I like the one you shared above. – vikrant rana Dec 07 '18 at 14:26
  • 1
    @vikrantrana: A warning: In the general case, you can't optimize your suggested pattern nearly as well as you could in this case; `csv` parsing will be necessary (because while headers can usually be treated as not containing embedded newlines, arbitrary CSV data can't be assumed to behave that way), and that parsing is a lot more expensive than mere raw bytes copies like this question allows; you can push a lot of it to the C layer (`csv.reader` & `csv.writer` are implemented in C), but the (relatively) expensive parsing work must be done. – ShadowRanger Dec 07 '18 at 14:30
  • 1
    Follow-up because this is kind of important: You *cannot* rely on the `glob` module to return files in a specific order, so as written, this code (and all other Python+`glob` module answers) will not reliably read from a directory containing `a.csv` and `b.csv` in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. So assuming the contents of the resulting CSV should appear in some reliable order, you will want to explicitly sort the result of the `glob.glob` call. – ShadowRanger Dec 20 '19 at 18:47
  • 1
    I've updated the answer to impose a simple ("natural") sort on the results to get reproducible behavior. For more info on when this `glob` module behavior has caused problems, see [A Code Glitch May Have Caused Errors In More Than 100 Published Studies](https://www.vice.com/en_us/article/zmjwda/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies). – ShadowRanger Dec 20 '19 at 18:48
12

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 
Peter Leimbigler
  • 9,860
  • 1
  • 20
  • 33
1

You don't need pandas for this, just the simple csv module would work fine.

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.
Alexander
  • 96,739
  • 27
  • 183
  • 184
  • A couple issues with this: 1) You opened the input files in text mode, the output in binary, which wouldn't work at all on Py3, and it's wrong on a Windows box even on Py2 (where you'd convert `\r\n` line endings in the inputs to `\n` line endings in the output). Sadly, it's not possible to make it fully portable without a lot of effort or third party modules (because the `csv` module requires binary I/O on Py2, and text I/O with `newline=''` on Py3). 2) (Minor) If nothing else, `headers = reader.next()` could be changed to `headers = next(reader)` to make it work on 2.6-3.x, not just 2.x. – ShadowRanger Dec 07 '18 at 14:40
0

Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-

import pandas as pd
import glob

path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_)
    stockstats_data = pd.concat((df, stockstats_data), axis=0)
markroxor
  • 5,168
  • 2
  • 31
  • 42