0

I have about 600,000 text files in the "./data" directory. All of them are of 1 line. I want to merge them into 1 file, where each lines should be enclosed by single quote '.

I write a python script like following:

#!/usr/bin/env python3

from glob import glob

def main():
    files = glob("data/*")
    for f in files:
        with open(f) as f2:
            print("'" + f2.read() + "'")

if __name__ == "__main__":
    main()

Saving this as merge.py, I can get a merged file with the command

./merge.py > merged.txt

At first, as a efficiency test, I run the code where for f in files is replaced with for f in files[:10000]. It finished in few seconds. And I thought if I run it for whole files (i.e. with the original for f in files line), it will finish in a several minutes. Then I modify the line and ran it. But it did not finish even after 15 minutes. I wondered. I opened another terminal and ran

while true; do date; wc -l merged.txt; sleep 300; done

According to the output of this command, my script processed about 20k files per 5 minutes (this is much smaller than I expect) and got slower the process goes.

My script just repeatedly opens a file, write a line to standard output, close the file. To my understanding it makes no difference if it is around the beginning of the loop or after processed hundreds of thousands.

Is there any reason that makes the process slower ?

hotoku
  • 162
  • 7

0 Answers0