What is the most efficient way to read a large binary file python

Question

I have a large (21 GByte) file which I want to read into memory and then pass to a subroutine which processes the data transparently to me. I am on python 2.6.6 on Centos 6.5 so upgrading the operating system or python is not an option. Currently, I am using

f = open(image_filename, "rb")
image_file_contents=f.read()
f.close()
transparent_subroutine ( image_file_contents )

which is slow (~15 minutes). Before I start reading the file, I know how big the file is, because I call os.stat( image_filename ).st_size

so I could pre-allocate some memory if that made sense.

Thank you

A larger buffer may help `open(image_filename, 'rb', 64*1024*1024)` — tdelaney, Sep 09 '14 at 23:11
How do you plan on accessing the data? Random access? Read a block, process, repeat? Or do you actually need the entire file in mapped in memory? — xavier, Sep 10 '14 at 00:29
I don't how the data is accessed. It is the input to the openstack program glance, which uses it to create a volume. I haven't tried changing the buffer size, that's clever. — Jeff Silverman, Sep 12 '14 at 04:27

score 2 · Answer 1 · answered Apr 16 '20 at 12:15

2

using a generator

def generator(file_location):

    with open(file_location, 'rb') as entry:

        for chunk in iter(lambda: entry.read(1024 * 8), b''):

            yield chunk


go_to_streaming = generator(file_location)

answered Apr 16 '20 at 12:15

sandes

1,488
14
26

score 1 · Answer 2 · answered Mar 20 '15 at 07:31

To follow Dietrich's suggestion, I measure this mmap technique is 20% faster than one big read for a 1.7GB input file

from zlib import adler32 as compute_cc

n_chunk = 1024**2
crc = 0
with open( fn ) as f:
  mm = mmap.mmap( f.fileno(), 0, prot = mmap.PROT_READ, flags = mmap.MAP_PRIVATE )
  while True:
    buf = mm.read( n_chunk )
    if not buf: break
    crc = compute_crc( buf, crc )
return crc

What is the most efficient way to read a large binary file python

2 Answers2

Linked