2

I have a large (21 GByte) file which I want to read into memory and then pass to a subroutine which processes the data transparently to me. I am on python 2.6.6 on Centos 6.5 so upgrading the operating system or python is not an option. Currently, I am using

f = open(image_filename, "rb")
image_file_contents=f.read()
f.close()
transparent_subroutine ( image_file_contents )

which is slow (~15 minutes). Before I start reading the file, I know how big the file is, because I call os.stat( image_filename ).st_size

so I could pre-allocate some memory if that made sense.

Thank you

Jeff Silverman
  • 644
  • 6
  • 14
  • 3
    Use `mmap`. https://docs.python.org/3/library/mmap.html – Dietrich Epp Sep 09 '14 at 23:03
  • 1
    A larger buffer may help `open(image_filename, 'rb', 64*1024*1024)` – tdelaney Sep 09 '14 at 23:11
  • How do you plan on accessing the data? Random access? Read a block, process, repeat? Or do you actually need the entire file in mapped in memory? – xavier Sep 10 '14 at 00:29
  • I don't how the data is accessed. It is the input to the openstack program glance, which uses it to create a volume. I haven't tried changing the buffer size, that's clever. – Jeff Silverman Sep 12 '14 at 04:27

2 Answers2

2

using a generator

def generator(file_location):

    with open(file_location, 'rb') as entry:

        for chunk in iter(lambda: entry.read(1024 * 8), b''):

            yield chunk


go_to_streaming = generator(file_location) 
sandes
  • 1,488
  • 14
  • 26
1

To follow Dietrich's suggestion, I measure this mmap technique is 20% faster than one big read for a 1.7GB input file

from zlib import adler32 as compute_cc

n_chunk = 1024**2
crc = 0
with open( fn ) as f:
  mm = mmap.mmap( f.fileno(), 0, prot = mmap.PROT_READ, flags = mmap.MAP_PRIVATE )
  while True:
    buf = mm.read( n_chunk )
    if not buf: break
    crc = compute_crc( buf, crc )
return crc
FDS
  • 4,621
  • 2
  • 20
  • 12