I/O Strategies for computational problems with large data sets?

Question

My research group focuses on molecular dynamics, which obviously can generate gigabytes of data as part of a single trajectory which must then be analyzed.

Several of the problems we're concerned with involve correlations in the data set, which means that we need to keep track of large amounts of data in memory and analyze them, rather than using a more sequential approach.

What I'd like to know is what are the most efficient strategies for handling I/O of large data sets into scripts. We normally use Python-based scripts because it makes coding the file I/O much less painful than C or Fortran, but when we have tens or hundreds of millions of lines that need to be processed, it's not so clear what the best approach is. Should we consider doing the file input part of the code in C, or is another strategy more useful? (Will simply preloading the entire array into memory be better than a series of sequential reads of "chunks" (order of megabytes)?

Some additional notes:

We are primarily looking for scripting tools for post-processing, rather than "on-line" tools—hence the use of Python.
As stated above, we're doing MD simulations. One topic of interest is diffusion calculations, for which we need to obtain the Einstein diffusion coefficient: $$D = \frac{1}{6} \lim_{\Delta t \rightarrow \infty} \left< \left( {\bf x}(t + \Delta t) - {\bf x}(t) \right)^2 \right>$$ This means we really need to load all of the data into memory before beginning the calculation—all of the chunks of data (records of individual times) will interact with one another.

score 7 · Answer 1 · edited Apr 05 '12 at 23:02

I've had to deal with similar problems before, and my favourite solution is to use Memory-mapped I/O, albeit in C...

The principle behind it is quite simple: instead of opening a file and reading from it, you load it directly to the memory and access it as if it were a huge array. The trick that makes it efficient is that the operating system doesn't actually load the file, it just treats it like swapped-out memory that needs to be loaded. When you access any given byte in your file, the memory page for that part of the file is swapped into memory. If you keep accessing different parts of the file and memory gets tight, the less-used parts will get swapped back out -- automagically!

A quick Google search tells me that this is also available for Python: 16.7. mmap — Memory-mapped file support, but I don't know enough about Python to tell if it's really the same thing.

Just make sure you measure and test before implementing something like mmap into your main code. Many modern operating systems give similar performance between regular read with less complication. (Also, yes, mmap in Python provides a portable interface to the Windows and UNIX memory maps). — Aron Ahmadia, Apr 06 '12 at 08:45

Diego · Accepted Answer · 2012-04-06T09:52:39.140

I'm assuming your question comes from the observation that the I/O causes a significant overhead in your whole analysis. In that case, you can try to overlap I/O with computation.

A successful approach depends on how you access the data, and the computation you perform on that data. If you can identify a pattern, or the access to different regions of the data is known beforehand, you can try to prefetch the "next chunks" of data in the background while processing the "current chunks".

As a simple example, if you only traverse your file once and process each line or set of lines, you can divide the stream in chunks of lines (or MBs). Then, at each iteration over the chunks, you can load chunk i+1 while processing chunk i.

Your situation may be more complex and need more involved solutions. In any case, the idea is to perform the I/O in the background while the processor has some data to work on. If you give more details on your specific problem, we may be able to take a deeper look into it ;)

---- Extended version after giving more details ----

I'm not sure I understand the notation, but well, as you said, the idea is an all-to-all interaction. You also mention that the data may fit in RAM. Then, I would start by measuring the time to load all the data and the time to perform the computation. Now,

if the percent of the I/O is low (low as in you don't care about the overhead, whatever it is: 0.5%, 2%, 5%, ...), then just use the simple approach: load data at once, and compute. You will save time for more interesting aspects of your research.
if you cannot afford the overhead you may want to look into what Pedro suggested. Keep in mind what Aron Ahmadia mentioned, and test it before going for a full implementation.
if the previous are not satisfactory, I would go for some out-of-core implementation[1]. Since it seems that you are performing $n^2$ computations on $n$ data, there is hope :) Some pseudocode (assuming the results of your analysis fit in RAM):

    load chunk1 and chunk2
    for chunks i = 1 to n
        asynchronously load chunk i+1
        for chunks in j = i+1 to n
            asynchronously load chunk j+1
            compute with chunks i, j  (* for the first iteration, these are the preloaded chunks 1 and 2 *)

Note: this is quick and dirty pseudocode, one would need to adjust the indices.

To implement this, it is common to use the so-called double-buffering. Roughly speaking: divide memory in two workspaces; while data is being loaded in the background into workspace 1, processor is computing with the data in workspace 2. At each iteration, exchange the role.

I am sorry I cannot come up with a good reference right now.

[1] An out-of-core algorithm incorporates some mechanism to (efficiently) deal with data residing on disk. They are called out-of-core as opposed to in-core ("in-RAM").

score 1 · Answer 3 · answered Apr 05 '12 at 12:02

1

Perhaps you may use Cython in your file I/O sections and convert this part into C code?

answered Apr 05 '12 at 12:02

asmatic

111
5

I/O Strategies for computational problems with large data sets?

3 Answers3