My research group focuses on molecular dynamics, which obviously can generate gigabytes of data as part of a single trajectory which must then be analyzed.
Several of the problems we're concerned with involve correlations in the data set, which means that we need to keep track of large amounts of data in memory and analyze them, rather than using a more sequential approach.
What I'd like to know is what are the most efficient strategies for handling I/O of large data sets into scripts. We normally use Python-based scripts because it makes coding the file I/O much less painful than C or Fortran, but when we have tens or hundreds of millions of lines that need to be processed, it's not so clear what the best approach is. Should we consider doing the file input part of the code in C, or is another strategy more useful? (Will simply preloading the entire array into memory be better than a series of sequential reads of "chunks" (order of megabytes)?
Some additional notes:
We are primarily looking for scripting tools for post-processing, rather than "on-line" tools—hence the use of Python.
As stated above, we're doing MD simulations. One topic of interest is diffusion calculations, for which we need to obtain the Einstein diffusion coefficient: $$D = \frac{1}{6} \lim_{\Delta t \rightarrow \infty} \left< \left( {\bf x}(t + \Delta t) - {\bf x}(t) \right)^2 \right>$$ This means we really need to load all of the data into memory before beginning the calculation—all of the chunks of data (records of individual times) will interact with one another.
mmapinto your main code. Many modern operating systems give similar performance between regularreadwith less complication. (Also, yes, mmap in Python provides a portable interface to the Windows and UNIX memory maps). – Aron Ahmadia Apr 06 '12 at 08:45