3

The following code is going to lazily print the contents of the text file line by line, with each print stopping at '/n' .

   with open('eggs.txt', 'rb') as file:
       for line in file:
           print line

Is there any configuration to lazily print the contents of a text file, with each print stopping at ', ' ?

(or any other character/string )

I am asking this because I am trying to read a file which contains one single 2.9 GB long line separated by commas.

PS. My question is different than this one: Read large text files in Python, line by line without loading it in to memory I am asking how to do the stopping at characters other than newlines ('\n')

Community
  • 1
  • 1
RetroCode
  • 332
  • 4
  • 14
  • 1
    @grael That's not relevant at all. – taleinat Aug 25 '16 at 08:36
  • Does the `split()` function not do the job just as well? – Vaibhav Bajaj Aug 25 '16 at 08:37
  • @TamasHegedus it's lazy because it doesn't load all of the text file to the memory at once , rather it loads a small fragment of it (the one you are currently printing) at a time. That way if the file is too big you can still access it's contents without running out of RAM. – RetroCode Aug 25 '16 at 08:38
  • 1
    @VaibhavBajaj that would not be lazy would it ? – RetroCode Aug 25 '16 at 08:38
  • Possible duplicate of [Read large text files in Python, line by line without loading it in to memory](http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory) – DhruvPathak Aug 25 '16 at 08:40
  • @RetroCode [This answer](http://stackoverflow.com/a/9770397/3398583) might be useful – muddyfish Aug 25 '16 at 08:40
  • 3
    @DhruvPathak The question specifically asks how to do this stopping at characters other than newlines. – taleinat Aug 25 '16 at 08:41
  • Is the file a single huge line? – Vaibhav Bajaj Aug 25 '16 at 08:43
  • @VaibhavBajaj Exactly. It is a 2.9 GB long line separated by commas. – RetroCode Aug 25 '16 at 08:45
  • @RetroCode please check this one: http://stackoverflow.com/questions/39140348/python-lazy-loading/39141071#39141071 it uses yield to not storing values in memory. – turkus Aug 25 '16 at 09:31

5 Answers5

3

I don't think there is a built-in way to achieve this. You will have to use file.read(block_size) to read the file block by block, split each block at commas, and rejoin strings that go across block boundaries manually.

Note that you still might run out of memory if you don't encounter a comma for a long time. (The same problem applies to reading a file line by line, when encountering a very long line.)

Here's an example implementation:

def split_file(file, sep=",", block_size=16384):
    last_fragment = ""
    while True:
        block = file.read(block_size)
        if not block:
            break
        block_fragments = iter(block.split(sep))
        last_fragment += next(block_fragments)
        for fragment in block_fragments:
            yield last_fragment
            last_fragment = fragment
    yield last_fragment
Sven Marnach
  • 530,615
  • 113
  • 910
  • 808
  • In terms of speed, do you think it would be better If I preprocessed the file likeso: `g = open(file, "w").next().replace(",", "/n")` ; g2 = open(file, "w").write(g);g = None and then lazy load it the normal way ? – RetroCode Aug 25 '16 at 09:39
  • What I did was load the file into memory momentarily, replaced commas with '\n' and then setting the file equal to None, to free the memory, because if the file remains in memory then I experience slow execution time (thrashing) when doing further operations. – RetroCode Aug 25 '16 at 09:45
  • @RetroCode Loading the whole file into memory is what you wanted to avoid. I don't think doing this will improve performance, no. (Side note: to unbind a name, use `del name` instead of assigning `None`.) – Sven Marnach Aug 25 '16 at 09:50
2

Using buffered reading from the file (Python 3):

buffer_size = 2**12
delimiter = ','

with open(filename, 'r') as f:
    # remember the characters after the last delimiter in the previously processed chunk
    remaining = ""

    while True:
        # read the next chunk of characters from the file
        chunk = f.read(buffer_size)

        # end the loop if the end of the file has been reached
        if not chunk:
            break

        # add the remaining characters from the previous chunk,
        # split according to the delimiter, and keep the remaining
        # characters after the last delimiter separately
        *lines, remaining = (remaining + chunk).split(delimiter)

        # print the parts up to each delimiter one by one
        for line in lines:
            print(line, end=delimiter)

    # print the characters after the last delimiter in the file
    if remaining:
        print(remaining, end='')

Note that the way this is currently written, it will just print the original file's contents exactly as they were. This is easily changed though, e.g. by changing the end=delimiter parameter passed to the print() function in the loop.

taleinat
  • 8,151
  • 1
  • 29
  • 42
  • `f.read()` is buffered anyway, unless you disable it, so no need to do it again. – dhke Aug 25 '16 at 09:03
  • @dhke Reading character for character using `f.read(1)` will be terribly slow due to Python function call overhead, so you definitely should read larger buffers at a time. This will also reduce the number of times you need to call `str.split()`. – Sven Marnach Aug 25 '16 at 09:06
  • @dhke `text = f.read()` will read the entire file contents into memory, and so will `f.read.split(',')`. The buffering you mention is at a lower level; the code using `f.read()` needs to be written carefully to take advantage of that, and it is not clear how to do so to achieve what was asked in the question. – taleinat Aug 25 '16 at 09:06
  • @SvenMarnach That is another problem, yes but as implemented here, data is buffered twice. – dhke Aug 25 '16 at 09:13
  • 2
    One minor issue with this approach (which also affects my own code) is that it won't print the last items if that's the empty string. – Sven Marnach Aug 25 '16 at 09:13
  • @dhke Might be the case. With OS buffering and libc buffering, it might even be buffered three times. How do you think that's a problem? Can you offer a better solution? – Sven Marnach Aug 25 '16 at 09:15
  • @SvenMarnach `open(filename, 'rb', 0)` gets rid of at least one level of buffering. Remember that we also copy around memory contents otherwise. And yes `read(1)` is stupidly slow. – dhke Aug 25 '16 at 09:27
  • @dhke The libc usually knows what block size to ask the OS for, so I'd leave it alone. I don't think using unbuffered mode actually results in less copying or less memory being used by the libc, it just results in the libc hiding user-visible effects of buffering, which aren't a problem in this case. – Sven Marnach Aug 25 '16 at 09:46
  • In terms of speed, do you think it would be better If I preprocessed the file likeso: `g = open(file, "w").next().replace(",", "/n") ; g2 = open(file, "w").write(g);g = None` and then lazy load it the normal way ? What I did was load the file into memory momentarily, replaced commas with '\n' and then setting the file equal to None, to free the memory, because if the file remains in memory I experience slow execution time (thrashing) when doing further operations. – RetroCode Aug 25 '16 at 09:48
  • @SvenMarnach There's no libc buffering, CPython directly calls `open()`, `read()` and `write()`. The default buffer size is determined at compile time, which defaults to 8192. And how you want hide user visible effects of internal buffering e.g. for non-seekable streams is something I really would like know. – dhke Aug 25 '16 at 11:09
  • @dhke Ah, right, they changed that when they rewrote the IO layer for Python 3. In Python 2, reading from a file calls `fread()` if I remember correctly. Anyway, I don't think buffering causes any ill effects here, and I wouldn't expect any measurable difference from using unbuffered mode, but it probably doesn't hurt either. – Sven Marnach Aug 25 '16 at 18:53
  • @SvenMarnach That must have been a while. No `fread()` in [Python 2.7.12](https://github.com/python/cpython/blob/be1540e9644c22ce934399b1163dc4216a802a4d/Modules/_io/fileio.c#L647) and it has been like that since at least 2009. – dhke Aug 26 '16 at 09:22
  • @dhke Oh, I haven't looked at that code in a while, but if I'm not mistaken, your link is pointing at the back-ported `io` library that is not used by the built-in `open()` function. By default, file object reads end up [here](https://github.com/python/cpython/blob/2.7/Objects/fileobject.c#L2830), and indeed call `fread()`. Anyway, as stated several times before, the buffering issue is completely irrelevant here. Do you think buffering causes any kind of problem here? – Sven Marnach Aug 26 '16 at 11:24
1

The following answer can be considered lazy since it is reading the file a character at a time:

def commaBreak(filename):
    word = ""
    with open(filename) as f:
        while True:
            char = f.read(1)
            if not char:
                print "End of file"
                yield word
                break
            elif char == ',':
                yield word
                word = ""
            else:
                word += char

You may choose to do something like this with a larger number of charachters, Eg 1000, read at a time.

Vaibhav Bajaj
  • 1,904
  • 14
  • 29
  • This still loads the whole file into memory, namely into the list `wordList`. – Sven Marnach Aug 25 '16 at 08:59
  • @SvenMarnach, It would load one charachter at a time until memory is full, right? – Vaibhav Bajaj Aug 25 '16 at 09:03
  • It's a minor nit. You should put this code into a generator function and yield each word you found instead of appending them to a list, so a consumer is able to consume the bits iteratively, without the need of loading all of them into memory at once. The major problem with this approach is that it will be quite slow. – Sven Marnach Aug 25 '16 at 09:09
  • @SvenMarnach Is this any better? – Vaibhav Bajaj Aug 26 '16 at 20:40
  • Yep, that's what I meant. You usually wouldn't want such a generator to print anything, just to yield (and you have an inconsistency since the last word doesn't get printed). This code still has the problem that it is rather slow, but it is correct and simple. – Sven Marnach Aug 26 '16 at 20:59
-1
with open('eggs.txt', 'rb') as file:
for line in file:
    str_line = str(line)
    words = str_line.split(', ')
    for word in words:
        print(word)

I'm not completely sure if I know what you are asking, is something like this what you mean?

Drew Davis
  • 11
  • 3
  • This will not work as he cant read a 2.9GB long line separated by commas with `for line in file`. Please follow the comments. – Vaibhav Bajaj Aug 25 '16 at 08:53
-1

It yields each character from file at once, what means that there is no memory overloading.

def lazy_read():
    try:
        with open('eggs.txt', 'rb') as file:
            item = file.read(1)
            while item:
                if ',' == item:
                    raise StopIteration
                yield item
                item = file.read(1)
    except StopIteration:
        pass

print ''.join(lazy_read())
turkus
  • 4,147
  • 22
  • 27
  • What is `exit()`? Why do you iterate over `line` if it's only a single character anyway? Also, your indentation is broken. – Sven Marnach Aug 25 '16 at 09:17