Reading really large text files in python does iterate correctly

Question

I have a 1.4GB file and I'm trying to iterate over every line, I tried the normal approach and this happened:

counter = 0
with open("myfile.txt") as infile:
    for line in infile:
        counter+=1
        if target in line:
            print line
print counter

658785

OK, everything looks good, but then I realized that the count is way lower than what it should be, so I wrote this instead:

textfile = open("myfile.txt") 
while True:
    line = text_file.readline()
    if not line: break
    counter+=1
print counter

Same number of rows, but I know for a fact that this file has over 20 million rows, anyone knows what I'm doing wrong?

EDIT: Seems like people are skeptic whether I'm reading the right files, how am I verifying the lines, etc.

So just a simple example if I run this:

counter=0
total_lines = 0
while True:
    line = text_file.readline()
    total_lines+=1
    if target in line:
        print line.split("|")[0].strip(), counter, total_lines
        counter+=1

This is my output:

HAIRY MOOSE 0 4722388
HAIRY MOOSE 1 4722389
HAIRY MOOSE 2 4722390
....
....
IN *HAIRY MOOSES CLEANING 45 12244264
IN *HAIRY MOOSES OF TU 46 12244265
IN *HAIRY MOOSES OF TULSA 47 12244266

but if I read it the other way, it finishes before a single match is found.

@sergzach I created two variable names for the same file, it's the same file. — Stupid.Fat.Cat, Nov 24 '15 at 19:45
Instead of the ```for``` loop, try ```print(sum(1 for line in infile))```. — wwii, Nov 24 '15 at 19:45
myfile.txt in what current directory? Please, check os.getcwd(). — sergzach, Nov 24 '15 at 19:47
print the last line and see if it is in fact the last line of your file. if so, then how do you *know for a fact` that there are 20mill+ lines? maybe not? — R Nar, Nov 24 '15 at 19:47
@RNar it is not the last line of the file, that's my issue, the program stops before it should stop, for instance, if I do an infinite loop with no breaks and check for every instance that the word "Cleaner's Joe" shows up I have over 60 entries, if I don't put it in an infinite loop it breaks before anything gets printed out. — Stupid.Fat.Cat, Nov 24 '15 at 19:49
How do you know the file has 20 m lines? And what is the 'line'? If you editor wraps lines it could potentially show the incorrect information. — sergzach, Nov 24 '15 at 19:49
What denotes the end of a line in the file? Perhaps each line read by readline() is longer than you think. — Riccati, Nov 24 '15 at 19:50
I think Python see 658785 lines but your soft see another number of lines. — sergzach, Nov 24 '15 at 19:52
I never said the list line is an issue, I was just asking you to check if the last line that it DOES read is the last line at all. because if it, then that means your line count of 20mil is wrong — R Nar, Nov 24 '15 at 19:53
@sergzach yes, that's the issue, and I can verify that there are actually more than 658785 lines by looking at the possible matches — Stupid.Fat.Cat, Nov 24 '15 at 19:53
@sergzach last line the code finds is "99" and it's not even past the 5% scroll bar. — Stupid.Fat.Cat, Nov 24 '15 at 20:02
Try in [Universal Newline](https://docs.python.org/2.3/whatsnew/node7.html) mode. i.e., `open("myfile.txt", 'rU')` — dawg, Nov 24 '15 at 20:04
@dawg you are a god and I love you. Could you post as answer? Thank you — Stupid.Fat.Cat, Nov 24 '15 at 20:11

score 2 · Answer 1 · edited May 23 '17 at 11:52

There is nothing intrinsically wrong with iterating files in the way you are doing it. I have opened much bigger files and walked them the same way; there is no size limit on files in python.

Possible explanations:

Your file is not newline delineated. Whatever you're using to count the lines delineates differently than python does.
Some kind of concurrent modification of the file. Before this occurs, are you, editing the file in your code? Is another program dynamically editing the file? Is there some chance the file has changed in between ground-truth check and runtime?

Ways of checking:

Check the (last) line(s) manually Print out the lines (or perhaps just the last line). If it is the same as the last line in your text file, then the issue is probably not that the resource is being modified.
Try manually splitting Use read() instead of readline() and then split it manually and walk through. If len(myfile.read().split('\n')) doesn't give the right answer but len(myfile.read().split('\r')) does, maybe it's a delineation problem. Perhaps those two sum to the number you're looking for?
Check line lengths until you find an erroneous one Are the lines as long as they should be? In your independent fact-check ground-truth machine that you are positive about, generate a count of line lengths. Then walk through and validate that each of the lines is as long as it should be in Python. If my math holds, it cannot be the case that (1) each line is the correct length and (2) there are the same total number of characters in both files (btw: validate that), but (3) you are not considering the correct number of lines. Take the first line that is not the right length and examine it manually. What's confusing python (again, I'd expect a delineation issue)
Change your read mode, depending on file type:

The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode

.

Double-check your ground truth and... are you sure you need to do this? I hope you've already validated your ground truth. (I would rarely be "positive" about my independent line count). Is it a numerical file? Maybe you can use numpy.loadtxt. Can you use pandas? Is it a database? If it's a matrix in a .mat file, maybe scipy.loadmat would be useful. Huge data is rarely randomly formatted, and most useful formats people have already done a good job with, so manually parsing a long file might not be your best bet, in which case, some mysterious are more fun not to solve.

There are no concurrent modifications, but I will try the split — Stupid.Fat.Cat, Nov 24 '15 at 19:58
@Stupid.Fat.Cat Just output the last line, after that you find the problem quickly. — sergzach, Nov 24 '15 at 20:02
@sergzach unless it is correct, in which case there might be something about an arbitrary number of middle lines that is confusing. — en_Knight, Nov 24 '15 at 20:03

dawg · Accepted Answer · 2015-11-24T23:01:52.080

Different OS's have different sense of what constitutes a 'line'

Windows:

line\r\n

Unix and OS X

line\n

Python has support for a Universal Newline that will support both line endings. Try that.

Demo:

from __future__ import print_function

import os 

les={'Unix':chr(10), 'MacOS':chr(13), 'Windows': ''.join([chr(13), chr(10)])}

n=10000
for osn, le in les.items():
   fn='{} {} lines.txt'.format(osn, n)
   print(fn)
   with open(fn, 'wb') as f:
       for x in range(n):
           f.write('line {}{}'.format(x, le))

   for mode in ('r', 'rb', 'rU'):
       with open(fn, mode) as f:
           print("{:10d} lines with {}".format(sum(1 for _ in f), mode))

   os.remove(fn)

On Unix, Windows prints:

Windows 10000 lines.txt
     10000 lines with r
     10000 lines with rb
     10000 lines with rU
Unix 10000 lines.txt
     10000 lines with r
     10000 lines with rb
     10000 lines with rU
MacOS 10000 lines.txt
         1 lines with r
         1 lines with rb
     10000 lines with rU

On Windows, if you with the file in text mode (with with open(fn, 'w') as f:) it will double the newlines though.

Reading really large text files in python does iterate correctly

2 Answers2