1

I have a 1.4GB file and I'm trying to iterate over every line, I tried the normal approach and this happened:

counter = 0
with open("myfile.txt") as infile:
    for line in infile:
        counter+=1
        if target in line:
            print line
print counter

658785

OK, everything looks good, but then I realized that the count is way lower than what it should be, so I wrote this instead:

textfile = open("myfile.txt") 
while True:
    line = text_file.readline()
    if not line: break
    counter+=1
print counter

Same number of rows, but I know for a fact that this file has over 20 million rows, anyone knows what I'm doing wrong?

EDIT: Seems like people are skeptic whether I'm reading the right files, how am I verifying the lines, etc.

So just a simple example if I run this:

counter=0
total_lines = 0
while True:
    line = text_file.readline()
    total_lines+=1
    if target in line:
        print line.split("|")[0].strip(), counter, total_lines
        counter+=1

This is my output:

HAIRY MOOSE 0 4722388
HAIRY MOOSE 1 4722389
HAIRY MOOSE 2 4722390
....
....
IN *HAIRY MOOSES CLEANING 45 12244264
IN *HAIRY MOOSES OF TU 46 12244265
IN *HAIRY MOOSES OF TULSA 47 12244266

but if I read it the other way, it finishes before a single match is found.

Stupid.Fat.Cat
  • 9,713
  • 18
  • 73
  • 133

2 Answers2

2

There is nothing intrinsically wrong with iterating files in the way you are doing it. I have opened much bigger files and walked them the same way; there is no size limit on files in python.

Possible explanations:

  • Your file is not newline delineated. Whatever you're using to count the lines delineates differently than python does.
  • Some kind of concurrent modification of the file. Before this occurs, are you, editing the file in your code? Is another program dynamically editing the file? Is there some chance the file has changed in between ground-truth check and runtime?

Ways of checking:

  • Check the (last) line(s) manually Print out the lines (or perhaps just the last line). If it is the same as the last line in your text file, then the issue is probably not that the resource is being modified.

  • Try manually splitting Use read() instead of readline() and then split it manually and walk through. If len(myfile.read().split('\n')) doesn't give the right answer but len(myfile.read().split('\r')) does, maybe it's a delineation problem. Perhaps those two sum to the number you're looking for?

  • Check line lengths until you find an erroneous one Are the lines as long as they should be? In your independent fact-check ground-truth machine that you are positive about, generate a count of line lengths. Then walk through and validate that each of the lines is as long as it should be in Python. If my math holds, it cannot be the case that (1) each line is the correct length and (2) there are the same total number of characters in both files (btw: validate that), but (3) you are not considering the correct number of lines. Take the first line that is not the right length and examine it manually. What's confusing python (again, I'd expect a delineation issue)

  • Change your read mode, depending on file type:

The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode

.

  • Double-check your ground truth and... are you sure you need to do this? I hope you've already validated your ground truth. (I would rarely be "positive" about my independent line count). Is it a numerical file? Maybe you can use numpy.loadtxt. Can you use pandas? Is it a database? If it's a matrix in a .mat file, maybe scipy.loadmat would be useful. Huge data is rarely randomly formatted, and most useful formats people have already done a good job with, so manually parsing a long file might not be your best bet, in which case, some mysterious are more fun not to solve.
Community
  • 1
  • 1
en_Knight
  • 5,066
  • 2
  • 25
  • 43
0

Different OS's have different sense of what constitutes a 'line'

Windows:

line\r\n

Unix and OS X

line\n

Python has support for a Universal Newline that will support both line endings. Try that.


Demo:

from __future__ import print_function

import os 

les={'Unix':chr(10), 'MacOS':chr(13), 'Windows': ''.join([chr(13), chr(10)])}

n=10000
for osn, le in les.items():
   fn='{} {} lines.txt'.format(osn, n)
   print(fn)
   with open(fn, 'wb') as f:
       for x in range(n):
           f.write('line {}{}'.format(x, le))

   for mode in ('r', 'rb', 'rU'):
       with open(fn, mode) as f:
           print("{:10d} lines with {}".format(sum(1 for _ in f), mode))

   os.remove(fn)    

On Unix, Windows prints:

Windows 10000 lines.txt
     10000 lines with r
     10000 lines with rb
     10000 lines with rU
Unix 10000 lines.txt
     10000 lines with r
     10000 lines with rb
     10000 lines with rU
MacOS 10000 lines.txt
         1 lines with r
         1 lines with rb
     10000 lines with rU

On Windows, if you with the file in text mode (with with open(fn, 'w') as f:) it will double the newlines though.

dawg
  • 90,796
  • 20
  • 120
  • 197