41

I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.

i.e.

l = open('file','r')
for line in l:
    pass (or code)

is much faster than

l = open('file','r')
for line in l.read() / l.readlines():
    pass (or code)

The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.

So - when should I ever use the .read() or .readlines()?

Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.

codeforester
  • 34,080
  • 14
  • 96
  • 122
Maverick Meerkat
  • 4,814
  • 2
  • 40
  • 59
  • 4
    Please clarify. is the `timeit` measurement for `read`, or for `readlines`? I'd expect the `read` loop to take longer because it returns a single string, so iterating over it would go character-by-character. If your file has on average 100 characters per line, then the code in the `for line in l.read()` loop will execute a hundred times as many times as the code in the `for line in l:` loop. – Kevin Jun 29 '16 at 16:43
  • 1
    it's also for readlines(). Surprisingly there's almost no time difference between read() or readlines()... – Maverick Meerkat Jun 29 '16 at 18:07

6 Answers6

40

The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.

f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.

Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.

(As noted in the other answer, this documentation is a very good read):

https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

Note: It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.

LeopardShark
  • 2,550
  • 2
  • 16
  • 30
Checkmate
  • 973
  • 7
  • 16
  • 2
    Mixing `readline` with a `for` loop over the file doesn't actually work; `readline` doesn't understand the `next` implementation's buffering. If you want to skip a line in a `for` loop, you should call `next` on the file. – user2357112 Jun 29 '16 at 16:55
  • I just tested that with python 3.4. readline() appears to move the looping buffer forward. Let me check python 2 really quick – Checkmate Jun 29 '16 at 17:00
  • Ah, you are right for python 2.7. I'll edit my answer. Thanks, that's good to know! – Checkmate Jun 29 '16 at 17:02
  • can you give an example where one would actually use the read()? The only one I can think of is if you store a password in a file and you would want to read it - then using the .read() would be just a bit faster than the for l in file code. But for any normal size file...? – Maverick Meerkat Jun 29 '16 at 18:15
  • Added. Does that example help clear up your question? I can give a more grounded example if needed. – Checkmate Jun 29 '16 at 18:24
  • oh, also if you're using regular expressions to search over the string of the file, than it could be a bit faster than iterating over the file. – Maverick Meerkat Jun 29 '16 at 18:24
  • Sure, no problem! It's a good question; let me know if I should add anything else. – Checkmate Jun 29 '16 at 18:31
2

Hope this helps!

https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory

Sorry for all the edits!

For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

for line in f:
    print line,

This is the first line of the file.
Second line of the file
Rudi
  • 85
  • 9
  • That's not an accurate description of the API for either C or Python. – user2357112 Jun 29 '16 at 16:53
  • I figured I wouldn't explain it very well, that's why I pulled the rest of my answer straight from the documentation. – Rudi Jun 29 '16 at 16:54
  • C does not default to reading files line by line. There isn't even a standard function for reading files line by line at all in C; `getline` is a POSIX extension. Also, the loop over `f.read()` does not read the entire file on each iteration, and it does not iterate over the lines. – user2357112 Jun 29 '16 at 17:01
  • I wasn't referring to getline, rather fscanf. – Rudi Jun 29 '16 at 17:21
  • It did last year when I took CS108, not sure when it changed, but I'll be sure to look into it. – Rudi Jun 29 '16 at 17:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115994/discussion-between-rudi-and-user2357112). – Rudi Jun 29 '16 at 17:35
0

Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.

I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:

  
def test_read_file_1():  
    f = open('ml/README.md', 'r')  
    for line in f.readlines():  
        print(line)  
  
  
def test_read_file_2():  
    f = open('ml/README.md', 'r')  
    for line in f:  
        print(line)  
  
  
def test_time_read_file():  
    from timeit import timeit  
  
    duration_1 = timeit(lambda: test_read_file_1(), number=1000000)  
    duration_2 = timeit(lambda: test_read_file_2(), number=1000000)  
  
    print('duration using readlines():', duration_1)  
    print('duration using for-loop:', duration_2)

And the results:

duration using readlines(): 78.826229238
duration using for-loop: 69.487692794

The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().

Shayan Amani
  • 5,031
  • 1
  • 35
  • 36
0

readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].

Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).

Fibo Kowalsky
  • 1,031
  • 10
  • 22
0
#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read()    #Reads all the elements of the file 
                               #into a single string(\n characters might be included)
line = file.readline()         #Reads the current line where the cursor as a string 
                               #is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings
-4

Eesssketit

That was a brilliant answer. / Something good to know is that wheneever you use the readline() function it reads a line..... and then it won't be able to read it again. You can return to the position by using the seek() function. to go back to the zero position simply type in f.seek(0).

Similiarly, the function f.tell() will let you know at which position you are.

cezar
  • 10,930
  • 6
  • 40
  • 81
Danny
  • 1