5

Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable. But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?

I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)

Edit: adding code: this is working on titanic dataset, where title is extracted from name: https://www.kaggle.com/c/titanic/data

%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
                                         ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
                                                ('Master' if 'Master' in x else 'None'))))

%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]

Result: 782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit2: To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:

import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
  tlist.append(i)
  tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2

display(df_test.head(5))


%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]

display(df_test.head(5))

1 loop, best of 3: 2.14 s per loop

1 loop, best of 3: 2.24 s per loop

Edit3: As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:

for row in df_test.itertuples():
  x = row.B
  if x%5==0:
    df_test.at[row.Index,'B'] = x*2

Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?

Tushar Seth
  • 469
  • 7
  • 14
  • 3
    isn't `apply` essentially a `for` loop? – Quang Hoang Aug 12 '19 at 21:11
  • 1
    you need to show some code with your benchmark – Chris_Rands Aug 12 '19 at 21:13
  • Here's an interesting and [related SO question and answer](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care) that I have bookmarked – G. Anderson Aug 12 '19 at 21:25
  • @G.Anderson , thanks for the link but it says apply is slower but not why – Tushar Seth Aug 12 '19 at 21:35
  • can you give an example where this is the case? – AndrewH Aug 12 '19 at 21:49
  • 1
    `.apply` is basically a for-loop. It does not use vectorization. And note, list comprehensions are only marginally faster than for loops, and both can be made essentially equally performant if you cache the `.append` method resolution, which is practically what a list comprehension does (note it still uses append) – juanpa.arrivillaga Aug 12 '19 at 22:53
  • @TusharSeth it does, you just need to look for it. It is essentially a slow wrapper around a for loop with a lot of overhead which usually isn't required for most simple operations. – cs95 Aug 15 '19 at 07:32
  • I am reffering to this SO post : https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code This says that never to use apply because it is much slower, but there is no answer to why, thats why i asked this seperate question – Tushar Seth Aug 19 '19 at 09:19
  • @juanpa.arrivillaga , added in question as apply is not for loop as for loop is way much slower than apply, so it has to be something else – Tushar Seth Aug 28 '19 at 09:13
  • @QuangHoang as explained to juanpa, added in question as apply is not for loop as for loop is way much slower than apply, so it has to be something else – Tushar Seth Aug 28 '19 at 09:18
  • 1
    @TusharSeth because the loop you are using the the *slowest possible way*. **Never** use `x = df_test.loc[i,'B']`, try it with `df.itertuples()`. It **is a loop**. You can [check the source code yourself](https://stackoverflow.com/questions/38938318/why-apply-sometimes-isnt-faster-than-for-loop-in-pandas-dataframe/38938507#38938507) – juanpa.arrivillaga Aug 28 '19 at 16:28
  • @juanpa.arrivillaga +1 for that link. But I have a doubt: for row in df_test.itertuples(): x = row.B if x%5==0: print(row.B) This code using itertuples also is very very slow . Apologies if I am missing on something, but i really need to get this through my head as to how come apply for loop is faster than this physical for loop – Tushar Seth Aug 28 '19 at 19:06
  • @juanpa.arrivillaga. updated the itertuples code. it takes around 23 seconds but apply works in just 1 second , so that was my doubt as to what would be difference in the implementation they are using – Tushar Seth Aug 29 '19 at 09:56

0 Answers0