3

So this may seem like an odd question, but I have a pandas DataFrame with addresses in it, that I want to geocode so I can get the latitude and longitude.

I have code that works using .apply() thanks to this very helpful thread (new column with coordinates using geopy pandas), but my problem is that all of the open APIs have strict limits to how many requests per second they allow, and also requests per day.

I haven't been able to find any way to throttle my code so match the limits of the APIs. My DF has 25K rows, but I've only been able to successfully geocode if I create a subset of it with up to 5 rows.

I don't have a lot of experience with python and pandas, but in SAS the DATA steps iterate one row at a time, so I could have a sleep command that would throttle the requests. What would be the best way to implement something like that with python/pandas?

EDIT: So based on the answers so far, I wanted to confirm, my code would change from: df_small['city_coord'] = df_small['Address'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
to:

df_small = df_clean[:5]
def f(x, delay=1):
# run your code    
sleep(delay)
return geolocator.geocode(x)

df_small['city_coord'] = df_small['Address'].apply(f).apply(lambda x: (x.latitude, x.longitude))

1 Answers1

5

To iterate with a delay, you can use df.iterrows() and time.sleep():

from time import sleep

for row in df.iterrows():
    # run your code
    sleep(1) # how many seconds to wait

Or you can just put time.sleep() within the apply function itself (as @RafaelC suggests in the comments):

def f(x, delay=1):
    # run your code
    sleep(delay)

df.apply(f)
ASGM
  • 10,212
  • 29
  • 50
  • 1
    Why not put the `sleep` inside the function that is argument for `apply`? – rafaelc Apr 09 '18 at 16:50
  • Could you take a look at my recent edit so let me know if I am on the right track? – Michael Melillo Apr 09 '18 at 17:30
  • @MichaelMelillo you're still sub-setting the data before running the code, while the code should presumably work on the full dataframe (also the indents are wrong, but I imagine that's just a posting error). – ASGM Apr 09 '18 at 22:56
  • Thanks. The subset is because there are actually 2 limitations, one is per second and another is per day. So this code solves one problem, but I will have to determine another solution to allow for the full data set to be completed. But using your solution, I was able to geocode a much larger population. Thank you for the answer. – Michael Melillo Apr 10 '18 at 02:55