3

I'm challenged with a rather simple task: Coming from my SQL-Query I recieved three different columns: One for hours, one for minutes and one for seconds. I wanted them to be combined into a single time value.

My approach was to apply the dt.time function:

# Import relevant libraries
import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

# Create an example Dataframe
rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

# Create my time function
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

The result is ridiculously slow in my real world scenario, clocking in with > 6 minutes for approximately 4 Mio. rows.

Ravi
  • 2,374
  • 1
  • 15
  • 31
MichaelA
  • 1,636
  • 2
  • 19
  • 34

2 Answers2

0

Apply seems to be rather slow, faster results are achieved when using the map function:

import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

%%timeit
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

%%timeit
test["time_map"] = list(map(lambda hours, minutes, seconds: dt.time(hours, minutes, seconds)
                            , test.hours
                            , test.minutes
                            , test.seconds)
                        )

The first version takes just short of 16 seconds, the second approach only half a second. That is more than 30 times faster.

The solution works great for me and can works also for similar tasks.

I created this Q&A question because I didn't find a fast solution and it took me some time to get a better version. It is a specific application of this question:

MichaelA
  • 1,636
  • 2
  • 19
  • 34
  • 3
    I would **avoid** using `datetime` object in Pandas since you can't add/subtract the time in that format. Not to mention the operation you mentioned is **not** vectorized by any mean. `test["time_map"] = pd.to_timedelta(test.hours*3600+test.minutes*60+test.seconds, unit='s')` took a mere 20 ms which is also 20x improvement from your suggested solution. – Quang Hoang Dec 21 '20 at 16:07
0

A very hackish approach would be to work with time deltas which is fast to manipulate. this works because your hours + minutes + seconds will always stay in the correct range (if you had let say 48 hours values this would fail)

q = (
    pd.to_timedelta(test.hours, unit="hours")
    + pd.to_timedelta(test.minutes, unit="minutes")
    + pd.to_timedelta(test.seconds, unit="seconds")
)
time_only = q.add(dt.datetime(2020,1,1)).apply(lambda x: x.time())

I would also recommend to modify your SQL query and cast it to a time format instead of doing this python side. This would be, in my opinion, the correct approach and should be much faster. Doing this modification on python side feels like a work around for a bad SQL query.

Steven G
  • 14,602
  • 6
  • 47
  • 72