fast date based replacement of rows in Pandas

Question

I am on a quest of finding the fastest replacement method based on index in Pandas. I want to fill np.nans to all rows based on index (DateTimeIndex).

I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).

Naively, I want to do this:

df['2017-01-01':'2018-01-01'] = np.nan

I tried and tested a performance of various other methods, such as

df.loc['2017-01-01':'2018-01-01'] = np.nan

And also creating a mask with NumPy to speed it up

df['DateTime'] = df.index

st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()

ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )

and then

df[mask] = np.nan
#or
df.where(~mask)

But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.

Would appreciate any ideas!

edit: after going through Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

fast date based replacement of rows in Pandas

0 Answers0