Given is a stock EOD dataset on which I'm performing groupby and transformations - one row per stock per day. Total, just a bit more than 1M rows and 5K groups. Here is an example with only the relevant series
bloomberg_ticker close
date
1999-11-01 DIN US 10.803848
1999-11-01 TXT US 28.428824
1999-11-01 LNG US 1.820000
1999-11-01 AXL US 11.960504
1999-11-01 CLH US 0.139396
... ... ...
2021-06-18 RCM US 21.500000
2021-06-18 ACA US 56.070000
2021-06-18 FLXN US 9.160000
2021-06-18 FICO US 502.350000
2021-06-18 1COV GR 51.780000
One of the transformations is an adjusted moving average (mean divided by today value):
for w in windows:
df['foo'] = (
data.groupby("ticker")
.close.transform(lambda x: x.rolling(w).mean() / x)
.astype(dtype="float32")
)
This, of course, generates some NaN in the first few rows, per the window in use. (I do not wish to use min_period). This runs pretty quickly (few seconds for all 1M+ rows), so far so good-ish.
When I'm trying to clear the NaN and then adding the new column, it takes ages. Same goes with using join or merge(how="left")
for w in windows:
s = (
data.groupby("ticker")
.adjusted_close.transform(lambda x: x.rolling(w).mean() / x)
.dropna()
.astype(dtype="float32")
.rename(f"bar{w}")
)
df.join(s) <--- fails because trying allocate 189GB of RAM (Why??)
df.merge(s, how="left") <---- takes ages
both s and df are using the same index.
The only difference I can see between the two methods, is that in the second one, my s doesn't have value for each row, as some were dropped because of dropna()
The solution I'm using at the moment is to run dropna on a subset, after all transformations are done, which is also pretty quick.
I was looking online to understand why it happens, but haven't figured it out yet. I'm sure there is a place for improvement, as my pandas skills are still weak.