Why is merge and join for a series are much slower than just adding a column?

Question

Given is a stock EOD dataset on which I'm performing groupby and transformations - one row per stock per day. Total, just a bit more than 1M rows and 5K groups. Here is an example with only the relevant series

    bloomberg_ticker    close
date        
1999-11-01  DIN US  10.803848
1999-11-01  TXT US  28.428824
1999-11-01  LNG US  1.820000
1999-11-01  AXL US  11.960504
1999-11-01  CLH US  0.139396
... ... ...
2021-06-18  RCM US  21.500000
2021-06-18  ACA US  56.070000
2021-06-18  FLXN US 9.160000
2021-06-18  FICO US 502.350000
2021-06-18  1COV GR 51.780000

One of the transformations is an adjusted moving average (mean divided by today value):

for w in windows:
   df['foo'] = (
       data.groupby("ticker")
       .close.transform(lambda x: x.rolling(w).mean() / x)
       .astype(dtype="float32")
   )

This, of course, generates some NaN in the first few rows, per the window in use. (I do not wish to use min_period). This runs pretty quickly (few seconds for all 1M+ rows), so far so good-ish.

When I'm trying to clear the NaN and then adding the new column, it takes ages. Same goes with using join or merge(how="left")

for w in windows:
   s = (
       data.groupby("ticker")
       .adjusted_close.transform(lambda x: x.rolling(w).mean() / x)
       .dropna()
       .astype(dtype="float32")
       .rename(f"bar{w}")
   )
   df.join(s) <--- fails because trying allocate 189GB of RAM (Why??)
   df.merge(s, how="left") <---- takes ages

both s and df are using the same index. The only difference I can see between the two methods, is that in the second one, my s doesn't have value for each row, as some were dropped because of dropna()

The solution I'm using at the moment is to run dropna on a subset, after all transformations are done, which is also pretty quick.

I was looking online to understand why it happens, but haven't figured it out yet. I'm sure there is a place for improvement, as my pandas skills are still weak.

https://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna --> Pandas: peculiar performance drop for inplace rename after dropna — pippo1980, Jul 05 '21 at 08:02

Why is merge and join for a series are much slower than just adding a column?

0 Answers0