0

I am collecting some data, which I am manipulating (via csv) as a (large-ish) dataframe.

Some of the information for the dataframe is only available to me at a certain time. To get around this, I write the raw data as such:

id    col2   col3  time_sensitive_data
id1  data2  data3        0            (when most data is available)
id1      0    0     time_sens_data

Then, when I analyse the data, I need to propagate this `time_sensitive_datà through the dataframe by the column 'id'. At the moment I do this:

ids = data['id'].unique()
for id in ids:
    current = data.loc[data['id'] == id]
    current_time_sensitive_info = current['time_sensitive_data'].max()
    data.loc[(data['id'] == id), 'time_sensitive_data'] = current_time_sensitive_info

This solution works, but is painfully slow (10 mins+). Is there a faster way to achieve this result?

tripleee
  • 158,107
  • 27
  • 234
  • 292
James_yf
  • 3
  • 1
  • 2
    Difficult to say without easily reproducible data (https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), but a `groupby` with `transform` would probably be faster. – coffeinjunky Aug 08 '21 at 15:59

0 Answers0