1

I have a dataframe maint_comp4 which has the following values-

comp4   datetime_maint       machineID
    1   2020-06-01 06:00:00     24
    1   2020-06-16 06:00:00     4
    1   2020-06-16 06:00:00     15
    1   2020-06-16 06:00:00     25
    1   2020-07-01 06:00:00     1

And another dataframe tel_times with following values-

    machineID   datetime_tel
0      1    2021-01-01 06:00:00
1      1    2021-01-01 07:00:00
2      1    2021-01-01 08:00:00
3      1    2021-01-01 09:00:00
4      1    2021-01-01 10:00:00

I am merging these two dataframes and than applying a condition such that we take values where datetime_tel is greater than datetime_maint and then I am taking the difference of these two datatime columns to find number of day difference.

maint_tel_comp4 = tel_times.merge(maint_comp4, on='machineID', how='left')
component4= maint_tel_comp4[maint_tel_comp4['datetime_tel'].gt(maint_tel_comp4['datetime_maint'])]
component4['sincelastComp4'] = (component4['datetime_tel']-component4['datetime_maint']).dt.days

The problem is when I am applying the condition greater than, it is giving multiple rows but I want the row which has the least day difference.

For example this is the code for a particular date and machineID,

component4[(component4['machineID']==30) & (component4['datetime_tel']=='2021-01-31 06:00:00')]

And it is giving the following output,

           machineID    datetime_tel    comp4   datetime_maint  sincelastComp4
1978014    30   2021-01-31 06:00:00     1   2020-08-15 06:00:00   169
1978015    30   2021-01-31 06:00:00     1   2021-01-16 06:00:00   15

Now this output has two rows but I want only one column which has sincelastComp4 value 15 (the minimum one). Similarly, for other dates also, there are multiple rows coming but I want only that ones which has minimum sincelastComp4 value.

I want the minimum because sincelastComp4 indicates, number of days which has passed since Component 4 was repaired. So, we want that value to be minimum.

Edit1- As per answer suggested in comment, I tried using sort_values, but it is giving output for only 1 machineID and not for all.

component4 = component4.sort_values('sincelastComp4').drop_duplicates('datetime_tel', keep='last')

Output-

       machineID    datetime_tel    comp4   datetime_maint  sincelastComp4
1554168 24  2021-01-01 16:00:00    1    2020-06-01 06:00:00 214
1554159 24  2021-01-01 15:00:00    1    2020-06-01 06:00:00 214
1554150 24  2021-01-01 14:00:00    1    2020-06-01 06:00:00 214
1554141 24  2021-01-01 13:00:00    1    2020-06-01 06:00:00 214
1554132 24  2021-01-01 12:00:00 1   2020-06-01 06:00:00 214
PeakyBlinder
  • 924
  • 7
  • 28
  • https://stackoverflow.com/questions/51459651/drop-duplicates-keeping-the-row-with-the-highest-value-in-another-column Sounds useful for you. – Yanirmr May 17 '22 at 10:57
  • https://stackoverflow.com/questions/12497402/remove-duplicates-by-columns-a-keeping-the-row-with-the-highest-value-in-column Similar problem here. – Yanirmr May 17 '22 at 10:58
  • @Yanirmr Thanks, it worked. Can you please upvote the question? – PeakyBlinder May 17 '22 at 11:23

1 Answers1

0

Assume your dataframe component4 looks like this after the merge (I'll call it df):

df
# 
#                 Date_x     Val_x  id              Date_y     Val_y
# 0  2015-02-27 00:00:00 -0.981776   1 2015-02-24 00:00:00 -0.644511
# 1  2015-02-27 00:00:00 -0.981776   1 2015-02-24 00:01:00 -0.259097
# 2  2015-02-27 00:00:00 -0.981776   1 2015-02-24 00:02:00 -1.613186
# 3  2015-02-27 00:01:00  0.026641   1 2015-02-24 00:00:00 -0.644511
# 4  2015-02-27 00:01:00  0.026641   1 2015-02-24 00:01:00 -0.259097
# 5  2015-02-27 00:01:00  0.026641   1 2015-02-24 00:02:00 -1.613186
# 6  2015-02-27 00:02:00 -0.136561   1 2015-02-24 00:00:00 -0.644511
# 7  2015-02-27 00:02:00 -0.136561   1 2015-02-24 00:01:00 -0.259097
# 8  2015-02-27 00:02:00 -0.136561   1 2015-02-24 00:02:00 -1.613186
# 9  2015-02-27 00:03:00  0.664752   2 2015-02-24 00:03:00  1.647569
# 10 2015-02-27 00:03:00  0.664752   2 2015-02-24 00:04:00  1.962239
# 11 2015-02-27 00:04:00 -1.303589   2 2015-02-24 00:03:00  1.647569
# 12 2015-02-27 00:04:00 -1.303589   2 2015-02-24 00:04:00  1.962239

One can add a new column diff and then find the minimum diff per id

df['diff'] = df['Date_x']-df['Date_y']

Then you can find the minimum diff in each group using transform

df[df['diff'] == df.groupby('id')['diff'].transform('min')]
Out[125]: 
                Date_x     Val_x  id              Date_y     Val_y            diff
2  2015-02-27 00:00:00 -0.981776   1 2015-02-24 00:02:00 -1.613186 2 days 23:58:00
10 2015-02-27 00:03:00  0.664752   2 2015-02-24 00:04:00  1.962239 2 days 23:59:00

(see also: Get the row(s) which have the max value in groups using groupby

user2314737
  • 24,359
  • 17
  • 91
  • 104