Efficient max selection in pandas dataframe with selection condition

Question

I have a pandas dataframe with columns (among others) user_id and start_time. I want to efficiently and readably find all rows associated with each user's max start_time.

For example, if this were my data:

user_id   start_time   A    B    C
1         37           a    b    c
1         45           d    e    f
1         45           g    h    i
2         58           j    k    l
2         17           m    n    o
2         58           p    q    r
3          2           s    t    u

then I would expect to find

user_id   start_time   A    B    C
1         45           d    e    f
1         45           g    h    i
2         58           j    k    l
2         58           p    q    r
3          2           s    t    u

I've been coming up with solutions a bit like Conditional selection of data in a pandas DataFrame, but that finds the user_id with the latest start time, not the selection of the table for per-user max start_time's.

Of course, it's easy if I iterated the DataFrame by hand, but that is inefficient.

Thanks for any pointers.

For convenience to future readers, generate the dataframe thus:

columns = ['user_id', 'start_time', 'A', 'B', 'C']
LoL = [
    [1, 37, 'a', 'b', 'c'],
    [1, 45, 'd', 'e', 'f'],
    [1, 45, 'g', 'h', 'i'],
    [2, 58, 'j', 'k', 'l'],
    [2, 17, 'm', 'n', 'o'],
    [2, 58, 'p', 'q', 'r'],
    [3, 2, 's', 't', 'u']]
pd.DataFrame = (LoL, columns=columns)

I'm curious to know how a groupby() and merge() approach would perform on your large frame vs the other answers below. Could you try it? g = df.groupby('user_id', as_index=False)['start_time'].max() df.merge(g) — Bob Haffner, Mar 09 '15 at 15:34

EdChum · Accepted Answer · 2015-03-09T12:19:08.737

You can groupby the user_id column and then call apply and pass a lambda which filters the results where the start time is equal to the max value, we want to generate a boolean index from this. We can then call reset_index but due to the way the groupby was filtered we will get an error with duplicate columns so we have to drop this duplicate column:

In [66]:

gp = df.groupby('user_id')
gp.apply(lambda x: x[x['start_time'] == x['start_time'].max()]).reset_index(drop=True)

Out[66]:
   user_id  start_time  A  B  C
0        1          45  d  e  f
1        1          45  g  g  i
2        2          58  j  k  l
3        2          58  p  q  r
4        3           2  s  t  u

if we didn't call reset_index you get duplicated values:

In [67]:

gp.apply(lambda x: x[x['start_time'] == x['start_time'].max()])
Out[67]:
           user_id  start_time  A  B  C
user_id                                
1       1        1          45  d  e  f
        2        1          45  g  g  i
2       3        2          58  j  k  l
        5        2          58  p  q  r
3       6        3           2  s  t  u

The inner boolean condition produces a boolean mask on the multi-index, this is then needed to pass to the lambda to produce the above:

In [68]:

gp.apply(lambda x: x['start_time'] == x['start_time'].max())
Out[68]:
user_id   
1        0    False
         1     True
         2     True
2        3     True
         4    False
         5     True
3        6     True
Name: start_time, dtype: bool

On a much larger data set, this runs in 790 ms compared to 1345 ms for ajcr's and Primer's solutions. (They all appear correct to me.) — jma, Mar 09 '15 at 12:49

Primer · Answer 2 · 2015-03-09T12:43:10.967

1

Alternatively you can do this:

df[df.start_time == df.groupby('user_id')['start_time'].transform(max)]

Which yields:

   user_id  start_time  A  B  C
1        1          45  d  e  f
2        1          45  g  h  i
3        2          58  j  k  l
5        2          58  p  q  r
6        3           2  s  t  u

edited Mar 09 '15 at 12:43

answered Mar 09 '15 at 12:22

Primer

9,564
4
39
51

Efficient max selection in pandas dataframe with selection condition

2 Answers2

Linked