Pandas GroupBy and select rows with the minimum value in a specific column

Question

I am grouping my dataset by column A and then would like to take the minimum value in column B and the corresponding value in column C.

data = pd.DataFrame({'A': [1, 2], 'B':[ 2, 4], 'C':[10, 4]})
data  
    A   B   C
0   1   4   3
1   1   5   4
2   1   2   10
3   2   7   2
4   2   4   4
5   2   6   6

and I would like to get :

    A   B   C
0   1   2   10
1   2   4   4

For the moment I am grouping by A, and creating a value that indicates me the rows I will keep in my dataset:

a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]

I am sure that there is a much more straight forward way to do this. I have seen many answers here that use multi-indexing but I would like to do this without adding multi-index to my dataframe. Thank you for your help.

score 55 · Accepted Answer · answered Feb 01 '19 at 00:03

55

I feel like you're overthinking this. Just use groupby and idxmin:

df.loc[df.groupby('A').B.idxmin()]

   A  B   C
2  1  2  10
4  2  4   4

df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)

   A  B   C
0  1  2  10
1  2  4   4

answered Feb 01 '19 at 00:03

cs95

330,695
80
606
657

I am trying this solution but with pandas 1.0.0 I am getting the error: ``Passing list-likes to .loc or [] with any missing labels is no longer supported ``. @cs95 do you have any suggestions how to fix that? – Eve Edomenko Aug 18 '20 at 10:25
1

@cs95 this results in one row per A, what if there are multiple rows with the minimum value for each of the values in A. something like all the students with minimum marks in Science. – A-dude Aug 05 '21 at 11:22

score 7 · Answer 2 · edited Dec 15 '19 at 10:51

7

Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:

df.loc[df.groupby('A')['B val'].idxmin()]

edited Dec 15 '19 at 10:51

Nazim Kerimbekov

4,497
8
31
54

answered Dec 15 '19 at 10:50

Juho

81
1
2

krassowski · Answer 3 · 2022-02-15T16:42:16.460

The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:

data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)

This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).

This approach has additional benefits:

it can be easily expanded to select n rows with smallest values in specific column

it can break ties by providing another column (as a list) to .sort_values(), e.g.:

data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)

As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:

df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)

Nice answer. I would complement saying that I did in this way and seems that works the same way: `data.sort_values('B').groupby('A').head(1)` — igorkf, May 06 '22 at 13:46

score 2 · Answer 4 · edited Sep 23 '21 at 23:31

I found an answer a little bit more wordy, but a lot more efficient:

This is the example dataset:

data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data

Out:
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6

First we will get the min values on a Series from a groupby operation:

min_value = data.groupby('A').B.min()
min_value

Out:
A
1    2
2    4
Name: B, dtype: int64

Then, we merge this series result on the original data frame

data = data.merge(min_value, on='A',suffixes=('', '_min'))
data

Out:
   A  B   C  B_min
0  1  4   3      2
1  1  5   4      2
2  1  2  10      2
3  2  7   2      4
4  2  4   4      4
5  2  6   6      4

Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.

data = data[data.B==data.B_min].drop('B_min', axis=1)
data

Out:
   A  B   C
2  1  2  10
4  2  4   4

I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.

very nice solution and easy to follow. – Niccola Tartaglia May 20 '22 at 15:23 — Niccola Tartaglia, May 20 '22 at 15:23

Pandas GroupBy and select rows with the minimum value in a specific column

4 Answers4

Linked

Related