38

I am grouping my dataset by column A and then would like to take the minimum value in column B and the corresponding value in column C.

data = pd.DataFrame({'A': [1, 2], 'B':[ 2, 4], 'C':[10, 4]})
data  
    A   B   C
0   1   4   3
1   1   5   4
2   1   2   10
3   2   7   2
4   2   4   4
5   2   6   6  

and I would like to get :

    A   B   C
0   1   2   10
1   2   4   4

For the moment I am grouping by A, and creating a value that indicates me the rows I will keep in my dataset:

a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]

I am sure that there is a much more straight forward way to do this. I have seen many answers here that use multi-indexing but I would like to do this without adding multi-index to my dataframe. Thank you for your help.

cs95
  • 330,695
  • 80
  • 606
  • 657
Wendy
  • 541
  • 1
  • 4
  • 8

4 Answers4

55

I feel like you're overthinking this. Just use groupby and idxmin:

df.loc[df.groupby('A').B.idxmin()]

   A  B   C
2  1  2  10
4  2  4   4

df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)

   A  B   C
0  1  2  10
1  2  4   4
cs95
  • 330,695
  • 80
  • 606
  • 657
  • I am trying this solution but with pandas 1.0.0 I am getting the error: ``Passing list-likes to .loc or [] with any missing labels is no longer supported ``. @cs95 do you have any suggestions how to fix that? – Eve Edomenko Aug 18 '20 at 10:25
  • 1
    @cs95 this results in one row per A, what if there are multiple rows with the minimum value for each of the values in A. something like all the students with minimum marks in Science. – A-dude Aug 05 '21 at 11:22
7

Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:

df.loc[df.groupby('A')['B val'].idxmin()]
Nazim Kerimbekov
  • 4,497
  • 8
  • 31
  • 54
Juho
  • 81
  • 1
  • 2
4

The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:

data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)

This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).

This approach has additional benefits:

  • it can be easily expanded to select n rows with smallest values in specific column
  • it can break ties by providing another column (as a list) to .sort_values(), e.g.:
    data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
    

As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:

df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
krassowski
  • 10,402
  • 3
  • 47
  • 75
  • 1
    Nice answer. I would complement saying that I did in this way and seems that works the same way: `data.sort_values('B').groupby('A').head(1)` – igorkf May 06 '22 at 13:46
2

I found an answer a little bit more wordy, but a lot more efficient:

This is the example dataset:

data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data

Out:
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6 

First we will get the min values on a Series from a groupby operation:

min_value = data.groupby('A').B.min()
min_value

Out:
A
1    2
2    4
Name: B, dtype: int64

Then, we merge this series result on the original data frame

data = data.merge(min_value, on='A',suffixes=('', '_min'))
data

Out:
   A  B   C  B_min
0  1  4   3      2
1  1  5   4      2
2  1  2  10      2
3  2  7   2      4
4  2  4   4      4
5  2  6   6      4

Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.

data = data[data.B==data.B_min].drop('B_min', axis=1)
data

Out:
   A  B   C
2  1  2  10
4  2  4   4

I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.

Dharman
  • 26,923
  • 21
  • 73
  • 125