86

I want to pass the numpy percentile() function through pandas' agg() function as I do below with various other numpy statistics functions.

Right now I have a dataframe that looks like this:

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

And my code looks like this:

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

The above code works, but I want to do something like

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

I.e., specify various percentiles to return from agg().

How should this be done?

Alex Waygood
  • 4,796
  • 3
  • 14
  • 41
slizb
  • 4,812
  • 3
  • 23
  • 22
  • 6
    This is probably a newer aspect of Pandas but look at http://stackoverflow.com/questions/19894939/calculate-arbitrary-percentile-on-pandas-groupby. TLDR: df.groupby('C').quantile(.95) – sachinruk Aug 25 '16 at 23:50

13 Answers13

128

Perhaps not super efficient, but one way would be to create a function yourself:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

Then include this in your agg:

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Note sure this is how it should be done though...

CDspace
  • 2,611
  • 17
  • 32
  • 36
Andy Hayden
  • 328,850
  • 93
  • 598
  • 514
  • 2
    This had multiple issues for me, see [my answer below](https://stackoverflow.com/a/54593214/4629950). – Thomas Feb 08 '19 at 13:14
56

You can have agg() use a custom function to be executed on specified column:

# 50th Percentile
def q50(x):
    return x.quantile(0.5)

# 90th Percentile
def q90(x):
    return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
Jaroslav Bezděk
  • 4,527
  • 4
  • 23
  • 38
Arun Nalpet
  • 920
  • 1
  • 7
  • 16
29

Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question's notation, aggregating by the percentile 95, should be:

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

You can also assign this function to a variable and use it in conjunction with other aggregation functions.

prl900
  • 3,741
  • 3
  • 31
  • 40
  • I'm getting the error TypeError: Must provide 'func' or tuples of '(column, aggfunc). Any idea what might be happening? – Dumb ML Aug 19 '20 at 12:12
  • Although this looks pretty but def. efficient if you work with big data – Areza Mar 30 '21 at 09:02
17

Try this for the 50% and 95% percentile:

column.describe(percentiles=[0.5, 0.95])
Hugolmn
  • 1,430
  • 1
  • 5
  • 19
scottlittle
  • 15,865
  • 5
  • 45
  • 66
16

I really like the solution Andy Hayden gave, however, this had multiple issues for me:

  • If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
  • For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
  • Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas...)

Here is an updated version that fixes these issues:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
    return percentile_
Thomas
  • 3,933
  • 4
  • 32
  • 60
15

I believe the idiomatic way to do this in pandas is:

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
jvans
  • 2,505
  • 2
  • 20
  • 22
10

For situations where all you need is a subset of the describe (typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

columns.agg('describe')[['25%', '50%', '75%', 'count']]

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

Maksim
  • 636
  • 1
  • 7
  • 21
7

More efficient solution with pandas.Series.quantile method:

df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))

With several percentile values

percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
Antiez
  • 439
  • 5
  • 8
5
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])

by default describe function give us mean, count, std, min, max, and with percentiles array you can choose the needed percentiles.

user2161065
  • 1,806
  • 1
  • 18
  • 18
Agredalopez
  • 51
  • 1
  • 1
4

Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})
magraf
  • 400
  • 3
  • 8
4

You can also perhaps use lambda to achieve the same. Some thing like below piece of code :

        agg(
            lambda x: [
                np.min(a=x), 
                np.percentile(q=25,a=x), 
                np.median(a=x), 
                np.percentile(q=75,a=x), 
                np.max(a=x)
    ]
)
0

Multiple function can be called as below:

import pandas as pd

import numpy as np

import random

C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']

A = [ random.randint(0,100) for i in range(10) ]

B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })

print(df)

d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)

I was unable to call median in this, but able to work other functions.

Sangam Belose
  • 3,761
  • 8
  • 27
  • 43
Fakira
  • 9
  • 2
  • 1
    this calls all of them, but selects a few. this is bad for performance, which is the reason why you would use `agg` over describe. – Sebastian Wozny May 10 '19 at 18:15
  • @SebastianWozny may be you can update your comment on which solution do you recommend when dealing with big data – Areza Mar 30 '21 at 09:03
0

This can provide some customization:

list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=27)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
Jose Rondon
  • 192
  • 1
  • 3
  • 9