Calculate weighted average using a pandas/dataframe

Question

I have the following table. I want to calculate a weighted average grouped by each date based on the formula below. I can do this using some standard conventional code, but assuming that this data is in a pandas dataframe, is there any easier way to achieve this rather than through iteration?

Date        ID      wt      value   w_avg
01/01/2012  100     0.50    60      0.791666667
01/01/2012  101     0.75    80
01/01/2012  102     1.00    100
01/02/2012  201     0.50    100     0.722222222
01/02/2012  202     1.00    80

01/01/2012 w_avg = 0.5 * ( 60/ sum(60,80,100)) + .75 * (80/ sum(60,80,100)) + 1.0 * (100/sum(60,80,100))

01/02/2012 w_avg = 0.5 * ( 100/ sum(100,80)) + 1.0 * ( 80/ sum(100,80))

Remark that in your example the 'value' column actually represents the weights, and the 'wt' column the values to be averaged... — kadee, Oct 10 '15 at 13:09

kadee · Answer 1 · 2015-10-10T14:51:29.830

53

Let's first create the example pandas dataframe:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')

In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)

Then, the average of 'wt' weighted by 'value' and grouped by the index is obtained as:

In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]: 
Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64

Alternatively, one can also define a function:

In [5]: def grouped_weighted_avg(values, weights, by):
   ...:     return (values * weights).groupby(by).sum() / weights.groupby(by).sum()

In [6]: grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index)
Out[6]: 
Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64

edited Oct 10 '15 at 14:51

answered Oct 10 '15 at 13:09

kadee

6,792
1
37
28

I like this one a lot better (due to readability), are there any significant performances between this and Andy Hayden's solution? – erb Oct 19 '15 at 11:47
2

Is it possible that in this line: In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value)) x.wt and x.value should be switched? – prooffreader Feb 09 '16 at 15:49
1

@prooffreader: As I commented [above](http://stackoverflow.com/questions/26205922/calculate-weighted-average-using-a-pandas-dataframe/33054358#comment53928794_26205922): in the example given by the asker, the 'value' column actually represents the weights, and the 'wt' column the values to be averaged. – kadee Apr 12 '16 at 10:02
This method is significantly slower than the accepted answer when working with large dataframes. – dwitvliet Dec 08 '20 at 08:20
@dwitvliet How large is "large"? I was dealing with census block groups data at daily frequencies. The dataframe has `Cx365` rows where C is the number of census block groups. With roughly 600,000 rows of simulated data, kadee's method actually performed two times faster than Andy's answer. I assume you were referring to "large dataframe" with a good number of _by_groups_? – llinfeng May 03 '21 at 13:43

score 31 · Accepted Answer · answered Oct 05 '14 at 19:56

31

I think I would do this with two groupbys.

First to calculate the "weighted average":

In [11]: g = df.groupby('Date')

In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0    0.125000
1    0.250000
2    0.416667
3    0.277778
4    0.444444
dtype: float64

If you set this as a column, you can groupby over it:

In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt

Now the sum of this column is the desired:

In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012    0.791667
01/02/2012    0.722222
Name: wa, dtype: float64

or potentially:

In [15]: g.wa.transform("sum")
Out[15]:
0    0.791667
1    0.791667
2    0.791667
3    0.722222
4    0.722222
Name: wa, dtype: float64

answered Oct 05 '14 at 19:56

Andy Hayden

328,850
93
598
514

1

Note: I'm not 100% how I feel about reusing g when mutating df, provided you're not mutating the groupby key I think it's neat... potentially this is controversial?! IMO pandastic. – Andy Hayden Oct 05 '14 at 19:59
i was able to accomplish this doing something similar, but instead of transform, i just used groupby(..).sum(). Is there any benefits to using transform? – mike01010 Oct 05 '14 at 20:25
@AndyHayden the DataFrameGroupBy object *would* reflect a mutated object, but in this case you are not mutating, so no big deal. – Jeff Oct 05 '14 at 20:44
When i try to insert this into the same data frame, the values are all NAN. i think it is because the aggregation is on Date, but the dataframe is indexed on Date and ID. so doing this does not work: df['w_avg'] = g.wa.sum(). How do i resolve this? – mike01010 Oct 05 '14 at 22:06
1

@mike01010 transform spreads the result across the group, useful if you are using it later. The inserting with NaNs is exactly that - that's why transform is needed (it matches the original index). – Andy Hayden Oct 05 '14 at 22:14
I found this answer difficult to implement, so I wrote a function that does it for you: https://stackoverflow.com/a/65204336/2801037 – dwitvliet Dec 08 '20 at 18:10

score 14 · Answer 3 · edited May 23 '17 at 11:54

14

I feel the following is an elegant solution to this problem from:(Pandas DataFrame aggregate function using multiple columns)

grouped = df.groupby('Date')

def wavg(group):
    d = group['value']
    w = group['wt']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)

edited May 23 '17 at 11:54

Community

1
1

answered Oct 28 '15 at 13:18

Anish Sugathan

359
4
4

score 10 · Answer 4 · edited Mar 05 '16 at 06:58

10

I saved the table in the .csv file

df=pd.read_csv('book1.csv')

grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)

edited Mar 05 '16 at 06:58

Martin Tournoij

24,971
24
101
136

answered Mar 05 '16 at 06:39

user15051990

1,635
2
23
36

You can also put this in single line..!! – Saurabh Bade May 09 '20 at 05:19

dwitvliet · Answer 5 · 2020-12-14T06:09:06.070

If speed is an important factor for you, vectorizing is critical. Thus, based on the answer by Andy Hayden, here is a solution using only Pandas native functions:

def weighted_mean(df, values, weights, groupby):
    df = df.copy()
    grouped = df.groupby(groupby)
    df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
    return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects

In comparison, using a custom lambda function is less code, but slower:

import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
    return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))

Speed test:

import time
import numpy as np
import pandas as pd

n = 100000000

df = pd.DataFrame({
    'values': np.random.uniform(0, 1, size=n), 
    'weights': np.random.randint(0, 5, size=n),
    'groupby': np.random.randint(0, 10000, size=n), 
})

time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)

time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)

Speed test output:

Time for `weighted_mean`: 3.4519572257995605
Time for `weighted_mean_by_lambda`: 11.41335940361023

score 0 · Answer 6 · answered Nov 08 '21 at 22:56

Another option to an old question (using @kadee sample data):

(df.drop(columns='ID')
  .assign(wt = lambda df: df.prod(1)) # product of weight and value
  .groupby('Date').sum()
  .pipe(lambda df: df.wt / df.value) # weighted average computation
) 
Date
01/01/2012    0.791667
01/02/2012    0.722222

Should be much faster if the computation is done without the use of anonymous functions:

temp = df.drop(columns='ID')
temp = temp.assign(wt = temp.prod(1)).groupby('Date').sum()
temp.wt / temp.value

Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64

Calculate weighted average using a pandas/dataframe

6 Answers6

Linked

Related