86

Suppose I have a column like so:

a   b  
1   5   
1   7
2   3
1   3
2   5

I want to sum up the values for b where a = 1, for example. This would give me 5 + 7 + 3 = 15.

How do I do this in pandas?

Alex Riley
  • 152,205
  • 43
  • 245
  • 225
adijo
  • 1,380
  • 1
  • 13
  • 19

2 Answers2

150

The essential idea here is to select the data you want to sum, and then sum them. This selection of data can be done in several different ways, a few of which are shown below.

Boolean indexing

Arguably the most common way to select the values is to use Boolean indexing.

With this method, you find out where column 'a' is equal to 1 and then sum the corresponding rows of column 'b'. You can use loc to handle the indexing of rows and columns:

>>> df.loc[df['a'] == 1, 'b'].sum()
15

The Boolean indexing can be extended to other columns. For example if df also contained a column 'c' and we wanted to sum the rows in 'b' where 'a' was 1 and 'c' was 2, we'd write:

df.loc[(df['a'] == 1) & (df['c'] == 2), 'b'].sum()

Query

Another way to select the data is to use query to filter the rows you're interested in, select column 'b' and then sum:

>>> df.query("a == 1")['b'].sum()
15

Again, the method can be extended to make more complicated selections of the data:

df.query("a == 1 and c == 2")['b'].sum()

Note this is a little more concise than the Boolean indexing approach.

Groupby

The alternative approach is to use groupby to split the DataFrame into parts according to the value in column 'a'. You can then sum each part and pull out the value that the 1s added up to:

>>> df.groupby('a')['b'].sum()[1]
15

This approach is likely to be slower than using Boolean indexing, but it is useful if you want check the sums for other values in column a:

>>> df.groupby('a')['b'].sum()
a
1    15
2     8
Alex Riley
  • 152,205
  • 43
  • 245
  • 225
  • What is the `.b` part? How are you indexing the columns like that? – adijo Jan 30 '15 at 13:09
  • 1
    You can access column `a` of the DataFrame by writing `df['a']` or `df.a`. The second one is convenient but doesn't work well if you have a column name with multiple words like "person id" or if the column name coincides with a DataFrame method like "where" or "sum". – Alex Riley Jan 30 '15 at 13:11
  • Ok thanks, also how do I convert all the values in the column to an `int` thereby enabling me to sum them? I was using `map` to convert all of them to `ints` but I think there might be an inbuilt function in pandas to do this more efficiently. – adijo Jan 30 '15 at 13:16
  • 2
    You could write `df['a'] = df['a'].astype(int)` or `df['a'] = df['a'].convert_objects(convert_numeric=True)` to do that. – Alex Riley Jan 30 '15 at 13:18
  • @ajcr Suppose the OP's example contains an additional column `c`. Would it be possible to sum `b` conditionally for both `a` and `c`? E.g. `df.loc[df['a'] == 1 and df['c'] == 2, 'b'].sum()`. – LucSpan May 08 '17 at 11:11
  • 4
    @LucSpan: absolutely, you can write `df.loc[(df['a'] == 1) & (df['c'] == 2), 'b'].sum()` to sum the values. – Alex Riley May 08 '17 at 11:53
  • @AlexRiley I wish I could give your response 100 likes. Thanks for the helpful response. – mOna Oct 29 '20 at 23:12
5

You can also do this without using groupby or loc. By simply including the condition in code. Let the name of dataframe be df. Then you can try :

df[df['a']==1]['b'].sum()

or you can also try :

sum(df[df['a']==1]['b'])

Another way could be to use the numpy library of python :

import numpy as np
print(np.where(df['a']==1, df['b'],0).sum())
Community
  • 1
  • 1
Himanshu Poddar
  • 3,438
  • 5
  • 34
  • 67