254

I have a pandas dataframe in the following format:

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']

df:

   col1 col2 col3     col4 col5
0   1.1    A  1.1    x/y/z    1
1   1.1    A  1.7      x/y    3
2   1.1    A  2.5  x/y/z/n    3
3   2.6    B  2.6      x/u    2
4   2.5    B  3.3        x    4
5   3.4    B  3.8    x/u/v    2
6   2.6    B    4    x/y/z    5
7   2.6    A  4.2        x    3
8   3.4    B  4.3  x/u/v/b    6
9   3.4    C  4.5        -    3
10  2.6    B  4.6      x/y    5
11  1.1    D  4.7    x/y/z    1
12  1.1    D  4.7        x    1
13  3.3    D  4.8  x/u/v/w    1

Now I want to group this by two columns like following:

df.groupby(['col5','col2']).reset_index()

OutPut:

             index col1 col2 col3     col4 col5
col5 col2                                      
1    A    0      0  1.1    A  1.1    x/y/z    1
     D    0     11  1.1    D  4.7    x/y/z    1
          1     12  1.1    D  4.7        x    1
          2     13  3.3    D  4.8  x/u/v/w    1
2    B    0      3  2.6    B  2.6      x/u    2
          1      5  3.4    B  3.8    x/u/v    2
3    A    0      1  1.1    A  1.7      x/y    3
          1      2  1.1    A  2.5  x/y/z/n    3
          2      7  2.6    A  4.2        x    3
     C    0      9  3.4    C  4.5        -    3
4    B    0      4  2.5    B  3.3        x    4
5    B    0      6  2.6    B    4    x/y/z    5
          1     10  2.6    B  4.6      x/y    5
6    B    0      8  3.4    B  4.3  x/u/v/b    6

I want to get the count by each row like following. Expected Output:

col5 col2 count
1    A      1
     D      3
2    B      2
etc...

How to get my expected output? And I want to find largest count for each 'col2' value?

Nilani Algiriyage
  • 28,024
  • 31
  • 81
  • 119
  • A very similar question just came up yesterday.. see [here](http://stackoverflow.com/questions/17666075/python-pandas-groupby-result/17666287#17666287). – bdiamante Jul 16 '13 at 14:29
  • Note on performance, including alternatives: [Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series](https://stackoverflow.com/questions/50328246/pandas-groupby-size-vs-series-value-counts-vs-collections-counter-with-multiple) – jpp Jun 25 '18 at 14:01
  • btw: How do you produce the GroubBy-output? – buhtz May 26 '21 at 08:43
  • 1
    @buhtz says get vaccinated: your question is not clear? – Nilani Algiriyage Jun 01 '21 at 09:51

7 Answers7

217

You are looking for size:

In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64

To get the same answer as waitingkuo (the "second question"), but slightly cleaner, is to groupby the level:

In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A       3
B       2
C       1
D       3
dtype: int64
Andy Hayden
  • 328,850
  • 93
  • 598
  • 514
  • 1
    I don't know Why I forgot this :O, Any way what about my second question?Find largest count for each "col2" value and get corresponding "col5" value? – Nilani Algiriyage Jul 16 '13 at 14:40
165

Followed by @Andy's answer, you can do following to solve your second question:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]: 
      0
col2   
A     3
B     2
C     1
D     3
waitingkuo
  • 80,738
  • 23
  • 108
  • 117
43

Idiomatic solution that uses only a single groupby

(df.groupby(['col5', 'col2']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='col2'))

  col5 col2  count
0    3    A      3
1    1    D      3
2    5    B      2
6    3    C      1

Explanation

The result of the groupby size method is a Series with col5 and col2 in the index. From here, you can use another groupby method to find the maximum value of each value in col2 but it is not necessary to do. You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2 with the drop_duplicates method.

cs95
  • 330,695
  • 80
  • 606
  • 657
Ted Petrou
  • 52,577
  • 18
  • 119
  • 125
  • 1
    There is no param called `name` in `reset_index()` in the current version of pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html – mmBs Nov 25 '18 at 21:27
  • 2
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reset_index.html – Ted Petrou Nov 26 '18 at 11:20
  • Ok, my bad. I used it when working with `DataFrame` not `Series`. Thanks for the link. – mmBs Nov 26 '18 at 11:42
  • Note: `.drop_duplicates(subset='col2'))` is only to answer the second question: `And I want to find the largest count for each 'col2' value?` and, thus, will remove duplicates with smaller count values. – momo Mar 17 '21 at 14:16
30

Inserting data into a pandas dataframe and providing column name.

import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df)   #printing dataframe.

This is our printed data:

enter image description here

For making a group of dataframe in pandas and counter,
You need to provide one more column which counts the grouping, let's call that column as, "COUNTER" in dataframe.

Like this:

df['COUNTER'] =1       #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)

OUTPUT:

enter image description here

Vaibhav Mule
  • 4,771
  • 3
  • 34
  • 52
The Gr8 Adakron
  • 1,122
  • 1
  • 12
  • 14
  • 11
    How can I get the Alphabet column (eg.A) to repeat below and not leave the gaps in the first column ?? – blissweb Jan 06 '18 at 06:54
  • 1
    how to access the value of each group which is sum based on the alphabet and word? – Rahul Goyal Mar 05 '20 at 02:52
  • the column names should be assigned with a flat list: `df.columns = ['Alphabet','Words']` – Navid Aug 10 '20 at 08:20
  • to not get the gaps in the Alphabet column just add .reset_index() to the end, so `group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum().reset_index()` – JCM Sep 01 '20 at 18:12
5

Should you want to add a new column (say 'count_column') containing the groups' counts into the dataframe:

df.count_column=df.groupby(['col5','col2']).col5.transform('count')

(I picked 'col5' as it contains no nan)

Tom
  • 147
  • 1
  • 6
1

Since pandas 1.1.0., you can value_counts on a DataFrame:

out = df[['col5','col2']].value_counts().sort_index()

Output:

col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64
-4

You can just use the built-in function count follow by the groupby function

df.groupby(['col5','col2']).count()