Pandas 'count(distinct)' equivalent

Question

I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in Pandas?

I have done table.groupby(['YEARMONTH'])['CLIENTCODE'].unique() and came with two series indexed by YEARMONTH and with all the unique values. How to count the amount of values on each series? — Adriano Almeida, Mar 14 '13 at 14:04
For some, `value_counts` might be the answer you are looking for: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html — sachinruk, May 16 '17 at 06:30

score 547 · Accepted Answer · edited Jun 10 '15 at 17:24

547

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

edited Jun 10 '15 at 17:24

LondonRob

63,346
33
126
176

answered Mar 14 '13 at 14:09

Dan Allan

32,119
6
68
62

6

What if I have multiple columns that I want to be unique together, like in .drop_duplicates(subset=['col1','col2'])? – ErnestScribbler Oct 02 '17 at 08:10
6

How to access this unique count . As there is no column name – Tarun Khaneja May 03 '18 at 12:08
Thanks lot, I used this style on output of resample. df_watch_record.resample('M').user.nunique() counts the number of unique users who have watched movie per month. – Mehdi Kazemi Jul 22 '18 at 08:44
2

and sort them with table.groupby('YEARMONTH').CLIENTCODE.nunique().sort_values(ascending=False) – wllbll Sep 10 '18 at 06:03
Is it possible to retrieve the group identifier after `nunique`? Try as I might I cannot find a way, since the result of this answer is a `Series`, not a `DataFrame`. – Josh Hansen May 09 '19 at 04:56
2

Is it possible to apply this for multiple columns? Right now in the example, only one column is selected. – Indrajit Jul 17 '19 at 07:11
@Indrajit, see my answer below to apply this for multiple columns. – Ramon Apr 06 '20 at 00:54
@TarunKhaneja table.groupby('YEARMONTH').CLIENTCODE.nunique().to_frame=(name='uniquecount').reset_index() will do the job. – Abhishek Singh Aug 13 '21 at 01:17
this just doesn't work – Ihor B. Jun 02 '22 at 08:11

score 127 · Answer 2 · edited Jul 24 '21 at 13:40

127

Here is another method and it is much simpler. Let’s say your dataframe name is daat and the column name is YEARMONTH:

daat.YEARMONTH.value_counts()

edited Jul 24 '21 at 13:40

Peter Mortensen

30,030
21
100
124

answered Jul 02 '17 at 11:16

StatguyUser

2,433
1
18
40

1

I like this answer. How can I use this method if my column name has a '.' in it (e.g. 'ck.Class')? Thanks – Dec 18 '17 at 23:38
5

daat['ck.Class'].value_counts() – StatguyUser Dec 20 '17 at 02:18
45

This does not address the question asked. – Aaron Schumacher Jan 16 '18 at 19:21
8

this counting the number of observations within each group, not the unique value of a certain column each group has. – Jason Goal Oct 22 '18 at 20:36
5

This is the incorrect answer; it does not reflect the `DISTINCT` requirement from the question! Moreover, it does not include counts of `NaN`! – Corey Levinson Jan 23 '19 at 20:03

score 52 · Answer 3 · edited Dec 23 '15 at 10:55

52

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

edited Dec 23 '15 at 10:55

jezrael

729,927
78
1,141
1,090

answered May 05 '14 at 02:59

Roman Kh

2,518
2
17
15

11

You mean this? `.CLIENTCODE.apply(lambda x: len(x.unique()))`, from [here](http://stackoverflow.com/a/17926436/4015990) – chen Dec 03 '15 at 00:24
6

@user32185 you'd have to drop it into an `apply` call with a lambda. For instance, `df.groupby('YEARMONTH')['CLIENTCODE'].apply(lambda x: x.unique().shape[0])`. – 3novak Apr 17 '18 at 18:59
3

Syntax isn't completely clear, I used `len(df['column'].unique())` no need for lambda function – mlh351 Sep 24 '18 at 20:27
I got `TypeError: object of type 'method' has no len()` from `Chen's` comment, `3novak's` worked for me. – Jason Goal Oct 22 '18 at 20:44

score 18 · Answer 4 · answered Aug 01 '19 at 09:38

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

Vivek Payasi · Answer 5 · 2020-04-20T11:11:50.673

10

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column (CLIENTCODE in your case), we can use nunique. We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)

edited Apr 20 '20 at 11:11

answered Apr 20 '20 at 10:47

Vivek Payasi

409
5
6

1

I think this answer is the best since it is closer to the way you would use the count distinct in SQL. If you use the most recent syntax for Pandas agg you can even skip the flatten step. ```grp_df = df.groupby('YEARMONTH').agg(CLIENTCODE_UNIQ_CNT = ('CLIENTCODE', 'nunique'), other_col_1_sum = ('other_col_1', 'sum'), other_col_1_cnt = ('other_col_1', 'count'))``` – CheradenineZK Nov 08 '21 at 11:13
Oh nice, I wasn't aware of this new syntax. Thanks for commenting :) – Vivek Payasi Nov 09 '21 at 12:40

score 5 · Answer 6 · edited Jul 24 '21 at 13:41

5

Using crosstab, this will return more information than groupby nunique:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]:
CLIENTCODE  1  2  3
YEARMONTH
201301      2  1  0
201302      1  2  1

After a little bit of modification, it yields the result:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]:
YEARMONTH
201301    2
201302    3
dtype: int64

edited Jul 24 '21 at 13:41

Peter Mortensen

30,030
21
100
124

answered Nov 23 '18 at 15:16

BENY

296,997
19
147
204

How can I export this as two column `YEARMONTH` and `count`. Also can i set the count in descending order? – Murtaza Haji Apr 30 '20 at 01:22

score 2 · Answer 7 · edited Jul 24 '21 at 13:43

Here is an approach to have count distinct over multiple columns. Let's have some data:

data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
        'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
        'PRODUCT_CODE': [100,150,220,400,50,80,100]
       }
table = pd.DataFrame(data)
table

CLIENT_CODE YEAR_MONTH  PRODUCT_CODE
0   1       201301      100
1   1       201301      150
2   2       201301      220
3   1       201302      400
4   2       201302      50
5   2       201302      80
6   3       201302      100

Now, list the columns of interest and use groupby in a slightly modified syntax:

columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()

We obtain:

YEAR_MONTH  PRODUCT_CODE CLIENT_CODE
1           2            3
2           2            3
3           1            1

score 1 · Answer 8 · edited Jul 24 '21 at 13:50

1

Create a pivot table and use the nunique series function:

ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
          'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)

Output:

               id
domain
facebook.com   1
google.com     1
twitter.com    2
vk.com         3

edited Jul 24 '21 at 13:50

Peter Mortensen

30,030
21
100
124

answered Jun 28 '21 at 14:15

Golden Lion

2,792
2
19
29

But the sample data does not match the question (YEARMONTH, CLIENTCODE, and SIZE). [The accepted answer](https://stackoverflow.com/questions/15411158/pandas-countdistinct-equivalent/15411596#15411596) and most of the other answers do. This answer (in its current state) would be a better match for question *[Count unique values with Pandas per groups](https://stackoverflow.com/questions/38309729/)*. – Peter Mortensen Jul 24 '21 at 14:00
pivot table does the aggregation using a function. – Golden Lion Jul 25 '21 at 12:56

score 0 · Answer 9 · edited Jul 24 '21 at 13:42

0

With the new Pandas version, it is easy to get as a data frame:

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE=('CLIENTCODE', pd.Series.count))

edited Jul 24 '21 at 13:42

Peter Mortensen

30,030
21
100
124

answered Oct 02 '19 at 14:58

Wickkiey

4,013
2
37
41

What is the version number? Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/58204139/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Jul 24 '21 at 13:42

score 0 · Answer 10 · edited Jul 24 '21 at 13:46

0

Now you are also able to use dplyr syntax in Python to do it:

>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
...     CLIENT_CODE=[1,1,2,1,2,2,3],
...     YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
   YEAR_MONTH       n
      <int64> <int64>
0      201301       2
1      201302       3

edited Jul 24 '21 at 13:46

Peter Mortensen

30,030
21
100
124

answered Jun 17 '21 at 02:16

Panwen Wang

3,068
1
14
33

What is "dplyr syntax"? Can you [add](https://stackoverflow.com/posts/68012034/edit) an (authoritative) reference to it (for context)? (But ***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today.) – Peter Mortensen Jul 24 '21 at 13:47

Pandas 'count(distinct)' equivalent

10 Answers10

Distinct of column along with aggregations on other columns

Linked

Related