33

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.

CREATE TABLE data_table (
    date date DEFAULT NULL,
    dimension1 varchar(64) DEFAULT NULL,
    dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;

One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table. However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum. The first command returns more results compared with the second (1466 rows vs. 1504.

-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;

-- command 2    
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?

EDIT: The following queries both return 1504 (same as the "simple" DISTINCT)

SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;

SELECT count(dimension1)
FROM data_table;

Thank you!

dmg
  • 3,904
  • 1
  • 17
  • 21
tekneee
  • 472
  • 1
  • 5
  • 10
  • Weird. Side note: Logically, `distinct` and `group by` should accomplish the same thing so it's redundant to have them both in play. – n8. Jun 13 '18 at 21:48
  • @n8. thank you for pointing that out. I will have more fields (outside the DISTINCT ON expression which will require me to have the ORDER BY > From the docs: Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first – tekneee Jun 13 '18 at 21:56
  • Here is something interesting: https://medium.com/statuscode/the-many-faces-of-distinct-in-postgresql-c52490de5954 – n8. Jun 13 '18 at 22:06
  • I don't use PostgreSQL anymore and when I did it wasn't often, so I didn't know that there is a difference between DISTINCT and DISTINCT ON. – n8. Jun 13 '18 at 22:08

4 Answers4

57

DISTINCT and DISTINCT ON have completely different semantics.

First the theory

DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.

For example, assume a table R with the following contents:

#table r;
a | b 
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a

(6 rows)

SELECT distinct * from R will result:

# select distinct * from r;
 a | b 
---+---
 1 | a
 3 | d
 2 | e
 2 | b
 3 | c
(5 rows)

Note that distinct applies to the entire list of projected attributes: thus

select distinct * from R

is semantically equivalent to

select distinct a,b from R

You cannot issue

select a, distinct b From R

DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.

DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.

Its syntax is:

 SELECT DISTINCT ON (attributeList) <rest as any query>

For example:

 SELECT DISTINCT ON (a) * from R

It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.

Example:

select * from r order by a;
     a | b 
    ---+---
     1 | a
     2 | e
     2 | b
     3 | c
     3 | d
    (5 rows)

Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:

 SELECT DISTINCT on (a) * from r;
  a | b 
 ---+---
 1 | a
 2 | b
 3 | c
 (3 rows)

Some DBMS (most notably sqlite) will allow you to run this query:

 SELECT a,b from R group by a;

And this give you a similar result.

Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).

For instance, if the primary key of R is a, then a->b and:

SELECT a,b FROM R group by a

is identical to:

  SELECT DISTINCT on (a) a, b from r;

Now, back to your problem:

First query:

SELECT DISTINCT count(dimension1)
FROM data_table;

computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query returns one tuple, which is always unique (hence DISTINCT is redundant).

Query 2:

SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

This is query in a query. Let me rewrite it for clarity:

WITH tmp_table AS (
   SELECT DISTINCT ON (dimension1) 
     dimension1 FROM data_table
     GROUP by dimension1) 
SELECT count(*) from tmp_table

Let us compute first tmp_table. As I mentioned above, let us first ignore the DISTINCT ON and do the rest of the query. This is a group by by dimension1. Hence this part of the query will result in one tuple per different value of dimension1.

Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence this makes the DISTINCT ON superflouos (it does nothing). The final count is simply a count of all the tuples in the group by.

As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):

SELECT (DISTINCT ON a) a
FROM R

and

SELECT a FROM R group by a

and

SELECT DISTINCT a FROM R

Warning

Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database. In other words, the query might return different results for the same tables.

One interesting aspect

Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:

SELECT a, b FROM R group by a

is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a. In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:

SELECT DISTINCT ON (a) a,b from R

Corollary

DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.

dmg
  • 3,904
  • 1
  • 17
  • 21
  • Thank you for the explanation. Definitely helpful for understanding these differences. I read through the docs, apparently not well enough. – tekneee Jun 13 '18 at 22:25
  • Hello @dmg, again, I see you updated your answer with a much more detailed explanation, I really appreciate your taking your time. – tekneee Jun 14 '18 at 19:44
  • What is good way to learn like this, where you know what's going on behind scenes at least theoretically. Most sites are mere syntax reference and provide no background information on this all ties together behind scenes. @dmg – Muhammad Umer May 17 '19 at 16:33
3

The first query gives the number of not null values of dimension1, while the second one returns the number of distinct values of the column. These numbers obviously are not equal if the column contains duplicates or nulls.

The word DISTINCT in

SELECT DISTINCT count(dimension1)
FROM data_table;

makes no sense, as the query returns a single row. Maybe you wanted

SELECT count(DISTINCT dimension1)
FROM data_table;

which returns the number of distinct not null values of dimension1. Note, that it is not the same as

SELECT count(*)
FROM (
    SELECT DISTINCT ON (dimension1) dimension1
    FROM data_table
    -- GROUP BY dimension1 -- redundant
    ) AS tmp_table;

The last query yields the number of all (null or not) distinct values of the column.

klin
  • 99,138
  • 12
  • 177
  • 203
  • damn.... moving the DISTINCT from outside the count to inside (your block of code `SELECT count(DISTINCT dimension1) FROM data_table;` shows it. Thank you! – tekneee Jun 13 '18 at 22:16
2

To learn and understand what happens by visual example.
Here's a bit of SQL to execute on a PostgreSQL:

DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table (
    id int NOT NULL primary key,
    col1 varchar(64) DEFAULT NULL
);

INSERT INTO test_table (id, col1) VALUES 
(1,'foo'), (2,'foo'), (3,'bar'), (4,null);

select count(*) as total1 from test_table;
-- returns: 4
-- Because the table has 4 records.

select distinct count(*) as total2 from test_table;
-- returns: 4
-- The count(*) is just one value. Making 1 total unique can only result in 1 total. 
-- So the distinct is useless here.

select col1, count(*) as total3 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- Since there are 3 unique col1 values. NULL's are included.

select distinct col1, count(*) as total4 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- The result is already grouped, and therefor already unique. 
-- So again, the distinct does nothing extra here.

select count(distinct col1) as total5 from test_table;
-- returns 2
-- NULL's aren't counted in a count by value. So only 'foo' & 'bar' are counted

select distinct on (col1) id, col1 from test_table order by col1 asc, id desc;
-- returns 3 rows: (2,'a'),(3,'b'),(4,NULL)
-- So it gets the records with the maximum id per unique col1
-- Note that the "order by" matters here. Changing that DESC to ASC would get the minumum id.

select count(*) as total6 from (select distinct on (col1) id, col1 from test_table order by col1 asc, id desc) as q;
-- returns 3. 
-- After seeing the previous query, what else would one expect?

select distinct col1 from test_table order by col1;
-- returns 3 unique values : ('bar'),('foo'),(null)

select distinct id, col1 from test_table order by col1;
-- returns all records. 
-- Because id is the primary key and therefore makes each returned row unique
LukStorms
  • 28,156
  • 5
  • 30
  • 44
  • Thank you @LukStorms, extremely useful stuff! I'm in the middle of translating a bunch of aggregations, originally written for MySQL, to Postgres and it the more examples I can get the better. – tekneee Jun 14 '18 at 19:46
  • @tekneee Nice to hear that it helps. But I think it's mostly that `DISTINCT ON(column)` that's not standard SQL and specific to PostgreSQL. The rest of those queries would behave the same on MySQL or other database types. – LukStorms Jun 14 '18 at 19:54
  • Sure @LukStorms. Curiously, this also arose form the need to replace a `Group By` from MySQL which on its hand also behaves in a non-standard way because it doesn't require an aggregation - and when this happens, it gets only ONE row of values, even if the aggregation resulted in the last step of aggregation returning more rows. (just a curiousity) – tekneee Jun 16 '18 at 20:12
  • @tekneee If you're talking about GROUP BY in MySql. I remember that at least older versions of MySql are peculiar with the GROUP BY compared to other database types. In any other database something like `select foo, bar, count(*) from test group by foo` wouldn't be allowed. Because you'd get the error that 'bar' isn't in the GROUP BY. But in MySql it's often allowed. Which can sometimes lead to unexpected results. – LukStorms Jun 16 '18 at 20:37
  • @tekneee they explained it though : [here](https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html). But you can actually disable that behaviour. F.e. check [this old SO post](https://stackoverflow.com/questions/23921117/) – LukStorms Jun 16 '18 at 21:15
  • Thanks. I know of that strange behaviour. The issue I had is that the script I'm "translating" is already working as is in MySQL and, migrating to Postgres means it needs to behave the same way. All the answers and contributions given surely helped get there. – tekneee Jun 17 '18 at 22:01
  • @tekneee Well, I suppose one way is to find which ported sql's fail on that in PostgreSQL. And then add the missing columns to the GROUP BY. Not that it would garantee the exact same results... sounds like a tedious job. – LukStorms Jun 17 '18 at 22:16
0

Try

SELECT count(dimension1a)
FROM (SELECT DISTINCT ON (dimension1) dimension1a
FROM data_table
ORDER BY dimension1) AS tmp_table;

DISTINCT ON appears to be synonymous with GROUP BY.

n8.
  • 1,701
  • 3
  • 15
  • 36
  • 1
    Thank you, but the answer from @klin actually solve it. The `DISTINCT` outside the count wasn't doing anything. – tekneee Jun 13 '18 at 22:17