30

The ANSI SQL standard defines (chapter 6.5, set function specification) the following behaviour for aggregate functions on empty result sets:

COUNT(...) = 0
AVG(...) = NULL
MIN(...) = NULL
MAX(...) = NULL
SUM(...) = NULL

Returning NULL for AVG, MIN and MAX makes perfect sense, since the average, minimum and maximum of an empty set is undefined.

The last one, however, bothers me: Mathematically, the SUM of an empty set is well-defined: 0. Using 0, the neutral element of addition, as the base case makes everything consistent:

SUM({})        = 0    = 0
SUM({5})       = 5    = 0 + 5
SUM({5, 3})    = 8    = 0 + 5 + 3
SUM({5, NULL}) = NULL = 0 + 5 + NULL

Defining SUM({}) as null basically makes "no rows" a special case that does not fit in with the others:

SUM({})     = NULL  = NULL
SUM({5})    = 5    != NULL + 5 (= NULL)
SUM({5, 3}) = 8    != NULL + 5 + 3 (= NULL)

Is there some obvious advantage of the choice that was made (SUM being NULL) that I have missed?

A-K
  • 7,244
  • 3
  • 32
  • 50
Heinzi
  • 3,145
  • 2
  • 29
  • 42

3 Answers3

20

I'm afraid that the reason is simply that the rules were set in an adhoc fashion (like quite many other "features" of the ISO SQL standard) at a time when SQL aggregations and their connection with mathematics were less understood than they are now (*).

It's just one of the extremely many inconsistencies in the SQL language. They make the language harder to teach, harder to learn, harder to understand, harder to use, harder to whatever you want, but that's just the way things are. The rules cannot be changed "cold" and "just like that", for obvious reasons of backward compatibility (If the ISO committee publishes a final version of the standard, and vendors then set out to implement that standard, then those vendors will not appreciate it very much if in a subsequent version, the rules are changed such that existing (compliant) implementations of the former version of the standard "automatically fail to comply" the new version ...)

(*) It is now better understood that aggregations over an empty set behave more consistently if they systematically return the identity value (= what you call the 'neutral element') of the underlying binary operator at hand. That underlying binary operator for COUNT and SUM is addition, and its identity value is zero. For MIN and MAX, that identity value is the highest and lowest value of the type at hand, respectively, if the concerned types are finite. Cases like averaging, harmonic means, medians, etc. are extremely intricate and exotic in this respect, though.

Erwin Smout
  • 1,746
  • 10
  • 12
  • I think null makes sense over an empty set with min and max. You might say an identity value there really is unknown, but the sum of no values is 0 for the same reason that n * 0 is always 0. But min and max are different. I don't think the result is properly defined running across no records. – Chris Travers Oct 05 '12 at 12:14
  • Also avg() over a null set makes sense as a null because 0/0 is not properly defined in this context. – Chris Travers Oct 05 '12 at 12:16
  • 5
    MIN and MAX are not so different. Take an underlying binary operator LOWESTOF(x,y) and HIGHESTOF(x,y) respectively. These binary operators do have an identity value. Because in both cases (if the involved type is finite), there exists indeed some value z such that forall x : LOWESTOF(z,x)=x and forall y : HIGHESTOF (y,z)=y. (The identity value is not the same for both cases, but it does exist for both cases.) I agree that the results look extremely counterintuitive at first glance, but there is no denying the mathematical reality. – Erwin Smout Oct 05 '12 at 12:53
  • @Erwin: I agree on all your points, except that the identity of some operations, like HIGHEST() many not be an element of the datatype, like for Reals where the identity would be the -Infinity (and +Infinity for LOWEST()) – ypercubeᵀᴹ Oct 05 '12 at 13:17
  • And we may have objections about the expected result of SELECT MAX(x) FROM t WHERE x >= 5. Should that be the -Infinity or 5 (for an empty set)? – ypercubeᵀᴹ Oct 05 '12 at 14:36
  • @ypercube : could you rephrase that first comment ? Either an identity value exists for some binary operator on some type, and then that identity value is a member of the type, or else it just doesn't exist, and then obviously no value can sensibly be returned from an aggregation over an empty set that involves that underlying binary operator. One example is HARMONICMEAN (parallel wiring of resistors), whose identity value is positive infinity, which is a problem for all types that do not include infinity as a value. For HIGHEST and LOWEST, an identity value always exists within the type. – Erwin Smout Oct 05 '12 at 21:52
  • @ypercube : Does 'WHERE x >= 5' cause the creation of a specialized type INTGE5, whose lowest value happens to be the value 5 ? Before you answer, ponder whether 'WHERE LN(x) >= 0' would then also cause the creation of a specialized type NUMBER_GE_1. Or the specialized type that would have to be created for 'WHERE EXP(-x*x) > 0.5'. – Erwin Smout Oct 05 '12 at 21:57
  • 1
    @SQL kiwi. Are you forgetting about static type checking ? If expressions like SUM() are handled by the static type checker as if they always return an integer, then obviously it should be impossible for the SUM() invocation to sometimes return something that is not an integer (e.g. an empty relation). – Erwin Smout Oct 05 '12 at 22:14
  • The space available in these comments is way too short to discuss all that in the level of detail it deserves. Is there some place else where we can take these discussions to ? – Erwin Smout Oct 05 '12 at 22:30
  • I've typed some stuff in that second link you gave ... – Erwin Smout Oct 05 '12 at 23:34
  • @ErwinSmout I do think that min and max are different. I think it is entirely reasonable to say that the maximum value in an empty set is undefined. However the sum of numbers in an empty set is pretty clearly defined (0 * any finite value = 0). I think we are arguing different things. I am arguing that NULL is inappropriate where the result is defined, but per perfectly appropriate where a result is undefined. For example 'Infinity'::float8 * 0 (in Pg it is NaN) should probably be NULL as should 0 / 0 (division by 0 exception). – Chris Travers Oct 07 '12 at 02:01
3

In a pragmatic sense the existing result of NULL is useful. Consider the following table and statements:

C1 C2
-- --
 1  3 
 2 -1 
 3 -2 

SELECT SUM(C2) FROM T1 WHERE C1 > 9;

SELECT SUM(C2) FROM T1 WHERE C1 < 9;

The first statement returns NULL and the second returns zero. If an empty set returned zero for SUM we would need another means to distinguish a true sum of zero from an empty set, perhaps using count. If we indeed want zero for the empty set then a simple COALESCE will furnish that requirement.

SELECT COALESCE(SUM(C2),0) FROM T1 WHERE C1 > 9;
Leigh Riffel
  • 23,854
  • 16
  • 78
  • 152
  • 1
    as a result., SUM(union of set1 and set2) <> SUM(set1) + SUM(set2), because any number + NULL = NULL. Does it make sense to you? – A-K Oct 05 '12 at 14:15
  • 2
    @Leigh: Using COALESCE() like this will not distinguish the (0) sum of an empty set from the (NULL) sum (say the table had a (10, NULL) row. – ypercubeᵀᴹ Oct 05 '12 at 14:26
  • Besides, we still cannot distinguish SUM(empty set) from SUM(set of one or more NULLs). Do we need to distinguish at all? – A-K Oct 05 '12 at 15:14
  • @AlexKuznetsov - We can distinguish a sum of an empty set from a sum of a set that contains one or more nulls as long as at least one row contains a value. You are correct that if the set contains only NULLs then we can't distinguish the NULL set from this set of all NULL values. My point wasn't that it is useful in every case, merely that it can be useful. If I SUM a column and get back zero I know without having to check that there is at least one not NULL row being used to show me the result. – Leigh Riffel Oct 05 '12 at 15:30
  • @ypercude - You are absolutely correct. My point was that the current behavior of SUM does distinguish an empty set from a set that contains values (even if some are null). It is simpler to use COALESCE when the distinction is not required than to use something like DECODE(count(c2),0,NULL,sum(c2)) when it is. – Leigh Riffel Oct 05 '12 at 17:07
  • @AlexKuznetsov : sorry to say, but what about SUM ( { 1 3 } ) + SUM ( { 1 4 } ) == SUM ( { 1 3 } UNION { 1 4 } ) ? Remember that relational union is supposed to eliminate duplicates !!! – Erwin Smout Oct 05 '12 at 22:21
-1

The main difference I can see is with regard to the datatype. COUNT has a well defined returntype: A whole number. All the others depend on the type of the column/expression they are looking at. Their return type must be compatible with all members of the set (think float, currency, decimal, bcd, timespan, ...). Since there is no set you cannot imply a return type, thus NULL is your best option.

Note: In most cases you could imply a return type from the column type you are looking at, but you can do SUMs not only on columns but on all kinds of things. Implying a return type might get very difficult if not impossible under certain circumstances, especially when you think about possible expansions of the standard (dynamic types come to mind).

TToni
  • 559
  • 3
  • 9
  • 5
    Why can we not imply a return type in a SUM(column) expression? Don't we have empty tables - and there all the columns have defined types? Why should it be any different for an empty result set? – ypercubeᵀᴹ Oct 05 '12 at 12:04
  • 5
    You get it wrong where you say "since there is NO SET". There is a set. The set of all possible values of the declared type of the involved columns or expression. That declared type exists even if the table you're looking at is empty. Even empty tables still have a heading. And that declared type is exactly your "implied return type". – Erwin Smout Oct 05 '12 at 13:09
  • Did the both of you actually read my note? Yes, it would work for column-based SUMs as of now. But as soon as you encounter a variable datatype-column (not in SQL Server - yet), you are out of luck. – TToni Oct 05 '12 at 14:12
  • 2
    How will you define the sum in that case? What will the result of 24 + 56.07 + '2012-10-05' + 'Red' be? I mean there is no pint in worrying how SUM() will behave when we have a problem defining addition. – ypercubeᵀᴹ Oct 05 '12 at 14:28
  • @ypercube: In that case it's obviously a runtime error (unless there are sums defined for that types. I have seen stranger things in the software world). If you have 24+56.07 you can use some C-like rules to get 80.07 as result, but what is the type if you have no value at all to play with? – TToni Oct 05 '12 at 15:48
  • 1
    @TToni : "especially when you think about possible expansions of the standard" is not the context that the OP was referring to. the OP was very clearly referring to the current version of the standard, which does not include any sort of notion of "dynamic types" or some such. (Oh, and I only commented, but didn't downvote. Apart from that tiny slip I took issue with, nothing in your answer was wrong enough to warrant a downvote. IMO.) – Erwin Smout Oct 05 '12 at 22:08