How can I write windowing query which sums a column to create discrete buckets?

Question

I have a table which includes a column of decimal values, such as this:

id value size
-- ----- ----
 1   100  .02
 2    99  .38
 3    98  .13
 4    97  .35
 5    96  .15
 6    95  .57
 7    94  .25
 8    93  .15

What I need to accomplish is a little difficult to describe, so please bear with me. What I am trying to do is create an aggregate value of the size column which increments by 1 each time the preceding rows sum up to 1, when in descending order according to value. The result would look something like this:

id value size bucket
-- ----- ---- ------
 1   100  .02      1
 2    99  .38      1
 3    98  .13      1
 4    97  .35      1
 5    96  .15      2
 6    95  .57      2
 7    94  .25      2
 8    93  .15      3

My naive first attempt was to keep a running SUM and then CEILING that value, however it doesn't handle the case where some records' size end up contributing to the total of two separate buckets. The below example might clarify this:

id value size crude_sum crude_bucket distinct_sum bucket
-- ----- ---- --------- ------------ ------------ ------
 1   100  .02       .02            1          .02      1
 2    99  .38       .40            1          .40      1
 3    98  .13       .53            1          .53      1
 4    97  .35       .88            1          .88      1
 5    96  .15      1.03            2          .15      2
 6    95  .57      1.60            2          .72      2
 7    94  .25      1.85            2          .97      2
 8    93  .15      2.00            2          .15      3

As you can see, if I were to simply use CEILING on crude_sum record #8 would be assigned to bucket 2. This is caused by the size of records #5 and #8 being split across two buckets. Instead, the ideal solution is to reset the sum each time it reaches 1, which then increments the bucket column and begins a new SUM operation starting at the size value of the current record. Because the order of the records is important to this operation, I've included the value column, which is intended to be sorted in descending order.

My initial attempts have involved making multiple passes over the data, once to perform the SUM operation, once more to CEILING that, etc. Here is an example of what I did to create the crude_sum column:

SELECT
  id,
  value,
  size,
  (SELECT TOP 1 SUM(size) FROM table t2 WHERE t2.value<=t1.value) as crude_sum
FROM
  table t1

Which was used in an UPDATE operation to insert the value into a table to work with later.

Edit: I'd like to take another stab at explaining this, so here goes. Imagine each record is a physical item. That item has a value associated with it, and a physical size less than one. I have a series of buckets with a volume capacity of exactly 1, and I need to determine how many of these buckets I will need and which bucket each item goes in according to the value of the item, sorted from highest to lowest.

A physical item cannot exist in two places at once, so it must be in one bucket or the other. This is why I can't do a running total + CEILING solution, because that would allow records to contribute their size to two buckets.

You should add your SQL to make it clear what your initial attempt included. — mdahlman, Jun 24 '13 at 19:52
Are you going to be aggregating data according to the bucket you're computing, or is the bucket number the final answer you're looking for? — Jon Seigel, Jun 24 '13 at 19:58
Okay, this is a bit of semantics, but what you're asking for is a window function, not an aggregate. Because you want to reset the running total in the middle, that makes it much more difficult. Have you attempted a CLR-based solution? — Jon Seigel, Jun 24 '13 at 20:15
@JonSeigel I'm afraid CLR isn't available, I'm in a corporate environment with very tight restrictions on the server. I've been fighting for months just to get proper BULK INSERT permissions, even. — Zikes, Jun 24 '13 at 20:22
I think the best option here is to use a cursor, or do it in a client-side application. Do you have to do this computation on a lot of data? — Jon Seigel, Jun 24 '13 at 20:32
@JonSeigel About 15m records, but if no better solution exists then I guess that's what I'll have to do. — Zikes, Jun 24 '13 at 20:35
Ack. I'd probably go with a client-side app since that will support better streaming-in of records as opposed to a cursor loop which fetches one row at a time. I think as long as all the updates are done in batches, it should perform reasonably well. — Jon Seigel, Jun 24 '13 at 20:39
The SQL standard defines a width_bucket() window function if I'm not mistaken. Does SQL Server have that? — , Jun 24 '13 at 22:55
As the others have already mentioned, the requirement of bucketing on distinct_count complicates things. Aaron Bertrand has a great summary of your options on SQL Server for this kind of windowing work. I have used the "quirky update" method to calculate distinct_sum, which you can see here on SQL Fiddle, but this is unreliable. — Nick Chammas, Jun 25 '13 at 17:26
@a_horse_with_no_name: No, it doesn't; I don't think that would help here anyway, as the number of buckets depends on the running sum (which is the part that's causing the whole mess in the first place). — Jon Seigel, Jun 25 '13 at 17:30
@JonSeigel We should note that problem of placing X items in minimal number of buckets cannot be efficiently solved using a row by row algorithm of SQL language. Eg items of size 0.7;0.8;0.3 will need 2 buckets, but if sorted by id they'll need 3 buckets. — Stoleg, Jun 25 '13 at 19:51
@Stoleg It's not necessarily a problem of minimal buckets, as that would require re-ordering rows to squeeze in smaller size items into leftover space in some buckets. It's based on the value of each item, so that the highest value item(s) are placed in the first bucket, then the second bucket has the next highest value item(s), etc. Leftover space in buckets is not an issue that needs addressed, except to prevent items from taking up space in multiple buckets. — Zikes, Jun 25 '13 at 20:30
@Zikes I see that we are asked just to put item into buckets one by one, sequentially. Recursive CTE is the right solution, so I do not pursue way of calculating bucket number anymore. Minimization of buckets used is a much harder problem. — Stoleg, Jun 25 '13 at 20:43

score 9 · Accepted Answer · answered Jun 25 '13 at 00:50

I am not sure what type of performance you are looking for, but if CLR or external app is not an option, a cursor is all that is left. On my aged laptop I get through 1,000,000 rows in about 100 seconds using the following solution. The nice thing about it is that it scales linearly, so I would be looking at a little about 20 minutes to run through the entire thing. With a decent server you will be faster, but not an order of magnitude, so it would still take several minutes to complete this. If this is a one off process, you probably can afford the slowness. If you need to run this as a report or similar regularly, you might want to store the values in the same table un update them as new rows get added, e.g. in a trigger.

Anyway, here is the code:

IF OBJECT_ID('dbo.MyTable') IS NOT NULL DROP TABLE dbo.MyTable;

CREATE TABLE dbo.MyTable(
 Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3) DEFAULT ABS(CHECKSUM(NEWID())%100)/100.0
);


MERGE dbo.MyTable T
USING (SELECT TOP(1000000) 1 X FROM sys.system_internals_partition_columns A,sys.system_internals_partition_columns B,sys.system_internals_partition_columns C,sys.system_internals_partition_columns D)X
ON(1=0)
WHEN NOT MATCHED THEN
INSERT DEFAULT VALUES;

--SELECT * FROM dbo.MyTable

DECLARE @st DATETIME2 = SYSUTCDATETIME();
DECLARE cur CURSOR FAST_FORWARD FOR
  SELECT Id,v FROM dbo.MyTable
  ORDER BY Id;

DECLARE @id INT;
DECLARE @v NUMERIC(5,3);
DECLARE @running_total NUMERIC(6,3) = 0;
DECLARE @bucket INT = 1;

CREATE TABLE #t(
 id INT PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3),
 bucket INT,
 running_total NUMERIC(6,3)
);

OPEN cur;
WHILE(1=1)
BEGIN
  FETCH NEXT FROM cur INTO @id,@v;
  IF(@@FETCH_STATUS <> 0) BREAK;
  IF(@running_total + @v > 1)
  BEGIN
    SET @running_total = 0;
    SET @bucket += 1;
  END;
  SET @running_total += @v;
  INSERT INTO #t(id,v,bucket,running_total)
  VALUES(@id,@v,@bucket, @running_total);
END;
CLOSE cur;
DEALLOCATE cur;
SELECT DATEDIFF(SECOND,@st,SYSUTCDATETIME());
SELECT * FROM #t;

GO 
DROP TABLE #t;

It drops and recreates the table MyTable, fills it with 1000000 rows and then goes to work.

The cursor copies each row into a temp table while running the calculations. At the end the select returns the calculated results. You might be a little faster if you don't copy the data around but do an in-place update instead.

If you have an option to upgrade to SQL 2012 you can look at the new window-spool supported moving window aggregates, that should give you better performance.

On a side note, if you have an assembly installed with permission_set=safe, you can do more bad stuff to a server with standard T-SQL than with the assembly, so I would keep working on removing that barrier - You have a good use case here where CLR really would help you.

I accepted this one due to how easy it was to implement, and how easily I can change and debug it later as the need arises. @NickChammas's answer is also correct and probably runs more efficiently, so I guess it's a matter of preference for anyone else coming up against a similar issue. — Zikes, Jun 25 '13 at 21:08

Nick Chammas · Answer 2 · 2013-06-26T05:55:43.543

Absent the new windowing functions in SQL Server 2012, complex windowing can be accomplished with the use of recursive CTEs. I wonder how well this will perform against millions of rows.

The following solution covers all the cases you described. You can see it in action here on SQL Fiddle.

-- schema setup
CREATE TABLE raw_data (
    id    INT PRIMARY KEY
  , value INT NOT NULL
  , size  DECIMAL(8,2) NOT NULL
);

INSERT INTO raw_data 
    (id, value, size)
VALUES 
   ( 1,   100,  .02) -- new bucket here
 , ( 2,    99,  .99) -- and here
 , ( 3,    98,  .99) -- and here
 , ( 4,    97,  .03)
 , ( 5,    97,  .04)
 , ( 6,    97,  .05)
 , ( 7,    97,  .40)
 , ( 8,    96,  .70) -- and here
;

Now take a deep breath. There are two key CTEs here, each preceded by a brief comment. The rest are just "cleanup" CTEs, for example, to pull the right rows after we've ranked them.

-- calculate the distinct sizes recursively
WITH distinct_size AS (
  SELECT
      id
    , size
    , 0 as level
  FROM raw_data

  UNION ALL

  SELECT 
      base.id
    , CAST(base.size + tower.size AS DECIMAL(8,2)) AS distinct_size
    , tower.level + 1 as level
  FROM 
                raw_data AS base
    INNER JOIN  distinct_size AS tower
      ON base.id = tower.id + 1
  WHERE base.size + tower.size <= 1
)
, ranked_sum AS (
  SELECT 
      id
    , size AS distinct_size
    , level
    , RANK() OVER (PARTITION BY id ORDER BY level DESC) as rank
  FROM distinct_size  
)
, top_level_sum AS (
  SELECT
      id
    , distinct_size
    , level
    , rank
  FROM ranked_sum
  WHERE rank = 1
)
-- every level reset to 0 means we started a new bucket
, bucket AS (
  SELECT
      base.id
    , COUNT(base.id) AS bucket
  FROM 
               top_level_sum base
    INNER JOIN top_level_sum tower
      ON base.id >= tower.id
  WHERE tower.level = 0
  GROUP BY base.id
)
-- join the bucket info back to the original data set
SELECT
    rd.id
  , rd.value
  , rd.size
  , tls.distinct_size
  , b.bucket
FROM 
             raw_data rd
  INNER JOIN top_level_sum tls
    ON rd.id = tls.id
  INNER JOIN bucket   b
    ON rd.id = b.id
ORDER BY
  rd.id
;

This solution assumes that id is a gapless sequence. If not, you will need to generate your own gapless sequence by adding an additional CTE at the beginning that numbers the rows with ROW_NUMBER() according to the desired order (e.g. ROW_NUMBER() OVER (ORDER BY value DESC)).

Fankly, this is quite verbose.

This solution does not seem to address the case where a row might contribute its size to multiple buckets. A rolling sum is easy enough, but I need that sum to reset each time it reaches 1. See the last example table in my question and compare crude_sum with distinct_sum and their associated bucket columns to see what I mean. — Zikes, Jun 25 '13 at 15:48
@Zikes - I have addressed this case with my updated solution. — Nick Chammas, Jun 25 '13 at 18:37
That looks like it should work now. I'll work on integrating it into my database to test it out. — Zikes, Jun 25 '13 at 20:36
@Zikes - Just curious, how do the various solutions posted here perform against your large data set? I'm guessing Andriy's is the fastest. — Nick Chammas, Jun 26 '13 at 05:34

SQLFox · Answer 3 · 2013-06-24T20:46:50.570

This feels like a silly solution, and it probably won't scale well, so test carefully if you use it. Since the main problem comes from the "space" left in the bucket, I first had to create a filler record to union into the data.

with bar as (
select
  id
  ,value
  ,size
  from foo
union all
select
  f.id
  ,value = null
  ,size = 1 - sum(f2.size) % 1
  from foo f
  inner join foo f2
    on f2.id < f.id
  group by f.id
    ,f.value
    ,f.size
  having cast(sum(f2.size) as int) <> cast(sum(f2.size) + f.size as int)
)
select
  f.id
  ,f.value
  ,f.size
  ,bucket = cast(sum(b.size) as int) + 1
  from foo f
  inner join bar b
    on b.id <= f.id
  group by f.id
    ,f.value
    ,f.size

http://sqlfiddle.com/#!3/72ad4/14/0

+1 I think this has potential if appropriate indexes are there. — Jon Seigel, Jun 24 '13 at 21:02

score 3 · Answer 4 · edited Apr 13 '17 at 12:42

The following is another recursive CTE solution, although I'd say it's more straightforward than @Nick's suggestion. It is actually closer to @Sebastian's cursor, only I used running differences instead of running totals. (At first I even thought that @Nick's answer was going to be along the lines of what I am suggesting here, and it is after learning that his was in fact a very different query that I decided to offer mine.)

WITH rec AS (
  SELECT TOP 1
    id,
    value,
    size,
    bucket        = 1,
    room_left     = CAST(1.0 - size AS decimal(5,2))
  FROM atable
  ORDER BY value DESC
  UNION ALL
  SELECT
    t.id,
    t.value,
    t.size,
    bucket        = r.bucket + x.is_new_bucket,
    room_left     = CAST(CASE x.is_new_bucket WHEN 1 THEN 1.0 ELSE r.room_left END - t.size AS decimal(5,2))
  FROM atable t
  INNER JOIN rec r ON r.value = t.value + 1
  CROSS APPLY (
    SELECT CAST(CASE WHEN t.size > r.room_left THEN 1 ELSE 0 END AS bit)
  ) x (is_new_bucket)
)
SELECT
  id,
  value,
  size,
  bucket
FROM rec
ORDER BY value DESC
;

Note: this query assumes that the value column consists of unique values with no gaps. If that is not the case, you'll need to introduce a calculated ranking column based on the descending order of value and use it in the recursive CTE instead of value to join the recursive part with the anchor.

A SQL Fiddle demo for this query can be found here.

This is much shorter than what I wrote. Nice work. Is there any reason you count down the room left in the bucket rather than count up? — Nick Chammas, Jun 26 '13 at 05:55
Yes, there is, not sure if it makes much sense for the version I ended up posting here, though. Anyway, the reason was that it seemed easier/more natural to compare a single value with a single value (size with room_left) as opposed to comparing a single value with an expression (1 with running_size + size). I didn't use an is_new_bucket flag at first but several CASE WHEN t.size > r.room_left ... instead ("several" because I also was calculating (and returning) the total size, but then thought against it for the sake of simplicity), so I thought it'd be more elegant that way. — Andriy M, Jun 26 '13 at 06:43

How can I write windowing query which sums a column to create discrete buckets?

4 Answers4