Understanding window functions to deduplicate records while retaining true changes

Question

I am close to figuring this out but I'm just stuck at a wall. I'm attempting to understand a post by Aaron Betrand and apply it to a situation I've encountered where I have a changes table that's heavily duplicated due to prior design error I'm inheriting. The sample data set is identical in concept to my real data set, except SortOrder would usually be a datetime value and not an integer. The code I've tried is here:

; with main as (
select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID, SortOrder) as "Rank"
, row_number() over (partition by ID, val order by ID, sortorder) as "s_rank" 
from 
(values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4), (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), (3, 'A', 1), (3, 'A', 1), (3, 'A', 2) ) 
        as x("ID", "VAL", "SortOrder") 
group by id, val, SortOrder
--order by ID, "SortOrder"
)
, cte_rest as (
select *
from main
where "s_rank" > 1
)
select *
from main left join cte_rest rest on main.id = rest.id and main.s_rank > 1 and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)

order by main.ID, main.SortOrder

The results are almost valid; however, the last row highlights a situation that I haven't been able to account for: the date changes, the value doesn't. I want this record to be excluded because it's not a true value change.

╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║  ID  ║ VAL  ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║  1 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ C   ║         3 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         4 ║    1 ║      2 ║ 1    ║ B    ║ 4         ║ 1    ║ 2      ║
║  1 ║ A   ║         5 ║    1 ║      2 ║ 1    ║ A    ║ 5         ║ 1    ║ 2      ║
║  2 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ A   ║         3 ║    1 ║      2 ║ 2    ║ A    ║ 3         ║ 1    ║ 2      ║
║  3 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  3 ║ A   ║         2 ║    1 ║      2 ║ 3    ║ A    ║ 2         ║ 1    ║ 2      ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝

A colleague of mine suggested this code, and while I can follow how it arrives, I don't understand why the first code sample doesn't work. It feels to me like this would require a lot of extra parsing, and with a large data set I'd be worried about performance impacts.


WITH cte1
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   (VALUES
                        ( 1, 'A', 1 ),
                        ( 1, 'A', 1 ),
                        ( 1, 'B', 2 ),
                        ( 1, 'C', 3 ),
                        ( 1, 'B', 4 ),
                        ( 1, 'A', 5 ),
                        ( 1, 'A', 5 ),
                        ( 2, 'A', 1 ),
                        ( 2, 'B', 2 ),
                        ( 2, 'A', 3 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
     ctedropped
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   cte1
         WHERE  [cte1].[rankall] > 1)
     SELECT [cte1].[id]
          , [cte1].[val]
          , [cte1].[sortorder]
     FROM   cte1
     WHERE  NOT EXISTS
     (
         SELECT *
         FROM   [ctedropped]
         WHERE  [cte1].[id] = [ctedropped].[id] AND 
                [cte1].[val] = [ctedropped].[val] AND 
                [cte1].[rankall] = [ctedropped].[rankall]
     )
     ORDER BY [cte1].[id]
            , [cte1].[sortorder];

Aaron's post is here for reference. – Sean Oct 16 '19 at 17:19 — Sean, Oct 16 '19 at 17:19

kevinnwhat · Accepted Answer · 2019-10-17T14:08:57.893

It's not clear if your dataset and expected outcome is the same as the question referenced. I think your looking for identifying the latest time an id was updated to a different value then the previous. In that case you can try the below

create table #test (
id int,
val varchar(1),
v_date datetime
)

insert into #test values (1,'A',getdate())
insert into #test values (1,'B',dateadd(mi,5,getdate()))
insert into #test values (1,'C',dateadd(mi,10,getdate()))
insert into #test values (2,'A',getdate())
insert into #test values (2,'B',dateadd(mi,15,getdate()))
insert into #test values (2,'B',dateadd(mi,20,getdate()))
insert into #test values (3,'A',getdate())
insert into #test values (3,'A',dateadd(mi,21,getdate()))
insert into #test values (3,'B',dateadd(mi,25,getdate()))
insert into #test values (3,'C',dateadd(mi,30,getdate()))
insert into #test values (4,'B',dateadd(mi,35,getdate()))
insert into #test values (4,'B',dateadd(mi,36,getdate()))
insert into #test values (4,'B',dateadd(mi,37,getdate()))
insert into #test values (5,'Z',dateadd(mi,-10,getdate()))

;with t1 as (
   select id,
          val,
          v_date,
          row_number() over(partition by id order by v_date asc) as rn
     from #test
), t2 as (

select t.id,
       t.val,
       t.v_date,
       row_number() over(partition by t.id order by t.v_date desc) as rn
  from t1 t
  left join t1 tt
    on t.id = tt.id
   and t.rn - 1 = tt.rn
  where t.val <> tt.val or tt.val is null
)

select *
  from t2

db fiddle

Hey Kevin, it's definitely a different situation than the original post which is why I made a separate question :) -- my goal is to get each unique change, but to allow for values to revert to prior values. AKA: A, A, A, B, A should be A, B, A at the end of the day. — Sean, Oct 17 '19 at 13:38
In that case you just remove the where clause at the end, where t2.rn = 1. The rn column in the final select is ordered by v_date desc, meaning the latest change will start with 1, you can change that to asc if you wanted to start from earliest change. — kevinnwhat, Oct 17 '19 at 14:09

Understanding window functions to deduplicate records while retaining true changes

1 Answers1