I am close to figuring this out but I'm just stuck at a wall. I'm attempting to understand a post by Aaron Betrand and apply it to a situation I've encountered where I have a changes table that's heavily duplicated due to prior design error I'm inheriting. The sample data set is identical in concept to my real data set, except SortOrder would usually be a datetime value and not an integer. The code I've tried is here:
; with main as (
select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID, SortOrder) as "Rank"
, row_number() over (partition by ID, val order by ID, sortorder) as "s_rank"
from
(values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4), (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), (3, 'A', 1), (3, 'A', 1), (3, 'A', 2) )
as x("ID", "VAL", "SortOrder")
group by id, val, SortOrder
--order by ID, "SortOrder"
)
, cte_rest as (
select *
from main
where "s_rank" > 1
)
select *
from main left join cte_rest rest on main.id = rest.id and main.s_rank > 1 and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder
The results are almost valid; however, the last row highlights a situation that I haven't been able to account for: the date changes, the value doesn't. I want this record to be excluded because it's not a true value change.
╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║ 1 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ C ║ 3 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ B ║ 4 ║ 1 ║ 2 ║ 1 ║ B ║ 4 ║ 1 ║ 2 ║
║ 1 ║ A ║ 5 ║ 1 ║ 2 ║ 1 ║ A ║ 5 ║ 1 ║ 2 ║
║ 2 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 2 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 2 ║ A ║ 3 ║ 1 ║ 2 ║ 2 ║ A ║ 3 ║ 1 ║ 2 ║
║ 3 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 3 ║ A ║ 2 ║ 1 ║ 2 ║ 3 ║ A ║ 2 ║ 1 ║ 2 ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝
A colleague of mine suggested this code, and while I can follow how it arrives, I don't understand why the first code sample doesn't work. It feels to me like this would require a lot of extra parsing, and with a large data set I'd be worried about performance impacts.
WITH cte1
AS (SELECT [id]
, [val]
, [sortorder]
, ROW_NUMBER() OVER(PARTITION BY [id]
, [val]
, [sortorder]
ORDER BY [id]
, [sortorder]) AS "rankall"
FROM (VALUES
( 1, 'A', 1 ),
( 1, 'A', 1 ),
( 1, 'B', 2 ),
( 1, 'C', 3 ),
( 1, 'B', 4 ),
( 1, 'A', 5 ),
( 1, 'A', 5 ),
( 2, 'A', 1 ),
( 2, 'B', 2 ),
( 2, 'A', 3 ),
( 3, 'A', 1 ),
( 3, 'A', 1 ),
( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
ctedropped
AS (SELECT [id]
, [val]
, [sortorder]
, ROW_NUMBER() OVER(PARTITION BY [id]
, [val]
, [sortorder]
ORDER BY [id]
, [sortorder]) AS "rankall"
FROM cte1
WHERE [cte1].[rankall] > 1)
SELECT [cte1].[id]
, [cte1].[val]
, [cte1].[sortorder]
FROM cte1
WHERE NOT EXISTS
(
SELECT *
FROM [ctedropped]
WHERE [cte1].[id] = [ctedropped].[id] AND
[cte1].[val] = [ctedropped].[val] AND
[cte1].[rankall] = [ctedropped].[rankall]
)
ORDER BY [cte1].[id]
, [cte1].[sortorder];