Finding partial and exact duplicate from a SQL table

Question

I am trying to read duplicates from a table. There are some partial duplicates based on values of Col1 and Col2 and there are some full duplicates based on Col1, Col2 and Col3 as in below table.

Col1   Col2   Col3
1      John   100
1      John   200
2      Tom    150
3      Bob    100
3      Bob    100
4      Sam    500

I want to capture partial and exact duplicates in two separate outputs and ignore the non-repeated rows like 2 and 4 e.g.

Partial Duplicates

Col1   Col2   Col3
1      John   100
1      John   200

Full Duplicate

Col1   Col2   Col3
3      Bob    100
3      Bob    100

What is the best way to achieve this with SQL?

I tried using the self join with spark-sql but getting error: -

val source_df = sql("select col1, col2, col3 from sample_table")
source_df.as("df1").join(inter_df.as("df2"), $"df1.Col3" === $"df2.Col3" and $"df1.Col2" === $"df2.Col2" and $"df1.Col1" === $"df2.Col1").select($"df1.Col1",$"df1.Col2",$"df1.Col3",$"df2.Col3").show()

Error

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange hashpartitioning(Col3#1957, 200)

`SELECT Col1, Col2, first(Col3) as Col3 from sample_table group BY Col1, Col2` — jxc, Jul 05 '20 at 03:56
Please provide a definition for partial and exact duplicate. — Tim Biegeleisen, Jul 05 '20 at 04:09
I removed the inaccurate database tags. Please tag only with the database you are really using. — Gordon Linoff, Jul 05 '20 at 13:00

score 1 · Accepted Answer · answered Jul 05 '20 at 04:08

For partial duplicates:

SELECT * 
FROM tbl
WHERE EXISTS (
    SELECT * 
    FROM tbl t2 
    WHERE tbl.col1 = t2.col1 AND tbl.col2 = t2.col2 AND tbl.col3 <> t2.col3
)

Returns:

col1    col2    col3
1       John    100
1       John    200

For full duplicates, add a unique identifier per combination of col1, col2, and col3, and look for cases where there's another record with the same col1, col2, and col3, but a different unique identifier:

;WITH cte AS (
SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1, col2, col3) AS uniqueid, col1, col2, col3 
FROM tbl
)
SELECT col1, col2, col3 
FROM cte
WHERE EXISTS (
    SELECT * 
    FROM cte t2 
    WHERE cte.col1 = t2.col1 AND cte.col2 = t2.col2 AND cte.col3 = t2.col3 AND cte.uniqueid <> t2.uniqueid
)

Returns:

col1    col2    col3
3       Bob     100
3       Bob     100

http://sqlfiddle.com/#!18/f1d78/2

CREATE TABLE tbl (col1 INT, col2 VARCHAR(5), col3 INT)
INSERT INTO tbl VALUES 
(1, 'John', 100), 
(1, 'John', 200), 
(2, 'Tom', 150), 
(3, 'Bob', 100), 
(3, 'Bob', 100), 
(4, 'Sam', 500)

score 0 · Answer 2 · answered Jul 05 '20 at 04:14

partial duplicates - use exists. Here is the demo.

select
    *
from myTable m1
where exists (
  select
    *
  from myTable m2
  where m1.Col1 = m2.Col1
  and m1.Col2 = m2.Col2
  and m1.Col3 <> m2.Col3
)

output:

----------------------
 Col1    Col2    Col3
----------------------
  1      John    100
  1      John    200
----------------------

full duplicates - you can use count(*) as window function. Here is the demo.

with cte as
( 
  select
    Col1,
    Col2,
    Col3,
    count(*) over (partition by Col1, Col2, Col3) as rn
  from myTable 
) 

select
  Col1,
  Col2,
  Col3
from cte    
where rn > 1

output:

----------------------
 Col1   Col2    Col3
----------------------
  3     Bob     100
  3     Bob     100
----------------------

Finding partial and exact duplicate from a SQL table

2 Answers2

Linked