0

I am trying to read duplicates from a table. There are some partial duplicates based on values of Col1 and Col2 and there are some full duplicates based on Col1, Col2 and Col3 as in below table.

Col1   Col2   Col3
1      John   100
1      John   200
2      Tom    150
3      Bob    100
3      Bob    100
4      Sam    500

I want to capture partial and exact duplicates in two separate outputs and ignore the non-repeated rows like 2 and 4 e.g.

Partial Duplicates

Col1   Col2   Col3
1      John   100
1      John   200

Full Duplicate

Col1   Col2   Col3
3      Bob    100
3      Bob    100

What is the best way to achieve this with SQL?

I tried using the self join with spark-sql but getting error: -

val source_df = sql("select col1, col2, col3 from sample_table")
source_df.as("df1").join(inter_df.as("df2"), $"df1.Col3" === $"df2.Col3" and $"df1.Col2" === $"df2.Col2" and $"df1.Col1" === $"df2.Col1").select($"df1.Col1",$"df1.Col2",$"df1.Col3",$"df2.Col3").show()

Error

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange hashpartitioning(Col3#1957, 200)

Gordon Linoff
  • 1,198,228
  • 53
  • 572
  • 709

2 Answers2

1

For partial duplicates:

SELECT * 
FROM tbl
WHERE EXISTS (
    SELECT * 
    FROM tbl t2 
    WHERE tbl.col1 = t2.col1 AND tbl.col2 = t2.col2 AND tbl.col3 <> t2.col3
)

Returns:

col1    col2    col3
1       John    100
1       John    200

For full duplicates, add a unique identifier per combination of col1, col2, and col3, and look for cases where there's another record with the same col1, col2, and col3, but a different unique identifier:

;WITH cte AS (
SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1, col2, col3) AS uniqueid, col1, col2, col3 
FROM tbl
)
SELECT col1, col2, col3 
FROM cte
WHERE EXISTS (
    SELECT * 
    FROM cte t2 
    WHERE cte.col1 = t2.col1 AND cte.col2 = t2.col2 AND cte.col3 = t2.col3 AND cte.uniqueid <> t2.uniqueid
)

Returns:

col1    col2    col3
3       Bob     100
3       Bob     100

http://sqlfiddle.com/#!18/f1d78/2

CREATE TABLE tbl (col1 INT, col2 VARCHAR(5), col3 INT)
INSERT INTO tbl VALUES 
(1, 'John', 100), 
(1, 'John', 200), 
(2, 'Tom', 150), 
(3, 'Bob', 100), 
(3, 'Bob', 100), 
(4, 'Sam', 500)
Max Szczurek
  • 4,199
  • 2
  • 19
  • 29
0
  1. partial duplicates - use exists. Here is the demo.
select
    *
from myTable m1
where exists (
  select
    *
  from myTable m2
  where m1.Col1 = m2.Col1
  and m1.Col2 = m2.Col2
  and m1.Col3 <> m2.Col3
)

output:

----------------------
 Col1    Col2    Col3
----------------------
  1      John    100
  1      John    200
----------------------
  1. full duplicates - you can use count(*) as window function. Here is the demo.
with cte as
( 
  select
    Col1,
    Col2,
    Col3,
    count(*) over (partition by Col1, Col2, Col3) as rn
  from myTable 
) 

select
  Col1,
  Col2,
  Col3
from cte    
where rn > 1  

output:

----------------------
 Col1   Col2    Col3
----------------------
  3     Bob     100
  3     Bob     100
----------------------
zealous
  • 7,088
  • 3
  • 11
  • 35