I am trying to read duplicates from a table. There are some partial duplicates based on values of Col1 and Col2 and there are some full duplicates based on Col1, Col2 and Col3 as in below table.
Col1 Col2 Col3
1 John 100
1 John 200
2 Tom 150
3 Bob 100
3 Bob 100
4 Sam 500
I want to capture partial and exact duplicates in two separate outputs and ignore the non-repeated rows like 2 and 4 e.g.
Partial Duplicates
Col1 Col2 Col3
1 John 100
1 John 200
Full Duplicate
Col1 Col2 Col3
3 Bob 100
3 Bob 100
What is the best way to achieve this with SQL?
I tried using the self join with spark-sql but getting error: -
val source_df = sql("select col1, col2, col3 from sample_table")
source_df.as("df1").join(inter_df.as("df2"), $"df1.Col3" === $"df2.Col3" and $"df1.Col2" === $"df2.Col2" and $"df1.Col1" === $"df2.Col1").select($"df1.Col1",$"df1.Col2",$"df1.Col3",$"df2.Col3").show()
Error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange hashpartitioning(Col3#1957, 200)