0

I have a Table loaded in Dataframe and I tried to use groupBy with PKs.

df_remitInsert = spark.sql("""SELECT * FROM trac_analytics.mainremitdata""")
df_remitInsert_filter = df_remitInsert.groupBy("LoanID_Serv", "LoanNumber", "Month").count().filter("count > 1").drop('count')

where, "LoanID_Serv", "LoanNumber", "Month" are my Primary Keys.

I want to achieve entire data from df_remitInsert which are deduplicated w.r.t Primary Keys.

Crime_Master_GoGo
  • 1,561
  • 18
  • 28

1 Answers1

0

You can use the dropDuplicates method.

df_remitInsert_filter = df_remitInsert.dropDuplicates(['LoanID_Serv', 'LoanNumber', 'Month'])
过过招
  • 2,515
  • 2
  • 2
  • 6