Find duplicates but keep one with a twist in datediff pyspark

Question

I have a sample spark dataframe:

df = [("1","5563","John","Smith","2020-12-15","M"),
    ("1","5563","John","Smith","2020-12-18","M"),
    ("2","5568","Jane","King","2020-12-15","F"),
    ("3","5574","Ernest","Goldberg","2021-10-12","M"),
    ("5","31","Joe","Hanson","2022-03-16","M"),
    ("1","5563","John","Smith","2021-01-02","M"),
    ("2","5568","Jane","King","2021-01-25","F")
  ]

columns = ['bldg_num",'person_id','first_name','last_name','intake_date','gender']

What I would like to happen is check the difference between intake_dates. If the datediff is less than 5, keep the later one. But if the datediff between the same id is 5 or more days, I keep both records.

I got the following initially ran using SQL:

SELECT * 
FROM df
WHERE EXISTS (
    SELECT * 
    FROM df df_v2 
    WHERE df.person_id = df_v2.person_id AND DATEDIFF(df.intake_date, df_v2.intake_date) => 5 )

HOwever, the above code filters it where the datediff is 5 or more. There are ids that may only have one intake_date that still needs to be kept.

There are links here that partially meets what I need to do, but so far, none is close enough: remove duplicates in list, but keep one copy Finding partial and exact duplicate from a SQL table I also thought of using the concept of returning/recurring customers Calculate recurring customer but I am lost in writing the code. From the df above, I am expecting to get the following output:

df2 = [
    ("1","5563","John","Smith","2020-12-18","M"),
    ("2","5568","Jane","King","2020-12-15","F"),
    ("3","5574","Ernest","Goldberg","2021-10-12","M"),
    ("5","31","Joe","Hanson","2022-03-16","M"),
    ("1","5563","John","Smith","2021-01-02","M"),
    ("2","5568","Jane","King","2021-01-25","F")
  ]

The first row will be dropped because it is less than 5 days to the next intake_date while still keeping id=5 and id=3 rows. I am thinking, the only way to do it is write a udf, but I am still new in that concept. SQL alternative is ok too, if it is possible.

Let me know if my question is a bit confusing, I will try to rephrase to make it clearer. Thank you

Find duplicates but keep one with a twist in datediff pyspark

0 Answers0