I have a sample spark dataframe:
df = [("1","5563","John","Smith","2020-12-15","M"),
("1","5563","John","Smith","2020-12-18","M"),
("2","5568","Jane","King","2020-12-15","F"),
("3","5574","Ernest","Goldberg","2021-10-12","M"),
("5","31","Joe","Hanson","2022-03-16","M"),
("1","5563","John","Smith","2021-01-02","M"),
("2","5568","Jane","King","2021-01-25","F")
]
columns = ['bldg_num",'person_id','first_name','last_name','intake_date','gender']
What I would like to happen is check the difference between intake_dates. If the datediff is less than 5, keep the later one. But if the datediff between the same id is 5 or more days, I keep both records.
I got the following initially ran using SQL:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df df_v2
WHERE df.person_id = df_v2.person_id AND DATEDIFF(df.intake_date, df_v2.intake_date) => 5 )
HOwever, the above code filters it where the datediff is 5 or more. There are ids that may only have one intake_date that still needs to be kept.
There are links here that partially meets what I need to do, but so far, none is close enough: remove duplicates in list, but keep one copy Finding partial and exact duplicate from a SQL table I also thought of using the concept of returning/recurring customers Calculate recurring customer but I am lost in writing the code. From the df above, I am expecting to get the following output:
df2 = [
("1","5563","John","Smith","2020-12-18","M"),
("2","5568","Jane","King","2020-12-15","F"),
("3","5574","Ernest","Goldberg","2021-10-12","M"),
("5","31","Joe","Hanson","2022-03-16","M"),
("1","5563","John","Smith","2021-01-02","M"),
("2","5568","Jane","King","2021-01-25","F")
]
The first row will be dropped because it is less than 5 days to the next intake_date while still keeping id=5 and id=3 rows. I am thinking, the only way to do it is write a udf, but I am still new in that concept. SQL alternative is ok too, if it is possible.
Let me know if my question is a bit confusing, I will try to rephrase to make it clearer. Thank you