I am working on binary classification problem:
there is a website, where user can do 2 types of actions:
- Non-target actions
- Target actions
I have a large dataset with columns:
utm_source, utm_medium, utm_keyword, ... target
There is one row per session.
utm_source, utm_medium, utm_keyword and others are parameters of session.
target = 1, if user performed at least 1 target action during session
target = 0 otherwise
My task is - given the parameters of session to predict if user will perform at least 1 target action during this session. I have to achieve ROC-AUC > 0.65, if this makes sense.
Dataset contains
1732218 rows total
50314 (2.9%) rows with target = 1
But there are many sessions with identical parameters (and with different session_id in raw data, but naturally I have dropped session_id). So if I remove duplicates from the dataset, it will contain
398240 rows total
24205 (1.4%) rows with target = 1
The question is - should I remove these duplicates and when?
My current approach is
- Original dataset with duplicate rows represents natural distribution of data, so I have to test my model on the part of original dataset.
- I can train my model on explicitly balanced dataset, and duplicate removal can be the part of this balancing.
But I there are people (on course, where I am studying now), who have removed all duplicates before train-test split - and these people have successfully defended this task and achieved certificate...
So what approach is right? Links to ML literature are welcome.