0

I am working on binary classification problem:

there is a website, where user can do 2 types of actions:

  1. Non-target actions
  2. Target actions

I have a large dataset with columns:

utm_source, utm_medium, utm_keyword, ... target

There is one row per session.

utm_source, utm_medium, utm_keyword and others are parameters of session.
target = 1, if user performed at least 1 target action during session
target = 0 otherwise

My task is - given the parameters of session to predict if user will perform at least 1 target action during this session. I have to achieve ROC-AUC > 0.65, if this makes sense.

Dataset contains

1732218 rows total
50314 (2.9%) rows with target = 1

But there are many sessions with identical parameters (and with different session_id in raw data, but naturally I have dropped session_id). So if I remove duplicates from the dataset, it will contain

398240 rows total
24205 (1.4%) rows with target = 1

The question is - should I remove these duplicates and when?

My current approach is

  1. Original dataset with duplicate rows represents natural distribution of data, so I have to test my model on the part of original dataset.
  2. I can train my model on explicitly balanced dataset, and duplicate removal can be the part of this balancing.

But I there are people (on course, where I am studying now), who have removed all duplicates before train-test split - and these people have successfully defended this task and achieved certificate...

So what approach is right? Links to ML literature are welcome.

Ars ML
  • 31
  • 2
    "who have removed all duplicates before train-test split - and these people have successfully defended this task and achieved certificate..." doing so before the train-test split shouldn't be defensible IMHO, many ostensibly benign pre-processing steps can introduce substantial biases if applied in this way. What if all of the duplicates are "hard to classify" cases - if you remove them from the test data you immediately have an optimistically biased AUROC estimate. – Dikran Marsupial Apr 09 '23 at 11:16
  • 1
    Many classifiers, e.g. SVM, allow you to weight the training patterns individually in the loss function. If computational expense is an issue, delete the duplicates from the training set only, but set the weight of the remaining example to be the number of copies in the original dataset. This assumes the duplicates all have the same target label. – Dikran Marsupial Apr 09 '23 at 11:19
  • 1
    "I have to achieve ROC-AUC > 0.65," ask your instructors why this is an appropriate performance criterion for the practical application. – Dikran Marsupial Apr 09 '23 at 11:21
  • This is a frequently asked question. See here https://stats.stackexchange.com/questions/602110/how-to-treat-duplicates-while-dealing-with-real-data/602140#602140 or here https://stats.stackexchange.com/a/428196/164936 for example. If your records have different session IDs, it's likely they are not erroneous duplicates, in which case you should keep them as it is just how your population of interest is. On the other hand, if they are erroneous duplicates (e.g. records erroneously duplicated due to some bug), you should remove them. – J-J-J Jan 09 '24 at 08:26

0 Answers0