So I have been looking at some public datasets on fraud detection, where the dataset contains a 0/1 column for whether the transaction is fraudulent and then a large set of numerical and categorical features about the transaction. Of course in a dataset like this, the number of instances of fraud is very very small compared to the number of valid transactions. I have not checked exactly, but perhaps like 5% of the transactions or less are fraudulent. The dataset has say 500,000 records.
So my question was, I can run a penalized logistic regression over this data and get some results. The dataset is large and won't fit in memory, so I can setup tensorflow to run the data in batches. Since the data is large I can have additional metrics like the number of times a 1 is predicted as 0, or such to encourage more attention to detecting fraudulent claims.
But my real question is this--why does it make sense to keep all of the zeroes in the data? Like I am looking for some mathematical reasoning for this. I suppose I could randomly remove half of the valid transactions, shrink the dataset to train faster, and probably not lose much accuracy.
I could of course experiment with this directly and basically generate a series of dataset with progressively more of the 0 or valid rows chopped out, and then see when the loss starts to grow or the false negatives start to grow. That is an empirical result which I can try. But I was wondering if there is a more mathematical reasoning behind keeping or discarding many of the valid or 0 rows?