3

So I have been looking at some public datasets on fraud detection, where the dataset contains a 0/1 column for whether the transaction is fraudulent and then a large set of numerical and categorical features about the transaction. Of course in a dataset like this, the number of instances of fraud is very very small compared to the number of valid transactions. I have not checked exactly, but perhaps like 5% of the transactions or less are fraudulent. The dataset has say 500,000 records.

So my question was, I can run a penalized logistic regression over this data and get some results. The dataset is large and won't fit in memory, so I can setup tensorflow to run the data in batches. Since the data is large I can have additional metrics like the number of times a 1 is predicted as 0, or such to encourage more attention to detecting fraudulent claims.

But my real question is this--why does it make sense to keep all of the zeroes in the data? Like I am looking for some mathematical reasoning for this. I suppose I could randomly remove half of the valid transactions, shrink the dataset to train faster, and probably not lose much accuracy.

I could of course experiment with this directly and basically generate a series of dataset with progressively more of the 0 or valid rows chopped out, and then see when the loss starts to grow or the false negatives start to grow. That is an empirical result which I can try. But I was wondering if there is a more mathematical reasoning behind keeping or discarding many of the valid or 0 rows?

krishnab
  • 1,512
  • 1
    Because you are trying to estimate the probability that a transaction is fraudulent, and the baseline for that probability is "perhaps like 5% of the transactions". If you take out a lot of the 0 values, your baseline probability won't be 5% any more, so your estimates won't be accurate. – jbowman Dec 14 '19 at 17:00
  • @jbowman I see. So it is like a scaling argument. As I shrink the total size of the set, I will just throw off the denominator for the proportion p(fraud) = #fraud events / # total events. And so would removing those events then correspondingly inflate the coefficient estimates, since the fraud events would become more likely in the reduced dataset? – krishnab Dec 14 '19 at 17:03
  • 2
    Exactly so. Imagine if you deleted 95% of the zeroes, so the balance was about 50-50, and your model had absolutely no explanatory power (ridiculous, I admit)... then your estimated probability of fraud would be about 50%, even though the actual probability is about 5%. – jbowman Dec 14 '19 at 17:06
  • Okay excellent. yeah, that makes a lot more sense. Thanks for answering the question. I think I was getting caught up in the numerical estimation side of this--and the time it takes to run through the data. But the argument you make is very nice and intuitive. – krishnab Dec 14 '19 at 17:08

1 Answers1

1

In a perfect world, you would be able to fit a half-million observations in memory and run the neural network on everything. This has the advantage of using a representative data set and letting the neural network learn from reality instead of fiction.

In this imperfect world, you might be right that you do not lose much performance. If you are able to build a useful model, that counts for a lot.

If you do reduce the number of non-fraud training cases to make the data fit in memory, use representative data when you evaluate your model. If you have a $95/5$ class imbalance in reality, you should have about that kind of class imbalance when you evaluate your model on unseen data. If the performance of the model trained on undersampled data only performs well in the fictional setting where the classes are rather balanced, there is limited reason to expect good performance in production.

Our Demetri Pananos discusses here what to do if you fiddle with the class ratio, and the King paper discussed here is related. Depending on your task, you might be more or less interested in the fraud probabilities (I would think that would be interesting for this task), and giving an incorrect prior probability of fraud by downsampling the majority class to fit the data into memory leads to an incorrect posterior probability of fraud, though that Pananos answer I linked discusses a possible remedy. That is, depending on what you value in predictions, you might find out that your model trained on downsampled data does not perform as well as you thought.

Dave
  • 62,186