0

I have an interesting medical problem where the price of failure and offset between precision and recall is much much higher than typical ml systems I’ve worked with before.

The precision of one class needs to be so high that we actually measure it in false positives per day, which should be 0.01 or less. In terms of precision it is one in five million, 99.99998%. To

To get enough test data to have an expected value of one failure, there needs to be 100 days of data. Does that also mean there needs to be 3,000 days of data to say anything with confidence about the precision metric (30x that figure)?

3,000 days of data is 600x how much training data there is currently.

However the training data is focusing on the edge cases, it’s not a typical day. A typical day will be mostly easy cases with maybe one interesting case per day. In that sense, we already have well over 3,000 days in terms of expected interesting cases.

What are some best practices for getting confidence in deploying the model short of actually collecting the raw amount of real world data needed to definitively hit a metric?

It seems estimating how many interesting cases are hit per day and estimating what the 3,000 day metric would be is the way to go, but I can’t convince the owners of it. Are there any similar problems in the literature to point to and learn from?

Also having a test set that’s magnitudes bigger than the training set feels weird, but I’m not sure if it’s necessarily wrong per se, but definitely expensive and counterintuitively out of proportion with a typical ml real world test

SwimBikeRun
  • 163
  • 1
  • 1
  • 6

0 Answers0