0

For a binary classification, I have a training data set that gets divided into calibration and validation sets. Some data is used to train the classifier and some data (truth data) is used to test the precision/recall. To calculate an F-score I think theoretically you would only need one test case for each class but this calls into question how reliable an F-score is based on how many test cases are used. Is there a statistical metric for describing the "power" of an F-score based on the amount (~sample size~?) of the validation data set? Intuitively, it makes sense that an F-score calculated with more truth data is a more reliable description of the classifier's precision/recall but after searching I can't find a metric that actually describes this.

RyanG
  • 1

0 Answers0