0

I've built a machine learning model on some training data to classify first names as male or female. The accuracy on the training data is x%. With the known accuracy, I then predict on a new dataset of n samples, and the model predicts that m names are male. How can I calculate the confidence interval for the number of male names in the dataset?

E.g. 600 of the names in a dataset of a 1000 names were predicted as male. The true answer, with 95% confidence, lies within the interval 575-625 male names in the dataset.

  • Welcome to Cross Validated! "Bias" has a specific meaning in statistics, and it is not clear that you mean this (I lean toward no). Could you please say exactly what you mean when you use this term? – Dave Jul 10 '23 at 15:27
  • 1
    Thanks! Let me remove it, I think it just adds confusion – Gonzalez Vit Jul 10 '23 at 15:40
  • 1
    Are you looking for a confidence interval or a prediction interval? There is a difference. Also, note that accuracy is problematic as an evaluation measure. – Stephan Kolassa Jul 10 '23 at 15:48
  • 1
    When you write "I find that $m$ names are male", do you mean that you predict that $m$ names are male, or that the new data set actually has $m$ names that are male? – jbowman Jul 10 '23 at 15:48
  • @StephanKolassa, I think I'm looking for the prediction interval, e.g. "The model predicted 600 of the names were male, with 95% confidence that the true answer lies between 575 and 625" – Gonzalez Vit Jul 10 '23 at 16:02
  • @jbowman, predicted, rather than acctual, I'll clarify. – Gonzalez Vit Jul 10 '23 at 16:03
  • 3
    For less-than-perfect accuracy, I'm not sure you can infer a CI. Suppose your classifier achieves 50% accuracy on a balanced dataset. That might be because it calls everybody male, or because it calls everybody female, or because calls male/female at a 50/50 rate but gets it wrong half the time. On a new balanced dataset of 1000 individuals, these three classifiers will predict m=1000, m=0, or m=500, respectively. If the predicted m value can be literally anything for fixed true m, I struggle to see how it can be informative in such a case. – Nuclear Hoagie Jul 10 '23 at 16:26

0 Answers0