5

I would like to know if it's possible to use a confusion matrix to measure the performance of a classification tool outside the realm of ML or a statistical model.

For example, if I had a small script that scanned for a virus within a file. If I were to run it over n many files, and collect the results into a confusion matrix, would metrics like accuracy, recall, and precision remain valid / applicable?

Dave
  • 62,186

2 Answers2

5

Sure you can. Those metrics are older than machine learning. For example, the ROC curves calculated based on TPR and FPR were designed during World War II for judging the accuracy of radars. The metrics calculated based on the confusion matrix are also commonly used in medicine for judging the performance of diagnostic tests. For example, below I posted a table from the report mentioning such results for a rapid COVID test (not advertising it, it was the first such result that I found online), as you can see, it has many of the "machine learning" metrics.

enter image description here

There are also used in many other scenarios regarding information retrieval, signal detection, classification, etc.

Tim
  • 138,066
4

YES AND NO

FIRST THE YES

In fact, there was (and still is, I suppose) an area of artificial intelligence called expert systems which were , more or less, decision trees designed by subject matter experts (doctors, scientists, etc). Given some data, the expert system would go down the decision tree and arrive at a prediction. You might even think of your own decision-making as working like this: you take in information, evaluate it, and then say, “Hi, doggie,” instead of, “Good morning, Mrs. Johnson,” since your eyes tell you that you see a dog instead of your neighbor.

Machine learning might be the dominant approach to artificial intelligence these days, but evaluating the predictions does not really depend on how you made those predictions.

NOW THE NO

Even in machine learning, metrics like accuracy, precision, and recall are threshold-based, discontinuous, improper scoring rules (arguably not even scoring rules). The problems with these metrics tend to be most noticeable in settings with class imbalance, which is the main setting where this topic arises in Cross Validated questions, but the problems are present with balanced classes, too. Briefly, the probabilities returned by many machine learning models allow us a much more nuanced evaluation.

This answer by our Stephan Kolassa is a good place to start reading about this notion, and it links to other good material (particularly Frank Harrell’s blog).

Dave
  • 62,186