1

My dataset is time-series sensor data and anomaly ratio is between 5% and 6%

1. For time-series anomaly detection evaluation, which one is better, precision/recall/F1 or ROC-AUC ?

When empirically studying this issue, I found some papers use precision/recall/F1 and some papers use ROC-AUC.

Considering that positive samples(anomalies) are relatively less than negative samples(normal points), which one is better?

I'm confused with this issue

2. If I use precision/recall/F1, should I check precision/recall/F1 only for positive class ?

I think because the number of positive samples are sparse, it's not appropriate to check precision/recall/F1 only for positive class

Thus, should I check precision/recall/F1 for both positive class and negative class?

If that's right, can I report precision/recall/F1 with macro avg in my paper?

(you can see the picture below. I used classification_report in sklearn library)

enter image description here

  • Setting aside the issues with these measures (covered in Kolassa’s answer and comments under it), I don’t understand how you’re using F1 or AUC in this evaluation. Are you treating the problem as a supervised problem where a signal is either normal or anomalous? If so, why do you view its evaluation as any different from evaluating any other machine learning problem with a binary outcome? $//$ If something happens five or six percent of the time, is it really such an anomaly? – Dave Mar 06 '24 at 14:42

1 Answers1

0

Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold The same problems apply to sensitivity, specificity, F1, and indeed to all evaluation metrics that rely on hard classifications.

Instead, use probabilistic classifications, and evaluate these using proper scoring rules. Note that AUC is a semi-proper scoring rule, so if there is an absolute choice between this and the improper rules above, use AUC. Better: use a real proper scoring rule. See the tag wiki for more information and pointers to literature.

Stephan Kolassa
  • 123,354