Standard function to quantify consistency of a sequence of predictions

Question

Let's say I let a deep learning model classify a single object multiple times but under varying circumstances. Ideally it should predict the same class again and again. But in reality its class predictions may vary.

So given a sequence of class predictions for the single object, I'd like to quantify how consistent the sequence is. To be clear, this is not about comparing predictions against some ground truth. This is about consistency within the prediction sequence itself.

For instance, a perfectly consistent prediction sequence like class_a, class_a, class_a, class_a should get a perfect score.
A less consistent sequence like class_a, class_b, class_a, class_c should get a lower score.
And a completely inconsistent sequence like class_a, class_b, class_c, class_d should get the lowest score possible.

The goal is to find out on what objects we may need to keep training the classification model. If the classification model is not very consistent in its predictions for a certain object, then we might need to add that object to a dataset for further training.

Preferably it works for any number of possible classes and also takes into account prediction confidences. The sequence class_a (0.9), class_b (0.9), class_a (0.9), class_c (0.9) should give a lower score then class_a (0.9), class_b (0.2), class_a (0.8), class_c (0.3), as it's no good when the predictions are inconsistent with high confidences.

Please note that the model's prediction confidences do not relate to probabilities. A prediction from the model also contains just 1 class with 1 confidence. Not a set of classes with confidences adding up to 1 in total. (That would of course be an interesting case too.)

I could build something myself, but I'd like to know if there's a standard sklearn or scipy (or similar) function for this? Thanks in advance!

The comment to this question suggests Spearman's correlation coefficient or the Kandell correlation coefficient. I'll look into that as well.

score 0 · Answer 1 · answered Mar 27 '24 at 12:03

One fundamental issue here is that small changes to the model can and will cause hard flips in predicted classes. If you compare predicted class membership probabilities to some threshold, then one prediction might be just below that threshold, and the next one just above it - so while the predicted probability of belonging to a particular class has moved only barely, the hard classification will flip to a different class. I would therefore strongly recommend that you move to probabilistic classifications and separate the decision aspect out. See Reduce Classification Probability Threshold.

Now, at any time $t$, you have a prediction $p_{tk}$ for the probability of the object to belong to class $k$. The predictions should sum to 1 at any point, but that is not really overly important here.

One simple way to assess the "persistence" of probabilistic predictions would be to fit an autoregressive model to the $(p_{tk})_t$ series for each $k$, specifically one of order 1, an AR(1) model, possibly with an intercept. Then check the AR coefficient. If this is close to 1, your predictions at time $t$ are close to those at $t-1$, and your series is more persistent than if it was farther away from 1.

This even allows you to look at the persistence of predictions for "secondary" classes. Maybe your model stays quite certain that your object is a dog, but the secondary probabilities that it might be a cat or a spaceship may be jumping all over the place. This may be helpful.

ARIMA models in general assume unbounded space, whereas your spaces are bounded within [0,1]. You could first transform your probabilities to the entire real line and see how this changes matters.

Also, since your predictions sum to 1, they are actually a time series on a simplex. I am not aware of any specific work on ARIMA modeling on the simplex, except for some people doing compositional time series forecasting, but forecasting is not what you are trying to do.

Hi Stephan, thanks for you answer. A few notes:

We get the classes and confidences from a deep learning model. So unfortunately we won't be able to move to probabilistic classifications...

The confidences we get from the deep learning model are not related to probabilities. A prediction from the model also contains just 1 class and 1 confidence. Not a set of classes and confidences adding up to 1 in total (as one might expect with probabilistics).

Sorry I didn't make that clear in my question. I'll add it soon. — wouterio, Mar 27 '24 at 15:25
Thank you. My recommendation would be not to worry about the consistency of sequences of hard classifications, but about how to change your underlying model to one that outputs probabilities, because that will be much more enlightening, see that initial link in my answer. Yes, I understand that you will probably not find that helpful. — Stephan Kolassa, Mar 27 '24 at 15:33

Standard function to quantify consistency of a sequence of predictions

1 Answers1