I am trying to compare two methods but I am looking for an appropriate metric. I have two methods which both infer stop-locations for a person - each using a different type of data. A stop-location is represented as a time interval. For example, consider the following scenario:
Method A:
09:00-09:30
09:40-10:00
Method B:
09:00-09:14
09:15-09:29
09:38-10:00
Ground truth:
09:01-09:30
09:38-10:00
In the above example, the ground truth is that the person visited two stop-locations for 29 minutes and 22 minutes. The first method is a little off around the start/stop-times. The second method is better at getting start/stop-times correct, but splits the first stop-location into two pieces.
I am interesting in comparing Method A to Method B, determining which is better at inferring the ground truth.
One possibility would be to simply compute the fraction of minutes where each method predicts a stop-location where there is none, and the fraction of minutes where each method predicts no stop-location where there is one. One could for example use the F1-score for the metric then.
Is there a better way to approach this?