In his Is Medicine Mesmerized by Machine Learning? blog article, Frank Harrell shows a calibration curve (below) and states that it is quite poor.
I follow the logic: the claimed probability of $0.20$ corresponds to an actual probability of $0.08$-$0.10$. Doubling the true probability sure seems like poor performance.
What are the good ways to wrap this up into one number? My first thought is to integrate to find the areas above and below the "perfectly calibrated" line. Are there others that work better or give different insights?
