I'll expand on my comment here. This answer was partially adapted from this one.
Let $Y$ represent a class label and $\hat Y$ represent a class prediction. In the binary case, let the two possible values for $Y$ and $\hat Y$ be $0$ and $1$, which represent the classes. Next, suppose that the confusion matrix for $Y$ and $\hat Y$ is:
|
$\hat Y = 0$ |
$\hat Y = 1$ |
| $Y = 0$ |
10 |
20 |
| $Y = 1$ |
30 |
40 |
With hindsight, let us normalize the rows and columns of this confusion matrix, such that the sum of all elements of the confusion matrix is $1$. Currently, the sum of all elements of the confusion matrix is $10 + 20 + 30 + 40 = 100$, which is our normalization factor. After dividing the elements of the confusion matrix by the normalization factor, we get the following normalized confusion matrix:
|
$\hat Y = 0$ |
$\hat Y = 1$ |
| $Y = 0$ |
$\frac{1}{10}$ |
$\frac{2}{10}$ |
| $Y = 1$ |
$\frac{3}{10}$ |
$\frac{4}{10}$ |
With this formulation of the confusion matrix, we can interpret it as an estimate of the joint probability mass function (PMF) of $Y$ and $\hat Y$. This interpretation allows us to measure how often a certain class is predicted. For example, suppose we wanted to compute $P(\hat{Y} = 1)$, which is how often the classifier predicts class $1$. Using the law of total probability, we can compute this as $P(\hat{Y} = 1) = P(\hat{Y} = 1,Y = 1) + P(\hat{Y} = 1,Y = 0) = \frac{4}{10} + \frac{2}{10} = \frac{6}{10}$. Therefore, the classifier predicts class $1$ about $60\%$ of the time. Note, however, that these are all estimates.