3

I was wondering about the motivation behind the following definition of expected loss:

$$E[L] = \sum_{k} \sum_{j} \int_{R_{j}} L_{kj} p(x, C_{k})dx$$

where $L_{kj}$ is the loss matrix, in which $j$ is the predicted class and $k$ the true class, $R_{j}$ is the decision region corresponding to the $j$ class and $x$ is an input vector. For the sake of concreteness, let's assume we have only two regions $R_{1}$ and $R_{2}$ and that elements contained in $R_{1}$ and $R_{2}$ belong to class $C_{1}$ and $C_{2}$, respectively. For example, an element $x_{i}$ in region $R_{2}$ will contribute with the term:

$$L_{12}p(x_{i}, C_{1}) + L_{22}p(x_{i}, C_{2})$$

but $L_{22}$ probably is $0$ because the loss associated to predicting class $C_{2}$ when the true class is $C_{2}$ is what we want.

I understand that we want to minimize $E[L]$, so every time we predict the class incorrectly, we are increasing $E[L]$ according to $L_{kj}$ but why are we multiplying, in this example, the term $p(x_{i}, C_{1})$ or in general, $p(x, C_{k})$?

Simplifying, for every assignment of $x$ to the class $j$, we want to minimize:

$$\sum_{k} L_{kj}p(C_{k}|x)$$

but the question remains, why p(C_{k}|x)? By the way, I can see that the expectation requires a probability but I can't see why to choose the probability of the true class given $x$.

Regards

r_31415
  • 3,331

2 Answers2

5

As you said, $L_{kk} = 0$ because you want to minimize the probability of misclassifying the sample. Then you want to minimize the area (where $p$ is the measure) of $\Omega-R_{k}$. The decision boundaries of your classifier define $R_{k}$. I hope that is clear.

Now, why $p(x,C_{j})$?. First, $p(x) = \sum_{i}p(x,C_{i})$. Now, let us assume that the $R_{i}'s$ form a partition of the space (i.e. $R_{i} \cap R_{j} = \emptyset$ if $i \neq j$ and $\cup_{k} R_{k} = \Omega$). Second, we know that $L_{kk} = 0$.

The probability of misclassification for each class, is the probability of assigning the sample to any other class, i.e.

$$\sum_{j}\int_{R_{j}}L_{kj}p(x)dx$$

If you substitute the above expressions and sum over k, you get your first expression.

jpmuc
  • 13,964
1

$p(C_k|x)$ is your model in this case, the thing you are estimating. If it is not part of the loss, how would you optimize it?

bayerj
  • 13,773
  • Sure but why to take $C_{k}$ as opposed to $C_{j}$? – r_31415 Feb 25 '13 at 19:48
  • The probabilities will be normalized (summing up to one), thus will cancel out. Pushing one probability up will result in all other be pushed down. – bayerj Feb 26 '13 at 17:17