E.T. Jaynes writes, in "Probability Theory: the Logic of Science" the following in order to motivate the derivation of the entropy, $H$:
Suppose the robot perceives two alternatives, to which it assigns probabilities $p_1$ and $q := 1 - p_1$. Then the 'amount of uncertainty' represented by this distribution is $H_2(p, q)$. But now the robot learns that the second alternative really consists of two possibilities, and it assigns probabilities $p_2, p_3$ to them, satisfying $p_2 + p_3 = q$. What is now the robot's full uncertainty $H_3(p_1, p_2, p_3)$ as to all three possibilities? Well, the process of choosing one of the three can be broken down into two steps. Firstly, decide whether the first possibility is or is not true; the uncertainty removed by this decision is the original $H_2(p_1, q)$. Then, with probability $q$, the robot encounters an additional uncertainty as to events $2, 3$, leading to
$$H_3(p_1, p_2, p_3) = H_2(p, q) + q H_2\left(\frac{p_2}{q}, \frac{p_3}{q} \right)$$
as the condition that we shall obtain the same net uncertainty for either method of calculation.
I find this reasoning not compelling enough, especially the way $q$ appears before $H_2$. Perhaps there's a better way to put what Jaynes' meant to say through an explicit example of a coin flip, where $T$ outcome turns out to be governed by another coin flip?