I'll assume you understand backpropagation with a sigmoidal function. If not please tell me and I'll edit this answer.
First lets observe both the sigmoid and softmax functions for a given output $ x_i $ being x the vector with all outputs of this given layer.
The sigmoid is: $ f_i(x) = \frac{1}{1 + e^{-x_i}} $, while the softmax is $ g_i(x) = \frac{e^{x_i}}{\sum_{j=1}^N e^{x_j}} $
Then the sigmoid derivative is $ \frac{\partial f_i(x)}{\partial x_j} = \left\{ \begin{matrix} f_i(x) (1 - f_i(x)) & i = j \\ 0 & i \neq j \end{matrix} \right. $
while the softmax derivative is $ \frac{\partial g_i{x}}{\partial x_j} = \left\{ \begin{matrix} \frac{e^{x_i} \left( \sum_{j=1, j \neq i}^N x_j \right)}{\left(\sum_{j=1}^N e^{x_j} \right)^2} & i = j \\ - \frac{e^{x_j + x_i}}{\left(\sum_{j=1}^N e^{x_j} \right)^2} & i \neq j \end{matrix} \right. $
As you can see $ \frac{\partial g_i{x}}{\partial x_j} > 0 $ so it will cause weights related to that output to have a positive reinforcement, just like with the sigmoid.
But for all other weights $ \frac{\partial g_i{x}}{\partial x_j} < 0 $ it will cause a negative reinforcement, different from the sigmoid that does not update them.
Depending on your problem this may help convergence occur faster at a higher computational expense (calculating the softmax is slower than the sigmoid).
In order to compute the predicted label will be the one which contains the maximum output value (thus, the name softmax). With that in mind the output vector $ o $ can be expressed as
$ o_i = \left\{ \begin{matrix} 1\;,&\; i = \text{argmax}[g(x)] \\
0\;,&\; otherwise \\\end{matrix} \right.$