Please let me derive for quite a general case. I think the deriviation is somewhere out there, but not easy for me to find. Thus I upload here.
The notations I follow is based on a CS229 note.
Definitions.
Let $J$ be a loss function of a neural network to minimize. Let $x\in \mathbb{R}^{d_0}$ be a single sample (input). Define its deep activations in a cascaded way as follows:
$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$
and $a^{(l)} = g^{(l)}(z^{(l)})$ for $l=1, \dots, L$ where $L$ is the number of layers.
Here, $W^{(l)} \in \mathbb{R}^{d_{l} \times d_{l-1}} $ and $b^{(l)} \in \mathbb{R}^{d_l} $ are parameters of the neural network model that we need to optimize. $a^{(l)} \in \mathbb{R}^{d_l} $ is the $l$-th layer activation output, and $g^{(l)} : \mathbb{R}^{d_l} \to \mathbb{R}^{d_l} $ is the corresponding activation function (such as ReLU). The zeroth layer activation is set as the input $a^{(0)} = x$.
Convention.
Except for the final layer $L$, the activation funciton is the same for all hidden layers $l <L$; $g^{(l)} = g$. Also, we assume $g$ is pointwise computation such as ReLU or sigmoid.
Another convention is that the loss function $J$ is defined such that
$$
\nabla_{z^{(L)}} J = z^{(L)} - y
$$
where $y$ is the target that our neural network model needs to predict. Note that the gradient $\nabla_{z^{(L)}} J = z^{(L)}$ is a column vector
$$
\nabla_{z^{(L)}} J = z^{(L)} =
\begin{bmatrix}
\frac{\partial J}{\partial z^{(L)}_1} \\
\vdots \\
\frac{\partial J}{\partial z^{(L)}_{d_L}}
\end{bmatrix}
\in \mathbb{R}^{d_L}.
$$
Motivation.
Backpropagation is simply a combination of chain rule applications. Observe first that $\frac{\partial J}{\partial W^{(L)}_{ij}}$ is decomposed into
$$
\frac{\partial J}{\partial W^{(L)}_{ij}}
=
\sum_k
\frac{\partial J}{\partial z^{(L)}_k}
\frac{\partial z^{(L)}_k}{\partial W^{(L)}_{ij}}
=
\frac{\partial J}{\partial z^{(L)}_i}
\frac{\partial z^{(L)}_i}{\partial W^{(L)}_{ij}}.
$$
(Note that the summand $\frac{\partial z^{(L)}_k}{\partial W^{(L)}_{ij}}$ cancels out since $z_k^{(L)}$ does not depend on $W^{(L)}_{ij}$.) Note that the above is computable since we know $\frac{\partial J}{\partial z^{(L)}_i}$ and $\frac{\partial z^{(L)}_i}{\partial W^{(L)}_{ij}} = a^{(L-1)}_j $ is the forward activation output.
Derivation.
We generalize this principle to early layers one by one. In particular,
$$
\frac{\partial J}{\partial W^{(l)}_{ij}}
=
\frac{\partial J}{\partial z^{(l)}_i}
\frac{\partial z^{(l)}_i}{\partial W^{(l)}_{ij}}
= \frac{\partial J}{\partial z^{(l)}_i} a^{(l-1)}_j.
\tag{1a}
$$
as
$ z^{(l)}_i = \sum_j W^{(l)}_{ij} a^{(l-1)}_j $. Similarly,
$$
\frac{\partial J}{\partial b^{(l)}_{i}}
= \frac{\partial J}{\partial z^{(l)}_i}.
\tag{1b}
$$
Now, it suffices to obtain $\frac{\partial J}{\partial z^{(l)}_i}$, which is obatined by applying chain rule in the backward direction (hence the name backpropagation); in particular, we have
$$
\frac{\partial J}{\partial z^{(l)}_i}
= \sum_k
\frac{\partial J}{\partial z^{(l+1)}_k}
\frac{\partial z^{(l+1)}_k}{\partial z^{(l)}_i}
\tag{2},
$$
which is
$$
\frac{\partial J}{\partial z^{(l)}_i}
= \sum_k
\frac{\partial J}{\partial z^{(l+1)}_k}
\frac{\partial z^{(l+1)}_k}{\partial a^{(l)}_i}
\frac{\partial a^{(l)}_i}{\partial z^{(l)}_i}
=
\sum_k
\frac{\partial J}{\partial z^{(l+1)}_k}
W^{(l+1)}_{ki} g'(z^{(l)}_i)
$$
as $\frac{\partial z^{(l+1)}_k}{\partial a^{(l)}_i} = W^{(l+1)}_{ki}$ and $\frac{\partial a^{(l)}_i}{\partial z^{(l)}_i} = g'(z^{(l)}_i)$. (Here $g'$ is the derivative of $g$.)
In the matrix form, the above equations (1a), (1b), and (2) can be stated as follows:
$$
\nabla_{W^{(l)}} J = \nabla_{z^{(l)}} J \cdot a^{(l-1)T},
\tag{1a-m}
$$
$$
\nabla_{b^{(l)}} J = \nabla_{z^{(l)}} J,
\tag{1a-m}
$$
$$
\nabla_{z^{(l)}} J = (W^{(l+1)T} \cdot \nabla_{z^{(l+1)}} J) \odot g'(z^{(l)})
\tag{2-m}
$$
Here, $\cdot$ is the matrix multiplication, $\odot$ is Hadamard product, and $\nabla_{W^{(l)}} J$ is the matrix that contains $\frac{\partial J}{\partial W^{(L)}_{ij}}$ as its $i$-th row $j$-th column element.