7

I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it. However, I have some difficulties with the matrix notation. Hope anyone can help me.

My network has 2 input, 3 hidden and 2 output neurons. enter image description here

I wrote up the matricesenter image description here

The loss is MSE: $ L = \frac {1}{2} \sum (y_{hat} - y_{true})^2$

The derivative of the Loss with respect to the weight matrix $W^{(3)}$ should have the same dimensions like $W^{(3)}$ to update each entry with (stochastic) gradient descent.

$\frac {\partial L}{\partial W^{(3)}} = \frac {\partial L}{\partial a^{(3)}} \frac {\partial a^{(3)}}{\partial z^{(3)}} \frac {\partial z^{(3)}}{\partial W^{(3)}} = (a^{(3)} - y) \odot a^{(3)} \odot (1 - a^{(3)}) a^{(2)T}$

First question, is that correct to transpose $a^{(2)}$ since otherwise the dimension would not work out?

Now for the second weight matrix, where I cannot figure out what is wrong with the dimensions:

$\frac {\partial L}{\partial W^{(3)}} = \frac {\partial L}{\partial a^{(3)}} \frac {\partial a^{(3)}}{\partial z^{(3)}} \frac {\partial z^{(3)}}{\partial a^{(2)}} \frac {\partial a^{(2)}}{\partial z^{(2)}} \frac {\partial a^{(2)}}{\partial z^{(2)}} \frac {\partial z^{(2)}}{\partial W^{(2)}} = (a^{(3)} - y) \odot a^{(3)} \odot (1 - a^{(3)}) W^{(3)} (1,1,1)^T a^{(1)T}$

I get 2x1 2x3 3x1 1x2...

I wrote just $(1,1,1)$ assuming that the the $z = (z_1, z_2, z_3)$ are greater than 0.

gunes
  • 57,205

2 Answers2

3

For your first question, yes, transposing $a^{(2)}$ will do the job since for each entry of the matrix $w_{ij}^{(3)}$, the derivative includes the multiplier $a_{i}^{(2)}$. So, $a_{1}^{(2)}$ will be in the first column, $a_{2}^{(2)}$ will be in the second column and so on. This is directly achieved by multiplying $\delta$ (the first three terms) by ${a_2^{(2)}}^T$Note that in your indexing, $w_{ij}$ denotes $j$-th row and $i$-th column.

Second one is a bit tricky. First of all, you're using denominator layout, so a vector of size $m$ divided by another vector of size $n$ has derivative of size $n\times m$. Typically, numerator layout is more common.

It's all about Layouts

Let's say we have a scalar loss $L$, and two vectors $a,z$ which have dimensions $m,n$ respectively. In $\frac{\partial L}{\partial z}=\frac{\partial L}{\partial a}\frac{\partial a}{\partial z}$, according to denominator layout, first one produces $m\times 1$ vector, and second one produces $n\times m$ matrix, so dimensions mismatch. If it was numerator layout, we'd have $1\times m$ times $m\times n$, and get $1\times n$ gradient, still consistent with the layout definition.

This is why you should append to the left as you move forward in denominator layout (because we're actually transposing a matrix multiplication in changing the layout: $(AB)^T=B^TA^T$): $$\underbrace{\frac{\partial L}{\partial z}}_{n\times 1}=\underbrace{\frac{\partial a}{\partial z}}_{n\times m}\underbrace{\frac{\partial L}{\partial a}}_{m\times 1}$$

So, $$\frac {\partial L}{\partial W_{ij}^{(2)}} = \underbrace{\frac {\partial z^{(2)}}{\partial W_{ij}^{(2)}}}_{1\times 3} \underbrace{\frac {\partial a^{(2)}}{\partial z^{(2)}}}_{3\times 3} \underbrace{\frac {\partial z^{(3)}}{\partial a^{(2)}}}_{3\times 2} \underbrace{\frac {\partial L}{\partial z^{(3)}}}_{2\times 1}$$

And everything matches. I've changed two more things:

  • Some of these calculations can be merged and optimised using element-wise multiplications, e.g. the term $\frac {\partial a^{(2)}}{\partial z^{(2)}}$ produces a 3x3 output, but it's a diagonal matrix. This is actually what you've done while calculating gradients in the last layer.
  • I've used $W_{ij}$ because it's easier to think. $\frac{\partial z}{\partial W}$ is a 3D tensor, since numerator is a vector and denominator is a matrix. After finding the expressions for each $W_{ij}$ and placing them into the gradient matrix one by one according to denominator layout, you can take out common multiplications and write the final formula.
gunes
  • 57,205
1

Please let me derive for quite a general case. I think the deriviation is somewhere out there, but not easy for me to find. Thus I upload here.

The notations I follow is based on a CS229 note.

Definitions. Let $J$ be a loss function of a neural network to minimize. Let $x\in \mathbb{R}^{d_0}$ be a single sample (input). Define its deep activations in a cascaded way as follows: $$ z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} $$ and $a^{(l)} = g^{(l)}(z^{(l)})$ for $l=1, \dots, L$ where $L$ is the number of layers. Here, $W^{(l)} \in \mathbb{R}^{d_{l} \times d_{l-1}} $ and $b^{(l)} \in \mathbb{R}^{d_l} $ are parameters of the neural network model that we need to optimize. $a^{(l)} \in \mathbb{R}^{d_l} $ is the $l$-th layer activation output, and $g^{(l)} : \mathbb{R}^{d_l} \to \mathbb{R}^{d_l} $ is the corresponding activation function (such as ReLU). The zeroth layer activation is set as the input $a^{(0)} = x$.

Convention. Except for the final layer $L$, the activation funciton is the same for all hidden layers $l <L$; $g^{(l)} = g$. Also, we assume $g$ is pointwise computation such as ReLU or sigmoid.

Another convention is that the loss function $J$ is defined such that $$ \nabla_{z^{(L)}} J = z^{(L)} - y $$ where $y$ is the target that our neural network model needs to predict. Note that the gradient $\nabla_{z^{(L)}} J = z^{(L)}$ is a column vector $$ \nabla_{z^{(L)}} J = z^{(L)} = \begin{bmatrix} \frac{\partial J}{\partial z^{(L)}_1} \\ \vdots \\ \frac{\partial J}{\partial z^{(L)}_{d_L}} \end{bmatrix} \in \mathbb{R}^{d_L}. $$

Motivation. Backpropagation is simply a combination of chain rule applications. Observe first that $\frac{\partial J}{\partial W^{(L)}_{ij}}$ is decomposed into $$ \frac{\partial J}{\partial W^{(L)}_{ij}} = \sum_k \frac{\partial J}{\partial z^{(L)}_k} \frac{\partial z^{(L)}_k}{\partial W^{(L)}_{ij}} = \frac{\partial J}{\partial z^{(L)}_i} \frac{\partial z^{(L)}_i}{\partial W^{(L)}_{ij}}. $$ (Note that the summand $\frac{\partial z^{(L)}_k}{\partial W^{(L)}_{ij}}$ cancels out since $z_k^{(L)}$ does not depend on $W^{(L)}_{ij}$.) Note that the above is computable since we know $\frac{\partial J}{\partial z^{(L)}_i}$ and $\frac{\partial z^{(L)}_i}{\partial W^{(L)}_{ij}} = a^{(L-1)}_j $ is the forward activation output.

Derivation. We generalize this principle to early layers one by one. In particular, $$ \frac{\partial J}{\partial W^{(l)}_{ij}} = \frac{\partial J}{\partial z^{(l)}_i} \frac{\partial z^{(l)}_i}{\partial W^{(l)}_{ij}} = \frac{\partial J}{\partial z^{(l)}_i} a^{(l-1)}_j. \tag{1a} $$ as $ z^{(l)}_i = \sum_j W^{(l)}_{ij} a^{(l-1)}_j $. Similarly, $$ \frac{\partial J}{\partial b^{(l)}_{i}} = \frac{\partial J}{\partial z^{(l)}_i}. \tag{1b} $$

Now, it suffices to obtain $\frac{\partial J}{\partial z^{(l)}_i}$, which is obatined by applying chain rule in the backward direction (hence the name backpropagation); in particular, we have $$ \frac{\partial J}{\partial z^{(l)}_i} = \sum_k \frac{\partial J}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial z^{(l)}_i} \tag{2}, $$ which is $$ \frac{\partial J}{\partial z^{(l)}_i} = \sum_k \frac{\partial J}{\partial z^{(l+1)}_k} \frac{\partial z^{(l+1)}_k}{\partial a^{(l)}_i} \frac{\partial a^{(l)}_i}{\partial z^{(l)}_i} = \sum_k \frac{\partial J}{\partial z^{(l+1)}_k} W^{(l+1)}_{ki} g'(z^{(l)}_i) $$ as $\frac{\partial z^{(l+1)}_k}{\partial a^{(l)}_i} = W^{(l+1)}_{ki}$ and $\frac{\partial a^{(l)}_i}{\partial z^{(l)}_i} = g'(z^{(l)}_i)$. (Here $g'$ is the derivative of $g$.)

In the matrix form, the above equations (1a), (1b), and (2) can be stated as follows: $$ \nabla_{W^{(l)}} J = \nabla_{z^{(l)}} J \cdot a^{(l-1)T}, \tag{1a-m} $$ $$ \nabla_{b^{(l)}} J = \nabla_{z^{(l)}} J, \tag{1a-m} $$ $$ \nabla_{z^{(l)}} J = (W^{(l+1)T} \cdot \nabla_{z^{(l+1)}} J) \odot g'(z^{(l)}) \tag{2-m} $$ Here, $\cdot$ is the matrix multiplication, $\odot$ is Hadamard product, and $\nabla_{W^{(l)}} J$ is the matrix that contains $\frac{\partial J}{\partial W^{(L)}_{ij}}$ as its $i$-th row $j$-th column element.

le4m
  • 470
  • Hi could you please help me out here? https://stats.stackexchange.com/questions/638260/simple-matrix-calculus-but-i-am-struggling-to-understand

    @le4m, I believe this is a similar question, thanks!

    – wrek Feb 01 '24 at 05:09