Why use matrix transpose in gradient descent?

Question

I just don't understand why use matrix transpose, instead of matrix inverse, to calculate delta of weight in gradient descent, like described in http://cs231n.github.io/optimization-2/#mat.

# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)

This is my understanding to calculate weight delta: $$ D = WX\\ WXX^-1 = DX^-1\\ W = DX^-1 $$ Could anyone please tell me what wrong with my understanding?

You can't take the inverse since most neural network matrices are not guaranteed to have an inverse. To certainly find one the matrix needs to be at least square. Which it usually isn't. Hence you need to do the calculus element-wise. — not_a_bot_no_really_82353, Jan 28 '21 at 18:55

qwr · Accepted Answer · 2021-11-10T08:48:25.793

10

Consider what matrix multiplication is, and observe the pattern of indices carefully:

$$D_{ij} = \sum_{k}W_{ik} X_{kj}$$

$$\frac{\partial D_{ij}}{\partial W_{ik}} = X_{kj}$$

For a previously described loss function $L$, by the chain rule,

$$\frac{\partial L}{\partial W_{ik}} = \sum_j \frac{\partial L}{\partial D_{ij}} \frac{\partial D_{ij}}{\partial W_{ik}} = \sum_j \frac{\partial L}{\partial D_{ij}} X_{kj} = \sum_j \frac{\partial L}{\partial D_{ij}} X_{jk}^T $$

Note $\partial D_{i'j}/\partial W_{ik} = 0$ for $i'\ne i$, so our chain rule sum is over the given $i$ ranging over $j$.

Since we used $X^T$, the inner index $j$ matches up for convenient matrix multiplication notation,

$$ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial D} X^T $$

This matrix of partial derivatives $\partial L / \partial W$ can also be implemented as the outer product of vectors: $(\partial L / \partial D) \otimes X$.

If you really understand the chain rule and are careful with your indexing, then you should be able to reason through every step of the gradient calculation. We need to be careful which matrix calculus layout convention we use: here "denominator layout" is used where $\partial L / \partial W$ has the same shape as $W$ and $\partial L / \partial D$ is a column vector.

edited Nov 10 '21 at 08:48

answered Mar 19 '18 at 04:51

qwr

516

1

Sorry, but can you explain why $\frac{dD}{dW} = X^T$ ? – eric2323223 Mar 19 '18 at 09:01
1

@eric2323223 - following qwr's suggestion about trying out derivatives by hand, try writing out an example with W 2x2 and X 2x1, that might help you see why that's correct. – jbowman Mar 19 '18 at 17:28
@eric2323223 I've added some details. I am still learning notation also but hopefully it's clear what I mean. – qwr Mar 19 '18 at 18:52
I am still editing, please bear with me... – qwr Mar 19 '18 at 19:08
The previous versions were incorrect or missing some details. This version I believe to be correct and as clear as possible. – qwr Apr 18 '21 at 09:42

user1824726 · Answer 2 · 2020-10-04T14:33:28.743

This thread is a bit old but I think the question is important for DL practitioners, so let me give a more intuitive answer. If you compare the computational graph for the forward pass with that of the backward pass you'll notice that there are a couple of key differences. In the backward pas, you have to sum across the different contributions from the different inputs to each node. That is, if a node takes several variables as input, you need the sum of the local gradients across those variables. So, what in the forward pass is a set of branches feeding into a given node becomes a sum of local gradients in the backward pass. Branching in forward becomes sum in backward On the oher hand, the opposite is also true. That is if you have a sum of inputs in the forward pass, you want to get all the partial derivatives with respect to each of those variables in the backward pass, that is Sum in forward becomes branching in backward If you just multiply by the inverse you are not perfroming those operations. On the other hand,multiplying by the transpose does the trick! You can think of the transpose as a kind of "inverse" (in the sense that it transforms outputs back to inputs) but which at the same time turns sums into branchings and branchings into sums. This happens because in multiplying you sum across rows of the matrix but get different results (branch) across columns. Transposing swap both operations. This is a wonderful video that goes into a detailed explanation about this point and many others, both formally and in an example graph: https://www.youtube.com/watch?v=kvnBw_D0gfs

Why use matrix transpose in gradient descent?

2 Answers2