4

In the first derivation of dL/dW, I use the rule for the derivative of a constant with respect to a matrix and then apply the chain rule.

\begin{gather*} Y\ =\ XW\ +\ B\\ X=\begin{bmatrix} x_{0} & x_{1} & x_{2} \end{bmatrix} ,\ Y=\begin{bmatrix} y_{0} & y_{1} \end{bmatrix} ,\ W=\begin{bmatrix} w_{00} & w_{01}\\ w_{10} & w_{11}\\ w_{20} & w_{21} \end{bmatrix} ,\ B=\begin{bmatrix} b_{0} & b_{1} \end{bmatrix} \end{gather*}

When it comes to the derivative with respect to a vector, the rules I found assume column vectors. Are the rules the same for row vectors? (For numerator layout, dY/dL is a column vector. However, they don't say that Y has to be a column vector, instead they say "If the numerator y is of size m and the denominator x of size n")

\begin{gather*} \left(\frac{\partial L}{\partial W}\right)^{T} =\begin{bmatrix} \frac{\partial L}{\partial w_{00}} & \frac{\partial L}{\partial w_{00}}\\ \frac{\partial L}{\partial w_{10}} & \frac{\partial L}{\partial w_{11}}\\ \frac{\partial L}{\partial w_{20}} & \frac{\partial L}{\partial w_{21}} \end{bmatrix} =\begin{bmatrix} \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{00}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{01}}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{10}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{11}}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{20}} & \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{21}} \end{bmatrix}\\ \\ Focus\ on\ one\ term:\\ y_{0} \ =\ w_{00} x_{0} +w_{10} x_{1} +w_{20} x_{2} \ +b_{0}\\ y_{1} \ =\ w_{01} x_{0} +w_{11} x_{1} +w_{21} x_{2} +b_{1}\\ \\ \frac{\partial Y}{\partial w_{00}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{00}}\\ \frac{\partial y_{1}}{\partial w_{00}} \end{bmatrix} =\begin{bmatrix} x_{0}\\ 0 \end{bmatrix}\\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial w_{00}} =\ \begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}} & \color{red}{\frac{\partial L}{\partial y_{1}}} \end{bmatrix}\begin{bmatrix} x_{0}\\ 0 \end{bmatrix} =\color{red}{\frac{\partial L}{\partial y_{0}}} \ x_{0} \ +\ \color{red}{\frac{\partial L}{\partial y_{1}}} *\ 0\ =\color{red}{\frac{\partial L}{\partial y_{0}}} \ x_{0}\\ \\ \frac{\partial Y}{\partial w_{10}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{10}}\\ \frac{\partial y_{1}}{\partial w_{10}} \end{bmatrix} \ =\begin{bmatrix} x_{1}\\ 0 \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{01}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{01}}\\ \frac{\partial y_{1}}{\partial w_{01}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{0} \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{11}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{11}}\\ \frac{\partial y_{1}}{\partial w_{11}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{1} \end{bmatrix} ,\\ \frac{\partial Y}{\partial w_{20}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{20}}\\ \frac{\partial y_{1}}{\partial w_{20}} \end{bmatrix} \ =\begin{bmatrix} x_{2}\\ 0 \end{bmatrix} ,\ \frac{\partial Y}{\partial w_{21}} =\ \begin{bmatrix} \frac{\partial y_{0}}{\partial w_{21}}\\ \frac{\partial y_{1}}{\partial w_{21}} \end{bmatrix} \ =\begin{bmatrix} 0\\ x_{2} \end{bmatrix}\\ \\ Finally:\\ \left(\frac{\partial L}{\partial W}\right)^{T} =\begin{bmatrix} \frac{\partial L}{\partial y_{0}} \ x_{0} & \frac{\partial L}{\partial y_{1}} \ x_{0}\\ \frac{\partial L}{\partial y_{0}} \ x_{1} & \frac{\partial L}{\partial y_{1}} \ x_{1}\\ \frac{\partial L}{\partial y_{0}} \ x_{2} & \frac{\partial L}{\partial y_{1}} \ x_{2} \end{bmatrix} =\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2} \end{bmatrix}\begin{bmatrix} \frac{\partial L}{\partial y_{0}} & \frac{\partial L}{\partial y_{1}} \end{bmatrix} =\ X^{T}\color{red}{\frac{\partial L}{\partial Y}} \end{gather*}

Is dY/dW, the derivative of a vector with respect to a matrix, a third degree tensor? Am I allowed to do the following derivation? (writing a 3d tensor as a vector of 2d matrices)

\begin{gather*} \frac{\partial L}{\partial W} =\ \color{red}{\frac{\partial L}{\partial Y}}\frac{\partial Y}{\partial W} =\begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}} & \color{red}{\frac{\partial L}{\partial y_{1}}} \end{bmatrix}\begin{bmatrix} \frac{\partial y_{0}}{\partial W}\\ \frac{\partial y_{1}}{\partial W} \end{bmatrix} =\color{red}{\frac{\partial L}{\partial y_{0}}}\frac{\partial y_{0}}{\partial W} +\color{red}{\frac{\partial L}{\partial y_{1}}}\frac{\partial y_{1}}{\partial W}\\ =\ \color{red}{\frac{\partial L}{\partial y_{0}}}\begin{bmatrix} \frac{\partial y_{0}}{\partial w_{00}} & \frac{\partial y_{0}}{\partial w_{01}}\\ \frac{\partial y_{0}}{\partial w_{10}} & \frac{\partial y_{0}}{\partial w_{11}}\\ \frac{\partial y_{0}}{\partial w_{20}} & \frac{\partial y_{0}}{\partial w_{21}} \end{bmatrix}^{T} +\color{red}{\frac{\partial L}{\partial y_{1}}}\begin{bmatrix} \frac{\partial y_{1}}{\partial w_{00}} & \frac{\partial y_{1}}{\partial w_{01}}\\ \frac{\partial y_{1}}{\partial w_{10}} & \frac{\partial y_{1}}{\partial w_{11}}\\ \frac{\partial y_{1}}{\partial w_{20}} & \frac{\partial y_{1}}{\partial w_{21}} \end{bmatrix}^{T}\\ =\ \color{red}{\frac{\partial L}{\partial y_{0}}}\begin{bmatrix} x_{0} & 0\\ x_{1} & 0\\ x_{2} & 0 \end{bmatrix}^{T} +\color{red}{\frac{\partial L}{\partial y_{1}}}\begin{bmatrix} 0 & x_{0}\\ 0 & x_{1}\\ 0 & x_{2} \end{bmatrix}^{T} =\begin{bmatrix} \color{red}{\frac{\partial L}{\partial y_{0}}} x_{0} & \color{red}{\frac{\partial L}{\partial y_{1}}} x_{0}\\ \color{red}{\frac{\partial L}{\partial y_{0}}} x_{1} & \color{red}{\frac{\partial L}{\partial y_{1}}} x_{1}\\ \color{red}{\frac{\partial L}{\partial y_{0}}} x_{2} & \color{red}{\frac{\partial L}{\partial y_{1}}} x_{2} \end{bmatrix}^{T}\\ =\ \left( X^{T}\color{red}{\frac{\partial L}{\partial Y}}\right)^{T} \end{gather*}

Edit: Found a similar question, but the final answer is different.

BPDev
  • 141
  • 1
    It helps to consider the dimensions. Here, $W$ has $3\times 2$ components and therefore is in $\mathbb R^6,$ and $Y$ has $2$ components, and therefore is in $\mathbb R^2.$ The map $W\to XW+B=Y$ therefore is a function from $\mathbb R^6$ to $\mathbb R^2.$ By definition, the derivative can be represented as a $6\times 2$ matrix. Therefore, any answer you give must act like such a matrix. In particular, it will have $12$ entries. – whuber May 29 '23 at 16:08
  • @whuber Given that ∂y0/∂W has dimensions 2 x 3, does ∂Y/∂W have dimensions 2 x 2 x 3 (or 2 x 3 x 2?) ? (12 entries, but isn't a matrix) – BPDev May 29 '23 at 17:59
  • You have many choices of how to express the derivative. What matters is that your notation needs to be interpretable as a pair of linear functions of the six components of $W.$ – whuber May 30 '23 at 14:44
  • 1
    This sounds like a purely mathematical question. It basically boils down to 'how to perform matrix calculus'. – Firebug May 30 '23 at 15:36
  • My post at https://stats.stackexchange.com/a/257616/919 provides basic definitions and an example of how to carry out this calculation rigorously and clearly (for a more complicated matrix function). – whuber May 31 '23 at 22:05

1 Answers1

4

The choice of layout notation and the treatment of vector variables (i.e. row/column) is usually tricky and may yield inconsistent results. So, it's hard to tell that the rules are the same for both. For example, in some cases, the multiplication terms in the chain rule goes from right to left, as opposed to what we're accustomed to, i.e. left to right.

For your second question, the matrix multiplication is not valid because the LHS, $\partial L/\partial Y$ has dimension $1\times 2$ and RHS, has dimension $4\times 3$. But, the expression

$$\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y_0}\frac{\partial y_0}{\partial W}+\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial W}$$

is correct due to chain rule in multivariate calculus. Therefore, the rest follows.

utobi
  • 11,726
gunes
  • 57,205
  • ∂Y/∂W has dimensions 4 x 3? I am not sure where the 4 comes from. Since ∂y0/∂W has dimensions 2 x 3, I am wondering if ∂Y/∂W has dimensions 2 x 2 x 3. – BPDev May 28 '23 at 22:31
  • I've meant the expansion of ∂Y/∂W you used, where the first row is ∂y0/∂W and the second row is ∂y1/∂W. Each of these elements have dimension 2 x 3. Concatenated together in row dimension as you did makes 4 x 3. This can't be multiplied with the LHS expression. On the other hand, ∂Y/∂W is actually a third degree tensor and does not play along well with the rest of the expressions, which is a rabbit hole in matrix calculus – gunes May 29 '23 at 09:54
  • I didn't realize I was concatenating dimensions. Is this an invalid way to write a third degree tensor (without a cube)? (I am thinking that multiplying 1 x 2 with 2 x 2 x 3 is somehow allowed and returns 1 x 2 x 3 or 2 x 3) – BPDev May 29 '23 at 18:09
  • If you write matrices and on top of each other (or side by side), that is block matrix concatenation. I'd not think of it as a 3d tensor. The link you mentioned writes 3d tensors with an additional brackets inside, which is different from yours (I don't think it's a standard way of writing 3d tensors btw). Moreover, multiplication with a 2d and 3d tensor is not very well defined here. – gunes May 29 '23 at 19:25