How to understand the relations in matrix multiplications in deep learning?

Question

In the scaled dot product attention they multiply a "softmaxed" matrix (which has shape (sequence_length, sequence_length) I think?) to the V matrix as shown

What does the second purple matmul actually scale explained in an algebra fashion?

Also if I have a matrix a of shape (home_count, furniture_type_count) to store the furniture I need for every home. And a matrix b of shape (store_count, furniture_type_count) to store the price for every furniture at every store. Then a*transpose(T) gives the total price I need to pay if I buy all furniture at each store if I'm not wrong. But when building some layers in deep learning it gets very hard to understand what that multiplcation actually does. Is there a good way to understand such operations? For example how to explain the purple matmul scale mechanism used in the attention mentioned above?

EDIT : By 'scale' in the picture I meant 'weight'

Tim · Answer 1 · 2022-04-02T08:31:46.150

The second matrix multiplication doesn't scale anything. The attention mechanism as described in the Attention Is All You Need paper, is

what translates to the equation (1) from the paper

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V $$

where scaling is done in the "Scale" step and the step is about dividing by the scaling factor $\sqrt{d_k}$.

As for matrix multiplication, the first multiplies query $Q$ with keys $K$ to produce attention weights at the penultimate step (softmax) that are multiplied by the values $V$, so to produce an attention-weighted result. There is also an optional masking step.

I don't understand your example with the furniture as it doesn't seem to refer to the neural network using attention. The transformer model is a model designed for natural language processing tasks, to process sequences (sentences, longer texts) while your example doesn't seem to have anything in common with sequence data, there doesn't seem to be any temporal dimension in the data, so I can't see how it is relevant for the problem.

the first multiplies query Q with keys K to produce attention weights Sorry I wasn't being too clear but what I said about 'scaling' actually meant 'weighting'. I wanted to ask in which dimension does this weighting occur? Does it apply different weight along the sequence or in just weight different features for the entire sequence or both? — IKnowHowBitcoinWorks, Apr 04 '22 at 09:30
The furniture problem is basically the only way I know how to use matrix multiplications. But in deep learning they tend to be used to build 'constrains' for different layers. I don't understand how those constrains could be applied with matrices like the second 'weighting' multiplication. Are there any topics that specifically talks about this problem? The algebra I've learned does not cover this topic. — IKnowHowBitcoinWorks, Apr 04 '22 at 09:35

How to understand the relations in matrix multiplications in deep learning?

1 Answers1