Does it make sense to multiply two embedding vectors?

Question

Many researchers are using neural network to infer embedding vectors for words, users, or items. Word embeddings, e.g., word2vec, allow people to calculate sum, average, and difference over embeddings.

So does it make sense to multiply two embeddings? For instance, one 200-d user embedding and one 200-d movies embedding. The miltiplication results in a new 200-d vector, which should be able to represent the interaction of the user and the movie. The new vector can be an input of any prediction model. Does it make sense?

FYI: Comprehensive list of combination techniques for word-level embeddings with pros/cons — Franck Dernoncourt, Oct 09 '16 at 13:59
I thought it would be used to calculate the consine similarity(unnormalized) between two embeddings. — Lerner Zhang, May 01 '19 at 01:14

score 2 · Accepted Answer · answered May 01 '19 at 12:35

Yes it does. Here you can find example of network that uses multiplication, among other methods, for combining embeddings. As described in my answer

element-wise product $u*v$, is basically an interaction term, this can catch similarities between values (big * big = bigger; small * small = smaller), or the discrepancies (negative * positive = negative) (see example here).

So it is perfectly reasonable way of combining weights, but often, as in above example, people use in parallel several different methods for combining them, to produce different kind of features for the model.

score 1 · Answer 2 · edited May 01 '19 at 07:26

1

I started working with words vectors for several weeks. I suspect that in order to obtain a valid answer to something like: "what is blood color?" the network will handle better Vec(blood)*Vec(color) insted of Vec(blood)+Vec(color) before calculating the sinus with all database words. Alas, I didn't test it yet.

Some stop words should change the way we operate with vectors. For example "I want non american food" should be calculated as: Vec(eat)+Vec(food)-Vec(american)

My main problem with words vectors is how slow it is when you have to calculate millions of sinus with 300 dimensional vectors ... I didn't find a way to accelerate this.

edited May 01 '19 at 07:26

kjetil b halvorsen

77,844

answered Mar 30 '19 at 09:18

Jeff BATTISTINI

11

I'm working an idea at the moment that combines both approaches: to combine two embeddings A and B I'm thinking of using 0.5(A*B + A + B); I'm hoping that the result will emphasize any similarities between the two embeddings (one of which is a word or phrase embedding and the other describes the context in which it is to be understood) without losing information about the two topics which are not included in their intersection. – occipita Feb 21 '23 at 12:35

score 1 · Answer 3 · answered Feb 21 '23 at 13:02

Not only does it make sense, it is one of the key operations in one of the biggest breakthroughs in network design of recent years, the idea of "attention" as used by, e.g. Google Translate, ChatGPT and all other GPT-based applications, Stable Diffusion, and many other recent machine learning systems.

Attention is essentially a database lookup over the values that are currently being examined by a network. For instance in the transformer architecture, the inputs to the attention layers are usually keys, queries and values, all of which are word embeddings for words in the current context of the input or that have recently been generated in the output. The attention layer then calculates dot products between each query and the keys (ie, component-wise multiplications followed by summing to a single scalar), scales them so that they total to 1, then uses those as weights to add up all the values. This produces an output embedding that is most similar to the values associated with the keys that are most similar to the queries, but contains a little of all of the values mixed in.

This has turned out to be extremely useful, allowing networks to be built that work on much larger contexts than would otherwise be possible because they are able to select only the parts of the context that are currently relevant for the output they are building. This means that tasks that previously required recurrent networks to accumulate information over a large context can now be addressed by feed-forward networks that are much simpler to train.

For more information, the paper Attention Is All You Need (Vaswani et al) is considered the best starting point.

Does it make sense to multiply two embedding vectors?

3 Answers3