How can I build a self-attention model with tf.keras.layers.Attention?

Question

I have completed an easy many-to-one LSTM model as following.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
model=Sequential()
model.add(LSTM(2**LSTM_units,input_length=data.shape[1],input_dim=data.shape[2],return_sequences=True))
model.add(Dropout(dropout))
model.add(Dense(1))
model.fit(data,res)
prediction=model.predict(test)

Where data.shape is (dates, time_interval, factors) and res.shape is (dates,1).

I would like to replace LSTM with Attention. How can I get the similar result by using tf.keras.layers.Attention?

Check this out https://keras.io/examples/nlp/text_classification_with_transformer/ — Deshwal, Nov 17 '20 at 16:39

Leevo · Answer 1 · 2023-03-20T21:21:37.587

Self attention is not available as a Keras layer at the moment. The layers that you can find in the tensorflow.keras docs are two:

AdditiveAttention() layers, implementing Bahdanau attention,
Attention() layers, implementing Luong attention.

For self-attention, you need to write your own custom layer. I suggest you to take a look at this TensorFlow tutorial on how to implement Transformers from scratch. The Transformer is the model that popularized the concept of self-attention, and by studying it you can figure out a more general implementation. In particular, check the section Multi-Head Attention, where they develop a custom MultiHeadAttention() layer. That is where all the attention-related action happens. In particular, study how the K, V, Q tensors are used in it in order to compute the attention formula.

It won't be easy but it's certainly a super interesting exercise. If you make some cool model out of it please share the link to a GitHub repository. Good luck!

EDIT:

There is a trick you can use: since self-attention is of multiplicative kind, you can use an Attention() layer and feed the same tensor twice (for Q, V, and indirectly K too).

You can't build a model in the Sequential way, you need the functional one. So you'd get something like:

attention = Attention(use_scale=True)(X, X)

where X is the tensor on which you want to get self-attention.

Note the use_scale=True arg: that is a scaling of the self-attention tensor, analogous to the one that happens in the original Transformer paper. Its purpose is to prevent vanishing gradient (that happens in extreme regions of the softmax). The only difference is that in this level, the scaling parameter is learned instead of being a fixed scalar. It defaults to False.

EDIT: Attention layers are also made available via maximal library. It is based on top of TensorFlow and lets you implement Transformer-based layers. In this case:

import maximal as ml
from maximal.layers import Attention

thanks! So if I feed the same input representation (raw sequence or LSTM hidden states) to a Keras Attention layer, I get the Self-attention? Is scale a must? — Alexey Burnakov, Jul 10 '21 at 08:58
Scale is not technically a must, but the authors of the original self-attention paper found that it is very useful. Based on their arguments I would expect performance to drop if you don't set it to True. — Leevo, Jul 12 '21 at 07:30
thank you. I posted my question on Attention, and what I have found with the help of yours and one more insight showed a good and stable loss decreasing. https://stats.stackexchange.com/q/533806/125481 (After edit). Could you maybe take a look and possibly reply or comment? — Alexey Burnakov, Jul 12 '21 at 08:47

score 3 · Answer 2 · answered Sep 01 '21 at 06:48

3

Keras layer for multi-head self-attention: https://keras.io/api/layers/attention_layers/multi_head_attention/

answered Sep 01 '21 at 06:48

keramat

224
1
2
12

greco.roamin · Answer 3 · 2020-07-01T21:52:51.553

2

How about this? https://pypi.org/project/keras-self-attention/

I use it but I have not tried it without the LSTM layer yet. Also, you have to use keras directly, not tensorflow.keras. Here is a working model I am running right now.

    model = keras.models.Sequential()
    model.add(keras.layers.LSTM(cfg.LSTM, input_shape=(cfg.TIMESTEPS,
              cfg.FEATURES),
              return_sequences=True))
    model.add(SeqSelfAttention(attention_activation='sigmoid'))
    model.add(keras.layers.Dense(cfg.DENSE))
    model.add(keras.layers.Dense(OUTPUT, activation='sigmoid'))

Also, look here, there is quite a bit https://awesomeopensource.com/projects/attention-mechanism

edited Jul 01 '20 at 21:52

answered Jul 01 '20 at 21:44

greco.roamin

137
2

how to do visualization in SeqSelfAttention – Sandeep Bhutani Jun 12 '21 at 20:11
Adding Flatten after SeqSelfAttention , model does not compile and starts giving error, as trainable parameters become 0 – Sandeep Bhutani Jun 12 '21 at 20:13

score 2 · Answer 4 · answered Dec 09 '20 at 16:10

There are at least a dozen major flavours of attention, most of them are minor variations over the first Attention model that came out - Bahdanau et al in 2014. Each of this flavour can be implemented in multiple ways, so this can be confusing to someone who wants to add a simple attention layer to her/his model. A simple custom implementation of Attention in Keras is shared here: https://stackoverflow.com/questions/63060083/create-an-lstm-layer-with-attention-in-keras-for-multi-label-text-classification/64853996#64853996

Couple of answers above have recommended using existing libraries for attention. My experience doing that is as follows:

Usage of tf.keras.layers.Attention and AdditiveAttention: While analysing tf.keras.layers.Attention Github code to better understand how it works, the first line I could come across was - "This class is suitable for Dense or CNN networks, and not for RNN networks". So this is not recommended for your case.
There is another open source version maintained by CyberZHG called keras-self-attention. To the best of my knowledge this is NOT a part of the Keras or TensorFlow library and seems to be an independent piece of code. This contains two classes - SeqWeightedAttention & SeqSelfAttention layer classes. former returns a 2D value and latter a 3D value. The former seems to be loosely based on Raffel et al and can be used for Seq classification, The latter seems to be a variation of Bahdanau. The repo QnA contains the relevant attention papers on which the code is based on.

In general, I would suggest you to write your Attention layer. This can be done in less than half a dozen lines of code (bare-bones essence.. See for e.g.: https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e)...much less than the time you would spend in integrating or debugging or understanding the code in these external libraries.

How can I build a self-attention model with tf.keras.layers.Attention?

4 Answers4