10

I have a list of stock price sequences with 20 timesteps each. That's a 2D array of shape (total_seq, 20). I can reshape it into (total_seq, 20, 1) for concatenation to other features.

I also have news title with 10 words for each timestep. So I have 3D array of shape (total_seq, 20, 10) of the news' tokens from Tokenizer.texts_to_sequences() and sequence.pad_sequences().

I want to concatenate the news embedding to the stock price and make predictions.

My idea is that the news embedding should return tensor of shape (total_seq, 20, embed_size) so that I can concatenate it with the stock price of shape (total_seq, 20, 1) then connect it to LSTM layers.

To do that, I should convert news embedding of shape (total_seq, 20, 10) to (total_seq, 20, 10, embed_size) by using Embedding() function.

But in Keras, the Embedding() function takes a 2D tensor instead of 3D tensor. How do I get around with this problem?

Assume that Embedding() accepts 3D tensor, then after I get 4D tensor as output, I would remove the 3rd dimension by using LSTM to return last word's embedding only, so output of shape (total_seq, 20, 10, embed_size) would be converted to (total_seq, 20, embed_size)

But I would encounter another problem again, LSTM accepts 3D tensor not 4D so

How do I get around with Embedding and LSTM not accepting my inputs?

Zephyr
  • 997
  • 4
  • 10
  • 20
offchan
  • 305
  • 3
  • 10
  • 1
    It's a bit funky how to do this, but I'm writing an example for you! – Jan van der Vegt Aug 11 '17 at 13:09
  • @JanvanderVegt Please check my question again. I added more details! Thanks. – offchan Aug 11 '17 at 13:26
  • 1
    Instead of what you wanted I got to an output after the embeddings of (total_seq, 20, 10 * embed_size + 1) by concatenating the embeddings for each of the words and the stock price, would that solve your problem? – Jan van der Vegt Aug 11 '17 at 13:42
  • It would be better if I can somehow summarize the news into one small embedding for each time step. – offchan Aug 11 '17 at 17:32
  • 2
    That's easy, add a TimeDistrbuted(Dense(dim)) that takes the concatenated words. You could use a convolution or RNN here on the words but with only 10 words I think Dense might be better. I updated my example, I hope this is what you were looking for. – Jan van der Vegt Aug 11 '17 at 19:16
  • Jan van der Vegt, Great demo for concatenate... I have a similar scenario but I don't use Embedding layer, I use a deep autoencoder to compress the input from 200 dimensions to 20 dimensions and then concatenate the results and feed into LSTM for further analysis. I know it's doable based on the code above, but my python skill is not good enough to do the trick. Please help me to figure this out with a sample code. Thanks. – Canal fishing Aug 22 '18 at 17:17

1 Answers1

8

I'm not entirely sure if this is the cleanest solution but I stitched everything together. Each of the 10 word positions get their own input but that shouldn't be too much of a problem. The idea is to make an Embedding layer and use it multiple times. First we will generate some data:

n_samples = 1000
time_series_length = 50
news_words = 10
news_embedding_dim = 16
word_cardinality = 50

x_time_series = np.random.rand(n_samples, time_series_length, 1)
x_news_words = np.random.choice(np.arange(50), replace=True, size=(n_samples, time_series_length, news_words))
x_news_words = [x_news_words[:, :, i] for i in range(news_words)]
y = np.random.randint(2, size=(n_samples))

Now we will define the layers:

## Input of normal time series
time_series_input = Input(shape=(50, 1, ), name='time_series')

## For every word we have it's own input
news_word_inputs = [Input(shape=(50, ), name='news_word_' + str(i + 1)) for i in range(news_words)]

## Shared embedding layer
news_word_embedding = Embedding(word_cardinality, news_embedding_dim, input_length=time_series_length)

## Repeat this for every word position
news_words_embeddings = [news_word_embedding(inp) for inp in news_word_inputs]

## Concatenate the time series input and the embedding outputs
concatenated_inputs = concatenate([time_series_input] + news_words_embeddings, axis=-1)

## Feed into LSTM
lstm = LSTM(16)(concatenated_inputs)

## Output, in this case single classification
output = Dense(1, activation='sigmoid')(lstm)

After compiling the model we can just fit it like this:

model.fit([x_time_series] + x_news_words, y)

EDIT:

After what you mentioned in the comments, you can add a dense layer that summarizes the news, and adds that to your time series (stock prices):

## Summarize the news:
news_words_concat = concatenate(news_words_embeddings, axis=-1)
news_words_transformation = TimeDistributed(Dense(combined_news_embedding))(news_words_concat)

## New concat
concatenated_inputs = concatenate([time_series_input, news_words_transformation], axis=-1)
Jan van der Vegt
  • 9,368
  • 35
  • 52