(I am proceeding as showed in How to extract data/labels back from TensorFlow dataset)
In Tensorflow, problem encountered when preparing the data: I first build sequences which is a list of lists
sequences = tokenizer.texts_to_sequences(bible_text)
##-->[[5, 1, 914, 32, 1352, 1, 214, 2, 1, 111],
## [2, 1, 111, 31, 252, 2091, 2, 1874, 2, 547, 31, 38, 1, 196, 3, 1, 899, 2, 1, 298, 3, 32, 878, 38, 1, 196, 3, 1, 266],
## ...]
sequences=pad_sequences(sequences, padding='post')
input_sequences, target_sequences = sequences[:,:-1], sequences[:,1:]
input_sequences = tf.keras.utils.to_categorical(input_sequences, num_classes=vocab_size)
target_sequences = tf.keras.utils.to_categorical(target_sequences, num_classes=vocab_size)
##-->[[[0. 1. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## ...
## [1. 0. 0. ... 0. 0. 0.]
## [1. 0. 0. ... 0. 0. 0.]
## [1. 0. 0. ... 0. 0. 0.]]
##
## [[0. 1. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## ...
## [1. 0. 0. ... 0. 0. 0.]
## [1. 0. 0. ... 0. 0. 0.]
## [1. 0. 0. ... 0. 0. 0.]]
##[...]
I turn them into a dataset
dataset= tf.data.Dataset.from_tensor_slices((input_sequences, target_sequences))
And I split between a validation and a training dataset
# Build validation dataset
validation_dataset = dataset.take(int(len_val))
validation_dataset = (
validation_dataset
.shuffle(BUFFER_SIZE)
.padded_batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.experimental.AUTOTUNE))
# Build training dataset
train_dataset = dataset.skip(int(len_val))
train_dataset = (
train_dataset
.shuffle(BUFFER_SIZE)
.padded_batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.experimental.AUTOTUNE))
Problem occurs when I try to access to a batch of data, say with take(1), and assign it to two variables for splitting between input and labels, as here
c1,c2=train_dataset.take(1)
print("Check")
Program is stalling and I never reaches print("Check"). But if I print train_dataset I get
<PrefetchDataset shapes: ((64, 89, 10891), (64, 89, 10891)), types: (tf.float32, tf.float32)>
and train_dataset.take(1) I get
<TakeDataset shapes: ((64, 89, 10891), (64, 89, 10891)), types: (tf.float32, tf.float32)>
So that it seems to me Datasets do "have two elements" and I can "resplit" into input and labels. What am I doing wrong?