0

(I am proceeding as showed in How to extract data/labels back from TensorFlow dataset)

In Tensorflow, problem encountered when preparing the data: I first build sequences which is a list of lists

sequences = tokenizer.texts_to_sequences(bible_text)

##-->[[5, 1, 914, 32, 1352, 1, 214, 2, 1, 111],
## [2, 1, 111, 31, 252, 2091, 2, 1874, 2, 547, 31, 38, 1, 196, 3, 1, 899, 2, 1, 298, 3, 32, 878, 38, 1, 196, 3, 1, 266],
##  ...]

sequences=pad_sequences(sequences, padding='post')
input_sequences, target_sequences = sequences[:,:-1], sequences[:,1:]
input_sequences = tf.keras.utils.to_categorical(input_sequences, num_classes=vocab_size)
target_sequences = tf.keras.utils.to_categorical(target_sequences, num_classes=vocab_size)

##-->[[[0. 1. 0. ... 0. 0. 0.]
##  [0. 0. 0. ... 0. 0. 0.]
##  [0. 0. 0. ... 0. 0. 0.]
##  ...
##  [1. 0. 0. ... 0. 0. 0.]
##  [1. 0. 0. ... 0. 0. 0.]
##  [1. 0. 0. ... 0. 0. 0.]]
##
## [[0. 1. 0. ... 0. 0. 0.]
##  [0. 0. 0. ... 0. 0. 0.]
##  [0. 0. 0. ... 0. 0. 0.]
##  ...
##  [1. 0. 0. ... 0. 0. 0.]
##  [1. 0. 0. ... 0. 0. 0.]
##  [1. 0. 0. ... 0. 0. 0.]]
##[...]

I turn them into a dataset

dataset= tf.data.Dataset.from_tensor_slices((input_sequences, target_sequences))

And I split between a validation and a training dataset

# Build validation dataset
validation_dataset = dataset.take(int(len_val))
validation_dataset = (
    validation_dataset
    .shuffle(BUFFER_SIZE)
    .padded_batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

# Build training dataset
train_dataset = dataset.skip(int(len_val))
train_dataset = (
    train_dataset
    .shuffle(BUFFER_SIZE)
    .padded_batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

Problem occurs when I try to access to a batch of data, say with take(1), and assign it to two variables for splitting between input and labels, as here

c1,c2=train_dataset.take(1)
print("Check")

Program is stalling and I never reaches print("Check"). But if I print train_dataset I get

<PrefetchDataset shapes: ((64, 89, 10891), (64, 89, 10891)), types: (tf.float32, tf.float32)>

and train_dataset.take(1) I get

<TakeDataset shapes: ((64, 89, 10891), (64, 89, 10891)), types: (tf.float32, tf.float32)>

So that it seems to me Datasets do "have two elements" and I can "resplit" into input and labels. What am I doing wrong?

kiriloff
  • 24,401
  • 34
  • 141
  • 212

0 Answers0