12

Is there an easy way to get the entire set of elements in a tf.data.Dataset? i.e. I want to set batch size of the Dataset to be the size of my dataset without specifically passing it the number of elements. This would be useful for validation dataset where I want to measure accuracy on the entire dataset in one go. I'm surprised there isn't a method to get the size of a tf.data.Dataset

Milad
  • 4,134
  • 5
  • 29
  • 39
  • 1
    You can also use `tf.metrics.accuracy` and run `sess.run(update_op)` on each batch of the validation data. At the end, calling `sess.run(accuracy)` should give you the total accuracy. – Olivier Moindrot Jan 07 '18 at 15:12
  • 2
    I am getting convinced it is a waste of time to use tensorflow API's and estimators. I spent so much time learning them, and then you face one limitation after another, like the one you have mentioned. I would just create my own dataset and batch generator. – Miladiouss Apr 11 '18 at 22:30

6 Answers6

4

In Tensorflow 2.0

You can enumerate the dataset using as_numpy_iterator

for element in Xtrain.as_numpy_iterator(): 
  print(element) 
Abhishek S
  • 879
  • 9
  • 22
3

In short, there is not a good way to get the size/length; tf.data.Dataset is built for pipelines of data, so has an iterator structure (in my understanding and according to my read of the Dataset ops code. From the programmer's guide:

A tf.data.Iterator provides the main way to extract elements from a dataset. The operation returned by Iterator.get_next() yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model.

And, by their nature, iterators do not have a convenient notion of size/length; see here: Getting number of elements in an iterator in Python

More generally though, why does this problem arise? If you are calling batch, you are also getting a tf.data.Dataset, so whatever you are running on a batch you should be able to run on the whole dataset; it will iterate through all the elements and calculate validation accuracy. Put differently, I don't think you actually need the size/length to do what you want to do.

muskrat
  • 1,459
  • 9
  • 17
  • 4
    My code accepts training and validation tfrecords files and turns them into two tf.Datasets with a single iterator that can be initialised to both Datasets (similar to [examples](https://www.tensorflow.org/programmers_guide/datasets) in TF's documentation). The number of epochs and batch sizes for training data is in my control and I can easily apply .batch() and .repeat() method on the training dataset. However, for the validation data I want to create a single batch containing all the samples but I don't necessarily know how many samples are in the tfrecord file. – Milad Jan 06 '18 at 15:01
  • 1
    I see; thanks for explanation. What I was trying to say is that, when you are running '.batch()`, it returns an object of the same type as your dataset. Thus whatever you are calling on a batch you should be able to call on the dataset itself (just without the call to batch). – muskrat Jan 06 '18 at 21:21
2

tf.data API creates a tensor called 'tensors/component' with the appropriate prefix/suffix if applicable). after you create the instance. You can evaluate the tensor by name and use it as a batch size.

#Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,7)
%matplotlib inline


from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")

Xtrain = mnist.train.images[mnist.train.labels < 2]
ytrain = mnist.train.labels[mnist.train.labels < 2]

print(Xtrain.shape)
#(11623, 784)
print(ytrain.shape)
#(11623,)  

#Data parameters
num_inputs = 28
num_classes = 2
num_steps=28

# create the training dataset
Xtrain = tf.data.Dataset.from_tensor_slices(Xtrain).map(lambda x: tf.reshape(x,(num_steps, num_inputs)))
# apply a one-hot transformation to each label for use in the neural network
ytrain = tf.data.Dataset.from_tensor_slices(ytrain).map(lambda z: tf.one_hot(z, num_classes))
# zip the x and y training data together and batch and Prefetch data for faster consumption
train_dataset = tf.data.Dataset.zip((Xtrain, ytrain)).batch(128).prefetch(128)

iterator = tf.data.Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
X, y = iterator.get_next()

training_init_op = iterator.make_initializer(train_dataset)

def get_tensors(graph=tf.get_default_graph()):
    return [t for op in graph.get_operations() for t in op.values()]

get_tensors()
#<tf.Tensor 'tensors_1/component_0:0' shape=(11623,) dtype=uint8>,
#<tf.Tensor 'batch_size:0' shape=() dtype=int64>,
#<tf.Tensor 'drop_remainder:0' shape=() dtype=bool>,
#<tf.Tensor 'buffer_size:0' shape=() dtype=int64>,
#<tf.Tensor 'IteratorV2:0' shape=() dtype=resource>,
#<tf.Tensor 'IteratorToStringHandle:0' shape=() dtype=string>,
#<tf.Tensor 'IteratorGetNext:0' shape=(?, 28, 28) dtype=float32>,
#<tf.Tensor 'IteratorGetNext:1' shape=(?, 2) dtype=float32>,
#<tf.Tensor 'TensorSliceDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'TensorSliceDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'ZipDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'BatchDatasetV2:0' shape=() dtype=variant>,
#<tf.Tensor 'PrefetchDataset:0' shape=() dtype=variant>]

sess = tf.InteractiveSession()
print('Size of Xtrain: %d' % tf.get_default_graph().get_tensor_by_name('tensors/component_0:0').eval().shape[0])
#Size of Xtrain: 11623
ARAT
  • 811
  • 1
  • 14
  • 33
1

Not sure if this still works in latest versions of TensorFlow but if this is absolutely needed a hacky solution is to create a batch that's bigger than the dataset size. You don't need to know how big the dataset is, just request a batch size that's larger.

Milad
  • 4,134
  • 5
  • 29
  • 39
1

TensorFlow's get_single_element() is finally around which does exactly this - return all of the elements in one call.

This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).

get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.

This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.

Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

manisar
  • 63
  • 1
  • 5
0

Adding on John's answer:

total = []
for element in val_ds.as_numpy_iterator(): 
  total.append(element[1])

all_total = np.concatenate(total)
print(all_total)
Farmaker
  • 2,580
  • 10
  • 15