Tf.data is a high level API provided by tensorflow, it performs as a pipeline for complex input and output. The core data structure of tf.data is Dataset which represents a potentially large set of elements.
Here is the defination of Dataset given by tensorflow.org
Datasetcan be used to represent an input pipeline as a collection of elements (nested structures of tensors) and a “logical plan” of transformations that act on those elements.
To summarize, the dataset is a data pipeline, and we can do some preprocessing on it. The core problem of a pipeline is how the data be imported and consumed, the following part will explain that as well as some useful APIs in preprocessing data.
Dataset can be built from several sources including csv file, numpy array and tensors.
Tf.data provides a convenient API make_csv_dataset to read records from one or more csv files.
Suppose the csv file is
We can build a dataset from the above csv in the following way
dataset = tf.contrib.data.make_csv_dataset(CSV_PATH, batch_size=2)
Here batch_size represents how many records would be aquired in a batch
We can use Iterator to see what contains in this dataset
batch = dataset.make_one_shot_iterator().get_next()
The result is
tf.Tensor(['how' 'I'], shape=(2,), dtype=string)
make_csv_dataset defaultly takes the first row as header, if there are no header in the csv file like this
We can set
dataset2 = tf.contrib.data.make_csv_dataset(CSV_PATH, batch_size=2, header=False,column_names=['a','b','c''d'])
Dataset2 should have the same value with dataset1
We can also create a dataset from tensors, the related API is tf.data.Dataset.from_tensor_slices()
dataset2 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([10, 5]))
Actually the input of this API is not necessarily tensors, numpy arrays are also adaptable .
dataset3 = tf.data.Dataset.from_tensor_slices(np.random.sample((10, 5)))
The only way to retrieve the data is Iterator(), Iterator enables us to loop over all the dataset and get back the data we want. There are basically two kinds of Iterator which are
The examples can be find in the first part when we show how to import csv files
Compared to one shot iterator, initializable iterator allows data to be changed after dataset has already been built.Note that this cannot work in eager_execution model. Here is the example
# using a placeholder
Tf.data provides several tools for data preprocessing such as batch and shuffle
dataset.batch(BATCH_SIZE) given the BATCH_SIZE, this API will make the output in a batch way, and output BATCH_SIZE size of data at one time.
<tf.Tensor: id=102, shape=(2,), dtype=int64, numpy=array([1, 2])>
When preparing the training data, one important step is shuffling the data to mitigate overfitting, tf.data offers convenient API to do that.
BATCH_SIZE = 2
<tf.Tensor: id=115, shape=(2,), dtype=int64, numpy=array([2, 3])>