Skcript's Logo
Skcript's Logo

Why every TensorFlow developer should know about TFRecord!

Why waste time on maintaining your datasets and its respective labels on different files and why read it at different times, when you can do it in one place! This native file format used in Tensorflow allows you to shuffle, batch and split datasets with its own functions. Cool right?!

Subscribe to our awesome Newsletter.

Why every TensorFlow developer should know about TFRecord!

Currently reading:

Why every TensorFlow developer should know about TFRecord!


Subscribe to our Newsletter:


Share article on:

After few days of Tensorflow, every beginner will meet this crazy awesome Tensorflow’s file format called Tfrecords. Most of the batch operations aren’t done directly from images, rather they are converted into a single tfrecord file (images which are numpy arrays and labels which are a list of strings). It’s always been a Beginner’s Nightmare to understand the purpose of this conversion and the real benefit this has to the workflow. So here I am making it easier to understand with simple to complex examples.

WHAT IS TFRECORD?

As per Tensorflow’s documentation,

“… approach is to convert whatever data you have into a supported format. This approach makes it easier to mix and match data sets and network architectures. The recommended format for TensorFlow is a TFRecords file containing tf.train.Example protocol buffers (which contain Features as a field).“

So, I suggest that the easier way to maintain a scalable architecture and a standard input format is to convert it into a tfrecord file.

Let me explain in terms of beginner’s language,

So when you are working with an image dataset, what is the first thing you do? Split into Train, Test, Validate sets, right? Also we will shuffle it to not have any biased data distribution if there are biased parameters like date.

Isn’t it tedious job to do the folder structure and then maintain the shuffle?

What if everything is in a single file and we can use that file to dynamically shuffle at random places and also change the ratio of train:test:validate from the whole dataset. Sounds like half the workload is removed right? A beginner’s nightmare of maintaining the different splits is now no more. This can be achieved by tfrecords.

Let’s see the difference between the code - Naive vs Tfrecord

Naive

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import os 

import glob

import random

# Loading the location of all files - image dataset
# Considering our image dataset has apple or orange
# The images are named as apple01.jpg, apple02.jpg .. , orange01.jpg .. etc.

images = glob.glob('data/*.jpg')

# Shuffling the dataset to remove the bias - if present
random.shuffle(images)
# Creating Labels. Consider apple = 0 and orange = 1

labels = [ 0 if 'apple' in image else 1 for image in images ]
data = list(zip(images, labels))

# Ratio

data_size = len(data)
split_size = int(0.6 * data_size)

# Splitting the dataset

training_images, training_labels = zip(*data[:split_size])
testing_images, testing_labels = zip(*data[split_size:])

Tfrecord

Follow the five steps and you are done with a single tfrecord file that holds all your data for proceeding.

  1. Use tf.python_io.TFRecordWriter to open the tfrecord file and start writing.
  2. Before writing into tfrecord file, the image data and label data should be converted into proper datatype. (byte, int, float)
  3. Now the datatypes are converted into tf.train.Feature
  4. Finally create an Example Protocol Buffer using tf.Example and use the converted features into it. Serialize the Example using serialize() function.
  5. Write the serialized Example.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import tensorflow as tf 

import numpy as np

import glob

from PIL import Image

# Converting the values into features
# _int64 is used for numeric values

def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

# _bytes is used for string/char values

def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
tfrecord_filename = 'something.tfrecords'

# Initiating the writer and creating the tfrecords file.

writer = tf.python_io.TFRecordWriter(tfrecord_filename)

# Loading the location of all files - image dataset
# Considering our image dataset has apple or orange
# The images are named as apple01.jpg, apple02.jpg .. , orange01.jpg .. etc.

images = glob.glob('data/*.jpg')
for image in images[:1]:
  img = Image.open(image)
  img = np.array(img.resize((32,32)))
label = 0 if 'apple' in image else 1
feature = { 'label': _int64_feature(label),
              'image': _bytes_feature(img.tostring()) }

# Create an example protocol buffer

 example = tf.train.Example(features=tf.train.Features(feature=feature))

# Writing the serialized example.

 writer.write(example.SerializeToString())

writer.close()

If you closely see the process involved, it’s very simple.

Data -> FeatureSet -> Example -> Serialized Example -> tfRecord.

So to read it back, the process is reversed.

tfRecord -> SerializedExample -> Example -> FeatureSet -> Data

Reading from tfrecord

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import tensorflow as tf 
import glob
reader = tf.TFRecordReader()
filenames = glob.glob('*.tfrecords')
filename_queue = tf.train.string_input_producer(
   filenames)
_, serialized_example = reader.read(filename_queue)
feature_set = { 'image': tf.FixedLenFeature([], tf.string),
               'label': tf.FixedLenFeature([], tf.int64)
           }
           
features = tf.parse_single_example( serialized_example, features= feature_set )
label = features['label']
 
with tf.Session() as sess:
  print sess.run([image,label])

You can also shuffle the files using the tf.train.shuffle_batch()

Share article on

Comments and Discussions

Skcript


Stay Updated with Our Newsletter
SIGN UP