In the training of a computer vision model, we need to open many images from the training dataset. A typical wrong solution is to save each image as a separate file. In this case, each opening operation causes some mechanical movements in the hard disk, and hundreds of epochs often damage the disk surface and ruin hard drives.
The correct solution is to have all images in a single large "container" file.
An ideal container file format is the tarball, which usually has the extension
name .tar.gz
. The .gz
extension means compressing the .tar
container file
using the GZip algorithm. Because Gzip is a streaming algorithm, reading
sequentially from a Gzipped byte stream is as convenient as reading from the
byte stream. The .tar
extension names the container format, consisting of a
sequence of files or directories, each with a descriptive header followed by the
file content. A directory does not have content, but only headers. This format
allows us to read sequentially from .tar
or .tar.gz
file for image files
without causing frequent mechanical movements in hard drives.
We must use GNU tar instead of bsdtar on macOS.
It is critical in deep learning to ensure that each minibatch or consecutive images read from a tarball belong to different labels. This property is known as shuffling. Unshuffled data sequence prevents the deep learning training algorithm from converging.
However, most tarballs contain unshuffled data because people often use the tar czf
command to create tarballs. For example, to archive a directory
mnist_png
into mnist_png.tar.gz
, we run the following command.
tar czf mnist_png.tar.gz mnist_png
We can download this tarball from this GitHub repo.
We can use the tar tvf
command to read the content sequentially and print the
image file names.
$ tar tvf mnist_png.tar.gz | head -n 10
drwxr-x--- 0 myleott myleott 0 Dec 10 2015 mnist_png/
drwxr-x--- 0 myleott myleott 0 Dec 10 2015 mnist_png/testing/
drwxr-x--- 0 myleott myleott 0 Dec 10 2015 mnist_png/testing/2/
-rw-r----- 0 myleott myleott 272 Dec 10 2015 mnist_png/testing/2/995.png
-rw-r----- 0 myleott myleott 261 Dec 10 2015 mnist_png/testing/2/8450.png
-rw-r----- 0 myleott myleott 282 Dec 10 2015 mnist_png/testing/2/5629.png
-rw-r----- 0 myleott myleott 280 Dec 10 2015 mnist_png/testing/2/9480.png
-rw-r----- 0 myleott myleott 248 Dec 10 2015 mnist_png/testing/2/2058.png
-rw-r----- 0 myleott myleott 259 Dec 10 2015 mnist_png/testing/2/7762.png
-rw-r----- 0 myleott myleott 243 Dec 10 2015 mnist_png/testing/2/9847.png
The result reveals that the tar czf
command archives folder recursively, so it
puts images in the same folder together. However, most image datasets published
as tarballs, including the well-known ImageNet dataset, use the base directory
to name the label of images in it. This behavior of tar czf
generates
tarballs that break the shuffling property.
We want to generate a tarball where consecutive images have different labels or
different base directories. We wrote two programs to achieve the goal using a
divide-and-merge strategy -- tarball_divide
and tarball_merge
.
To install them, we need the Go compiler and run the following commands.
go get github.com/wangkuiyi/gotorch/tool/...
We can then find the executable files in $GOPATH/bin
.
To split the unshuffled mnist_png.tar.gz
into two shuffled tarballs:
mnist_png_training_shuffled.tar.gz
and mnist_png_testing_shuffled.tar.gz
,
let us take the following steps.
-
Split
mnist_png.tar.gz
intomnist_png_training.tar.gz
andmnist_png_testing.tar.gz
, which are unshuffled.tar xzf mnist_png.tar.gz # Generates two directories mnist_png/training and mnist_png/testing cd mnist_png tar czf mnist_png_training.tar.gz training tar czf mnist_png_testing.tar.gz testing
-
Divide
mnist_png_training.tar.gz
into ten small tarballs; each contains images in a base directory.tarball_divide mnist_png_training.tar.gz
This generates
[0-9].tar.gz
files. -
Merge these files into a new shuffled tarball
mnist_png_training_shuffled.tar.gz
.tarball_merge -out=mnist_png_training_shuffled.tar.gz [0-9].tar.gz
To check the generated tarball is valid, using the
file
command, which prints errors if the checksum is wrong.file mnist_png_training_shuffled.tar.gz
To count PNG images in these tarballs, run the
tar tvf
command.tar tvf mnist_png_training.tar.gz | grep \.png$ | wc -l tar tvf mnist_png_training_shuffled.tar.gz | grep \.png$ | wc -l
Both commands should print 60000.
-
Clear the intermediate files and divide-and-merge the testing dataset.
rm [0-9].tar.gz tarball_divide mnist_png_testing.tar.gz tarball_merge -out=mnist_png_testing_shuffled.tar.gz [0-9].tar.gz tar tvf mnist_png_testing_shuffled.tar.gz | grep \.png$ | wc -l tar tvf mnist_png_testing.tar.gz | grep \.png$ | wc -l
The last two commands should both print 10000.