My original implementation for extracting a batch of data for training is not the most efficient to say the least as it greatly affects training time. Below I will detail the changes I have made to preprocess the images to accelerate training time. I’d like to thank Francis Dutil for discussing his approach with me and providing his code for inspiration.
The motivation behind preprocessing
The inefficient way that the batch of images is loaded is by extracting each file as a .jpg and converting it as a numpy array. This is done repeatedly for each batch, for both training and validation batches.
Two possible bottlenecks are striking here, first is converting the actual image to the numpy array and second is dealing with individual files. The preprocessing will therefore aim to alleviate both of these.
The issue of memory
One of the issue limiting the whole database to be put into one single tensor is the size of it. Indeed, it is not practical to make it a huge file and play with it just as we do with a smaller dataset like MNIST. This is where the memory limitations of the GPU come into play. Ideally, we could load the full dataset onto the GPU for training making it optimal.
To deal with memory, the preprocessed dataset will be saved under multiple files for later use. To keep randomness during training, the user could then randomly select a file and then a random batch within that file.
At the time of this writing, I opted for sets of 15000 images and this results in datasets of about 1.5 GB each.
Format of dataset
Images will be saved without corruption (without the middle part chopped off) as this is easily done with the numpy arrays once loaded. In addition, this provides more flexibility. The axes of the images will also be shuffled from 64x64x3 to 3x64x64 simply to match future inputs in theano. I will also take the opportunity to remove the greyscale images.
The captions, as well as the middle part will also be packaged in a similar way to correspond with their image counterpart. This is done in anticipation of later work.
As a comparison, below are some loading times of the three components mentioned above, evaluated on my powerful Macbook Air under the previous and new implementation.
- loading a batch size of 1 — 2.441 sec
- loading a batch size of 1 — 0.003 sec
- loading a batch size of 128 — 2.51 sec
- loading a batch size of 128 — 0.01 sec
- loading a batch size of 10000 — 21.1 sec
- loading a batch size of 10000 — 3.68 sec
I leave it to the reader to figure out which one is the old/new :).
For more details on the implementation, refer to this link to my GitHub repository.