In this last post regarding the course project of my Deep Learning class, I go over my final thoughts and expand on some possible other work that could be done in the future for improving results.
The goal of this project was to explore different methods of reconstructing the downsampled mscoco dataset images where the middle was cropped out. In addition, we were to explore the effectiveness, if any, of including the captions in our model.
I showed in the series of posts made regarding this project that using pre-trained Generative Adversarial Networks on uncorrupted images, combined with a perceptual and contextual loss as proposed by Yeh et al. for image reconstruction proved to be successful to a certain level. Furthermore, according to the model and methodology I used, performance was not increased and seemed to decrease by incorporating the captions in my model. This can be possibly explained by the fact that the embedding of the captions was not standard enough and was being considered as noise by the generator.
I think applying this task to dataset such as images of faces or birds, can be considered as a simpler task and can explain the successful work of others. I believe the dataset we are using has so much variance in it (in terms of the different images) that it is very hard for our models to fit an appropriate distribution that could be used to reconstruct.
The method and model I used could be further expanded to possibly create greater results. Some of the things I think could have been further explored include:
- Increasing the size of the noise vector in the generator
- Pre-process the captions to “concentrate” the embedding
I think increasing the size of the noise vector might have helped following the results with the captions embedding. The images generated out of noise (and caption) without reconstruction seemed to be sharper than the ones without.
Also, since the embedding seem to be not helping, maybe exploring different embeddings would prove to be successful. Another option is to remove words form the captions to try to get only the most important ones. For example, in a sentence “a man holding a bunch of bananas” it could be pre-processed as “man, bananas”. Maybe then the embedding would be less polluted by the other words.
I have to admit I found this project very interesting. It allowed me to get familiar with the family of generative models and in particular, GANs, which are very promising models. Furthermore, it was interesting to move along the progression of this project. I enjoyed starting from a basic model and expanding throughout the past couple of months while making adjustments following the results I observed and researching possible alternative methods/implementations throughout this whole process.