In this series of posts I will detail how I incorporated the image captions to my model in order to perform image inpainting.
In this part, I cover the approach I used and the implementation I will going for. The next part (Part 2) will include results and in addition, I will expand on some elements I previously mentioned could help my performance!
This post relates to the class project for my Deep Learning class. For more information regarding this project, or for all other post related, please follow this link. For the summary/plan of my project, refer to this post.
To perform the image inpainting, in addition to the images themselves, we have access to captions which describe the full image before it was corrupted (before the black square was put in the middle). Here are some examples of the captions taken from the training data
- ‘A small artificial bird inside a bowl on a table.’
- ‘a blue and black bird is sitting in a bowl on a table’,
- ‘A black bowl on a wooden table with a ceramic bird placed inside it’
- ‘A small bird sitting bowl on a table.’
- ‘There is a small bird inside of a bowl’
- ‘A sheep is eating grass with a sign in the background.’
- ‘Four white goats standing and eating in a green field.’
- ‘A cute white goat eating grass near the street.’
- ‘A lone goat doing some lawn work at a big business.’
- ‘A goat grazing on lush green grass near a road.’
One thing we can notice is that the captions are full sentences of different length and there can be multiple captions per image (1-5 and 6-10 are group of captions for the same image).
Ideally, to help our model generate true images that match the captions, we would like it to understand what the captions mean and infer properties about the true image behind the corruption. By having this understanding of the caption, we could add it during training as an input to the generator (maybe even to the discriminator). Conceptually, this could allow the generator to understand what it should be generating. The discriminator could also need to get that caption as an input to properly evaluate if the image is real or fake.
Let’s explore two possible approaches for letting our model get information from the captions.
Bag of words
These sentences can be decomposed as simply the appearance of words. This conceptually can be seen as a binary variable for each possible word that either is present or not in a sentence. Under this approach, we would consider a large one-hot vector of our vocabulary for each word indicating which one it is. It can also be seen as a dictionary of the words present with a counter.
The advantage of this approach is its simplicity and the fact that the order of appearance of words do not impact the information. Therefore, the model would simply know that a goat is in caption 7-10.
The major disadvantage of this approach, however, is that it doesn’t consider the similarity of words! If we consider caption 6 vs 7-10, as humans we understand that a goat is very similar to sheep, but with a bag of words, a sheep is as different to a goat as a plane or even grass.
Another approach is to consider embeddings of words. This approach allows for assigning a real value to words, while considering that similar words should have similar value. How similar are words is very subjective and how to train and create such word embeddings is a whole research field in itself. The goal here is not necessarily to train the model to create appropriate embeddings, but rather to use word embeddings.
The main disadvantage to this approach is that word embeddings need to be trained. Indeed, a model needs to figure out what are the link between different words in order to assign it similar real values.
The advantage of this approach is that it would consider similar words like sheep and goat as close. The other major advantage of this approach is that powerful models have already been trained to generate word embedding matrices. These pre-trained embeddings can be used directly to generate real-valued vectors from words, also known as Word2Vec. Furthermore, operations can be made on the real-value vectors.
The classic word2vec example is to consider the operation of ‘king’ – ‘man’ + ‘woman’. A good embedding should allow you to get a vector that is similar to the one for ‘queen’!
Pre-trained word embeddings
One of the popular pre-trained word2vec model is one provided by Google (more info and download here), which was trained on more than 3 billion words from GoogleNews. It covers a vocabulary of 3 million words and can be easily implemented/used in Python with the gensim package. Basically, the model we are using is simply a very large matrix that has for each of the 3 million words in the vocabulary, an array of size 300 with real-values.
Using the Python package, we can load the pre-trained model/matrix which will later allow us to generate the vectors of size 300. One small adjustment to consider is that Google’s model is lowercased only and has limited vocabulary (even though it’s pretty large), therefore captions will be lowercased and words that are not in the vocabulary will be ignore (e.g. the word ‘a’, or any ponctuation).
Example of how to load and use the model
import gensim # load the pre-trained word2vec model path = './path_to_model/GoogleNews-vectors-negative300.bin.gz' model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True) # to get the embedding for a word, simply call it vector = model['test sentence'.split()] print vector.shape # should print (2, 300)
Embedding the captions
The model and gensim package can be used for converting words to vectors, but how do we deal with multiple words in a sentence? What about multiple captions?
An approach that can be thought of is for a sentence, to simply consider the average of each word embedding as being the embedding of the sentence! This a simple approach that will be used as a starting point. Much time could be spent trying to find the best way of combining word embeddings for sentences, but I think this is a good idea and appropriate for the task at hand. For example in the sample code provided above, we would average over the 0 axis to recover the embedding of the full sentence.
The same kind of method can be applied to deal with the multiple captions. We will simply average the sentence embeddings (which is an average of its own word embeddings by now) over the different captions to have one caption embedding. I will therefore have the following levels of embedding
- Word embedding
- Sentence embedding
- Caption embedding
My model will then receive the caption embedding (which is only a 300 element vector at this point) as an input and will be trained just like any other neural network. You can also think that with a minibatch of training examples, each will have its 300 element vector. Of course, this will not be the only input to my model, previous inputs will be concatenated with the caption embedding.
Results and more details on actual implementation will be in Part 2 :)!