Automatic image description makes it possible to generate textual descriptions from images. Generating a textual description from an image is a challenging artificial intelligence problem, but recently deep learning methods have achieved good results in this task. This project’s approach is inspired by this problem and the goal is to implement a language model that can generate movie titles from poster images. The main model is a language model used to learn the likelihood that a token occurs based on the previous sequence of tokens used in the input title. This information is then used to generate plausible movie titles.
Train an image caption model on movie titles and visual features of the poster images and generate a new image titles.
Build an RNN model that uses pre-trained sub-word embedding and feature fusion layer which concatenates each subword embedding with the visual representation of the poster image, then train the RNN model on the movie titles and the visual vectors.
The token vectors are generated using English pre-trained embeddings from BPEmb that uses Byte-Pair Encoding (BPE) to generate tokens from a string of characters. BPE is an unsupervised subword segmentation method that iteratively merges the most frequent token pair of a string into a new token (https://nlp.h-its.org/bpemb) [1]. Subword embeddings represent morphological information by splitting words into smaller instances. The idea of encoding rare and unknown words as sequences of subword units, based on the intuition that various word classes are translatable via smaller units than words (originally used to improve translation models for unknown words). The segmentation made by the Byte-Pair Encoding (BPE) that is often satisfactory, however, it does not make a linguistic morphological analysis [2].
This is the Pytorch version of ResNet residual nets model is pre-trained on the ImageNet classification dataset, as proposed in "Deep Residual Learning for Image Recognition" [3]. The versions of the model contains 5, 34, 50, 101, 152 layers respectively [6]. The image representations were generated by feeding the image posters into the model and copying the output of the final layer. This results in a 512-dimensional vector representation of each image.
The movie posters are obtained from IMDB website. The dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters. Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to it. https://www.kaggle.com/neha1703/movie-genre-from-its-poster [7].
The dataset used for the titles was the WikiPlots corpus which has a collection of 112,936 story plots extracted from English language Wikipedia [4]. The dataset was created by searching language article that contains a sub-header with words like "plot" or "plot summary" and thus includes summaries from movies, books, tv episodes, video games, etc. The character-level multi-layer RNN model used consistently came up with varied and plausibe titles, such as "Pirates: A Fight Dance Story", "Cannibal Spy II" and "Conan the Pirate" [5].
The RNN model utilizes an embedding layer to vectorize the movie titles and concatenates each token embedding vector with the corresponding image vector. The model accepts a list of movie title encoded to indices mappings using the BPEmb model and the image vectors as input. The token indices are then passed through the embedding layer to generate the vector representations for the tokens. Then, each batch of token vectors are padded before being concatenated with the image vector. The model was fitted against a set of validation data during the training.
[1] BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
[2] Neural Machine Translation of Rare Words with Subword Units
[3] Deep Residual Learning for Image Recognition
[4] WikiPlots corpus
[5] Story titles, invented by neural network
[7] Movie Genre from its Poster: Predicting the Genre of the movie by analysing its poster