key: cord-020912-tbq7okmj authors: Batra, Vishwash; Haldar, Aparajita; He, Yulan; Ferhatosmanoglu, Hakan; Vogiatzis, George; Guha, Tanaya title: Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise Illustration date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_4 sha: doc_id: 20912 cord_uid: tbq7okmj We address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. Given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. This new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. We propose a novel variational recurrent seq2seq (VRSS) retrieval model for this seq2seq task. Unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. This synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. To this end, we build and release a new Stepwise Recipe dataset for research purposes, containing 10K recipes (sequences of image-text pairs) having a total of 67K image-text pairs. To our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. Our model is shown to outperform several competitive and relevant baselines in the experiments. We also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods. There is growing interest in cross-modal analytics and search in multimodal data repositories. A fundamental problem is to associate images with some corresponding descriptive text. Such associations often rely on semantic understanding, beyond traditional similarity search or image labelling, to provide humanlike visual understanding of the text and reflect abstract ideas in the image. Stepwise Recipe illustration example showing a few text recipe instruction steps alongside one full sequence of recipe images. Note that retrieval of an accurate illustration of Step 4, for example, depends on previously acquired context information. Cross-modal retrieval systems must return outputs of one modality from a data repository, while a different modality is used as the input query. The multimodal repository usually consists of paired objects from two modalities, but may be labelled or unlabelled. Classical approaches to compare data across modalities include canonical correlation analysis [12] , partial least squares regression [28] , and their numerous variants. More recently, various deep learning models have been developed to learn shared embedding spaces from paired image-text data, either unsupervised, or supervised using image class labels. The deep models popularly used include deep belief networks [23] , correspondence autoencoders [9] , deep metric learning [13] , and convolutional neural networks (CNNs) [33] . With all these models it is expected that by learning from pairwise aligned data, the common representation space will capture semantic similarities across modalities. Most such systems, however, do not consider sequences of related data in the query or result. In traditional image retrieval using text queries, for example, each image-text pair is considered in isolation ignoring any broader 'context'. A context-aware image-from-text retrieval model must look at pairwise associations and also consider sequential relationships. Such sequence-to-sequence (seq2seq) cross-modal retrieval is possible when contextual information and semantic meaning are both encoded and used to inform the retrieval step. For stepwise recipe illustration, an effective retrieval system must identify and align a set of relevant images corresponding to each step of a given text sequence of recipe instructions. More generally, for the task of automatic story picturing, a series of suitable images must be chosen to illustrate the events and abstract concepts found in a sequential text taken from a story. An example of the instruction steps and illustrations of a recipe taken from our new Stepwise Recipe dataset is shown in Fig. 1 . In this paper, we present a variational recurrent learning model to enable seq2seq retrieval, called Variational Recurrent Sequence-to-Sequence (VRSS) model. VRSS produces a joint representation of the image-text repository, where the semantic associations are grounded in context by making use of the sequential nature of the data. Stepwise query results are then obtained by searching this representation space. More concretely, we incorporate the global context information encoded in the entire text sequence (through the attention mechanism) into a variational autoencoder (VAE) at each time step, which converts the input text into an image representation in the image embedding space. To capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows the distribution conditional on the latent topic from the previous time step. By doing so, our model can naturally capture sequential semantic structure. Our main contributions can be summarised below: -We formalise the task of sequence-to-sequence (seq2seq) retrieval for stepwise illustration of text. -We propose a new variational recurrent seq2seq (VRSS) retrieval model for seq2seq retrieval, which employs temporally-dependent latent variables to capture the sequential semantic structure of text-image sequences. -We release a new Stepwise Recipe dataset (10K recipes, 67K total imagetext pairs) for research purposes, and show that VRSS outperforms several cross-modal retrieval alternatives on this dataset, using various performance metrics. Our work is related to: cross-modal retrieval, story picturing, variational recurrent neural networks, and cooking recipe datasets. A number of pairwise-based methods over the years have attempted to address the cross-modal retrieval problem in different ways, such as metric learning [26] and deep neural networks [32] . For instance, an alignment model [16] was devised that learns inter-modal correspondences using MS-COCO [19] and Flickr-30k [25] datasets. Other work [18] proposed unifying joint image-text embedding models with multimodal neural language models, using an encoder-decoder pipeline. A later method [8] used hard negatives to improve their ranking loss function, which yielded significant gains in retrieval performance. Such systems focus only on isolated image retrieval when given a text query, and do not address the seq2seq retrieval problem that we study here. In a slight variation [2] , the goal was to retrieve an image-text multimodal unit when given a text query. For this, they proposed a gated neural architecture to create an embedding space from the query texts and query images along with the multimodal units that form the retrieval results set, and then performed semantic matching in this space. The training minimized structured hinge loss, and there was no sequential nature to the data used. Picturing. An early story picturing system [15] retrieved landscape and art images to illustrate ten short stories based on key terms in the stories and image descriptions as well as a similarity linking of images. The idea was pursued further with a system [11] for helping people with limited literacy to read, which split a sentence into three categories and then retrieved a set of explanatory pictorial icons for each category. To our knowledge, an application [17] that ranks and retrieves image sequences based on longer text paragraphs as queries was the first to extend the pairwise image-text relationship to matching image sequences with longer paragraphs. They employed a structural ranking support vector machine with latent variables and used a custom-built Disneyland dataset, consisting of blog posts with associated images as the parallel corpus from which to learn joint embeddings. We follow a similar approach, creating our parallel corpus from sequential stepwise cooking recipes rather than unstructured blog posts, and design an entirely new seq2seq model to learn our embeddings. The Visual Storytelling Dataset (VIST) [14] was built with a motivation similar to our own, but for generating text descriptions of image sequences rather than the other way around. Relying on human annotators to generate captions, VIST contains sequential image-text pairs with a focus on abstract visual concepts, temporal event relations, and storytelling. In our work, we produce a similar sequenced dataset in a simple, automated manner. A recent joint sequence-to-sequence model [20] learned a common image-text semantic space and generated paragraphs to describe photo streams. This bidirectional attention recurrent neural network was evaluated on both the above datasets. Despite being unsuitable for our inverse problem, VIST has also been used for retrieving images when given text, in work related to ours. In an approach called Coherent Neural Story Illustration (CNSI), an encoder-decoder network [27] was built to first encode sentences using a hierarchical two-level sentence-story gated recurrent unit (GRU), and then sequentially decode into a corresponding sequence of illustrative images. A previously proposed coherence model [24] was used to explicitly model co-references between sentences. Variational Recurrent Neural Networks. Our model is partly inspired by the variational recurrent neural network (VRNN) [6] , which introduces latent random variables into the hidden state of an RNN by combining it with a variational autoencoder (VAE). They showed that using high level latent random variables, VRNN can model the variability observed in structured sequential data such as natural speech and handwriting. VRNN has recently been applied to other sequential modelling tasks such as machine translation [31] . Our proposed VRSS model introduces temporally-dependent latent variables to capture the sequential semantic structure of text/image sequences. Different from existing approaches, we take into account the global context information encoded in the entire query sequence. We use VAE for cross-modal generation by converting the text into a representation in the image embedding space instead of using it to reconstruct the text input. Finally, we use the max-margin hinge loss to enforce similarity between text and paired image representations. Cooking Recipe Datasets. The first attempt at automatic classification of food images was the Food-101 dataset [3] having 101K images across 101 categories. Since then, the new Recipe1M dataset [29] gained wide attention, which paired each recipe with several images to build a collection of 13M food images for 1M recipes. Recent work [4] proposed a cross-modal retrieval model that aligns Recipe1M images and recipes in a shared representation space. As this dataset does not offer any sequential data for stepwise illustration, this association is between images of the final dish and the corresponding entire recipe text. Our Stepwise Recipe dataset, by comparison, provides an image for each instruction step, resulting in a sequence of image-text pairs for each recipe. In [5] they release a dataset of sequenced image-text pairs in the cooking domain, with focus on text generation conditioned on images. RecipeQA [34] is another popular dataset, used for multimodal comprehension and reasoning, with 36K questions about the 20K recipes and illustrative images for each step of the recipes. Recent work [1] used it to analyse image-text coherence relations, thereby producing a human-annotated corpus with coherence labels to characterise different inferential relationships. The RecipeQA dataset reveals associations between image-text pairs much like our Stepwise Recipe dataset, and we therefore utilise it to augment our own dataset. We construct the Stepwise Recipe dataset, composed of illustrated, step-bystep recipes from three websites 1 . Recipes were automatically web-scraped and cleaned of HTML tags. The information about data and scripts will be made available on GitHub 2 . The construction of such an image-text parallel corpus has several challenges as highlighted in previous work [17] . The text is often unstructured, without information about the canonical association between image-text pairs. Each image is semantically associated with some portion of the text in the same recipe, and we assume that the images chosen by the author to augment the text are semantically meaningful. We thus perform text segmentation to divide the recipe text and associate segments with a single image each. We perform text-based filtering [30] to ensure text quality: (1) descriptions should have a high unique word ratio covering various part-of-speech tags, therefore descriptions with high noun ratio are discarded; (2) descriptions with high repetition of tokens are discarded; and (3) some predefined boiler-plate prefixsuffix sequences are removed. Our constructed dataset consists of about 2K recipes with 44K associated images. Furthermore, we augment our parallel corpus using similarly filtered RecipeQA data [34] , which contains images for each step of the recipes in addition to visual question answering data. The final dataset contains over 10K recipes in total and 67K images. The seq2seq retrieval task is formalised as follows: given a sequence of text passages, x = {x 1 , x 2 , ..., x T }, retrieve a sequence of images i = {i 1 , i 2 , . .., i T } (from a data repository) which best describes the semantic meanings of the text passages, i.e., p(i|x) = We address the seq2seq retrieval problem by considering three aspects: (1) encoding the contextual information of text passages; (2) capturing the semantics of the images retrieved (in a story/recipe); and (3) learning the relatedness between each text passage and its corresponding image. It is natural to use RNNs to encode a sequence of text passages. Here, we encode a text sequence using a bi-directional GRU (bi-GRU). Given a text passage, we use the attention mechanism to capture the contextual information of the whole recipe. We map the text embedding into a latent topic z t by using a VAE. In order to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows a distribution conditional on the latent topic z t−1 from the previous step. We decode the corresponding image vector i t conditional on the latent topic, to learn the relatedness between text and image with a multi-layer perceptron and obtain a synthetic image embedding point generated from its associated text embedding point. Our proposed Variational Recurrent Seq2seq (VRSS) model is illustrated in Fig. 2 . Below, we describe each of the main components of the VRSS model. We use a bi-GRU to learn the hidden representations of the text passage (e.g. one recipe instruction) in the forward and backward directions. The two learned hidden states are then concatenated to form the text segment To encode a sequence of such text passages (e.g. one recipe), a hierarchical bi-GRU is used which first encodes each text segment and subsequently combines them. Image Encoder. To generate the vector representation of an image, we use the pre-trained modified ResNet50 CNN [22] . In experiments, this model produced a well distributed feature space when trained on the limited domain, namely food related images. This was verified using t-SNE visualisations [21] , which showed less clustering in the generated embedding space as compared to embeddings obtained from models pre-trained on ImageNet [7] . To capture global context, we feed the bi-GRU encodings into a top level bi-GRU. Assuming the hidden state output of each text passage x l in the global context is h c l , we use an attention mechanism to capture its similarity with the hidden state output of the t th text passage h t as The context vector is encoded as the combination of L text passages weighted by the attentions as c t = L l=1 α l h c l . This ensures that any given text passage is influenced more by others that are semantically similar. At the t th step text x t of the text sequence, the bi-GRU output h t is combined with the context c t and fed into a VAE to generate the latent topic z t . Two prior networks f μ θ and f Σ θ define the prior distribution of z t conditional on the previous z t−1 . We also define two inference networks f μ φ and f Σ φ which are functions of h t , c t , and z t−1 : Unlike the typical VAE setup where the text input x t is reconstructed by generation networks, here we generate the corresponding image vector i t . To generate the image vector conditional on z t , the generation networks are defined which are also conditional on z t−1 : The generation loss for image i t is then: − KL(q(z t |x ≤t , z