key: cord-020888-ov2lzus4 authors: Formal, Thibault; Clinchant, Stéphane; Renders, Jean-Michel; Lee, Sooyeol; Cho, Geun Hee title: Learning to Rank Images with Cross-Modal Graph Convolutions date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_39 sha: doc_id: 20888 cord_uid: ov2lzus4 We are interested in the problem of cross-modal retrieval for web image search, where the goal is to retrieve images relevant to a text query. While most of the current approaches for cross-modal retrieval revolve around learning how to represent text and images in a shared latent space, we take a different direction: we propose to generalize the cross-modal relevance feedback mechanism, a simple yet effective unsupervised method, that relies on standard information retrieval heuristics and the choice of a few hyper-parameters. We show that we can cast it as a supervised representation learning problem on graphs, using graph convolutions operating jointly over text and image features, namely cross-modal graph convolutions. The proposed architecture directly learns how to combine image and text features for the ranking task, while taking into account the context given by all the other elements in the set of images to be (re-)ranked. We validate our approach on two datasets: a public dataset from a MediaEval challenge, and a small sample of proprietary image search query logs, referred as WebQ. Our experiments demonstrate that our model improves over standard baselines. This paper considers the typical image search scenario, where a user enters a text query, and the system returns a set of ranked images. More specifically, we are interested in re-ranking a subset of candidate images retrieved from the whole image collection by an efficient base ranker, following standard multi-stage ranking architectures in search engines [36] . Directly including visual features in the ranking process is actually not straightforward due to the semantic gap between text and images: this is why the problem has initially been addressed using standard text-based retrieval, relying for instance on text crawled from the image's webpage (e.g. surrounding text, title of the page etc.). In order to exploit visual information, and therefore improve the quality of the results -especially because this text is generally noisy, and hardly describes the image semantic-, many techniques have been developed since. For instance, some works have focused on building similarity measures by fusing mono-modal similarities, using either simple combination rules, or more complex propagation mechanisms in similarity graphs. More recently, techniques have emerged from the computer vision community, where text and images are embedded in the same latent space (a.k.a. joint embedding), allowing to directly match text queries to images. The latter are currently considered as state-of-the-art techniques for the cross-modal retrieval task. However, they are generally evaluated on artificial retrieval scenarios (e.g. on MSCOCO dataset [34] ), and rarely considered in a re-ranking scenario, where mechanisms like pseudo-relevance feedback (PRF) [31] are highly effective. We propose to revisit the problem of cross-modal retrieval in the context of re-ranking. Our first contribution is to derive a general formulation of a differentiable architecture, drawing inspiration from cross-modal retrieval, learning to rank, neural information retrieval and graph neural networks. Compared to joint embedding approaches, we tackle the problem in a different view: instead of learning new (joint) embeddings, we focus on designing a model that learns to combine information from different modalities. Finally, we validate our approach on two datasets, using simple instances of our general formulation, and show that the approach is not only able to reproduce PRF, but actually outperform it. Cross-Modal Retrieval. In the literature, two main lines of work can be distinguished regarding cross-modal retrieval: the first one focuses on designing effective cross-modal similarity measures (e.g. [2, 10] ), while the second seeks to learn how to map images and text into a shared latent space (e.g. [15, 18, 19, 54] ). The first set of approaches simply combines different mono-media similarity signals, relying either on simple aggregation rules, or on unsupervised crossmodal PRF mechanisms, that depend on the choice of a few but critical hyperparameters [2, 10, 11, 45] . As it will be discussed in the next section, the latter can be formulated as a two-step PRF propagation process in a graph, where nodes represent multi-modal objects and edges encode their visual similarities. It has been later extended to more general propagation processes based on random walks [28] . Alternatively, joint embedding techniques aim at learning a mapping between textual and visual representations [15, 18, 19, 23, [52] [53] [54] [55] 61] . Canonical Correlation Analysis (CCA) [17] and its deep variants [5, 27, 58] , as well as bi-directional ranking losses [8, 9, 52, 53, 55, 61] (or triplet losses) ensure that, in the new latent space, an image and its corresponding text are correlated or close enough w.r.t. to the other images and pieces of text in the training collection. Other objective functions utilize metric learning losses [35] , machine translation-based measures [44] or even adversarial losses [51] . These approaches suffer from several limitations [61] : they are sensitive to the triplet sampling strategy as well as the choice of appropriate margins in the ranking losses. Moreover, constituting a training set that ensures good learning and generalization is not an easy task: the text associated to an image should describe its visual content (e.g. "a man speaking in front of a camera in a park"), and nothing else (e.g. "the President of the US, the 10th of March", "John Doe", "joy and happiness"). Building a universal training collection of paired (image, text) instances, where text describes faithfully the content of the image in terms of elementary objects and their relationships, would be too expensive and time-consuming in practice. Consequently, image search engines rely on such pairs crawled from the Web, where the link between image and text (e.g. image caption, surrounding sentences etc.) is tenuous and noisy. To circumvent this problem, query logs could be used but, unfortunately -and this is our second argument regarding the limitations-, real queries are never expressed in the same way as the ones considered when evaluating joint embedding methods (e.g. artificial retrieval setting on MSCOCO [34] or Flickr-30K [43] datasets, where the query is the full canonical textual description of the image). In practice, queries are characterised by very large intent gaps: they do not really describe the content of the image but, most of the time, contain only a few words, and are far from expressing the true visual needs. What does it mean to impose close representations for all images representing "Paris" (e.g. "the Eiffel Tower", "Louvre Museum"), even if they can be associated to the same textual unit? Neural Information Retrieval. Neural networks, such as RankNet and LambdaRank, have been intensively used in IR to address the learning to rank task [7] . More recently, there has been a growing interest in designing effective IR models with neural models [1, 12, 13, 20, 25, 26, 37, 38, 41, 56] , by learning the features useful for the ranking task directly from text. While standard strategies focus on learning a global ranking function that considers each query-document pair in isolation, they tend to ignore the difference in distribution in the feature space for different queries [4] . Hence, some recent works have been focusing on designing models that exploit the context induced by the re-ranking paradigm, either by explicitly designing differentiable PRF models [32, 40] , or by encoding the ranking context -the set of elements to re-rank-, using either RNNs [4] or attention mechanisms [42, 62] . Consequently, the score for a document takes into account all the other documents in the candidate list. Because of their resemblance with structured problems, this type of approaches could benefit from the recent body of work around graph neural networks, which operate on graphs by learning how to propagate information to neighboring nodes. Graph Neural Networks. Graph Neural Networks (GNNs) are extensions of neural networks that deal with structured data encoded as a graph. Recently, Graph Convolutional Networks (GCNs) [30] have been proposed for semisupervised classification of nodes in a graph. Each layer of a GCN can generally be decomposed as: (i) node features are first transformed (e.g. linear mapping), (ii) node features are convolved, meaning that for each node, a differentiable, permutation-invariant operation (e.g. sum, mean, or max) of its neighbouring node features is computed, before applying some non-linearity, (iii) finally, we obtain a new representation for each node in the graph, which is then fed to the next layer. Many extensions of GCNs have been proposed (e.g. GraphSAGE [21] , Graph Attention Network [50] , Graph Isomorphism Network [57] ), some of them directly tackling the recommendation task (e.g. PinSAGE [59] ). But to the best of our knowledge, there is no prior work on using graph convolutions for the (re-)ranking task. Our goal is to extend and generalize simple yet effective unsupervised approaches which have been proposed for the task [2, 3, 10, 11, 45] , that can be seen as an extension of pseudo-relevance feedback methods for multi-modal objects. Let d ∈ D denote a document to re-rank, composed of text and image. We denote by s V (., .) a normalized similarity measure between two images, and by s T (q, d) the textual relevance score of document d w.r.t. query q. The cross-modal similarity score is given by: where NN K T (q) denotes the set of K most relevant documents w.r.t. q, based on text, i.e. on s T (q, .). The model can be understood very simply: similarly to PRF methods in standard information retrieval, the goal is to boost images that are visually similar to top images (from a text point of view), i.e. images that are likely to be relevant to the query but were initially badly ranked (which is likely to happen in the web scenario, where text is crawled from source page and can be very noisy). Despite showing good empirical results, cross-modal similarities are fully unsupervised, and lack some dynamic behaviour, like being able to adapt to different queries. Moreover, they rely on a single relevance score s T (q, .), while it could actually be beneficial to learn how to use a larger set of features such as the ones employed in learning to rank models. In [3] , the authors made a parallel between the cross-modal similarity from Eq. (1) and random walks in graphs: it can be seen as a kind of multimodal label propagation in a graph. This motivates us to tackle the task using graph convolutions. We therefore represent each query q ∈ Q as a graph G q , as follows: -The set of nodes is the set of candidate documents d i to be re-ranked for this query: typically from a few to hundreds of documents, depending on the query. -Each node i is described by a set of n learning to rank features x q,di ∈ R n . v i ∈ R d denotes the (normalized) visual embedding for document d i . -As we do not have an explicit graph structure, we consider edges given by a k-nearest neighbor graph, based on a similarity between the embeddings v i 1 . -We denote by N i the neighborhood of node i, i.e. the set of nodes j such that there exists an edge from j to i. -We consider edge weights, given by a similarity function between the visual features of its two extremity nodes Our goal is to learn how to propagate features in the above graph. Generalizing convolution operations to graphs can generally be expressed as a message passing scheme [16] : where γ and φ denote differentiable functions, e.g. MLPs (Multi Layer Perceptron). By choosing φ(h This graph convolution can be reduced to the cross-modal similarity in Eq. (1). Indeed, assuming that d) , and N i := N is the whole set of candidates to re-rank, then: In other words, one layer defined with Eq. (3) includes the standard cross-modal relevance feedback as a special case. Equation (3) is more general, and can easily be used as a building block in a differentiable ranking architecture. In the following, we derive a simple convolution layer from Eq. (3), and we introduce the complete architecture -called DCMM for Differentiable Cross-Modal Model-, summarized in Fig. 1 . Learning to rank features x q,di are first encoded with an MLP(.;θ) with ReLU activations, in order to obtain node features h 0 i . Then, the network splits into two branches: -The first branch simply projects linearly each h i , that acts as a pure text-based score 2 . -The second branch is built upon one or several layer(s) of cross-modal convolution, simply defined as: For the edge function g, we consider two cases: the cosine similarity g cos , defining the first model (referred as DCMM-cos), and a simple learned similarity measure parametrized by a vector a such that After the convolution(s), the final embedding for each node h (L) i is projected to a real-valued score s conv (q, d i ), using either a linear layer (s conv (q, , ω) ). Finally, the two scores are combined to obtain the final ranking score: The model is trained using backpropagation and any standard learning to rank loss: pointwise, pairwise or listwise. It is worth to remark that, by extending PRF mechanisms for cross-modal re-ranking, our model is actually closer to listwise context-based models introduced in Sect. 2 than current state-of-the-art cross-modal retrieval models. It is listwise by design 3 : an example in a batch is not a single image in isolation, but all the candidate images for a given query, encoded as a graph, that we aim to re-rank together in a one shot manner. In our experiments, we used the pairwise BPR loss [46] , from which we obtained the best results 4 . Let's consider a graph (i.e. the set of candidate documents for query q) in the batch, and all the feasible pairs of documents D +,− q for this query (by feasible, we mean all the pairs that can be made from positive and negative examples in the graph). Then the loss is defined: Note that contrary to previous works on listwise context modeling, we consider a set of objects to re-rank, and not a sequence (for instance in [4] , a RNN encoder is learned for re-ranking). In other words, we discard the rank information of the first ranker into the re-ranking process: we claim that the role of the first retriever is to be recall-oriented, and not precision-oriented. Thus, using initial order might be a too strong prior, and add noise information. Moreover, in the case of implicit feedback (clicks used as weak relevance signals), using rank information raises the issue of biased learning to rank (sensitivity to position and trust biases). It is also worth to emphasize that, contrary to most of the works around graph convolution models, our graph structure is somehow implicit: while edges between nodes generally indicate a certain relationship between nodes (for instance, connection between two users in a social network), in our case a connection represents the visual similarity between two nodes. In the following, we introduce the two datasets we used to validate our approach -a public dataset from a MediaEval 5 challenge, and an annotated set of queries sampled from image search logs of Naver, the biggest commercial search engine in Korea-, as well as our experimental strategy. We emphasize on the fact that we restrict ourselves to two relatively small datasets and few features as input for the models. Even though the formulation from Eq. (3) is very general, our claim is that a simple model, i.e. containing few hundreds to thousands parameters, should be able to reproduce PRF mechanisms introduced in Sect. 3. When adapting the approach to larger datasets, the model capacity can be adjusted accordingly, in order to capture more complex relevance patterns. Note that we did not consider in our study standard datasets generally used to train joint embeddings such as MSCOCO [34] or Flickr30k [43] , because the retrieval scenario is rather artificial, compared to web search: there are no explicit queries, and a text is only relevant to a single image. Furthermore, we have tried to obtain the Clickture [24] dataset without success 6 , and therefore cannot report on it. MediaEval. We first conduct experiments on the dataset from the "MediaE-val17, Retrieving Diverse Social Images Task" challenge 7 . While this challenge also had a focus on diversity aspects, we solely consider the standard relevance ranking task. The dataset is composed of a ranked list of images (up to 300) for each query, retrieved from Flickr using its default ranking algorithm. The queries are general-purpose queries (e.g. q = autumn color ), and each image has been annotated by expert annotators (binary label, i.e. relevant or not). The goal is to refine the results from the base ranking. The training set contains 110 queries for 33340 images, while the test set contains 84 queries for 24986 images. While we could consider any number of learning to rank features as input for our model, we choose to restrict ourselves to a very narrow set of weak relevance signals, in order to remain comparable to its unsupervised counterpart, and ensure that the gain does not come from the addition of richer features. Hence, we solely rely on four relevance scores, namely tf-idf, BM25, Dirichlet smoothed LM [60] and DESM score [39] , between the query and each image's text component (the concatenation of the image title and tags). We use an Inception-ResNet model [48] pre-trained on ImageNet to get the image embeddings (d = 1536). WebQ. In order to validate our approach on a real world dataset, we sample a set of 1000 queries 8 from the image search logs of Naver. All images appearing in the top-50 candidates for these queries within a period of time of two weeks have been labeled by three annotators in terms of relevance to the query (binary label). Because of different query characteristics (in terms of frequency, difficulty etc.), and given the fact that new images are continuously added to/removed from the index, the number of images per query in our sample is variable (from around ten to few hundreds). Note that, while we actually have access to a much larger amount of click logs, we choose to restrict the experiments to this small sample in order keep the evaluations simple. Our goal here is to show that we are able to learn and reproduce some PRF mechanisms, without relying on large amount of data. Moreover, in this setting, it is easier to understand model's behaviour, as we avoid to deal with click noise and position bias. After removing queries without relevant images (according to majority voting among the three annotators), our sample includes 952 queries, and 43064 images, indexed through various text fields (title of the page, image caption etc.). We select seven of such fields, that might contain relevant pieces of information, and for which we compute two simple relevance features w.r.t. query q: BM25 and DESM [39] (using embeddings trained on a large query corpus from an anterior period). We also add an additional feature, which is a mixture of the two above, on the concatenation of all the fields. Image embeddings (d = 2048) are obtained using a ResNet-152 model [22] pre-trained on ImageNet. Given the limited number of queries in both collections, we conducted 5-fold cross-validation, by randomly splitting the queries into five folds. The model is trained on 4 folds (with 1 fold kept for validation, as we use early stopping on nDCG), and evaluated on the remaining one; this procedure is repeated 5 times. Then, the average validation nDCG is used to select the best model configuration. Note that for the MediaEval dataset, we have access to a separate test set, so we modify slightly the evaluation methodology: we do the above 5fold cross-validation on the training set, without using a validation fold (hence, we do not use early stopping, and the number of epochs is a hyperparameter to tune). Once the best model has been selected with the above strategy, we re-train it on the full training set, and give the final performance on the test set. We report the nDCG, MAP, P@20, and nDCG@20 for both datasets. We train the models using stochastic gradient descent with the Adam optimizer [29] . We set the batch size (i.e. number of graphs per batch) to MediaEval, we also tune the number of epochs ∈ {50, 100, 200, 300, 500}, while for WebQ, we set it to 500, and use early stopping with patience set to 80. All node features are query-level normalized (mean-std normalization). The models are implemented using PyTorch and PyTorch geometric 9 [14] for the message passing components. In order to be fair, we want to compare methods with somewhat similar feature sets. Obviously, for the supervised methods, results can be improved by either adding richer/more features, or increasing models' capacity. For both datasets, we compare our DCMM model to the following baselines: -A learning to rank model only based on textual features (LTR). -The cross-modal similarity introduced in Sect. 3.1 [2, 3, 10, 11, 45] (CM). -The above LTR model with the cross-modal similarity as additional input feature (LTR+CM), to verify that it is actually beneficial to learn the crossmodal propagation in DCMM in a end-to-end manner. For the cross-modal similarity, we use as proxy for s T (q, .) a simple mixture of term-based relevance score (Dirichlet-smoothed LM and BM25 for respectively MediaEval and WebQ) and DESM score, on a concatenation of all text fields. From our experiments, we observe that it is actually beneficial to recombine the cross-modal similarity with the initial relevance s T (q, .), using a simple mixture. Hence, three parameters are tuned (the two mixture parameters, and the number of neighbors for the query), following the evaluation methodology introduced in Sect. 4.2 10 . The LTR models are standard MLPs: they correspond to the upper part of architecture Fig. 1 (text branch), and are tuned following the same strategy. We do not compare our models with joint embedding approaches on those datasets for the reasons mentioned in Sect. 2, but also due to our initial experiments on Medieval which gave poor results. For the sake of illustration, on MediaEval, 64% of the queries have no lemmas in common with training queries (and 35% for WebQ): given the relatively small size of these datasets, the models cannot generalize to unseen queries. This illustrates an "extreme" example of the generalization issues -especially on tail queries-of joint embedding techniques. In the meantime, as our model is fed with learning to rank features, especially term-based relevance scores like BM25, it could be less sensitive to generalization issues, for instance on new named entities. However, we want to emphasize that both approaches are not antagonist, but can actually be complementary. As our model can be seen as an extension of listwise learning to rank for bi-modal objects (if edges are removed, the model reduces to a standard MLP-based learning to rank), it can take as input node features matching scores from joint embeddings models. The model being an extension of PRF, we actually see the approaches at different stages of ranking. Table 1 gathers the main results of our study. Without too much surprise, going from pure text ranker to a model using both media types improves the results by a large margin (all the models are significantly better than the text-based LTR model, so we do not include these tests on Table 1 for clarity). Moreover, results indicate that combining initial features with the unsupervised cross-modal similarity in a LTR model allows to slightly improve results over the latter (not significantly though) for the MediaEval dataset, while it has no effect on WebQ: this is likely due to the fact that features are somehow redundant in our setting, because of how s T (q, .) is computed for the cross-modal similarity; the same would not hold if we would consider a richer set of features for the LTR models. Furthermore, the DCMM-cos model outperforms all the baselines, with larger margins for MediaEval than for WebQ; the only significant result (p-value < 0.05) is obtained for the MAP on MediaEval. Nevertheless, it shows that this simple architecture -the most straightforward extension of cross-modal similarity introduced in Sect. 3.1-, with a handful of parameters (see Table 1 ) and trained on small datasets, is able to reproduce PRF mechanisms. Interestingly, results tend to drop as we increase the number of layers (best results are obtained with a single convolution layer), no matter the number of neighbors chosen to define the visual graph. While it might be related to the relative simplicity of the model, it actually echoes common observations in PRF models (e.g. [3] ): if we propagate too much, we also tend to diffuse information too much. Similarly, we can also make a parallel with over-smoothing in GNNs [33] , which might be more critical for PRF, especially considering the simplicity of this model. The DCMM-edge shows interesting results: on WebQ, we manage to improve results significantly w.r.t. to CM sim, while on MediaEval, results are slightly worse than DCMM-cos (except for the MAP). It might be due to the fact that images in the latter are more alike to the ones used to train image signatures, compared to the (noisy) web images in WebQ; hence, learning a new metric between images has less impact. Interestingly, for both datasets, best results are obtained with more than a single layer; we hypothesize that the edge function plays the role of a simple filter for edges, allowing to propagate information from useful nodes across more layers. Note that the number of layers needed for the task is tied with how we define our input graph: the less neighbors we consider for each node, the more layers might be needed, in order for each node to gather information from useful nodes. In Fig. 2 , we observe that if the number of neighbors is too small (e.g. 3 or 5), then the model needs more layers to improve performance. On the other side, when considering too many neighbors (e.g. 20 or all), the nodes already have access to all the useful neighbors, hence adding layers only reduces performances. We need to find the right balance between the number of neighbors and the number of convolution layers, so that the model can learn to propagate relevant signals (e.g. 10 neighbors and 3 layers for WebQ). In this paper, we have proposed a reformulation of unsupervised cross-modal PRF mechanisms for image search as a differentiable architecture relying on graph convolutions. Compared to its unsupervised counterpart, our novel approach can integrate any set of features, while providing a high flexibility in the design of the architecture. Experiments on two datasets showed that a simple model derived from our formulation achieved comparable -or better-performance compared to cross-modal PRF. There are many extensions and possible directions stemming from the relatively simple model we have studied. Given enough training data (e.g. large amount of click logs), we could for instance learn to dynamically filter the visual similarity by using an attention mechanism to choose which nodes to attend, similarly to Graph Attention Networks [50] and Transformer model [49] , discarding the need to set the number of neighbors in the input graph. Finally, our approach directly addressed the cross-modal retrieval task, but its application to the more general PRF problem in IR remains possible. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data XRCE's Participation to ImageCLEF Unsupervised visual and textual information fusion in CBMIR using graph-based methods Learning a deep listwise context model for ranking refinement Deep canonical correlation analysis Revisiting approximate metric optimization in the age of deep neural networks From ranknet to lambdarank to lambdamart: an overview Crossmodal retrieval in the cooking context: learning semantic text-image embeddings AMC: attention guided multimodal correlation learning for image search Trans-media pseudo-relevance feedback methods in multimedia retrieval Unsupervised visual and textual information fusion in multimedia retrieval -a graph-based point of view Convolutional neural networks for softmatching n-grams in ad-hoc search Modeling diverse relevance patterns in ad-hoc retrieval Fast graph representation learning with PyTorch geometric DeViSE: a deep visual-semantic embedding model Neural message passing for quantum chemistry A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics Improving imagesentence embeddings using large weakly annotated photo collections Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic A deep relevance matching model for ad-hoc retrieval Inductive representation learning on large graphs Deep residual learning for image recognition Scalable deep multimodal learning for crossmodal retrieval Clickage: towards bridging semantic and intent gaps via mining click logs of search engines A position-aware deep model for relevance matching in information retrieval RE-PACRR: a context and densityaware neural information retrieval model Multi-view deep network for cross-view classification Multi-modal image retrieval with random walk on multi-layer graphs Adam: a method for stochastic optimization Semi-supervised classification with graph convolutional networks Relevance based language models NPRF: a neural pseudo relevance feedback framework for ad-hoc information retrieval Deeper insights into graph convolutional networks for semi-supervised learning Microsoft COCO: common objects in context Deep coupled metric learning for crossmodal matching Cascade ranking for operational e-commerce search An updated duet model for passage re-ranking Learning to match using local and distributed representations of text for web search A dual embedding space model for document ranking Task-oriented query reformulation with reinforcement learning DeepRank: a new deep architecture for relevance ranking in information retrieval Personalized context-aware re-ranking for e-commerce recommender systems Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models Cross-modal bidirectional translation via reinforcement learning NLE@MediaEval'17: combining cross-media similarity and embeddings for retrieving diverse social images BPR: Bayesian personalized ranking from implicit feedback Dropout: a simple way to prevent neural networks from overfitting Inception-v4, inception-ResNet and the impact of residual connections on learning Attention is all you need Graph attention networks Adversarial cross-modal retrieval Learning two-branch neural networks for image-text matching tasks Learning deep structure-preserving image-text embeddings WSABIE: scaling up to large vocabulary image annotation Learning semantic structure-preserved embeddings for cross-modal retrieval End-to-end neural ad-hoc ranking with kernel pooling How powerful are graph neural networks? Deep correlation for matching images and text Graph convolutional neural networks for web-scale recommender systems A study of smoothing methods for language models applied to ad hoc information retrieval Deep cross-modal projection learning for image-text matching A domain generalization perspective on listwise context modeling