key: cord- -kgje qy authors: suominen, hanna; kelly, liadh; goeuriot, lorraine; krallinger, martin title: clef ehealth evaluation lab date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: kgje qy laypeople’s increasing difficulties to retrieve and digest valid and relevant information in their preferred language to make health-centred decisions has motivated clef ehealth to organize yearly labs since . these evaluation tasks on information extraction (ie), management, and information retrieval (ir) in – have been popular—as demonstrated by the large number of team registrations, submissions, papers, their included authors, and citations ( , , , , and , respectively, up to and including )—and achieved statistically significant improvements in the processing quality. in , clef ehealth is calling for participants to contribute to the following two tasks: the task on ie focuses on term coding for clinical textual data in spanish. the terms considered are extracted from clinical case records and they are mapped onto the spanish version of the international classification of diseases, the th revision, including also textual evidence spans for the clinical codes. the task is a novel extension of the most popular and established task in clef ehealth on chs. this ir task uses the representative web corpus used in the challenge, but now also spoken queries, as well as textual transcripts of these queries, are offered to the participants. the task is structured into a number of optional subtasks, covering ad-hoc search using the spoken queries, textual transcripts of the spoken queries, or provided automatic speech-to-text conversions of the spoken queries. in this paper we describe the evolution of clef ehealth and this year’s tasks. the substantial community interest in the tasks and their resources has led to clef ehealth maturing as a primary venue for all interdisciplinary actors of the ecosystem for producing, processing, and consuming electronic health information. substantial community interest in the tasks and their resources has led to clef ehealth maturing as a primary venue for all interdisciplinary actors of the ecosystem for producing, processing, and consuming electronic health information. keywords: ehealth · medical informatics · information extraction · information storage and retrieval · speech recognition improving the legibility of electronic health record (ehr) can contribute to patients' right to be informed about their health and health care. the requirement to ensure that patients can understand their own privacy-sensitive, official health information in their ehr are stipulated by policies and laws. for example, the declaration on the promotion of patients' rights in europe by world health organization (who) from obligates health care workers to communicate in a way appropriate to each patient's capacity for understanding and give each patient a legible written summary of these care guidelines. this patient education must capture the patient's health status, condition, diagnosis, and prognosis, together with the proposed and alternative treatment/non-treatment with risks, benefits, and progress. patients' better abilities to understand their own ehr empowers them to take part in the related health/care judgment, leading to their increased independence from health care providers, better health/care decisions, and decreased health care costs [ ] . improving patients' ability to digest this content could mean enriching the ehr-text with hyperlinks to term definitions, paraphrasing, care guidelines, and further supportive information on patientfriendly and reliable websites, and the enabling methods for such reading aids can also release health care workers' time from ehr-writing to, for example, longer patient-education discussions [ ] . information access conferences have organized evaluation labs on related electronic health (ehealth) information extraction (ie), information management (im), and information retrieval (ir) tasks for almost years. yet, with rare exception, they have targeted the health care experts' information needs only [ , , ] . such exception, the clef ehealth evaluation-lab and lab-workshop series has been organized every year since as part of the conference and labs of the evaluation forum (clef) [ , , [ ] [ ] [ ] , , ] . in , the inaugural scientific clef workshop took place, and from - this annual workshop has been supplemented with a lead-up evaluation lab, consisting of, on average, three shared tasks each year (fig. ) . although the tasks have been centered around the patients and their families' needs in accessing and understanding ehealth information, also automatic speech recognition (asr) and ie to aid clinicians in im were considered in - and in - , tasks on technology assisted reviews to support health scientists and health care policymakers' information access were organized. this paper presents first an overview of clef ehealth lab series from to and introduces its evaluation tasks. then, it concludes by presenting our vision for clef ehealth beyond . clef ehealth tasks offered yearly from have brought together researchers working on related information access topics, provided them with resources to work with and validate their outcomes, and accelerated pathways from scientific ideas to societal impact. in , , , , , , and as many as , , , , , , and teams have registered their expression of interest in the clef ehealth tasks, respectively, and the number of teams proceeding to the task submission stage has been , , , , , , and , respectively [ , , [ ] [ ] [ ] , ] . according to our analysis of the impact of clef ehealth labs up to [ ] , the submitting teams have achieved statistically significant improvements in the processing quality in at least out of the top- methods submitted to the following eight tasks: [ ] . clef ehealth lab workshop has resulted in papers and each year clef ehealth - evaluation labs have increased this number from to . in accordance with the clef ehealth mission to foster teamwork, the number of co-authors per paper has been from to (the mean and standard deviation of and , respectively). in about a quarter of the papers, this co-authoring collaboration has been international, and sometimes even intercontinental. this substantial community interest in the clef ehealth tasks and their resources has led to the evaluation campaign maturing and establishing its presence over the years. in , clef ehealth is one of the primary venues for all interdisciplinary actors of the ecosystem for producing, processing, and consuming ehealth information [ , , ] . its niche is addressing health information needs of laypeople-and not health care experts only-in retrieving and digesting valid and relevant ehealth information to make health-centered decisions. the clef ehealth task on ie, called codiesp supported by the spanish national plan for the advancement of language technology (plan tl), builds upon the five previous editions of the task in - [ , , , , ] that have already addressed the analysis of biomedical text in english, french, hungarian, italian, and german. this year, the codiesp task, will focus on the international classification of diseases, the th revision (icd ) coding for clinical case data in spanish using the spanish version of icd (cie ). the codiesp task will explore the automatic assignment of cie codes to clinical case documents in spanish, namely of two categories: procedure and diagnosis (known as 'procedimiento' and 'diagnostico' in spanish). the following three subtasks will be posed: ( ) codiesp diagnosis coding will consist of automatically assigning diagnosis codes to clinical cases in spanish. ( ) codiesp procedure coding will focus on assigning procedure codes to clinical cases in spanish. ( ) codiesp explainable artificial intelligence (ai) will evaluate the explainability/interpretability of the proposed systems, as well as their performance by requesting to return the text spans supporting the assignment of cie codes. the codiesp corpus used for this task consists of a total of , clinical cases that were manually annotated by clinical coding professionals with clinical procedure and diagnosis codes from the spanish version of icd together with the actual minimal text spans supporting the clinical codes. the codiesp corpus has around , sentences, and contains about , words and , clinical codes. code annotations will be released in a separate file together with the respective document code and the span of text that leads to the codification (the evidence). additional data resources including medical literature abstracts in spanish indexed with icd codes, linguistic resources, gazetteers, and a background set of medical texts in spanish will also be released to complement the codiesp corpus, together with annotation guidelines and details. for the codiesp diagnosis and procedure coding subtasks, participants will submit their coding predictions returning ranked results. for every document, a list of possible codes will be submitted, ordered by confidence or relevance. since these subtasks are designed to be ranking competitions, they will be evaluated on a standard ranking metric: mean average precision. for the codiesp explainable ai subtask, explainability of the systems will be considered, in addition to their performance on the test set. systems have to provide textual evidence from the clinical case documents that supports the code assignment and thus can be interpreted by humans. this automatically returned evidence will be evaluated against manually annotated text spans. true positive evidence texts are those that consist in a sub-match of the manual annotations. f will be used as the primary evaluation metric. the clef ehealth task on ir builds on the tasks that have run at clef ehealth since its inception in . this consumer health search (chs) task follows a standard ir shared challenge paradigm from the perspective that it provides participants with a test collection consisting of a set of documents and a set of topics to develop ir techniques for. runs submitted by participants are pooled, and manual relevance assessments conducted. performance measures are then returned to participants. in the clef ehealth chs task, similarly to , we used the clueweb b document collection [ , ] . this consisted of a collection of . million medically related web pages. given the scale of this document collection participants reported that it was difficult to store and manipulate the document collection. in response, the chs task introduced a new document collection, named clefehealth . this collection consists of over million medical webpages from selected domains acquired from the commoncrawl [ ] . given the positive feedback received for this document collection, it will be used again in the chs task. historically the clef ehealth ir task has released text queries representative of layperson information needs in various scenarios. in recent years, query variations issued by multiple laypeople for the same information need have been offered. in this year's task we extend this to spoken queries. these spoken queries are generated by individuals using the information needs derived for the challenge [ ] . we also provide textual transcripts of these spoken queries and asr translations. given the query variants for an information need, participants are challenged in the task with retrieving the relevant documents from the provided document collection. this is divided into a number of subtasks which can be completed using the spoken queries or their textual transcripts by hand or asr. similar to the chs tasks, subtasks explored this year are: adhoc/personalized search, query variations, and search intent with binary preference, mean reciprocal rank, normalized discounted cumulative gain@ - , and (understandability-biased) rank-biased precision as subtask-dependent evaluation measures. participants can submit multiple runs for each subtask. the general purpose of clef ehealth throughout the years, as its ie and ir tasks demonstrate, has been to assist laypeople in finding and understanding health information in order to make enlightened decisions. breaking language barriers has been our priority over the years, and this will continue in our multilingual tasks. text has been our major media of interest, but speech has been, and continues to be, included in tasks as a major new way of interacting with systems. each year of the labs has enabled the identification of difficulties and challenges in ie, im, and ir which have shaped our tasks. for example, popular ir tasks have considered multilingual, contextualized, and/or spoken queries and query variants. however, further exploration of query construction, aiming at a better understanding of chs are still needed. the task into the future will also further explore relevance dimensions, and work toward a better assessment of readability and reliability, as well as methods to take these dimensions into consideration. as lab organizers, our purpose is to increase the impact and the value of the resources, methods and the community built by clef ehealth. examining the quality and stability of the lab contributions will help the clef ehealth series to better understand where it should be improved and how. as future work, we intend continuing our analyses of the influence of the clef ehealth evaluation series from the perspectives of publications and data/software releases [ , , ] . aspiring to unintended consequences of natural language processing: a review of recent developments in clinical and consumergenerated text processing advancing the state of the art in clinical natural language processing through shared tasks an analysis of evaluation campaigns in ad-hoc medical information retrieval: clef ehealth overview of the clef ehealth evaluation lab clef ehealth evaluation lab overview community challenges in biomedical text mining over years: success, failure and the future overview of the clef consumer health search task overview of the clef ehealth evaluation lab overview of the share/clef ehealth evaluation lab overview of the clef ehealth evaluation lab patient empowerment: the need to consider it as a measurable patient-reported outcome for chronic conditions working notes of conference and labs of the evaluation (clef) forum. ceur workshop proceedings clefehealth -the clef workshop on cross-language evaluation of methods, applications, and resources for ehealth document analysis working notes scholarly influence of the conference and labs of the evaluation forum ehealth initiative: review and bibliometric study of the to outcomes information retrieval evaluation in a changing world: lessons learned from years of clef overview of the clef ehealth evaluation lab overview of the share/clef ehealth evaluation lab the ir task at the clef ehealth evaluation lab : usercentred health information retrieval key: cord- -d oru w authors: leekha, maitree; goswami, mononito; jain, minni title: a multi-task approach to open domain suggestion mining using language model for text over-sampling date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: d oru w consumer reviews online may contain suggestions useful for improving commercial products and services. mining suggestions is challenging due to the absence of large labeled and balanced datasets. furthermore, most prior studies attempting to mine suggestions, have focused on a single domain such as hotel or travel only. in this work, we introduce a novel over-sampling technique to address the problem of class imbalance, and propose a multi-task deep learning approach for mining suggestions from multiple domains. experimental results on a publicly available dataset show that our over-sampling technique, coupled with the multi-task framework outperforms state-of-the-art open domain suggestion mining models in terms of the f- measure and auc. consumers often express their opinions towards products and services through online reviews and discussion forums. these reviews may include useful suggestions that can help companies better understand consumer needs and improve their products and services. however, manually mining suggestions amid vast numbers of non-suggestions can be cumbersome, and equated to finding needles in a haystack. therefore, designing systems that can automatically mine suggestions is essential. the recent semeval [ ] challenge on suggestion mining saw many researchers using different techniques to tackle the domain-specific task (in-domain suggestion mining). however, open-domain suggestion mining, which obviates the need for developing separate suggestion mining systems for different domains, is still an emerging research problem. we formally define the problem of open-domain suggestion mining as follows: building on the work of [ ] , we design a framework to detect suggestions from multiple domains. we formulate a multitask classification problem to identify both the domain and nature (suggestion or non-suggestion) of reviews. furthermore, we also propose a novel language model-based text over-sampling approach to address the class imbalance problem. we use the first publicly available and annotated dataset for suggestion mining from multiple domains created by [ ] . it comprises of reviews from four domains namely, hotel, electronics, travel and software. during pre-processing, we remove all urls (eg. https:// ...) and punctuation marks, convert the reviews to lower case and lemmatize them. we also pad the text with start s and end e symbols for over-sampling. one of the major challenges in mining suggestions is the imbalanced distribution of classes, i.e. the number of non-suggestions greatly outweigh the number of suggestions (refer table ). to this end, studies frequently utilize synthetic minority over-sampling technique (smote) [ ] to over-sample the minority class samples using the text embeddings as features. however, smote works in table . datasets and their sources used in our study [ ] . the class ratio column highlights the extent of class imbalance in the datasets. the travel datasets have lower inter-annotator agreement than the rest, indicating that they may contain confusing reviews which are hard to confidently classify as suggestions or non-suggestions. this also reflects in our classification results. the euclidean space and therefore does not allow an intuitive understanding and representation of the over-sampled data, which is essential for qualitative and error analysis of the classification models. we introduce a novel over-sampling technique, language model-based over-sampling technique (lmote), exclusively for text data and note comparable (and even slightly better sometimes) performance to smote. we use lmote to over-sample the number of suggestions before training our classification model. for each domain, lmote uses the following procedure to over-sample suggestions: find top η n-grams: from all reviews labelled as suggestions (positive samples), sample the top η = most frequently occurring n-grams (n = ). for example, the phrase "nice to be able to" occurred frequently in many domains. train a bilstm language model on the positive samples (suggestions). the bilstm model predicts the probability distribution of the next word (w t ) over the whole vocabulary (v ∪ e) based on the last n = words (w t− , . . . , w t− ), i.e., the model learns to predict the probability distribution n-grams: using the language model and a randomly chosen frequent -gram as the seed, we generate text by repeatedly predicting the most probable next word (w t ), until the end symbol e is predicted. table comprises of the most frequent -grams and their corresponding suggestions 'sampled' using lmote. in our study, we generate synthetic positive reviews till the number of suggestion and non-suggestion class samples becomes equal in the training set. seed ← random(n grams) : sample ← lmotegenerate(language model, seed) : s ← s ∪ sample : end while : return s algorithm summarizes the lmote over-sampling methodology. following is a brief description of the sub-procedures used in the algorithm: • lmotegenerate(language model, seed): the procedure takes as input the trained language model and a randomly chosen n-gram from the set of top η n-grams as seed, and starts generating a review till the end tag, e is produced. the procedure is repeated until we have a total of n suggestion reviews. multi-task learning (mtl) has been successful in many applications of machine learning since sharing representations between auxiliary tasks allows models to generalize better on the primary task. figure b illustrates -dimensional umap [ ] visualization of text embeddings of suggestions, coloured by their domain. these embeddings are outputs of the penultimate layer (dense layer before the final softmax layer) of the single task (stl) ensemble baseline. it can be clearly seen that suggestions from different domains may have varying feature representations. therefore, we hypothesize that we can identify suggestions better by leveraging domain-specific information using mtl. therefore, in the mtl setting, given a review r i in the dataset, d, we aim to identify both the domain of the review, as well as its nature. we use an ensemble of three architectures namely, cnn [ ] to mirror the spatial perspective and preserve the n-gram representations; attention network to learn the most important features automatically; and a bilstm-based text rcnn [ ] model to capture the context of a text sequence (fig. ) . in the mtl setting, the ensemble has two output softmax layers, to predict the domain and nature of a review. the stl baselines on the contrary, only have a singe softmax layer to predict the nature of the review. we use elmo [ ] word embeddings trained on the dataset, as input to the models. we conducted experiments to assess the impact of over-sampling, the performance of lmote and the multi-task model. we used the same train-test split as provided in the dataset for our experiments. all comparisons have been made in terms of the f- score of the suggestion class for a fair comparison with prior work on representational learning for open domain suggestion mining [ ] (refer baseline in table ). for a more insightful evaluation, we also compute the area under receiver operating characteristic (roc) curves for all models used in this work. tables , over-sampling improves performance. to examine the impact of oversampling, we compared the performance of our ensemble classifier with and without over-sampling i.e. we compared results under the stl, stl + smote and stl + lmote columns. our results confirm that in general, over-sampling suggestions to obtain a balanced dataset improves the performance (f- score & auc) of our classifiers. we compared the performance of smote and lmote in the single task settings (stl + smote and stl + lmote ) and found that lmote performs comparably to smote (and even outperforms it in the electronics and software domains). lmote also has the added advantage of resulting in intelligible samples which can be used to qualitatively analyze and troubleshoot deep learning based systems. for instance, consider suggestions created by lmote in table . while the suggestions may not be grammatically correct, their constituent phrases are nevertheless semantically sensible. multi-task learning outperforms single-task learning. we compared the performance of our classifier in single and multi-task settings (stl + lmote and mtl + lmote ) and found that by multi-task learning improves the performance of our classifier. we qualitatively analysed the single and multi task models, and found many instances where by leveraging domain-specific information the multi task model was able to accurately identify suggestions. for instance, consider the following review: "bring a lan cable and charger for your laptop because house-keeping doesn't provide it." while the review appears to be an assertion (non-suggestion), by predicting its domain (hotel), the multitask model was able to accurately classify it as a suggestion. in this work, we proposed a multi-task learning framework for open domain suggestion mining along with a novel language model based over-sampling technique for text-lmote. our experiments revealed that multi-task learning combined with lmote over-sampling outperformed considered alternatives in terms of both the f -score of the suggestion class and auc. smote: synthetic minority over-sampling technique convolutional neural networks for sentence classification recurrent convolutional neural networks for text classification umap: uniform manifold approximation and projection for dimension reduction suggestion mining from text semeval- task : suggestion mining from online reviews and forums. in: semeval@naacl-hlt deep contextualized word representations key: cord- -pacy qx authors: muhammad, shamsuddeen hassan; brazdil, pavel; jorge, alípio title: incremental approach for automatic generation of domain-specific sentiment lexicon date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: pacy qx sentiment lexicon plays a vital role in lexicon-based sentiment analysis. the lexicon-based method is often preferred because it leads to more explainable answers in comparison with many machine learning-based methods. but, semantic orientation of a word depends on its domain. hence, a general-purpose sentiment lexicon may gives sub-optimal performance compare with a domain-specific lexicon. however, it is challenging to manually generate a domain-specific sentiment lexicon for each domain. still, it is impractical to generate complete sentiment lexicon for a domain from a single corpus. to this end, we propose an approach to automatically generate a domain-specific sentiment lexicon using a vector model enriched by weights. importantly, we propose an incremental approach for updating an existing lexicon to either the same domain or different domain (domain-adaptation). finally, we discuss how to incorporate sentiment lexicons information in neural models (word embedding) for better performance. sentiment lexicon is a dictionary of a lexical item with the corresponding semantic orientation. recently, with the issue of growing concern about interpretable and explainable artificial intelligence, domains that require high explainability in sentiment analysis task (eg., health domain and financial domain), lexicon-based sentiment analysis approaches are often preferred over machine-learning-based approaches [ , ] . however, sentiment lexicons are domain-dependent, a word may convey two different connotations in a different domain. for example, the word high may have a positive connotation in economics (e.g., he has a high salary), and negative connotation in medicine (e.g., he has a high blood pressure). therefore, general-purpose sentiment lexicon may not give the expected predictive accuracy across different domains. thus, a lexicon-based approach with domain-specific lexicons are used to achieve better performance [ , ] . although research has been carried out on corpus-based approaches for automatic generation of a domain-specific lexicon [ , , , , , , ] , existing approaches focused on creation of a lexicon from a single corpus [ ] . afterwards, one cannot automatically update the lexicon with a new corpus. there are many reasons one would want to update an existing lexicon: (i) the existing lexicon may not contain sufficient number of sentiment-bearing words (i.e., it is limited) and it needs to be extended with a corpus from the same domain with a source corpus; (ii) the language may have evolved (new words and meaning changes) and it is necessary to update the existing lexicon with a new corpus. the new corpus may not be large to enable generation of a new lexicon from scratch. thus, it is better to update the existing lexicon with the new corpus; and (iii) we need to update an existing lexicon to another domain (domainadaptation) with a corpus from different domain with the source corpus. to this end, this work proposes an incremental approach for the automatic generation of a domain-specific sentiment lexicon. we aim to investigate an incremental technique for automatically generating domain-specific sentiment lexicon from a corpus. specifically, we aim to answer the following three research questions: can we automatically generate a sentiment lexicon from a corpus and improves the existing approaches? rq : can we automatically update an existing sentiment lexicon given a new corpus from the same domain (i.e., to extend an existing lexicon to have more entries) or from a different domain (i.e., to adapt the existing lexicon to a new domain -domain adaptation)? rq : how can we enrich the existing sentiment lexicons using information obtained from neural models (word embedding)? to the best of our knowledge, no one attempted to design an approach for automatic construction of a sentiment lexicon in an incremental fashion. but, incremental approaches are common in the area of data streaming [ ] ; thus, our work could fill this gap and represent a novel contribution. the research plan is structured as follows: sect. . attempts to answer rq , sect. . attempts to answer rq , and sect. . attempts to answer rq . sattam et al. [ ] introduced a novel domain agnostic sentiment lexicon-generation approach from a review corpus annotated with star-ratings. we propose an extended approach that includes the use of weight vector. also, our approach includes verbs and nouns in the lexicon as studies show they contain sentiment [ , ] . the process includes the following four steps: (i) gathering data annotated with star-ratings; (ii) pre-processing the data; (iii) obtaining wordtag rating distribution, as shown in fig. from the corpus introduced in [ ] ; and (iv) generation of sentiment value for each word-tag pair using the equation: where f r w−t represents the frequency of word-tag pair and w is a weight vector. if the result is positive, the word is categorize as positive, otherwise it is negative. this basic approach of sentiment lexicon generation forms the basis of the incremental approach proposes in sect. . . we propose an incremental approach for sentiment lexicon expansion to either the same domain or different domain (domain-adaptation). to illustrate the approaches, assume we have a sentiment lexicon l i generated from a corpus c i (using the approach described in sect. . ). then, we receive a new batch of corpus c i+ (of the same or different domain with c i ). the incremental approach aims to generate an updated sentiment lexicon l i+ that would improve the accuracy of the lexicon l i . assume we receive c i+ and we want to update l i . assume we have the distributions of all the words in the previous corpus (c i ) saved. a naive approach would involve generating distributions of all the words in the new batch (c i+ ) without creating a new lexicon from it. such a distribution represents the so-called "sufficient statistics" [ ] and we can construct lexicon from each set of the distributions. to update l i , the two sets of distributions (from c i and c i+ ) are first merged and updated lexicon (l i+ ) is generated using the approach described in sect. . . however, this approach may be inefficient since we update all the words in the existing lexicon. an enhanced and more efficient approach aims to update only subset of the words in l i whose orientation may have changed. this approach use l i to predict the user's sentiment rating scores on the new labelled corpus c i+ sentences. if the predicted rating scores are the same with the user's sentiment ratings, we can skip those sentences and only consider those sentences where the predicted rating is significantly different from the user's sentiment rating scores. we extract the words from these sentences (reviews), elaborate the corresponding distribution of sentiment values, merge the distribution with the corresponding subset in the l i and generate a new sentiment lexicon l i+ . assume we receive c i+ and we want to update l i to a new domain. firstly, we propose to detect if c i+ and c i are from different domain. to do this, we generate the distribution of c i+ and compare it with the distribution of c i . if the distributions of the two corpora differ significantly, it indicates a domain shift. alternatively, we can use l i to predict the user's sentiment rating scores on the new labelled corpus c i+ sentences. if the prediction accuracy is below some predefined threshold, we can conclude there is a domain shift. after detecting the domain shift, we merge the distribution using a similar approach discussed (in updating using the same corpus) and generate the lexicon. however, in this case, we give different weight to the two distributions by taking into consideration not only their size, but also recency. more recent batches will be given more weight than the previous ones. the idea of word embedding have been widely used for generation of sentiment lexicon because of their advantage for giving semantic representation of words [ ] . if two words appear in similar contexts, they will have similar embedding. we propose to use word embedding in the following way. suppose we have seed words with their sentiment values, and we encounter some word, say wx, for which we do not have a sentiment value (sval) yet. but if we have its embedding, we can look for the most similar embedding in the embedding space and retrieve the corresponding word, wy, retrieve its sval and use it as a sval of wx. as reported in [ ] , neural models performance can increase by including lexicon information. we aim to further study litreture and find how to exploit combination of an existing sentiment lexicon (more explainable) and neural models performance. we plan to evaluate our system and compare it with other five existing lexicons: sentiwords, splm, so-cal, bing liu's opinion lexicon, and sentiword-net [ ] . the evaluation task will be on three sentiment analysis tasks (movie review, polarity of tweets and hotel review). in these comparisons we will compare ( ) the precision of the predictions of sentiment values and ( ) runtime to carry out updates of the lexicon. we seek suggestions on how our proposal can be improved. more importantly, discussion on how to exploit combination of word embedding with sentiment lexicon. we also welcome comments. cognitive-inspired domain adaptation of sentiment lexicons sentiment lexicon generation constructing automatic domainspecific sentiment lexicon using knn search via terms discrimination vectors automatic construction of domain-specific sentiment lexicons for polarity classification inducing domain-specific sentiment lexicons from unlabeled corpora determining the level of clients' dissatisfaction from their commentaries lexicon-based methods for sentiment analysis automatic domain adaptation outperforms manual domain adaptation for predicting financial outcomes word embeddings for sentiment analysis: a comprehensive empirical survey sentiment lexicon construction with representation learning based on hierarchical sentiment supervision lexicon information in neural sentiment analysis: a multi-task learning approach explainable sentiment analysis with applications in medicine explainable artificial intelligence: a survey an overview of sentiment analysis approaches knowledge discovery from data streams on the negativity of negation acknowledgement. this project was partially financed by the portuguese funding agency, fct -fundação para a ciência e a tecnologia, through national funds, and co-funded by the feder. key: cord- -cbikq v authors: papadakos, panagiotis; kalipolitis, orfeas title: dualism in topical relevance date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: cbikq v there are several concepts whose interpretation and meaning is defined through their binary opposition with other opposite concepts. to this end, in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms for eventually producing an answer which provides a better overview of the related conceptual and information space. specifically, we sketch a method in which antonyms are used for producing dual queries, which can in turn be exploited for defining a multi-dimensional topical relevance based on the antonyms. we motivate this direction by providing examples and by conducting a preliminary evaluation that shows its importance to specific users. dualism denotes the state of two parts. the term was originally coined to denote co-eternal binary opposition and has been especially studied in philosophy. for example, there is duality in ethics (good -bad), in human beings (man -nietzsche'sübermensch or man -god) and in logic (true -false). in addition, dualism determines in a great extent our everyday lives (ugly -beautiful, happyunhappy, etc.), and our relations with other people (rich -poor, black -white, love -hate, etc.). none of these concepts can be understood without their dual concepts, since this duality and opposition generates their meaning and interpretation. dualism is also crucial in mathematics and physics (e.g., matterantimatter), and is the power behind our whole information society and our binary data. moving from philosophy, sciences and everyday life to information retrieval, we find a very vague situation. users of search engines are 'dictated' to provide a very concise and specific query that is extremely efficient for focalized search (e.g., looking for a specific hotel). on the other hand, studies show that % of user tasks are of exploratory nature [ ] . in such tasks users do not accurately know their information need and can not be satisfied by a single 'hit' [ ] . consequently, users spend a lot of time reformulating queries and investigating results, in order to construct a conceptual model regarding their information need. information needs that include non-monosemous terms can be considered such exploratory tasks. however, the simplicity of inserting terms in an empty text box and 'magically' return the most relevant object(s), will always be a desired feature. in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms (if they exist), for eventually producing an answer which provides a better overview of the related information and conceptual space. we sketch a method in which antonyms are used for producing dual queries, which in turn can be exploited for defining a multi-dimensional topical relevance. this approach can be applied on demand, helping users to be aware of the various opposing dimensions and aspects of their topic of interest. a preliminary evaluation shows the value of the approach for some exploratory tasks and users. to the best of our knowledge, the proposed direction is not covered by the existing literature. antonyms have been studied in fuzzy logic [ ] showing a relation with negates. in the ir domain, query expansion methods are based on synonyms and semantically related terms, but do not exploit antonyms explicitly, while in relevance and pseudo-relevance feedback techniques the antonyms are essentially penalized [ ] . results diversification can produce a kind of dual clusters, but this is neither guaranteed nor controlled [ ] . "capitalism and war". consider a user exploring the relationship between capitalism and war. the user submits to a wse (web search engine) the query "capitalism and war" and starts inspecting the results. the left part of fig. shows the top- results for this query from a popular wse. the results include articles about the connection of capitalism with war from research and academic domains, as well as from socialistic, communistic and theological sites. considering a different direction, the user might also be interested about how capitalism can support peace, the dual of war. the top- results for the query "capitalism and peace" are shown at the right side of fig. . they contain a wikipedia and a research article about the capitalist peace theory, and articles about the importance of capitalism for the prosperity of modern societies and its association to peace from policy research organizations. analogously, since socialism is the economic system that opposes capitalism, the user could be interested about how socialism may promote war or support peace, by inspecting the results of the queries "socialism and war" and "socialism and peace" respectively. the top- results for each of the above queries are shown in fig. . the results for the former query include the socialism and war pamphlet written by lenin, a collection of articles by the economist and philosopher friedrich hayek, a list of articles from two marxist domains, and a critical article for both left and right views from the foundation for economic education. for the latter query, the results include articles connecting socialism with peace, like a chapter from the encyclopedia of anti-revisionism, a wikipedia article about the theoretical magazine problems of peace and socialism, and an article from a site supporting a far left u.s. party. the above hits indicate interesting directions to the original information need of the user. we argue that users should get aware of these directions for a better exploration of the domain at hand, since they can provide a more comprehensive view of the information and conceptual space. furthermore, the exploration of these directions let available supportive or counter arguments of dual concepts to emerge, leading to better informed and responsible humans and citizens. "aloe". a comprehensive view of the various different directions can be beneficial also for reducing false-positive results. for example, consider a pregnant woman that was advised to take aloe vera by mouth to relieve digestive discomfort. to check if this is true, she submits to a wse the query "aloe vera indications". however, since aloe can stimulate uterine contractions, increasing the risk of miscarriage or premature birth, it is crucial to know also its contraindications. the proposed direction can alleviate this problem, because this information would be contained in the results of the query "aloe vera contraindications". one can imagine various ways for leveraging antonyms. we shall hereafter use t t to denote that the terms t and t are antonyms. building on the "capitalistic" example of the previous section, according to the online dictionary wordnet , socialism capitalism, and war peace. now, we can generate all possible queries, denoted by q, where non-monosemous terms of the original query are substituted by their dual ones, as expressed by their antonyms. for example, the query "capitalism and war" will generate three extra queries: "socialism and peace", "capitalism and peace" and "socialism and war". based on q we can now define two vector spaces. in the first case, the space has |q| dimensions, where each query is a dimension of the space. each document is placed in this space according to its relevenace to each query. in the second case we assume a space with only |q| dimensions. each dimension represents a pair of dual queries, where each query in the pair contains the antonyms of the other. we denote with q q , that the queries q and q are dual. for our running example, the first pair is ("capitalism and war","socialism and peace") and the second one is ("capitalism and peace","socialism and war"). each pair defines an axis, therefore the two pairs define a d space against which we can evaluate the "value" of each document. for each axis we can consider policies for composing the relevance scores of each document to each member of a dual query. generally, there are various criteria that can be considered for assessing the value of each document or set of documents. such criteria include the bias of documents to specific queries (e.g., the original user query), the purity to a specific query, the overview factor of a document regarding either a dual query or all queries, and the diversity of the returned set of documents with respect to these queries. in general, we need to define appropriate ranking methods, that will take into account the relevance of the documents to the available queries for different criteria. therefore, we will explore whether the existing multiplecriteria approaches described in [ , , , ] are appropriate for the problem at hand. regarding the process of finding the corresponding antonyms, we can use existing dictionaries like wordnet for nouns and adjectives or word-embedding antonym detection approaches like [ , ] . the case of verbs and adverbs is more complicated since they require a kind of grammatical and language analysis (i.e., exist not exist, lot total, a lot bit, etc). there are three categories of antonyms: (a) gradable, (b) relational and (c) complementary. we have gradable antonyms (e.g., hot cold) in cases where the definitions of the words lie on a continuous spectrum. we have relational antonyms (e.g., teacher student) in cases where the two meanings are opposite only within the context of their relationship. the rest are called complementary antonyms (e.g., day night). in general, the selection of the "right" antonyms raises various questions. in many cases more than one antonyms exist, so one should decide which one(s) to select. sometimes this can depend on the context, e.g., the antonym of "action" is "apathy", but in terms of physics or sociology the dual of "action" is "reaction". notice that the proposed approach can be exploited in any context where the aim is to retrieve semantically opposing entities, information, etc. as an example consider the argument web [ ] , where the approach could be used for retrieving contradicting arguments and providing support for each one of them. from a system's perspective, the approach can be realized in various levels and settings. in the setting of an ir system, it can be implemented by changing accordingly the query processor and the ranking module, while in a meta-search setting, by changing the query rewriting, the query forwarding and the ranking components. it could also be exploited in the query autocompletion layer. to start with, we have conducted a preliminary evaluation. we have specified information tasks which are shown in table , that can exploit the proposed approach. the tasks are of exploratory nature and were created using the task refinement steps described in [ ] . we have identified the following types of tasks: explore domain (ed), medical treatment (mt), explore product reviews (epr) and person qualities (pq). for each task we provide a description of the information need, a representative query and the relevant antonyms, which were manually selected from the list of the respective wordnet antonyms. we conducted our experiment over female and male users of various ages. for each task, they were given two lists of results. one contained the results of the query from a popular wse, and the other one was constructed by interleaving the results of the same wse for the dual queries of this task (i.e., first the top result of the original query, then the first result of its dual, etc.). the two kinds of lists were given in random order for each task. the users were asked to select the most preferred list and to provide a grade of preference taking values in { , , , , }, where means that the selected list was preferred much more than the other one. in the background, when users prefer the results of the dual approach, we change the sign of the score and make it negative. the users were not aware how the lists were constructed and were not guided in any way by the evaluator. in fig. we provide two graphs that describe the results of the evaluation. figure (a), shows the aggregated scores given by all users to each query, while fig. (b) shows the aggregated scores given by each participant to all queries. regarding the first one the results are not the expected ones, although we hypothesize that the users mainly penalized the dual approach because of the 'irrelevant' results to the original query in terms of query tokens and not in terms of relevant information. for eleven of the queries there is a strong preference towards the non-dual approach. the epr type of queries belong to this category, showing that users are probably not interested for reviews with the opposite direction of what they are looking for. this is especially true for q , where the dual approach provided results about winter vacations and was the least preferred. for two of the tasks, the approaches are almost incomparable. both of these tasks belong to the mt group. there are also two queries, q and q , where the dual approach is better, especially in the last one. in their comments for these queries, users mention that the selected (i.e., dual) list "provides a more general picture" and "more relevant and interesting results, although contradicting". regarding the second graph we have the interesting result that the proposed approach appeals to specific users. it seems that nine users ( % of the participants) have an exploratory nature and generally prefer the dual approach (six of them strongly), while for four of them the two approaches are incomparable. the rest are better served with the non-dual approach. this is an interesting outcome, and in the future we plan to identify the types of users that prefer the dual approach. we have motivated with examples why it is worth investigating dualism for nonmonosemous terms in the context of exploratory search and we have shown its importance at least for some types of users and tasks. for the future, we plan to define the appropriate antonyms selection algorithms and relevance metrics, implement the proposed functionality in a meta-search setting, and conduct a large scale evaluation with real users over exploratory tasks, to identify in which queries the dual approach is beneficial and to what types of users. query expansion techniques for information retrieval: a survey implementing the argument web evaluating subtopic retrieval methods: clustering versus diversification of search results multidimensional relevance: a new aggregation criterion supporting exploratory search multidimensional relevance: prioritized aggregation in a personalized information retrieval setting on antonym and negate in fuzzy logic improving word embeddings for antonym detection using thesauri and sentiwordnet negotiating a multidimensional framework for relevance space creating exploratory tasks for a faceted search interface word embedding-based antonym detection using thesauri and distributional information understanding user goals in web search relevance: a review of the literature and a framework for thinking on the notion in information science. part ii: nature and manifestations of relevance key: cord- -f icyt authors: sharma, ujjwal; rudinac, stevan; worring, marcel; demmers, joris; van dolen, willemijn title: semantic path-based learning for review volume prediction date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: f icyt graphs offer a natural abstraction for modeling complex real-world systems where entities are represented as nodes and edges encode relations between them. in such networks, entities may share common or similar attributes and may be connected by paths through multiple attribute modalities. in this work, we present an approach that uses semantically meaningful, bimodal random walks on real-world heterogeneous networks to extract correlations between nodes and bring together nodes with shared or similar attributes. an attention-based mechanism is used to combine multiple attribute-specific representations in a late fusion setup. we focus on a real-world network formed by restaurants and their shared attributes and evaluate performance on predicting the number of reviews a restaurant receives, a strong proxy for popularity. our results demonstrate the rich expressiveness of such representations in predicting review volume and the ability of an attention-based model to selectively combine individual representations for maximum predictive power on the chosen downstream task. multimodal graphs have been extensively used in modeling real-world networks where entities interact and communicate with each other through multiple information pathways or modalities [ , , ] . each modality encodes a distinct view of the relation between nodes. for example, within a social network, users can be connected by their shared preference for a similar product or by their presence in the same geographic locale. each of these semantic contexts links the same user set with a distinct edge set. such networks have been extensively used for applications like semantic proximity search in existing interaction networks [ ] , augmenting semantic relations between entities [ ] , learning interactions in an unsupervised fashion [ ] and augmenting traditional matrix factorization-based collaborative filtering models for recommendation [ ] . each modality within a multimodal network encodes a different semantic relation and exhibits a distinct view of the network. while such views contain relations between nodes based on interactions within a single modality, observed outcomes in the real-world are often a complex combination of these interactions. therefore, it is essential to compose these complementary interactions meaningfully to build a better representation of the real world. in this work, we examine a multimodal approach that attempts to model the review-generation process as the end-product of complex interactions within a restaurant network. restaurants share a host of attributes with each other, each of which may be treated as a modality. for example, they may share the same neighborhood, the same operating hours, similar kind of cuisine, or the same 'look and feel'. furthermore, each of these attributes only uncovers a specific type of relation. for example, a view that only uses the location-modality will contain venues only connected by their colocation in a common geographical unit and will prioritize physical proximity over any other attribute. broadly, each of these views is characterized by a semantic context and encodes modality-specific relations between restaurants. these views, although informative, are complementary and only record associations within the same modality. while each of these views encodes a part of the interactions within the network, performance on a downstream task relies on a suitable combination of views pertinent to the task [ ] . in this work, we use metapaths as a semantic interface to specify which relations within a network may be relevant or meaningful and worth investigating. we generate bimodal low-dimensional embeddings for each of these metapaths. furthermore, we conjecture that their relevance on a downstream task varies with the nature of the task and that this task-specific modality relevance should be learned from data. in this work, -we propose a novel method that incorporates restaurants and their attributes into a multimodal graph and extracts multiple, bimodal low dimensional representations for restaurants based on available paths through shared visual, textual, geographical and categorical features. -we use an attention-based fusion mechanism for selectively combining representations extracted from multiple modalities. -we evaluate and contrast the performance of modality-specific representations and joint representations for predicting review volume. the principle challenge in working with multimodal data revolves around the task of extracting and assimilating information from multiple modalities to learn informative joint representations. in this section, we discuss prior work that leverages graph-based structures for extracting information from multiple modalities, focussing on the auto-captioning task that introduced such methods. we then examine prior work on network embeddings that aim to learn discriminative representations for nodes in a graph. graph-based learning techniques provide an elegant means for incorporating semantic similarities between multimedia documents. as such, they have been used for inference in large multimodal collections where a single modality may not carry sufficient information [ ] . initial work in this domain was structured around the task of captioning unseen images using correlations learned over multiple modalities (tag-propagation or auto-tagging). pan et al. use a graph-based model to discover correlations between image features and text for automatic image-captioning [ ] . urban et al. use an image-context graph consisting of captions, image features and images to retrieve relevant images for a textual query [ ] . stathopoulos et al. [ ] build upon [ ] to learn a similarity measure over words based on their co-occurrence on the web and use these similarities to introduce links between similar caption words. rudinac et al. augment the image-context graph with users as an additional modality and deploy it for generating visual-summaries of geographical regions [ ] . since we are interested in discovering multimodal similarities between restaurants, we use a graph layout similar to the one proposed by pan et al. [ ] for the image auto-captioning task but replace images with restaurants as central nodes. other nodes containing textual features, visual features and users are retained. we also add categorical information like cuisines as a separate modality, allowing them to serve as semantic anchors within the representation. graph representation learning aims to learn mappings that embed graph nodes in a low-dimensional compressed representation. the objective is to learn embeddings where geometric relationships in the compressed embedding space reflect structural relationships in the graph. traditional approaches generate these embeddings by finding the leading eigenvectors from the affinity matrix for representing nodes [ , ] . with the advent of deep learning, neural networks have become increasingly popular for learning such representations, jointly, from multiple modalities in an end-to-end pipeline [ , , , , ] . existing random walk-based embedding methods are extensions of the random walks with restarts (rwr) paradigm. traditional rwr-based techniques compute an affinity between two nodes in a graph by ascertaining the steadystate transition probability between them. they have been extensively used for the aforementioned auto-captioning tasks [ , , , ] , tourism recommendation [ ] and web search as an integral part of the pagerank algorithm [ ] . deep learning-based approaches build upon the traditional paradigm by optimizing the co-occurrence statistics of nodes sampled from these walks. deepwalk [ ] uses nodes sampled from short truncated random walks as phrases to optimize a skip-gram objective similar to word vec [ ] . similarly, node vec augments this learning paradigm with second-order random walks parameterized by exploration parameters p and q which control between the importance of homophily and structural equivalence in the learnt representations [ ] . for a homogeneous network, random walk based methods like deepwalk and node vec assume that while the probabilities of transitioning from one node to another can be different, every transition still occurs between nodes of the same type. for heterogeneous graphs, this assumption may be fallacious as all transitions do not occur between nodes of the same type and consequently, do not carry the same semantic context. indeed, our initial experiments with node vec model suggest that it is not designed to handle highly multimodal graphs. clements et al. [ ] demonstrated that in the context of content recommendation, the importance of modalities is strongly task-dependent and treating all edges in heterogeneous graphs as equivalent can discard this information. metapath vec [ ] remedies this by introducing unbiased walks over the network schema specified by a metapath [ ] , allowing the network to learn the semantics specified by the metapath rather than those imposed purely by the topology of the graph. metapath-based approaches have been extended to a variety of other problems. hu et al. use an exhaustive list of semantically-meaningful metapaths for extracting top-n recommendations with a neural co-attention network [ ] . shi et al. use metapath-specific representations in a traditional matrix factorization-based collaborative filtering mechanism [ ] . in this work, we perform random walks on sub-networks of a restaurant-attribute network containing restaurants and attribute modalities. these attribute modalities may contain images, text or categorical features. for each of these sub-networks, we perform random walks and use a variant of the heterogeneous skip-gram objective introduced in [ ] to generate low-dimensional bimodal embeddings. bimodal embeddings have several interesting properties. training relations between two modalities provide us with a degree of modularity where modalities can be included or held-out from the prediction model without affecting others. it also makes training inexpensive as the number of nodes when only considering two modalities is far lower than in the entire graph. in this section, we begin by providing a formal introduction to graph terminology that is frequently referenced in this paper. we then move on to detail our proposed method illustrated in fig. . formally, a heterogeneous graph is denoted by g = (v, e, φ, σ) where v and e denote the node and edge sets respectively. for every node and edge, there exists mapping functions φ(v) → a and σ(e) → r where a and r are sets of node types and edge types respectively such that |a + r| > . for a heterogeneous graph g = (v, e, φ, σ), a network schema is a metagraph m g = (a, r) where a is the set of node types in v and r is the set of edge types in e. a network schema enumerates the possible node types and edge types that can occur within a network. a metapath m(a , a n ) is a path on the network schema m g consisting of a sequence of ordered edge transitions: we use tripadvisor to collect information for restaurants in amsterdam. each venue characteristic is then embedded as a separate node within a multimodal graph. in the figure above r nodes denote restaurants, i nodes denote images for a restaurant, d nodes are review documents, a nodes are categorical attributes for restaurants and l nodes are locations. bimodal random walks are used to extract pairwise correlations between nodes in separate modalities which are embedded using a heterogeneous skip-gram objective. finally, an attention-based fusion model is used to combine multiple embeddings together to regress the review volume for restaurants. let g = (v, e) be the heterogeneous graph with a set of nodes v and edges e. we assume the graph to be undirected as linkages between venues and their attributes are inherently symmetric. below, we describe the node types used to construct the graph (cf. figs. and and use the penultimate layer output as a compressed low-dimensional representation for the image. since the number of available images for each venue may vary dramatically depending on its popularity, adding a node for every image can lead to an unreasonably large graph. to mitigate this issue, we cluster image features for each restaurant using the k-means algorithm and use the cluster centers as representative image features for a restaurant, similar to zahálka et al. [ ] . we chose k = as a reasonable trade-off between the granularity of our representations and tractability of generating embeddings for this modality. the way patrons write about a restaurant and the usage of specialized terms can contain important information about a restaurant that may be missing from its categorical attributes. for example, usage of the indian cottage cheese 'paneer' can be found in similar cuisine types like nepali, surinamese, etc. and user reviews talking about dishes containing 'paneer' can be leveraged to infer that indian and nepali cuisines share some degree of similarity. to model such effects, we collect reviews for every restaurant. since individual reviews may not provide a comprehensive unbiased picture of the restaurant, we chose not to treat them individually, but to consider them as a single document. we then use a distributed bag-ofwords model from [ ] to generate low-dimensional representations of these documents for each restaurant. since the reviews of a restaurant can widely vary based on its popularity, we only consider the most recent reviews for each restaurant to prevent biases from document length getting into the model. . users: since tripadvisor does not record check-ins, we can only leverage explicit feedback from users who chose to leave a review. we add a node for each of the users who visited at least two restaurants in amsterdam and left a review. similar to [ , , ] , we introduce two kinds of edges in our graph: . attribute edges: these are heterogeneous edges that connect a restaurant node to the nodes of its categorical attributes, image features, review features and users. in our graph, we instantiate them as undirected, unweighted edges. . similarity edges: these are homogeneous edges between the feature nodes within a single modality. for image features, we use a radial basis function as a non-linear transformation of the euclidean distances between image feature vectors. for document vectors, we use cosine similarity to find restaurants with similar reviews. adding a weighted similarity edge between every node in the same modality would yield an extremely dense adjacency matrix. to avoid this, we only add similarity links between a node and its k nearest neighbors in each modality. by choosing the nearest k neighbors, we make our similarity threshold adaptive allowing it to adjust to varying scales of distance in multiple modalities. metapaths can provide a modular and simple interface for injecting semantics into the network. since metapaths, in our case, are essentially paths over the modality set, they can be used to encode inter-modality correlations. in this work, we generate embeddings with two specific properties: . all metapaths are binary and only include transitions over modalities. since venues/restaurants are always a part of the metapath, we only include one other modality. . during optimization, we only track the short-range context by choosing a small window size. window size is the maximum distance between the input node and a predicted node in a walk. in our model, walks over the metapath only capture short-range semantic contexts and the choice of a larger window can be detrimental to generalization. for example, consider a random walk over the restaurant -cuisine -restaurant metapath. in the sampled nodes below, restaurants are in red while cuisines are in blue. optimizing over a large context window can lead to mcdonald's (fast-food cuisine) and kediri (indonesian cuisine) being placed close in the embedding space. this is erroneous and does not capture the intended semantics which should bring restaurants closer only if they share the exact attribute. we use the metapaths in table to perform unbiased random walks on the graph detailed in sect. . . each of these metapaths enforces similarity based on certain semantics. we train separate embeddings using the heterogeneous skip-gram objective similar to [ ] . for every metapath, we maximize the probability of observing the heterogeneous context n a (v) given the node v. in eq. ( ) , a m is the node type-set and v m is the node-set for metapath m. arg max θ v∈vm a∈am ca∈na (v) log p(c a |v; θ) the original metapath vec model [ ] uses multiple metapaths [ ] to learn separate embeddings, some of which perform better than the others. on the dblp metapath-specific embeddings fig. . attention-weighted modality fusion: metapath-specific embeddings are fed into a common attention mechanism that generates an attention vector. each modality is then reweighted with the attention vector and concatenated. this joint representation is then fed into a ridge regressor to predict the volume of ratings for each restaurant. bibliographic graph that consists of authors (a), papers (p) and venues (v), the performance of their recommended metapath 'a-p-v-p-a' was empirically better than the alternative metapath 'a-p-a' on the node classification task. at this point, it is important to recall that in our model, each metapath extracts a separate view of the same graph. these views may contain complementary information and it may be disadvantageous to only retain the best performing view. for an optimal representation, these complementary views should be fused. in this work, we employ an embedding-level attention mechanism similar to the attention mechanism introduced in [ ] that selectively combines embeddings based on their performance on a downstream task. assuming s to be the set of metapath-specific embeddings for metapaths m , m , . . . , m n , following the approach outlined in fig. , we can denote it as: we then use a two-layer neural network to learn an embedding-specific attention a mn for metapath m n : further, we perform a softmax transformation of the attention network outputs to an embedding-specific weight finally, we concatenate the attention-weighted metapath-specific embeddings to generate a fused embedding we evaluate the performance of the embedding fusion model on the task of predicting the volume (total count) of reviews received by a restaurant. we conjecture that the volume of reviews is an unbiased proxy for the general popularity and footfall for a restaurant and is more reliable than indicators like ranking or ratings which may be biased by tripadvisor's promotion algorithms. we use the review volume collected from tripadvisor as the target variable and model this task as a regression problem. data collection. we use publicly-available data from tripadvisor for our experiments. to build the graph detailed in sect. . , we collect data for , restaurants in amsterdam, the netherlands that are listed on tripadvisor. we additionally collect , user-contributed restaurant reviews made by , unique users, of which only , users visit more than restaurants in the city. we only retain these , users in our graph and drop others. we also collect , user-contributed images for these restaurants. we construct the restaurant network by embedding venues and their attributes listed in table as nodes. bimodal embeddings. we train separate bimodal embeddings by optimizing the heterogeneous skip-gram objective from eq. ( ) using stochastic gradient descent and train embeddings for all metapaths enumerated in table . we use restaurant nodes as root nodes for the unbiased random walks and perform walks per root node, each with a walk length of . each embedding has a dimensionality of , uses a window-size of and is trained for epochs. embedding fusion models. we chose two fusion models in our experiments to analyze the efficacy of our embeddings: . simple concatenation model: we use a model that performs a simple concatenation of the individual metapath-specific embeddings detailed in sect. . to exhibit the baseline performance on the tasks detailed in sect. . simple concatenation is a well-established additive fusion technique in multimodal deep learning [ , ] . each of the models uses a ridge regression algorithm to estimate the predictive power of each metapath-specific embedding on the volume regression task. this regressor is jointly trained with the attention model in the attention-weighted model. all models are optimized using stochastic gradient descent with the adam optimizer [ ] with a learning rate of . . in table , we report the results from our experiments on the review-volume prediction task. we observe that metapaths with nodes containing categorical attributes perform significantly better than vector-based features. in particular, categorical attributes like cuisines, facilities, and price have a significantly higher coefficient of determination (r ) as compared to visual feature nodes. it is interesting to observe here that nodes like locations, images, and textual reviews are far more numerous than categorical nodes and part of their decreased performance may be explained by the fact that our method of short walks may not be sufficiently expressive when the number of feature nodes is large. in addition, as mentioned in related work, we performed these experiments with the node vec model, but since it is not designed for heterogeneous multimodal graphs, it yielded performance scores far below the weakest single modality. a review of the fusion models indicates that taking all the metapaths together can improve performance significantly. the baseline simple concatenation fusion model, commonly used in literature, is considerably better than the best-performing metapath (venues -facilities -venues). the attention basedmodel builds significantly over the baseline performance and while it employs a similar concatenation scheme as the baseline concatenation model, the introduction of the attention module allows it to handle noisy and unreliable modalities. the significant increase in the predictive ability of the attention-based model can be attributed to the fact that while all modalities encode information, some of them may be less informative or reliable than others, and therefore contribute less to the performance of the model. our proposed fusion approach is, therefore, capable of handling weak or noisy modalities appropriately. in this work, we propose an alternative, modular framework for learning from multimodal graphs. we use metapaths as a means to specify semantic relations between nodes and each of our bimodal embeddings captures similarities between restaurant nodes on a single attribute. our attention-based model combines separately learned bimodal embeddings using a late-fusion setup for predicting the review volume of the restaurants. while each of the modalities can predict the volume of reviews to a certain extent, a more comprehensive picture is only built by combining complementary information from multiple modalities. we demonstrate the benefits of our fusion approach on the review volume prediction task and demonstrate that a fusion of complementary views provides the best way to learn from such networks. in future work, we will investigate how the technique generalises to other tasks and domains. mantis: system support for multimodal networks of in-situ sensors hyperlearn: a distributed approach for representation learning in datasets with many modalities interaction networks for learning about objects, relations and physics heterogeneous network embedding via deep architectures the task-dependent effect of tags and ratings on social media access metapath vec: scalable representation learning for heterogeneous networks m-hin: complex embeddings for heterogeneous information networks via metagraphs node vec: scalable feature learning for networks deep residual learning for image recognition leveraging meta-path based context for top-n recommendation with a neural co-attention model multimodal network embedding via attention based multi-view variational autoencoder adam: a method for stochastic gradient descent distributed representations of sentences and documents deep collaborative embedding for social image understanding how random walks can help tourism image labeling on a network: using social-network metadata for image classification distributed representations of words and phrases and their compositionality multimodal deep learning multi-source deep learning for human pose estimation the pagerank citation ranking: bringing order to the web gcap: graph-based automatic image captioning deepwalk: online learning of social representations the visual display of regulatory information and networks nonlinear dimensionality reduction by locally linear embedding generating visual summaries of geographic areas using community-contributed images imagenet large scale visual recognition challenge heterogeneous information network embedding for recommendation semantic relationships in multi-modal graphs for automatic image annotation pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding study on optimal frequency design problem for multimodal network using probit-based user equilibrium assignment adaptive image retrieval using a graph model for semantic feature integration heterogeneous graph attention network network representation learning with rich text information interactive multimodal learning for venue recommendation metagraph vec: complex semantic path augmented heterogeneous network embedding key: cord- -yrocw j authors: agarwal, mansi; leekha, maitree; sawhney, ramit; ratn shah, rajiv; kumar yadav, rajesh; kumar vishwakarma, dinesh title: memis: multimodal emergency management information system date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: yrocw j the recent upsurge in the usage of social media and the multimedia data generated therein has attracted many researchers for analyzing and decoding the information to automate decision-making in several fields. this work focuses on one such application: disaster management in times of crises and calamities. the existing research on disaster damage analysis has primarily taken only unimodal information in the form of text or image into account. these unimodal systems, although useful, fail to model the relationship between the various modalities. different modalities often present supporting facts about the task, and therefore, learning them together can enhance performance. we present memis, a system that can be used in emergencies like disasters to identify and analyze the damage indicated by user-generated multimodal social media posts, thereby helping the disaster management groups in making informed decisions. our leave-one-disaster-out experiments on a multimodal dataset suggest that not only does fusing information in different media forms improves performance, but that our system can also generalize well to new disaster categories. further qualitative analysis reveals that the system is responsive and computationally efficient. the amount of data generated every day is colossal [ ] . it is produced in many different ways and many different media forms. analyzing and utilizing this data to drive the decision-making process in various fields intelligently has been the primary focus of the research community [ ] . disaster response management is one such area. natural calamities occur frequently, and in times of such crisis, if the large amount of data being generated across different platforms is harnessed m. agarwal and m. leekha-the authors contributed equally, and wish that they be regarded as joint first authors. rajiv ratn shah is partly supported by the infosys center for ai, iiit delhi. well, the relief groups will be able to make effective decisions that have the potential to enhance the response outcomes in the affected areas. to design an executable plan, disaster management and relief groups should combine information from different sources and in different forms. however, at present, the only primary source of information is the textual reports which describe the disaster's location, severity, etc. and may contain statistics of the number of victims, infrastructural loss, etc. motivated by the cause of humanitarian aid in times of crises and disasters, we propose a novel system that leverages both textual and visual cues from the mass of user-uploaded information on social media to identify damage and assess the level of damage incurred. in essence, we propose memis, a system that aims to pave the way to automate a vast multitude of problems ranging from automated emergency management, community rehabilitation via better planning from the cues and patterns observed in such data and improve the quality of such social media data to further the cause of immediate response, improving situational awareness and propagating actionable information. using a real-world dataset, crisismmd, created by alam et al. [ ] , which is the first publicly available dataset of its kind, we present the case for a novel multimodal system, and through our results report its efficiency, effectiveness, and generalizability. in this section, we briefly discuss the disaster detection techniques of the current literature, along with their strengths and weaknesses. we also highlight how our approach overcomes the issues present in the existing ones, thereby emphasizing the effectiveness of our system for disaster management. chaudhuri et al. [ ] examined the images from earthquake-hit urban environments by employing a simple cnn architecture. however, recent research has revealed that often fine-tuning pre-trained architectures for downstream tasks outperform simpler models trained from scratch [ ] . we build on this by employing transfer learning with several successful models from the imagenet [ ] , and observed significant improvements in the performance of our disaster detection and analysis models, in comparison to a simple cnn model. sreenivasulu et al. [ ] investigated microblog text messages for identifying those which were informative, and therefore, could be used for further damage assessment. they employed a convolutional neural network (cnn) for modeling the text classification problem, using the dataset curated by alam et al. [ ] . extending on their work on crisismmd, we experimented with several other state-of-the-art architectures and observed that adding recurrent layers improved the text modeling. although researchers in the past have designed and experimented with unimodal disaster assessment systems [ , ] , realizing that multimodal systems may outperform unimodal frameworks [ ] , the focus has now shifted to leveraging information in different media forms for disaster management [ ] . in addition to using several different media forms and feature extraction techniques, several researchers have also employed various methods to combine the information obtained from these modalities, to make a final decision [ ] . yang et al. [ ] developed a multimodal system-madis which leverages both text and image modalities, using hand-crafted features such as tf-idf vectors, and low-level color features. although their contribution was a step towards advancing damage assessment systems, the features used were relatively simple and weak, as opposed to the deep neural network models, where each layer captures complex information about the modality [ ] . therefore, we utilize the latent representation of text and image modalities, extracted from their respective deep learning models, as features to our system. another characteristic that is essential for a damage assessment system is generalizability. however, most of the work carried out so far did not discuss this practical perspective. furthermore, to the best of our knowledge, so far no work has been done on developing an end-to-end multimodal damage identification and assessment system. to this end, we propose memis, a multimodal system capable of extracting information from social media, and employs both images and text for identifying damage and its severity in real-time (refer sect. ). through extensive quantitative experimentation in the leave-one-disaster-out training setting and qualitative analysis, we report the system's efficiency, effectiveness, and generalizability. our results show how combining features from different modalities improves the system's performance over unimodal frameworks. in this section, we describe the different modules of our proposed system in greater detail. the architecture for the system is shown in fig. . the internal methodological details of the individual modules are in the next section. the tweet streaming module uses the twitter streaming api to scrap realtime tweets. as input to the api, the user can enter filtering rules based on the available information like hashtags, keywords, phrases, and location. the module outputs all the tweets that match these defined cases as soon as they are live on social media. multiple rules can be defined to extract tweets for several disasters at the same time. data from any social media platform can be used as input to the proposed framework. however, in this work, we consume disaster-related posts on twitter. furthermore, although the proposed system is explicitly for multimodal tweets having both images and text, we let the streaming module filter both unimodal and multimodal disaster tweets. we discuss in sect. . how our pipeline can be generalized to process unimodal tweets as well, making it more robust. a large proportion of the tweets obtained using the streaming module may be retweets that have already been processed by the system. therefore, to avoid overheads, we maintain a list of identifiers (ids) of all tweets that have been processed by the system. in case an incoming tweet is a retweet that has already been processed by the system before, we discard it. furthermore, some tweets may also have location or geographic information. this information is also stored to maintain a list of places where relief groups are already providing services currently. if a streamed geo-tagged tweet is from a location where the relief groups are already providing aid, the tweet is not processed further. a substantial number of tweets streamed from the social media platforms are likely to be irrelevant for disaster response and management. furthermore, different relief groups have varying criteria for what is relevant to them for responding to the situation. for instance, a particular relief group could be interested only in reaching out to the injured victims, while another provides resources for infrastructural damages. therefore, for them to make proper use of information from social media platforms, the relevant information must be filtered. we propose two sub-modules for filtering: (i) the first filters the informative tweets, i.e., the tweets that provide information relevant to a disaster, which could be useful to a relief group, (ii) the second filter is specific to the relief group, based on the type of damage response they provide. to demonstrate the system, in this work, we filter tweets that indicate infrastructural damage or physical damage in buildings and other structures. finally, once the relevant tweets have been filtered, we analyze them for the severity of the damage indicated. the system categorizes the severity of infrastructural damage into three levels: high, medium and low. based on the damage severity assessment by the system, the relief group can provide resources and services to a particular location. this information must further be updated in the database storing the information about all the places where the group is providing aid currently. furthermore, although not shown in the system diagram, we must also remove a location from the database once the relief group's activity is over, and it is no longer actively providing service there. this ensures that if there is an incoming request from that location after it was removed from the database, it can be entertained. in this section, we discuss the implementation details of the two main modules of the system for relevance filtering and severity analysis. we begin by describing the data pre-processing required for the multimodal tweets, followed by the deep learning-based models that we use for the modules. image pre-processing: the images are resized to × for the transfer learning model [ ] and then normalized in the range [ , ] across all channels (rgb). text pre-processing: all http urls, retweet headers of the form rt, punctuation marks, and twitter user handles specified as @username are removed. the tweets are then lemmatized and transformed into a stream of tokens that can be fed as input to the models used in the downstream modules. these tokens act as indices to an embedding matrix, which stores the vector representation for tokens corresponding to all the words maintained in the vocabulary. in this work, we use dimensional fasttext word-embeddings [ ] , trained on the cri-sismmd dataset [ ] that has been used in this work. the system as a whole, however, is independent of the choice of vector representation. for the proposed pipeline, we use recurrent convolutional neural network (rcnn) [ ] as the text classification model. it adds a recurrent structure to the convolutional block, thereby capturing contextual information with long term dependencies and the phrases which play a vital role at the same time. furthermore, we use the inception-v model [ ] , pre-trained on the imagenet dataset [ ] for modelling the image modality. the same underlying architectures, for both text and image respectively, are used to filter the tweets that convey useful information regarding the presence of infrastructural damage in the relevance filtering modules, and the analysis of damage in the severity analysis module. therefore, we effectively have three models for each modality: first for filtering the informative tweets, then for those pertaining to the infrastructural damage (or any other category related to the relief group), and finally for assessing the severity of damage present. in this subsection, we describe how we combine the unimodal predictions from the text and image models for different modules. we also discuss in each case about how the system would treat a unimodal text or image only input tweet. gated approach for relevance filtering. for the two modules within relevance filtering, we use a simplistic approach of combining the outputs from the text and image models by using the or function (⊕). technically speaking, we conclude that the combined output is positive if at least one of the unimodal models predicts so. therefore, if a tweet is predicted as informative by either the text, or the image, or both the models, the system predicts the tweet as informative, and it is considered for further processing in the pipeline. similarly, if at least one of the text and the image modality predicts an informative tweet as containing infrastructural damage, the tweet undergoes severity analysis. this simple technique helps avoid missing any tweet that might have even the slightest hint of damage, in either or both the modalities. any false positive can also be easily handled in this approach. if, say, a non-informative tweet is predicted as informative in the first step at relevance filtering, it might still be the case that in the second step, the tweet is predicted as not containing any infrastructural damage. furthermore, in case a tweet is unimodal and has just the text or the image, then the system can take the default prediction of the missing modality as negative (or false for a boolean or function), which is the identity for the or operation. in that case, the prediction based on the available modality will guide the analysis (fig. ) . attention fusion for severity analysis. the availability of data from different media sources has encouraged researchers to explore and leverage the potential boost in performance by combining unimodal classifiers trained on individual modalities [ , ] . here, we use attention fusion to combine the feature interpretations from the text and image modalities for the severity analysis module [ , ] . the idea of attention fusion is to attend particular input features as compared to others while predicting the output class. the features, i.e., the outputs of the penultimate layer or the layer before the softmax, of the text and image models are concatenated. this is followed by a softmax layer to learn the attention weights for each feature dimension, i.e., the attention weight α i for a feature x i is given by: therefore, the input feature after applying the attention weights is, where, i, j ∈ , , .., p, and p is the total number of dimensions in the multimodal concatenated feature vector. w is the weight matrix learned by the model. this vector of attended features is then used to classify the given multimodal input. with this type of fusion, we can also analyze how the different modalities are interacting with each other employing their attention weights. moving from the relevance filtering to the severity analysis module, we strengthen our fusion technique by using attention mechanism. this is required since human resources are almost always scarce, and it is necessary to correctly assess the requirements at different locations based on the severity of the damage. as opposed to an or function, using attention, we are able to combine the most important information as seen by the different modalities to together analyze the damage severity. in this case, the treatment of unimodal tweets is not that straightforward, since the final prediction using attention fusion occurs after concatenation of the latent feature vectors of the individual modalities. therefore, in case the text or image is missing, we use the unimodal model for the available modality. in other words, we use attention mechanism only when both the modalities are present to analyze damage severity, else we use the unimodal models. recently, several datasets on crisis damage analysis have been released to foster research in the area [ ] . in this work, we have used the first multimodal, labeled, publicly available damage related to the twitter dataset, crisismmd, created by alam et al. [ ] . it was collected by crawling the blogs posted by users during seven natural disasters, which can be grouped into disaster categories, namely-floods, hurricanes, wildfires and earthquakes. crisismmd introduces three hierarchical tasks: . informativeness. this initial task classifies each multimodal post as informative or non-informative. alam et al. define a multimodal post as informative if it serves to be useful in identifying areas where damage has occurred due to disaster. it is therefore a binary classification problem, with the two classes being informative and non-informative. . infrastructural damage. the damage in an informative tweet may be of many different kinds [ , ] . crisismmd identifies several categories for the type of damage, namely-infrastructure and utility damage, vehicle damage, affected individuals, missing or found people, other relevant information, none. alam et al. [ ] also noted that the tweets which signify physical damage in structures, where people could be stuck, are especially beneficial for the rescue operation groups to provide aid. out of the above-listed categories, the tweets having infrastructure and utility damage are therefore identified in this task. this again is modelled as a classification problem with two classesinfrastructural and non-infrastructural damage. . damage severity analysis. this final task uses the text and image modalities together to analyze the severity of infrastructural damage in a tweet ashigh, medium, or low. we add another label, no-damage, to support the pipeline framework that can handle false positives as well. specifically, if a tweet having no infrastructural damage is predicted as positive, it can be detected here as having no damage. this is modelled as a multi-class classification problem. the individual modules of the proposed pipeline essentially model the above three tasks of crisismmd. specifically, the two relevance filtering modules model the first and the second tasks, respectively, whereas the severity analysis module models the third task (table ) . to evaluate how well our system can generalize to new disaster categories, we train our models for all the three tasks in a leave-one-disaster-out (lodo) training paradigm. therefore, we train on disaster categories and evaluate the performance on the left-out disaster. to handle the class imbalance, we also used smote [ ] with the word embeddings of the training fold samples for linguistic baselines. we used adam optimizer with an initial learning rate of . , the values of β and β as . and . , respectively, and a batch size of to train our models. we use f -score as the metric to compare the model performance. all the models were trained on a geforce gtx ti gpu with a memory speed of gbps. to demonstrate the effectiveness of the proposed system for multimodal damage assessment on social media, we perform an ablation study, the results for which have been described below. design choices. we tried different statistical and deep learning techniques for modelling text-tf-idf features with svm, naive bayes (nb) and logistic regression (lr); and in the latter category, cnn [ ] , hierarchical attention model (hattn), bidirectional lstm (bilstm) and rcnn [ ] . as input to the deep learning models, we use -dimensional fasttext word embeddings [ ] trained on the dataset. by operating at the character n-gram level, fasttext tends to capture the morphological structure well. thus, helping the otherwise out of vocabulary words (such as hash-tags) to share semantically similar embeddings with its component words. as shown in table , the rcnn model performed the best on all three tasks of the relevance filtering and severity analysis modules. specifically, the average lodo f -scores of rcnn on the three tasks are . , . , and . , respectively. furthermore, the architecture considerably reduces the effect of noise in social media posts [ ] . for images, we fine-tuned the vgg- [ ] , resnet- [ ] and incep-tionv [ ] models, pre-trained on the imagenet dataset [ ] . we also trained a cnn model from scratch. experimental results in table reveal that incep-tionv performed the best, and the average f -score with lodo training for the three tasks are . , . , and . , respectively. the architecture employs multiple sized filters to get a thick rather than a deep architecture, as very deep networks are prone to over-fitting. such a design makes the network computationally less expensive, which is a prime concern for our system as we want to minimize latency to give quick service to the disaster relief groups. table highlights the results of an ablation study over the best linguistic and vision models, along with the results obtained when the predictions by these individual models are combined as discussed in sect. . . the results for all the modules demonstrate the effectiveness of multimodal damage assessment models. specifically, we observe that for each disaster category in the lodo training paradigm, the f -score for the multimodal model is always better than or compares with those of the text and image unimodal models. in this section, we analyze some specific samples to understand the shortcomings of using unimodal systems, and to demonstrate the effectiveness of our proposed multimodal system. table records these sample tweets along with their predictions as given by the different modules. in green are the correct predictions, whereas the incorrect ones are shown in red they have been discussed below in order: . the image in the first sample portrays the city landscape from the top, damaged by the calamity. due to the visual noise, the image does not give much information about the intensity of damage present, and therefore, the image model incorrectly predicts the tweet as mildly damaged. on the other hand, the text model can identify the severe damage indicated by phrases like 'hit hard'. combining the two predictions by using attention fusion, therefore, helps in overcoming the unimodal misclassifications. in this tweet, the text uses several keywords, such as 'damaged' and 'earthquake', which misleads the text model in predicting it as severely damaged. however, the image does not hold the same perspective. by combining the feature representations, attention fusion can correctly predict the tweet as having mild damage. the given tweet is informative and therefore, it is considered for damage analysis. however, the text classifier, despite the presence of words like 'killed' and 'destroyed', incorrectly classifies it to the non-infrastructural damage class. the image classifier correctly identifies the presence of damage, and therefore, the overall prediction for the tweet is infrastructural damage, which is correct. furthermore, both the text and image models are unable to identify the severity of damage present, but the proposed system can detect the presence of severe damage using attention fusion. the sample shows how the severity analysis module combines the text and visual cues by identifying and attending to more important features than others. this helps in modelling the dependency between the two modalities, even when both, individually give incorrect predictions. the image in the tweet shows some hurricane destroyed structures, depicting severe damage. however, the text talks about 'raising funds and rebuilding', which does not indicate severe damage. the multimodal system learns to attend the text features more and correctly classifies the sample as having no damage, even though both the individual models predicted incorrectly. furthermore, in this particular example, even by using the or function, the system could not correctly classify it as not having infrastructural damage. yet, the damage severity analysis module identifies this false positive and correctly classifies it. in this section, we discuss some of the practical and deployment aspects of our system, as well as some of its limitations. we simulate an experiment to analyze the computational efficiency of the individual modules in terms of the time they take to process a tweet, i.e., the latency. we are particularly interested in analyzing the relevance filtering and severity analysis modules. we developed a simulator program to act as the tweet streaming module that publishes tweets at different load rates (number of tweets in second) to be processed by the downstream modules. the modules also process the incoming tweets at the same rate. we calculate the average time for processing a tweet by a particular module as the total processing time divided by the total number of tweets used in the experiment. we used , multimodal tweets from crisismmd, streamed at varying rates. the performance of the two relevance filtering modules and the severity analysis module as we gradually increase the load rate is shown in the fig. . as a whole, including all the modules, we observed that on an average, the system can process tweets in minute. this experiment was done using an intel i - u cpu having gb ram. one can expect to see an improvement if a gpu is used over a cpu. generalization. the proposed system is also general and robust, especially in three aspects. firstly, the results of our lodo experiments indicate that the system can perform well in case it is used for analyzing new disasters, which were not used for training the system. this makes it suitable for real-world deployment where circumstance with new disaster categories cannot be foreseen. furthermore, we also saw how the two main modules of the system work seamlessly, even when one of the modalities is missing. this ensures that the system can utilize all the information that is available on the media platforms to analyze the disaster. finally, the second module in relevance filtering can be trained to suit the needs of several relief groups that target different types of damage, and therefore, the system is capable of being utilized for many different response activities. limitations. although the proposed system is robust and efficient, some limitations must be considered before it can be used in real-time. firstly, the system is contingent on the credibility i.e., the veracity of the content shared by users on social media platforms. it may so happen that false information is spread by some users to create panic amongst others [ ] . in this work, we have not evaluated the content for veracity, and therefore, it will not be able to differentiate such false news media. another aspect that is also critical to all systems that utilize data generated on social media is the socio-economic and geographic bias. specifically, the system will only be able to get information about the areas where people have access to social media, mostly the urban cities, whereas damage in the rural locations may go unnoticed since it did not appear on twitter or any other platform. one way to overcome this is to make use of aerial images, that can provide a top view of such locations as the rural lands. however, this again has a drawback as to utilize aerial images effectively, a bulk load of data would have to be gathered and processed. identifying damage and human casualties in real-time from social media posts is critical to providing prompt and suitable resources and medical attention, to save as many lives as possible. with millions of social media users continuously posting content, an opportunity is present to utilize this data and learn a damage recognition system. in this work, we propose memis, a novel multimodal emergency management information system for identifying and analyzing the level of damage severity in social media posts with the scope for betterment in disaster management and planning. the system leverages both textual and visual cues to automate the process of damage identification and assessment from social media data. our results show how the proposed multimodal system outperforms the state-of-the-art unimodal frameworks. we also report the system's responsiveness through extensive system analysis. the leave-one-disaster-out training setting proves the system is generic and can be deployed for any new unseen disaster. crisismmd: multimodal twitter datasets from natural disasters processing social media images by combining human and machine computing during crises crisisdps: crisis data processing services a twitter tale of three hurricanes: harvey, irma, and maria. arxiv multimodal vehicle detection: fusing d-lidar and color camera data enriching word vectors with subword information application of image analytics for disaster response in smart cities smote: synthetic minority over-sampling technique imagenet: a large-scale hierarchical image database how much data do we create every day? the mind-blowing stats everyone should read deep residual learning for image recognition an attention-based decision fusion scheme for multimedia information retrieval convolutional neural networks for sentence classification recurrent convolutional neural networks for text classification from chirps to whistles: discovering eventspecific informative content from twitter damage identification in social media posts using multimodal deep learning handcrafted vs. non-handcrafted features for computer vision classification a survey on transfer learning multimodal deep learning based on multiple correspondence analysis for disaster management a computationally efficient multimodal classification approach of disaster-related twitter images natural disasters detection in social media and satellite imagery: a survey multimodal analysis of user-generated multimedia content very deep convolutional networks for large-scale image recognition detecting informative tweets during disaster using deep neural networks rethinking the inception architecture for computer vision attention is all you need multimodal fusion of eeg and fmri for epilepsy detection madis: a multimedia-aided disaster information integration system for emergency management how transferable are features in deep neural networks? in: advances in neural information processing systems key: cord- -ch fg rp authors: grand, adrien; muir, robert; ferenczi, jim; lin, jimmy title: from maxscore to block-max wand: the story of how lucene significantly improved query evaluation performance date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: ch fg rp the latest major release of lucene (version ) in march incorporates block-max indexes and exploits the block-max variant of wand for query evaluation, which are innovations that originated from academia. this paper shares the story of how this came to be, which provides an interesting case study at the intersection of reproducibility and academic research achieving impact in the “real world”. we offer additional thoughts on the often idiosyncratic processes by which academic research makes its way into deployed solutions. we share the story of how an innovation that originated from academia-blockmax indexes and the corresponding block-max wand query evaluation algorithm of ding and suel [ ] -made its way into the open-source lucene search library. this represents not only a case study in widespread reproducibility, since every recent deployment of lucene has access to these features and thus their performance benefits can be easily measured, but also of academic research achieving significant impact. how did these innovations make their way from the "ivory tower" into the "real world"? we recount the sequence of events, including false starts, that finally led to the inclusion of block-max wand in the latest major version of lucene (version ), released in march . we see this paper as having two main contributions beyond providing a narrative of events: first, we report results of experiments that attempt to match the original conditions of ding and suel [ ] and present additional results on a number of standard academic ir test collections. these experiments characterize the performance of lucene's implementation and show the extent to which performance improvements are retained when moving from a research prototype to a production codebase. second, we offer a number of observations about the adoption of academic innovations, perhaps providing some insight into how academics might achieve greater real-world impact with their work. from its very beginnings in , lucene has mostly existed in a "parallel universe" from academic ir researchers. part of this can be attributed to its "target audience": developers who wish to build real-world search applications, as opposed to researchers who wish to write papers. academic ir researchers have a long history of building and sharing search engines, dating back to the mid s with cornell's smart system [ ] . the tradition continues to this day, with lemur/indri [ , ] and terrier [ , ] being the most successful examples of open-source academic search engines, still popular with many researchers today. until recently, there has been little exchange between lucene and these systems, other than a few academic workshops [ , ] . lucene has, for the longest time, been somewhat maligned in the academic ir community. for much of its existence, its default ranking model was a variant of tf-idf that was not only ad hoc, but demonstrably less effective than ranking models that were widely available in academic systems [ ] . okapi bm was not added to lucene until , more than a decade after it gained widespread adoption in the research community; the consensus had long emerged that it was more effective than tf-idf variants. this lag has contributed to the broad perception by researchers that lucene produces poor search results and is illsuited for information retrieval research. this negative perception of lucene, however, began to change a few years ago. in , an evaluation exercise known as the "open-source reproducibility challenge" [ ] benchmarked seven open-source search engines and demonstrated that lucene was quite competitive in terms of both effectiveness and efficiency. it was the fourth fastest system (of seven) in terms of query evaluation, beating all the systems that were better than it in terms of effectiveness. since then, there has been a resurgence of interest in adopting lucene for information retrieval research, including a number of workshops that brought together like-minded researchers over the past few years [ , ] . anserini [ , ] is an open-source toolkit built on lucene that was specifically designed to support replicable information retrieval research by providing many research-oriented features missing from lucene, such as out-of-the-box support for a variety of common test collections. the project aims to better align ir researchers and practitioners, as lucene has become the de facto platform used in industry to build production search solutions (typically via systems such as elasticsearch and solr). the experiments in this paper were conducted with anserini. from maxscore to block-max wand at berlin buzzwords in , stefan pohl gave a presentation about max-score [ ] to raise awareness about efficient retrieval techniques in the lucene community [ ] . the presentation was accompanied by a working prototype. this contribution was exciting but also challenging to integrate as it conflicted with some of the flexibility that lucene provides, requiring an index rewrite. there were ideas on how to address these issues, but they entailed a lot of effort, and so the issue remained stalled for about five years. five years is a long time and many changes occurred meanwhile. the switch from tf-idf to bm as lucene's default scoring function in created a natural upper bound on scores due to bm 's saturation effect, which made it possible to implement retrieval algorithms that reasoned about maximum scores without changes to lucene's index format. this led to an effort to implement a general-purpose wand [ ] , based on a previous implementation for booleanquery. lucene received support for wand at the end of (although it wasn't released until version . with block-max indexes). implementing wand introduced two new issues. first, the total hit count would no longer be accurate, since not all matches are visited. common analytics use cases depend on this count, and many search engines display this value in their interfaces (see additional discussion in sect. ). second, the fact that some lucene queries could produce negative scores became problematic, so lucene now requires positive scores. support for block-max indexes was the final feature that was implemented, based on the developers' reading of the paper by ding and suel [ ] , which required invasive changes to lucene's index format. note that the paper describes directly storing the maximum impact score per block, which fixes the scoring function at indexing time. to provide flexibility in being able to swap in different scoring functions, the lucene implementation stores all tf (term frequency) and dl (document length) pairs that might yield the maximum score. if we have one such pair (tf i , dl i ) then we can remove all other pairs (tf j , dl j ) where tf j ≤ tf i ∧ dl j ≥ dl l , since they are guaranteed to yield lower (or equal) scores-based on the assumption that scores increase monotonically with increasing tf and decreasing dl. this is implemented by accumulating all such pairs in a tree-like structure during the indexing process. these pairs are stored in skip lists, so the information is available to groups of , , , , . . . blocks, allowing query evaluation to skip over more than one block at a time. an interesting coda to this story is that academic researchers were exploring alternatives to per-block impact scores circa , for exactly the same reason (to allow the scoring model to be defined at search time). for example, macdonald and tonellotto [ ] showed how to derive tight approximate upper bounds for block-max wand, based on work that dates back to [ ] . similarly, the recently-released pisa research system stores flexible block-level metadata [ ] . unfortunately, the lucene developers were not aware of these developments during their implementation. the journey from maxscore to block-max wand concluded in march , with the rollout of all these features in the version . release of lucene. they are now the out-of-the-box defaults in the world's most popular search library. during the implementation of block-max wand, performance improvements were quantified in terms of lucene's internal benchmark suite, which showed a × to × improvement in query evaluation performance. as part of a formal reproducibility effort, we present experiments that attempt to match, to the extent practical, the original conditions described by ding and suel [ ] . according to the paper, experiments were conducted on the gov web collection, on a randomly-selected subset of queries from the trec and efficiency tracks, which we were able to obtain from the authors. for their experiments, the inverted index was completely loaded into main memory and query evaluation latency was measured to retrieval depth ten. our experiments were conducted with the anserini ir toolkit, comparing v . . , which depends on lucene . and uses an optimized exhaustive or query evaluation strategy [ ] with v . . , which depends on lucene . and uses blockmax wand. we used anserini's standard regression test settings on the different collections, as described on its homepage. results represent averages over three trials on a warm cache. while the indexes were not explicitly loaded into memory, lucene benefits from caching at the os level. all experiments were conducted using a single thread on an otherwise idle server with dual intel xeon e - v processors and tb ram running rhel (release . ). results are shown in table , where figures in the top three rows are copied from table in the original paper. it is interesting that ding and suel report a much larger increase in performance comparing exhaustive or to bmw ( × on trec and × on trec ) than the comparable conditions in lucene (a more modest improvement of around ×). this is due to a more optimized implementation of exhaustive or in lucene, which, for example, implements block processing [ ] . interestingly, ding and suel report faster query evaluation in absolute terms, even on hardware that is much older: among the differences include c++ vs. java, as well as the simplicity of a research prototype vs. the realities of a fully-featured search library. beyond implementation differences, lucene must additionally compute the upper bound scores per block from the stored (tf, dl) pairs on the fly. we also report performance evaluations on two other standard test collections frequently used in academic information retrieval: clueweb b and clueweb -b , with the same sets of queries. these results are shown in table , where we report figures for different values of retrieval depth k, also averaged over three trials. these numbers are consistent with fig. in ding and suel's paper: performance of exhaustive or drops modestly as depth k increases, but bmw performance degrades much more quickly. this is exactly as expected. finally, we quantify the modest increase in indexing time due to the need to maintain (tf, dl) pairs in the inverted indexes, shown in table (averaged over three trials, using threads in all cases). these experiments used anserini's default regression settings on the respective collections, which builds full positional indexes and also stores the raw documents. the story of block-max wand in lucene provides a case study of how an innovation that originated in academia made its way into the world's most widely-used search library and achieved significant impact in the "real world" through hundreds of production deployments worldwide (if we consider the broader lucene ecosystem, which includes systems such as elasticsearch and solr). as there are very few such successful case studies (the other prominent one being the incorporation of bm in lucene), it is difficult to generalize these narratives into "lessons learned". however, here we attempt to offer a few observations about how academic research might achieve greater real-world impact. in short, block-max wand is in lucene because the developers learned about ding and suel and decided to reimplement it. this is somewhat stating the obvious, but this fateful decision highlights the idiosyncratic nature of technology adoption. we could imagine alternatives where the lucene developers had not come across the paper and developed a comparable solution in isolation, or they might have known about the paper and elected to take a different approach. in either case, the lucene solution would likely differ from block-max wand. this would be akin to convergent evolution in evolutionary biology, whereby different organisms independently evolve similar traits because they occupy similar environments. in such an "alternate reality", this paper would be comparing and contrasting different solutions to handling score outliers, not describing a reproducibility effort. to bring researchers and practitioners closer together, we recommend that the former be more proactive to "evangelize" their innovations, and the latter be more diligent in consulting the literature. eight years passed from the publication of the original paper ( ) until the release of lucene that included block-max wand ( ). the entire course of innovation was actually much longer if we trace the origins back to maxscore ( ) and wand ( ). one obvious question is: why did it take so long? there are many explanations, the most salient of which is the difference between a research prototype and a fully-featured search library that is already widely deployed. this decomposes into two related issues, the technical and the social. from a technical perspective, supporting bmw required invasive changes to lucene's index format and a host of related changes in scoring functionsfor example, scores could no longer be negative, and implementations could no longer access arbitrary fields (which was an api change). these had to be staged incrementally. concomitant with technical changes and backwards-compatibility constraints were a host of "social" changes, which required changing users' expectations about the behavior of the software. in short, bmw was not simply a drop-in replacement. for example, as discussed in sect. , the hit count was no longer accurate, which required workarounds for applications that depended on the value. because such major changes can be somewhat painful, they need to be justified by the potential benefits. this means that only dramatic improvements really have any hope of adoption: multiple-fold, not marginal, performance gains. an interesting side effect is that entire generations of techniques might be skipped, in the case of lucene, directly from exhaustive or to bmw, leapfrogging intermediate innovations such as maxscore and wand. aiming to achieve real-world impact with academic research is a worthy goal, and we believe that this case study represents an endorsement of efforts to better align research prototypes with production systems, as exemplified by lucenebased projects like anserini. if academic researchers are able to look ahead "down the road" to see how their innovations might benefit end applications, the path from the "ivory tower" to the "real world" might become more smoothly paved. proceedings of the th annual international acm sigir conference on research and development in information retrieval (sigir ) lucene ir: developing information retrieval evaluation resources using lucene efficient query evaluation using a two-level retrieval process implementation of the smart information retrieval system. department of computer science tr space optimizations for total ranking faster top-k document retrieval using block-max indexes toward reproducible baselines: the open-source ir reproducibility challenge from puppy to maturity: experiences in developing terrier upper-bound approximations for dynamic pruning upper bound approximation for blockmaxwand pisa: performant indexes and search for academia combining the language model and inference network approaches to retrieval indri at trec : terabyte track terrier: a high performance and scalable information retrieval platform efficient scoring in lucene open source information retrieval: a report on the sigir workshop query evaluation: strategies and optimizations yet another comparison of lucene and indri performance anserini: enabling the use of lucene for information retrieval research anserini: reproducible ranking baselines using lucene sigir workshop report: open source information retrieval systems (osir ). in: sigir forum acknowledgments. this work was supported in part by the natural sciences and engineering research council (nserc) of canada. we'd like to thank craig macdonald, joel mackenzie, antonio mallia, and nicola tonellotto for helpful discussions on the intricacies of computing flexible per-block score bounds, and torsten suel for providing us with the original queries used in their evaluations. key: cord- -j eboa authors: kamphuis, chris; de vries, arjen p.; boytsov, leonid; lin, jimmy title: which bm do you mean? a large-scale reproducibility study of scoring variants date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: j eboa when researchers speak of bm , it is not entirely clear which variant they mean, since many tweaks to robertson et al.’s original formulation have been proposed. when practitioners speak of bm , they most likely refer to the implementation in the lucene open-source search library. does this ambiguity “matter”? we attempt to answer this question with a large-scale reproducibility study of bm , considering eight variants. experiments on three newswire collections show that there are no significant effectiveness differences between them, including lucene’s often maligned approximation of document length. as an added benefit, our empirical approach takes advantage of databases for rapid ir prototyping, which validates both the feasibility and methodological advantages claimed in previous work. bm [ ] is perhaps the most well-known scoring function for "bag of words" document retrieval. it is derived from the binary independence relevance model to include within-document term frequency information and document length normalization in the probabilistic framework for ir [ ] . although learning-to-rank approaches and neural ranking models are widely used today, they are typically deployed as part of a multi-stage reranking architecture, over candidate documents supplied by a simple term-matching method using traditional inverted indexes [ ] . often, this is accomplished using bm , and thus this decades-old scoring function remains a critical component of search applications today. as many researchers have previously observed, e.g., trotman et al. [ ] , the referent of bm is quite ambiguous. there are, in fact, many variants of the scoring function: beyond the original version proposed by robertson et al. [ ] , many variants exist that include small tweaks by subsequent researchers. also, researchers using different ir systems report (sometimes quite) different effectiveness measurements for their implementation of bm , even on the same test collections; consider for example the results reported in osirrc , the open-source ir replicability challenge at sigir [ ] . furthermore, bm is parameterized in terms of k and b (plus k , k in the original formulation), and researchers often neglect to include the parameter settings in their papers. our goal is a large-scale reproducibility study to explore the nuances of different variants of bm and their impact on retrieval effectiveness. we include in our study the specifics of the implementation of bm in the lucene open-source search library, a widely-deployed variant "in the real world". outside of a small number of commercial search engine companies, lucene-either stand-alone or via higher-level platforms such as solr and elasticsearch-has today become the de facto foundation for building search applications in industry. our approach enlists the aid of relational databases for rapid prototyping, an idea that goes back to the s and was more recently revived by mühleisen et al. [ ] . adding or revising scoring functions in any search engine requires custom code within some framework for postings traversal, making the exploration of many different scoring functions (as in our study) a tedious and error-prone process. as an alternative, it is possible to "export" the inverted index to a relational database and recast the document ranking problem into a database (specifically, sql) query. varying the scoring function, then, corresponds to varying the expression for calculating the score in the sql query, allowing us to explore different bm variants by expressing them declaratively (instead of programming imperatively). we view our work as having two contributions: -we conducted a large-scale reproducibility study of bm variants, focusing on the lucene implementation and variants described by trotman et al. [ ] . their findings are confirmed: effectiveness differences in ir experiments are unlikely to be the result of the choice of bm variant a system implemented. -from the methodological perspective, our work can be viewed as reproducing and validating the work of mühleisen et al. [ ] , the most recent advocate of using databases for rapid ir prototyping. robertson et al. is negative when df t > n/ , lucene adds a constant one before calculating the log value. second, the document length used in the scoring function is compressed (in a lossy manner) to a one byte value, denoted l dlossy . with only distinct document lengths, lucene can pre-compute the value of k · ( − b + b · (l dlossy /l avg )) for each possible length, resulting in fewer computations at query time. lucene (accurate) represents our attempt to measure the impact of lucene's lossy document length encoding. we implemented a variant that uses exact document lengths, but is otherwise identical to the lucene default. atire [ ] implements the idf component of bm as log (n/df t ), which also avoids negative values. the tf component is multiplied by k + to make it look more like the classic rsj weight; this has no effect on the resulting ranked list, as all scores are scaled linearly with this factor. bm l [ ] builds on the observation that bm penalizes longer documents too much compared to shorter ones. the idf component differs, to avoid negative values. the tf component is reformulated as the c td component is further modified by adding a constant δ to it, boosting the score for longer documents. the authors report using δ = . for highest effectiveness. [ ] encodes a general approach for dealing with the issue that ranking functions unfairly prefer shorter documents over longer ones. the proposal is to add a lower-bound bonus when a term appears at least one time in a document. the difference with bm l is a constant δ to the tf component. the idf component is again changed to a variant that disallows negative values. [ ] is an approach that varies k per term (i.e., uses term specific k values). in order to determine the optimal value for k , the method starts by identifying the probability of a term occurring at least once in a document as (df r + . )/(n + ). the probability of the term occurring one more time is then defined as (df r+ + . )/(df r + ). the information gain of a term occurring r + instead of r times is defined as g r q = log ((df r+ + . )/(df r + )) − log ((df tr + . )/(n + )), where df r is defined as follows: |d t|c td ≥r− . | if r > , df t if r = , and n if r = (c td is the same as in bm l). the information gain is calculated for r ∈ { , . . . , t }, until g r q > g r+ q . the optimal value for k is then determined by finding the value for k that minimizes the equation essentially, this gives a value for k that maximizes information gain for that specific term; k and g q are then plugged into the bm -adpt formula. we found that the optimal value of k is actually not defined for about % of the terms. a unique optimal value for k only exists when r > while calculating g r q . for many terms, especially those with a low df , g r q > g r+ q occurs before r > . in these cases, picking different values for k has virtually no effect on retrieval effectiveness. for undefined values, we set k to . , the same as trotman et al. [ ] . tf l•δ•p ×idf [ ] models the non-linear gain of a term occurring multiple times in a document as + log ( + log (tf td )). to ensure that terms occurring at least once in a document get boosted, the approach adds a fixed component δ, following bm +. these parts are combined into the tf component using tf td /( − b + b · (l d /l avg )). the same idf component as in bm + is used. our experiments were conducted using anserini (v . . ) on java to create an initial index, and subsequently using relational databases for rapid prototyping, which we dub "olddog" after mühleisen et al. [ ] ; following that work we use monetdb as well. evaluations with lucene (default) and lucene (accurate) were performed directly in anserini; the latter was based on previously-released code that we updated and incorporated into anserini. the inverted index was exported from lucene to olddog, ensuring that all experiments share exactly the same document processing pipeline (tokenization, stemming, stopword removal, etc.). while exporting the inverted index, we precalculate all k values for bm adpt as suggested by lv and zhai [ ] . as an additional verification step, we implemented both lucene (default) and lucene (accurate) in olddog and compared results to the output from anserini. we are able to confirm that the results are the same, setting aside unavoidable differences related to floating point precision. all bm variants are then implemented in olddog as minor variations upon the original sql query provided in mühleisen et al. [ ] . the term-specific parameter optimization for the adpt variant was already calculated during the index extraction stage, allowing us to upload the optimal (t, k) pairs and directly use the term-specific k values in the sql query. the advantage of our experimental methodology is that we did not need to implement a single new ranking function from scratch. all the sql variants implemented for this paper can be found on github. the experiments use three trec newswire test collections: trec disks and , excluding congressional record, with topics and relevance judgments from the trec robust track (robust ); the new york times annotated corpus, with topics and relevance judgments from the trec common core track (core ); the trec washington post corpus, with topics and relevance judgments from the trec common core track (core ). following standard experimental practice, we assess ranked list output in terms of average precision (ap) and precision at rank (p@ ). the parameters shared by all models are set to k = . and b = . , anserini's defaults. the parameter δ is set to the value reported as best in the corresponding source publication. table presents the effectiveness scores for the implemented retrieval functions on all three test collections. all experiments were run on a linux desktop (fedora , kernel . . , selinux enabled) with cores (intel xeon cpu e - v @ . ghz) and gb of main memory; the monetdb . . server was compiled from source using the --enable-optimize flag. table presents the average retrieval time per query in milliseconds (without standard deviation for anserini, which does not report time per query). monetdb uses all cores for both inter-and intraquery parallelism, while anserini is single-threaded. the observed differences in effectiveness are very small and can be fully attributed to variations in the scoring function; our methodology fixes all other parts of the indexing pipeline (tag cleanup, tokenization, stopwords, etc.). both an anova and tukey's hsd show no significant differences between any variant, on all test collections. this confirms the findings of trotman et al. [ ] : across the ir literature, we find that differences due to more mundane settings (such as the choice of stopwords) are often larger than the differences we observe here. although we find no significant improvements over the original robertson et al. [ ] formulation, it might still be worthwhile to use a variant of bm that avoids negative ranking scores. comparing lucene (default) and lucene (accurate), we find negligible differences in effectiveness. however, the differences in retrieval time are also negligible, which calls into question the motivation behind the original length approximation. currently, the similarity function and thus the document length encoding are defined at index time. storing exact document lengths would allow for different ranking functions to be swapped at query time more easily, as no information would be discarded at index time. accurate document lengths might additionally benefit downstream modules that depend on lucene. we therefore suggest that lucene might benefit from storing exact document lengths. in summary, this work describes a double reproducibility study-we methodologically validate the usefulness of databases for ir prototyping claimed by mühleisen et al. [ ] and performed a large-scale study of bm to confirm the findings of trotman et al. [ ] . returning to our original motivating question regarding the multitude of bm variants: "does it matter?", we conclude that the answer appears to be "no, it does not". effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures ceur workshop proceedings of the open-source ir replicability challenge (osirrc ) at sigir adaptive term frequency normalization for bm lower-bounding term frequency normalization when documents are very long old dogs are great at new tricks: column stores for ir prototyping the probabilistic relevance framework: bm and beyond okapi at trec- composition of tf normalizations: new insights on scoring functions for ad hoc ir towards an efficient and effective search engine improvements to bm and language models examined acknowledgements. this work is part of the research program commit data with project number . . , which is (partly) financed by the nwo. additional support was provided by the natural sciences and engineering research council (nserc) of canada. key: cord- - f p t authors: hofstätter, sebastian; zlabinger, markus; hanbury, allan title: neural-ir-explorer: a content-focused tool to explore neural re-ranking results date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: f p t in this paper we look beyond metrics-based evaluation of information retrieval systems, to explore the reasons behind ranking results. we present the content-focused neural-ir-explorer, which empowers users to browse through retrieval results and inspect the inner workings and fine-grained results of neural re-ranking models. the explorer includes a categorized overview of the available queries, as well as an individual query result view with various options to highlight semantic connections between query-document pairs. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. the prevalent evaluation of information retrieval systems, based on metrics that are averaged across a set of queries, distills a large variety of information into a single number. this approach makes it possible to compare models and configurations, however it also decouples the explanation from the evaluation. with the adoption of neural re-ranking models, where the scoring process is arguably more complex than traditional retrieval methods, the divide between result score and the reasoning behind it becomes even stronger. because neural models learn based on data, they are more likely to evade our intuition about how their components should behave. having a thorough understanding of neural reranking models is important for anybody who wants to analyze or deploy these models [ , ] . in this paper we present the neural-ir-explorer: a system to explore the output of neural re-ranking models. the explorer complements metrics based evaluation, by focusing on the content of queries and documents, and how the neural models relate them to each other. we enable users to efficiently browse the output of a batched retrieval run. we start with an overview page showing all evaluated queries. we cluster the queries using their term representations taken from the neural model. users can explore each query result in more detail: we show the internal partial scores and content of the returned documents with different highlighting modes to surface the inner workings of a neural re-ranking model. here, users can also select different query terms to individually highlight their connections to the terms in each document. in our demo we focus on the kernel-pooling models knrm [ ] and tk [ ] evaluated on the msmarco-passage [ ] collection. the kernel-pooling makes it easy to analyze temporary scoring results. finally, we discuss some of the insights we gained about the knrm model using the neural-ir-explorer. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. our work sits at the intersection of visual ir evaluation and the interpretability of neural networks with semantic word representations. the ir community mainly focused on tools to visualize result metrics over different configurations: claire allows users to select and evaluate a broad range of different settings [ ] ; aviator integrates basic metric visualization directly in the experimentation process [ ] ; and the retrieval tool provides a data-management platform for multimedia retrieval including differently scoped metric views [ ] . lipani et al. [ ] created a tool to inspect different pooling strategies, including an overview of the relevant result positions of retrieval runs. from a visualization point of view term-by-term similarities are similar to attention, as both map a single value to a token. lee et al. [ ] created a visualization system for attention in a translation task. transformer-based models provide ample opportunity to visualize different aspects of the many attention layers used [ , ] . visualizing simpler word embeddings is possible via a neighborhood of terms [ ] . now we showcase the capabilities of the neural-ir-explorer (sect. . ) and how we already used it to gain novel insights (sect. . ). the explorer displays data created by a batched evaluation run of a neural re-ranking model. the back-end is written in python and uses flask as web server; the front-end uses vue.js. the source code is available at: github.com/sebastian-hofstaetter/neural-ir-explorer. when users first visit our website they are greeted with a short introduction to neural re-ranking and the selected neural model. we provide short explanations throughout the application, so that that new users can effectively use our tool. we expect this tool's audience to be not only neural re-ranking experts, but anyone who is interested in ir. the central hub of the neural-ir-explorer is the query overview (fig. ) . we organize the queries by clustering them in visually separated cards. we collapse the cards to only show a couple of queries per default. this is especially useful for collections with a large number of queries, such as the msmarco collection we use in this demo (the dev set contains over . queries). in the cluster header we display a manually assigned summary title, the median result of the queries, and median difference to the initial bm ranking, as this is the basis for the re-ranking. each query is displayed with the rank of the first relevant document, the difference to bm , and the query text. the controls at the top allow to sort the queries and clusters -including a random option to discover new queries. users can expand all clusters or apply a term-prefix filter to search for specific words in the queries. once a user clicks on a query, they are redirected to the query result view (fig. ) . here, we offer an information rich view of the top documents returned by the neural re-ranking model. each document is displayed in full with its rank, overall and kernel-specific scores. the header controls allow to highlight the connections between the query and document terms in two different ways. first, users can choose a minimum cosine similarity that a term pair must exceed to be colored, which is a simple way of exploring the semantic similarity of the word representations. secondly, for kernel-pooling models that we support, we offer a highlight mode much closer to how the neural model sees the document: based on the association of a term to a kernel. users can select one or more kernels and terms are highlighted based on their value after the kernel transformation. additionally, we enable users to select two documents and compare them side-by-side (fig. ) . users can highlight query-document connections as in the list view. additionally, we display the different kernel-scores in the middle, so that users can effectively investigate which kernels of the neural model have the deciding influence of the different scores for the two documents. we already found the neural-ir-explorer to be a useful tool to analyze the knrm neural model and understand its behaviors better. the knrm model includes a kernel for exact matches (cosine similarity of exactly ), however judging from the displayed kernel scores this kernel is not a deciding factor. most of the time the kernels for . & . (meaning quite close cosine similarities) are in fact the deciding factor for the overall score of the model. we assume this is due to the fact, that every candidate document (retrieved via exact matched bm ) contains exact matches and therefore it is not a differentiating factor anymore -a specific property of the re-ranking task. additionally, the neural-ir-explorer also illuminates the pool bias [ ] of the msmarco ranking collection: the small number of judged documents per query makes the evaluation fragile. users can see how relevant unjudged documents are actually ranked higher than the relevant judged documents, wrongly decreasing the model's score. we presented the content-focused neural-ir-explorer to complement metric based evaluation of retrieval models. the key contribution of the neural-ir-explorer is to empower users to efficiently explore retrieval results in different depths. the explorer is a first step to open the black-boxes of neural re-ranking models, as it investigates neural network internals in the retrieval task setting. the seamless and instantly updated visualizations of the neural-ir-explorer offer a great foundation for future work inspirations, both for neural ranking models as well as how we evaluate them. claire: a combinatorial visual analytics system for information retrieval evaluation ms marco: a human generated machine reading comprehension dataset visualizing and measuring the geometry of bert a progressive visual analytics tool for incremental experimental evaluation interactive analysis of word vector embeddings let's measure run time! extending the ir replicability infrastructure to include performance aspects on the effect of lowfrequency terms on neural-ir models interpretable & time-budgetconstrained contextualization for re-ranking retrieval: an online performance evaluation tool for information retrieval methods interactive visualization and manipulation of attention-based neural machine translation visual pool: a tool to visualize and interact with the pooling method the impact of fixedcost pooling strategies on test collection bias a multiscale visualization of attention in the transformer model end-to-end neural ad-hoc ranking with kernel pooling acknowledgements. this work has received funding from the european union's horizon research and innovation program under grant agreement no. . key: cord- -lof r authors: landin, alfonso; parapar, javier; barreiro, Álvaro title: novel and diverse recommendations by leveraging linear models with user and item embeddings date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: lof r nowadays, item recommendation is an increasing concern for many companies. users tend to be more reactive than proactive for solving information needs. recommendation accuracy became the most studied aspect of the quality of the suggestions. however, novel and diverse suggestions also contribute to user satisfaction. unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. in this paper, we present eer, a linear model for the top-n recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy. in recent years, the way users access services has shifted from a proactive approach, where the user actively looks for the information, to one where the users take a more passive role, and content is suggested to them. within this transformation, recommender systems have played a pivotal role, enabling an increase in user engagement and revenue. recommender systems are usually classified into three families [ ] . the first approach, content-based systems, use item metadata to produce recommendations [ ] . the second family, collaborative filtering, is composed of systems that exploit the past interactions of the users with the items to compute the recommendations [ , ] . these interactions can take several forms, such as ratings, clicks, purchases. finally, hybrid approaches combine both to generate suggestions. collaborative filtering (cf) systems can be divided into memory-based systems, that use the information about these interactions directly to compute the recommendations, and model-based systems, that build models from this information that are later used to make the recommendations. in this paper, we will present a cf model to address the top-n recommendation task [ ] . the objective of a top-n recommender is to produce a ranked list of items for each user. these systems can be evaluated using traditional ir metrics over the rankings [ , ] . in that evaluation approach, accuracy is usually the most important metric and has been the focus of previous research and competitions [ ] . nevertheless, other properties are also important, such as diversity and novelty [ , ] . diversity is the ability of the system to make recommendations that include items equitably from the whole catalog, which is usually desired by vendors [ , ] . on the other hand, novelty is the capacity of the system to produce unexpected recommendations. this characteristic is a proxy for serendipity, associated with higher user engagement and satisfaction [ ] . all these properties, accuracy, diversity and novelty, are linked to the extent that raising accuracy usually lowers the best achievable results in the other properties [ ] . in this paper, we propose a method to augment an existing recommendation linear model to make more diverse and novel recommendations, while maintaining similar accuracy results. we do so by making use of user and item embeddings that are able to capture non-linear relations thanks to the way they are obtained [ ] . experiments conducted on three datasets show that our proposal outperforms the original model in both novelty and diversity while maintaining similar levels of accuracy. with reproducibility in mind, we also make the software used for the experiments publicly available . in this section, we introduce fism, the existing recommendation method we augment in our proposal. after that, we introduce prefs vec, the user and item embedding model used to make this enhancement. fism is a state-of-the-art model-based recommender system proposed by kabbur et al [ ] . this method learns a low rank factorization of an item-item similarity matrix, which is later used to compute the scores to make the predictions. this method is an evolution of a previous method, slim [ ] , that learns this matrix without factorizing it. factorizing the similarity matrix allows fism to overcome slim's limitation of not being able to learn a similarity other than zero for items that have never been rated both by at least one user. as a side effect of this factorization, it lowers the space complexity from o( |i|. it also drops the non-negativity constraint and the constraint that the diagonal of the similarity matrix has to contain zeroes. as a consequence of these changes, the optimization problem can be solved using regular gradient descent algorithms, instead of the coordinated gradient descent used by slim, leading to faster training times. embedding models allow transforming high-dimensional and sparse vector representations, such as classical one-hot and bag-of-words, into a space with much lower dimensionality. in particular, previous word embedding models, that produce fixed-length dense representations, have proven to be more effective in several npl tasks [ , , ] . recently, prefs vec [ ] , a new embedding model for obtaining dense user and item representations, an adaptation of the cbow model [ ] , has shown that these embeddings can be useful for the top-n recommendation task. when used with a memory-based recommender, they are more efficient than the classical representation [ ] . the results show that not only they can improve the accuracy of the results, but also their novelty and diversity. the versatility of this embedding model, in particular of the underlying neural model and the way it is trained, is also shown in [ ] . here the prediction capabilities of the neural model are used directly in a probabilistic recommender. in this section, we present our method to enhance diversity and novelty in recommendation, explaining how the model is trained and used to produce recommendations. firstly, we introduce how the product of user and item embeddings (based on prefs vec) can be used to make recommendations, which is later used as part of the proposal. as representations of users and items in a space with much lower dimensionality, prefs vec embeddings can be viewed as latent vectors. however, there is no sense in multiplying both item and user vectors as they have different basis even when they have the same dimensions. this is a consequence of learning the item and user representations independently, how prefs vec initializes the parameters of the model and how the training is performed. however, it is possible to make this product if we can compute a change of basis matrix t ∈ r d×d to transform the user embeddings into the item embeddings space. this way we can calculate an estimated ratings matrixr using the simple matrix multiplication: where e ∈ r |u |×d is the matrix of user embeddings, and f ∈ r |i|×d is the matrix of item embeddings, one embedding in each row. the transformation matrix t is learned by solving the optimization problem with regularization: where r is the ratings matrix and β e is the regularization hyperparameter. this problem can be solved using gradient descent algorithms. once the transformation matrix has been trained, recommendations can be produced by computing the estimated rating matrixr as described in eq. . recommendations are made to each user by sorting the corresponding row and picking the top-n items not already rated by the user. we dubbed this recommender elp, short for embedding linear product, and we present its performance in table in the experiments section. we have seen that linear methods, like fism, can obtain good accuracy figures. on the other side, as results in table show, elp is able to provide good figures in novelty and diversity, thanks to the embedding model capturing non-linear relations between users and items. we propose to capture both properties by joining the models together in the eer model (embedding enhanced recommender). we choose the rmse variant of fism as it matches the loss used in elp. we also use a trainable scalar parameter α to joint the models, as the scores obtained from each recommender need not be on the same scale. this results in the following equation to calculate the estimated ratings matrix:r where p ∈ r |i|×k and q ∈ r k×|i| are the low rank factorization of the item-item similarity matrix. the parameters of the model, p , q, t and α, are learned by solving the joint regularized optimization problem resulting from the previous joint equation, using standard gradient descent algorithms: minimize p ,q ,t ,α similar to the case of elp, once the parameters are learned, we make the recommendations by calculating the estimated ratings matrix using eq. , sorting each row and picking the top-n items not yet rated by the user corresponding to that row. in this section, we introduce the datasets used to perform our experiments, the evaluation protocol followed and the metrics used. after that, we present the results of our experiments. to evaluate our proposal, we conducted a series of experiments on several datasets, from different domains: the movielens m dataset , a movie dataset, the book dataset librarything, and the beeradvocate dataset , consisting of beer reviews. table shows statistics of each collection. in order to perform the experiments, the datasets were divided randomly into train and test sets. the training dataset consisted of % of the ratings of each user, with the remaining % forming the test dataset. we follow the testitems evaluation methodology [ ] to evaluate the performance. to assess the accuracy of the rankings, we use normalized discounted cumulative gain (ndcg), using the standard formulation as described in [ ] , with the ratings in the test set as graded relevance judgments. we considered only items with a rating of or more, on a point scale, to be relevant for evaluation purposes. we also measured the diversity of the recommendations using the complement of the gini index [ ] . finally, we use the mean self-information (msi) [ ] to assess the novelty of the recommendations. all the metrics are evaluated at cut-off because it has shown to be more robust with respect to the sparsity and popularity biases than sallower cut-offs [ ] . we perform a wilcoxon test [ ] to asses the statistical significance of the improvements regarding ndcg@ and msi@ , with p < . . we cannot apply it to the gini index because we are using a paired test and gini is a global metric. results in table are annotated with their statistical significance. we performed a grid search over the hyperparameters of the original model and our proposal tuning them to maximize ndcg@ . although we aim to increase diversity and novelty, we want the recommendations to be effective, which is why the tuning is done over accuracy. for the parameters of the prefs vec model, we took those that performed better in [ ] . for reproducibility's sake, values for the best hyperparameters for each collection can be consulted in table . table shows the values of ndcg@ , gini@ and msi@ for fism, eer and elp. the results show that eer outperforms the baseline (fism) on both novelty and diversity. it also surpasses it on accuracy on the movielens m and librarything datasets. in the case of diversity, we can see important table . best values of the hyperparameters for ndcg@ for fism and our proposals eer and elp. librarything beeradvocate fism β = , k = β = , k = β = , k = elp βe = . βe = βe = eer β = . , βe = , k = β = , βe = , k = β = , βe = , k = improvements. elp, on the other hand, obtains the best diversity and novelty values, but this comes with a big reduction in accuracy. it is common in the field of recommender systems for methods with lower accuracy to have higher values in diversity and novelty. we believe that the ability of the embeddings to find nonlinear relationships contributes to the model novelty and diversity. this property of the model allows it, for example, to discover relationships between popular and not so popular items leading to better diversity. moreover, the integration in the linear model allows to keep its advantage in terms on accuracy, clearly suparssing the use of embeddings in isolatation (elp). in this paper, we presented eer, a method to enhance an existing recommendation algorithm to produce recommendations that are both more diverse and novel, while maintaining similar levels on accuracy. this process is done by combining two models, a linear one that is able to obtain good levels of accuracy, with a model based in an embedding technique that extracts non-linear relationships, allowing it to produce more diverse and novel recommendations. as future work, we plan to apply the same technique to other recommender systems, examining if it can be applied in general to enhance the recommendations, independently of the base algorithm chosen for the task. we also envision studying the effects that varying the value of α in eq. has on the recommendations. fab: content-based, collaborative recommendation precision-oriented evaluation of recommender systems the netflix prize performance of recommender algorithms on top-n recommendation tasks blockbuster culture's next rise or fall: the impact of recommender systems on sales diversity beyond accuracy: evaluating recommender systems by coverage and serendipity semantics-aware content-based recommender systems recommender systems handbook evaluating collaborative filtering recommender systems fism: factored item similarity models for top-n recommender systems advances in collaborative filtering when diversity met accuracy: a story of recommender systems prin: a probabilistic recommender with item priors and neural models being accurate is not enough: how accuracy metrics have hurt recommender systems efficient estimation of word representations in vector space distributed representations of words and phrases and their compositionality slim: sparse linear methods for top-n recommender systems a comprehensive survey of neighborhoodbased recommendation methods using score distributions to compare statistical significance tests for information retrieval evaluation glove: global vectors for word representation on the robustness and discriminative power of information retrieval metrics for top-n recommendation collaborative filtering embeddings for memory-based recommender systems item-based relevance modelling of recommendations for getting rid of long tail products. knowl.-based syst a theoretical analysis of ndcg ranking measures solving the apparent diversity-accuracy dilemma of recommender systems key: cord- -cq lbd l authors: almeida, tiago; matos, sérgio title: calling attention to passages for biomedical question answering date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: cq lbd l question answering can be described as retrieving relevant information for questions expressed in natural language, possibly also generating a natural language answer. this paper presents a pipeline for document and passage retrieval for biomedical question answering built around a new variant of the deeprank network model in which the recursive layer is replaced by a self-attention layer combined with a weighting mechanism. this adaptation halves the total number of parameters and makes the network more suited for identifying the relevant passages in each document. the overall retrieval system was evaluated on the bioasq tasks and , achieving similar retrieval performance when compared to more complex network architectures. question answering (qa) is a subfield of information retrieval (ir) that specializes in producing or retrieving a single answer for a natural language question. qa has received growing interest since users often look for a precise answer to a question instead of having to inspect full documents [ ] . similarly, biomedical question answering has also gained importance given the amount of information scattered over large specialized repositories such as medline. research on biomedical qa has been pushed forward by community efforts such as the bioasq challenge [ ] , originating a range of different approaches and systems. recent studies on the application of deep learning methods to ir have shown very good results. these neural models are commonly subdivided into two categories based on their architecture. representation-based models, such as the deep structured semantic model (dssm) [ ] or the convolutional latent semantic model (clsm) [ ] , learn semantic representations of texts and score each query-document pair based on the similarity of their representations. on the other hand, models such as the deep relevance matching model (drmm) [ ] or deeprank [ ] follow a interaction-based approach, in which matching signals between query and document are captured and used by the neural network to produces a ranking score. the impact of neural ir approaches is also noticeable in biomedical question answering, as shown by the results on the most recent bioasq challenges [ ] . the top performing team in the document and snippet retrieval sub-tasks in [ ] , for example, used a variation of the drmm [ ] to rank the documents recovered by the traditional bm [ ] . for the task, the same team extended their system with the inclusion of models based on bert [ ] and with joint training for document and snippet retrieval. the main contribution of this work is a new variant of the deeprank neural network architecture in which the recursive layer originally included in the final aggregation step is replaced by a self-attention layer followed by a weighting mechanism similar to the term gating layer of the drmm. this adaptation not only halves the total number of network parameters, therefore speeding up training, but it is also more suited for identifying the relevant snippets in each document. the proposed model was evaluated on the bioasq dataset, as part of a document and passage (snippet) retrieval pipeline for biomedical question answering, achieving similar retrieval performance when compared to more complex network architectures. the full network configuration is publicly available at https://github.com/bioinformatics-ua/bioasq, together with code for replicating the results presented in this paper. this section presents the overall retrieval pipeline and describes the neural network architecture proposed in this work for the document ranking step. the retrieval system follows the pipeline presented in fig. , encompassing three major modules, fast retrieval, neural ranking and snippet extraction. the fast retrieval step is focused on minimizing the number of documents passed on to the computationally more demanding neural ranking module, while maintaining the highest possible recall. as in previous studies [ , ] , we adopted elasticsearch (es) with the bm ranking function as the retrieval mechanism. the documents returned by the first module are ranked by the neural network which also directly provides to the following module the information for extracting relevant snippets. these modules are detailed in sects. . and . . the network follows a similar architecture to the original version of deeprank [ ] , as illustrated in fig. . particularly, we build upon the best reported configuration, which uses a cnn in the measurement network and the reciprocal function as the position indicator. the inputs to the network are the query, a set of document passages aggregated by each query term, and the absolute position of each passage. for the remaining explanation, let us first define a query as a sequence of terms q = {u , u , ..., u q }, where u i is the i-th term of the query; a set of document passages aggregated by each query term as d(u i ) = {p , p , ..., p p }, where p j corresponds to the j-th passage with respect to the query term u i ; and a document passage as where v k is the k-th term of the passage. we chose to aggregate the passages by their respective query term at the input level, since it simplifies the neural network flow and implementation. the detection network receives as input the query and the set of document passages and creates a similarity tensor (interaction matrix) s ∈ [− , ] q×s for each passage, where each entry s ij corresponds to the cosine similarity between the embeddings of the i-th query term and j-th passage term, the measurement network step is the same used in the original deeprank model. it takes as inputs the previously computed tensors s and the absolute position of each passage and applies a d convolution followed by a global max polling operation, to capture the local relevance present in each tensor s, as defined in eq. : at this point, the set of document passages for each query term is represented by their respective vectors h, i.e, d( encodes the local relevance captured by the m convolution kernels of size x × y, plus an additional feature corresponding to the position of the passage. . the next step uses a self-attention layer [ ] to obtain an aggregation c ui m × over the passages h pj for each query term u i , as defined in eq. . the weights a pj , which are computed by a feed forward network and converted to a probabilistic distribution using the softmax operation, represent the importance of each passage vector from the set d(u i ). the addition of this self-attention layer, instead of the recurrent layer present in the original architecture, allows using the attention weights, that are directly correlated with the local relevance of each passage, to identify important passages within documents. moreover, this layer has around a × m parameters, compared to up to three times more in the gru layer (approximately × a × (a + m )), which in practice means reducing the overall number of network parameters to half. finally, the aggregation network combines the vectors c ui m × according to weights that reflect the importance of each individual query term u i . we chose to employ a similar weighting mechanism to the term gating layer in drmm [ ] , which uses the query term embedding to compute its importance, as defined in eq. . this option replaces the use of a trainable parameter for each vocabulary term, as in the original work, which is less suited for modelling a rich vocabulary as in the case of biomedical documents. the final aggregated vector c is then fed to a dense layer for computing the final ranking score. optimization. we used the pairwise hinge loss as the objective function to be minimized by the adadelta optimizer. in this perspective, the training data is viewed as a set of triples, (q, d + , d − ), composed of a query q, a positive document d + and a negative document d − . additionally, inspired by [ ] and as successfully demonstrated by [ ] , we adopted a similar negative sampling strategy, where a negative document can be drawn from the following sets: -partially irrelevant set: irrelevant documents that share some matching signals with the query. more precisely, this corresponds to documents retrieved by the fast retrieval module but which do not appear in the training data as positive examples; -completely irrelevant set: documents not in the positive training instances and not sharing any matching signal with the query. passage extraction is accomplished by looking at the attention weights of the neural ranking model. as described, the proposed neural ranking model includes two attention mechanisms. the first one computes a local passage attention with respect to each query term, a pi . the second is used to compute the importance of each query term, a u k . therefore, a global attention weight for each passage can be obtained from the product of these two terms, a g (k,i) = a u k × a pi , as shown in eq. : this section presents the system evaluation results. we used the training data from the bioasq b and b phase a challenges [ ] , containing and biomedical questions with the corresponding relevant documents, taken from the medline repository. the objective for a system is to retrieve the ten most relevant documents for each query, with the performance evaluated in terms of map@ on five test sets containing queries each. at first, a study was conducted to investigate the performance of the proposed neural ranking model. after that, the full system was compared against the results of systems submitted to the bioasq and editions for the document retrieval task. finally, we investigate if the attention given to each passage is indeed relevant. in the results, we compare two variants of deeprank: biodeeprank refers to the model with the modified aggregation network and weighting mechanism, and using word embeddings for the biomedical domain [ ] ; attn-biodeeprank refers to the final model that additionally replaces the recurrent layer by a self-attention layer. neural ranking models. we compared both neural ranking versions against bm in terms of map@ and recall@ , on a -fold cross validation over the bioasq training data. table summarizes the results. both models successfully improved the bm ranking order, achieving an increase of around . in map and . in recall. results of attn-biodeeprank, although lower, suggest that this version is at least nearly as effective at ranking the documents as the model that uses the recursive layer. biomedical document retrieval. we report results on the bioasq b and bioasq b document ranking tasks (table ) . regarding bioasq b, it should be noted that the retrieved documents were evaluated against the final goldstandard of the task, revised after reevaluating the documents submitted by the participating systems. since we expect that some of the retrieved documents would have been revised as true positives, the results presented can be considered a lower bound of the system's performance. for bioasq b, the results shown are against the gold-standard before the reevaluation, since the final annotations were not available at the time of writing. in this dataset both systems achieved performance nearer to the best result, including a top result on batch . passage evaluation. finally, we analysed whether the information used by the model for ranking the documents, as given by the attention weights, corresponded to relevant passages in the gold-standard. for this, we calculated the precision of the passages, considering overlap with the gold-standard, and evaluated how it related to the confidence assigned by the model. interestingly, although the model is not trained with this information, the attention weights seem to focus on these relevant passages, as indicated by the results in fig. . fig. . quality of retrieved passages as a function of the confidence attributed by the model. this paper describes a new neural ranking model based on the deeprank architecture. evaluated on a biomedical question answering task, the proposed model achieved similar performance to a range of others strong systems. we intend to further explore the proposed approach by considering semantic matching signals in the fast retrieval module, and by introducing joint learning for document and passage retrieval. the network implementation and code for reproducing these results are available at https://github.com/bioinformatics-ua/bioasq. aueb at bioasq : document and snippet retrieval bert: pre-training of deep bidirectional transformers for language understanding a deep relevance matching model for ad-hoc retrieval natural language question answering: the view from here learning deep structured semantic models for web search using clickthrough data a structured self-attentive sentence embedding mindlab neural network approach at bioasq b deep relevance ranking using enhanced document-query interactions results of the sixth edition of the bioasq challenge deeprank the probabilistic relevance framework: bm and beyond a latent semantic model with convolutional-pooling structure for information retrieval an overview of the bioasq large-scale biomedical semantic indexing and question answering competition learning fine-grained image similarity with deep ranking biowordvec, improving biomedical word embeddings with subword information and mesh a hierarchical attention retrieval model for healthcare question answering acknowledgments. this work was partially supported by the european regional development fund (erdf) through the compete operational programme, and by national funds through fct -foundation for science and technology, projects ptdc/eei-ess/ / and uid/cec/ / . key: cord- - sbicp v authors: macavaney, sean; soldaini, luca; goharian, nazli title: teaching a new dog old tricks: resurrecting multilingual retrieval using zero-shot learning date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: sbicp v while billions of non-english speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-english languages. this is primarily due to a lack of data set that are suitable to train ranking algorithms. in this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on english collections to non-english queries and documents. our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for arabic, chinese mandarin, and spanish. we also show that augmenting the english training collection with some examples from the target language can sometimes improve performance. every day, billions of non-english speaking users [ ] interact with search engines; however, commercial retrieval systems have been traditionally tailored to english queries, causing an information access divide between those who can and those who cannot speak this language [ ] . non-english search applications have been equally under-studied by most information retrieval researchers. historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on english corpora (e.g., [ , , , ] ). more recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [ , , , , , , ] ; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as bm . like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on english queries and documents. the absence of rankers designed to operate on languages other than english can largely be attributed to a lack of suitable publicly available data sets. this aspect particularly limits supervised ranking methods, as they require samples for training and validation. for english, previous research relied on english collections such as trec robust [ ] , the - trec web track [ ] , and ms marco [ ] . no datasets of similar size exist for other languages. while most of recent approaches have focused on ad hoc retrieval for english, some researchers have studied the problem of cross-lingual information retrieval. under this setting, document collections are typically in english, while queries get translated to several languages; sometimes, the opposite setup is used. throughout the years, several cross lingual tracks were included as part of trec. trec , , [ ] offer queries in english, german, dutch, spanish, french, and italian. for all three years, the document collection was kept in english. clef also hosted multiple cross-lingual ad-hoc retrieval tasks from to [ ] . early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations [ ] . more recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. for example, vulić and moens [ ] proposed bwesg, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. sasaki et al. [ ] leveraged a data set of wikipedia pages in languages to train a learning to rank algorithm for japanese-english and swahili-english cross-language retrieval. litschko et al. [ ] proposed an unsupervised framework that relies on aligned word embeddings. ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-english document retrieval. a few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. for example, trec produced several non-english test collections: spanish [ ] , chinese mandarin [ ] , and arabic [ ] . other languages were explored, but the document collections are no longer available. the clef initiative includes some non-english monolingual datasets, though these are primarily focused on european languages [ ] . recently, zheng et al. [ ] introduced sogou-qcl, a large query log dataset in mandarin. such datasets are only available for languages that already have large, established search engines. inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non english languages using supervised neural approaches. in particular, to circumvent the lack of training data, we leverage transfer learning techniques to train arabic, mandarin, and spanish retrieval models using english training data. in the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g. [ , , , ] ). our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in english. we evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. by leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned [ ] or unaligned [ ] web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. during the peer review of this article, a preprint [ ] was published with similar observations as ours. in summary, our contributions are: -we study zero shot transfer learning for ir in non-english languages. -we propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. our approach outperforms several baselines on multiple non-english collections. -we show that including additional in-language training samples may help further improve ranking performance. -we release our code for pre-processing, initial retrieval, training, and evaluation of non-english datasets. we hope that this encourages others to consider cross-lingual modeling implications in future work. zero-shot multi-lingual ranking. because large-scale relevance judgments are largely absent in languages other than english, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. this setting makes use of relevance data from one language that has a considerable amount of training data (e.g., english) for model training and validation, and applies the trained model to a different language for testing. more formally, let s be a collection of relevance tuples in the source language, and t be a collection of relevance judgments from another language. each relevance tuple q, d, r consists of a query, document, and relevance score, respectively. in typical evaluation environments, s is segmented into multiple splits for training (s train ) and testing (s test ), such that there is no overlap of queries between the two splits. a ranking algorithm is tuned on s train to define the ranking function r strain (q, d) ∈ r, which is subsequently tested on s test . we propose instead tuning a model on all data from the source language (i.e., training r s (·)), and testing on a collection from the second language (t ). we evaluate on monolingual newswire datasets from three languages: arabic, mandarin, and spanish. the arabic document collection contains k documents (ldc t ), and we use topics/relevance information from the - trec multilingual track ( and topics, respectively). for mandarin, we use k news articles from ldc t . mandarin topics and relevance judgments are utilized from trec and ( and topics, respectively). finally, the spanish collection contains k articles from ldc t , and we use topics from trec and ( topics each). we use the topics, rather than the query descriptions, in all cases except trec spanish , in which only descriptions are provided. the topics more closely resemble real user queries than descriptions. we test on these collections because they are the only document collections available from trec at this time. we index the text content of each document using a modified version of anserini with support for the languages we investigate [ ] . specifically, we add anserini support for lucene's arabic and spanish light stemming and stop word list (via spanishanalyzer and arabicanalyzer). we treat each character in mandarin text as a single token. modeling. we explore the following ranking models: -unsupervised baselines. we use the anserini [ ] implementation of bm , rm query expansion, and the sequential dependency model (sdm) as unsupervised baselines. in the spirit of the zero-shot setting, we use the default parameters from anserini (i.e., assuming no data of the target language). -pacrr [ ] models n-gram relationships in the text using learned d convolutions and max pooling atop a query-document similarity matrix. -knrm [ ] uses learned gaussian kernel pooling functions over the querydocument similarity matrix to rank documents. -vanilla bert [ ] uses the bert [ ] transformer model, with a dense layer atop the classification token to compute a ranking score. to support multiple languages, we use the base-multilingual-cased pretrained weights. these weights were trained on wikipedia text from languages. we use the embedding layer output from base-multilingual-cased model for pacrr and knrm. in pilot studies, we investigated using cross-lingual muse vectors [ ] and the output representations from bert, but found the bert embeddings to be more effective. experimental setup. we train and validate models using trec robust collection [ ] . trec robust contains topics, k documents, and k relevance judgments in english (folds - from [ ] for training, fold for validation). thus, the model is only exposed to english text in the training and validation stages (though the embedding and contextualized language models are trained on large amounts of unlabeled text in the languages). the validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via ndcg@ ). we train using pairwise softmax loss with adam [ ] . we evaluate the performance of the trained models by re-ranking the top documents retrieved with bm . we report map, precision@ , and ndcg@ to gauge the overall performance of our approach, and the percentage of judged documents in the top ranked documents (judged@ ) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments. we present the ranking results in table . we first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, rm and sdm outperform bm , whereas in other cases they underperform. similarly, the pacrr and knrm neural models also vary in effectiveness, though more frequently perform much worse than bm . this makes sense because these models capture matching characteristics that are specific to english. for instance, n-gram patterns captured by pacrr for english do not necessarily transfer well to languages with different constituent order, such as arabic (vso instead of svo). an interesting observation is that the vanilla bert model (which recall is only tuned on english text) generally outperforms a variety of approaches across three test languages. this is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. the exceptions are the arabic dataset, in which it performs only comparably to bm and the map results for spanish. for spanish, rm is able to substantially improve recall (as evidenced by map), and since vanilla bert acts as a re-ranker atop bm , it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. in all cases, vanilla bert exhibits judged@ above %, indicating that these test collections are still valuable for evaluation. to test whether a small amount of in-language training data can further improve bert ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. the in-language samples are interleaved into the english training samples. results for this fewshot setting are shown in table . we find that the added topics for arabic (+ ) and spanish (+ ) significantly improve the performance. this results in a model significantly better than bm for arabic , which suggests that there may be substantial distributional differences in the english trec training and arabic test collections. we further back this up by training an "oracle" bert model (training on the test data) for arabic , which yields a model substantially better (p@ = . , ndcg@ = . , map = . ). we introduced a zero-shot multilingual setting for evaluation of neural ranking methods. this is an important setting due to the lack of training data available in many languages. we found that contextualized languages models (namely, bert) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to english). we also found that additional in-language training data may improve the performance, though not necessarily. by releasing our code and models, we hope that cross-lingual evaluation will become more commonplace. probabilistic models of information retrieval based on measuring the divergence from randomness ms marco: a human generated machine reading comprehension dataset clef -overview of results cross-language information retrieval (clir) track overview learning to rank: from pairwise approach to listwise approach a survey of automatic query expansion in information retrieval trec web track overview word translation without parallel data deeper text understanding for ir with contextual neural language modeling bert: pre-training of deep bidirectional transformers for language understanding overview of the fourth text retrieval conference (trec- ) overview of the third text retrieval conference (trec- ) pacrr: a position-aware neural ir model for relevance matching parameters learned in the comparison of retrieval models using term dependencies google's multilingual neural machine translation system: enabling zero-shot translation cross-lingual transfer learning for pos tagging without cross-lingual resources adam: a method for stochastic optimization cross-lingual language model pretraining unsupervised cross-lingual information retrieval using monolingual data only cedr: contextualized embeddings for document ranking a markov random field model for term dependencies an introduction to neural information retrieval the trec arabic/english clir track neural information retrieval: at the end of the early years multilingual information retrieval: from research to practice cross-lingual learning-torank with shared representations cross-lingual transfer learning for multilingual task oriented dialog cross-lingual relevance transfer for document retrieval the sixth text retrieval conference (trec- ) overview of the trec robust retrieval track overview of the fifth text retrieval conference (trec- ) monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings end-to-end neural ad-hoc ranking with kernel pooling anserini: reproducible ranking baselines using lucene simple applications of bert for ad hoc document retrieval transfer learning for sequence tagging with hierarchical recurrent networks the digital language divide the st international acm sigir conference on research & development in information retrieval this work was supported in part by arcs foundation. key: cord- -tbq okmj authors: batra, vishwash; haldar, aparajita; he, yulan; ferhatosmanoglu, hakan; vogiatzis, george; guha, tanaya title: variational recurrent sequence-to-sequence retrieval for stepwise illustration date: - - journal: advances in information retrieval doi: . / - - - - _ sha: doc_id: cord_uid: tbq okmj we address and formalise the task of sequence-to-sequence (seq seq) cross-modal retrieval. given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. this new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. we propose a novel variational recurrent seq seq (vrss) retrieval model for this seq seq task. unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. this synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. we evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. to this end, we build and release a new stepwise recipe dataset for research purposes, containing k recipes (sequences of image-text pairs) having a total of k image-text pairs. to our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. our model is shown to outperform several competitive and relevant baselines in the experiments. we also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods. there is growing interest in cross-modal analytics and search in multimodal data repositories. a fundamental problem is to associate images with some corresponding descriptive text. such associations often rely on semantic understanding, beyond traditional similarity search or image labelling, to provide humanlike visual understanding of the text and reflect abstract ideas in the image. stepwise recipe illustration example showing a few text recipe instruction steps alongside one full sequence of recipe images. note that retrieval of an accurate illustration of step , for example, depends on previously acquired context information. cross-modal retrieval systems must return outputs of one modality from a data repository, while a different modality is used as the input query. the multimodal repository usually consists of paired objects from two modalities, but may be labelled or unlabelled. classical approaches to compare data across modalities include canonical correlation analysis [ ] , partial least squares regression [ ] , and their numerous variants. more recently, various deep learning models have been developed to learn shared embedding spaces from paired image-text data, either unsupervised, or supervised using image class labels. the deep models popularly used include deep belief networks [ ] , correspondence autoencoders [ ] , deep metric learning [ ] , and convolutional neural networks (cnns) [ ] . with all these models it is expected that by learning from pairwise aligned data, the common representation space will capture semantic similarities across modalities. most such systems, however, do not consider sequences of related data in the query or result. in traditional image retrieval using text queries, for example, each image-text pair is considered in isolation ignoring any broader 'context'. a context-aware image-from-text retrieval model must look at pairwise associations and also consider sequential relationships. such sequence-to-sequence (seq seq) cross-modal retrieval is possible when contextual information and semantic meaning are both encoded and used to inform the retrieval step. for stepwise recipe illustration, an effective retrieval system must identify and align a set of relevant images corresponding to each step of a given text sequence of recipe instructions. more generally, for the task of automatic story picturing, a series of suitable images must be chosen to illustrate the events and abstract concepts found in a sequential text taken from a story. an example of the instruction steps and illustrations of a recipe taken from our new stepwise recipe dataset is shown in fig. . in this paper, we present a variational recurrent learning model to enable seq seq retrieval, called variational recurrent sequence-to-sequence (vrss) model. vrss produces a joint representation of the image-text repository, where the semantic associations are grounded in context by making use of the sequential nature of the data. stepwise query results are then obtained by searching this representation space. more concretely, we incorporate the global context information encoded in the entire text sequence (through the attention mechanism) into a variational autoencoder (vae) at each time step, which converts the input text into an image representation in the image embedding space. to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows the distribution conditional on the latent topic from the previous time step. by doing so, our model can naturally capture sequential semantic structure. our main contributions can be summarised below: -we formalise the task of sequence-to-sequence (seq seq) retrieval for stepwise illustration of text. -we propose a new variational recurrent seq seq (vrss) retrieval model for seq seq retrieval, which employs temporally-dependent latent variables to capture the sequential semantic structure of text-image sequences. -we release a new stepwise recipe dataset ( k recipes, k total imagetext pairs) for research purposes, and show that vrss outperforms several cross-modal retrieval alternatives on this dataset, using various performance metrics. our work is related to: cross-modal retrieval, story picturing, variational recurrent neural networks, and cooking recipe datasets. a number of pairwise-based methods over the years have attempted to address the cross-modal retrieval problem in different ways, such as metric learning [ ] and deep neural networks [ ] . for instance, an alignment model [ ] was devised that learns inter-modal correspondences using ms-coco [ ] and flickr- k [ ] datasets. other work [ ] proposed unifying joint image-text embedding models with multimodal neural language models, using an encoder-decoder pipeline. a later method [ ] used hard negatives to improve their ranking loss function, which yielded significant gains in retrieval performance. such systems focus only on isolated image retrieval when given a text query, and do not address the seq seq retrieval problem that we study here. in a slight variation [ ] , the goal was to retrieve an image-text multimodal unit when given a text query. for this, they proposed a gated neural architecture to create an embedding space from the query texts and query images along with the multimodal units that form the retrieval results set, and then performed semantic matching in this space. the training minimized structured hinge loss, and there was no sequential nature to the data used. picturing. an early story picturing system [ ] retrieved landscape and art images to illustrate ten short stories based on key terms in the stories and image descriptions as well as a similarity linking of images. the idea was pursued further with a system [ ] for helping people with limited literacy to read, which split a sentence into three categories and then retrieved a set of explanatory pictorial icons for each category. to our knowledge, an application [ ] that ranks and retrieves image sequences based on longer text paragraphs as queries was the first to extend the pairwise image-text relationship to matching image sequences with longer paragraphs. they employed a structural ranking support vector machine with latent variables and used a custom-built disneyland dataset, consisting of blog posts with associated images as the parallel corpus from which to learn joint embeddings. we follow a similar approach, creating our parallel corpus from sequential stepwise cooking recipes rather than unstructured blog posts, and design an entirely new seq seq model to learn our embeddings. the visual storytelling dataset (vist) [ ] was built with a motivation similar to our own, but for generating text descriptions of image sequences rather than the other way around. relying on human annotators to generate captions, vist contains sequential image-text pairs with a focus on abstract visual concepts, temporal event relations, and storytelling. in our work, we produce a similar sequenced dataset in a simple, automated manner. a recent joint sequence-to-sequence model [ ] learned a common image-text semantic space and generated paragraphs to describe photo streams. this bidirectional attention recurrent neural network was evaluated on both the above datasets. despite being unsuitable for our inverse problem, vist has also been used for retrieving images when given text, in work related to ours. in an approach called coherent neural story illustration (cnsi), an encoder-decoder network [ ] was built to first encode sentences using a hierarchical two-level sentence-story gated recurrent unit (gru), and then sequentially decode into a corresponding sequence of illustrative images. a previously proposed coherence model [ ] was used to explicitly model co-references between sentences. variational recurrent neural networks. our model is partly inspired by the variational recurrent neural network (vrnn) [ ] , which introduces latent random variables into the hidden state of an rnn by combining it with a variational autoencoder (vae). they showed that using high level latent random variables, vrnn can model the variability observed in structured sequential data such as natural speech and handwriting. vrnn has recently been applied to other sequential modelling tasks such as machine translation [ ] . our proposed vrss model introduces temporally-dependent latent variables to capture the sequential semantic structure of text/image sequences. different from existing approaches, we take into account the global context information encoded in the entire query sequence. we use vae for cross-modal generation by converting the text into a representation in the image embedding space instead of using it to reconstruct the text input. finally, we use the max-margin hinge loss to enforce similarity between text and paired image representations. cooking recipe datasets. the first attempt at automatic classification of food images was the food- dataset [ ] having k images across categories. since then, the new recipe m dataset [ ] gained wide attention, which paired each recipe with several images to build a collection of m food images for m recipes. recent work [ ] proposed a cross-modal retrieval model that aligns recipe m images and recipes in a shared representation space. as this dataset does not offer any sequential data for stepwise illustration, this association is between images of the final dish and the corresponding entire recipe text. our stepwise recipe dataset, by comparison, provides an image for each instruction step, resulting in a sequence of image-text pairs for each recipe. in [ ] they release a dataset of sequenced image-text pairs in the cooking domain, with focus on text generation conditioned on images. recipeqa [ ] is another popular dataset, used for multimodal comprehension and reasoning, with k questions about the k recipes and illustrative images for each step of the recipes. recent work [ ] used it to analyse image-text coherence relations, thereby producing a human-annotated corpus with coherence labels to characterise different inferential relationships. the recipeqa dataset reveals associations between image-text pairs much like our stepwise recipe dataset, and we therefore utilise it to augment our own dataset. we construct the stepwise recipe dataset, composed of illustrated, step-bystep recipes from three websites . recipes were automatically web-scraped and cleaned of html tags. the information about data and scripts will be made available on github . the construction of such an image-text parallel corpus has several challenges as highlighted in previous work [ ] . the text is often unstructured, without information about the canonical association between image-text pairs. each image is semantically associated with some portion of the text in the same recipe, and we assume that the images chosen by the author to augment the text are semantically meaningful. we thus perform text segmentation to divide the recipe text and associate segments with a single image each. we perform text-based filtering [ ] to ensure text quality: ( ) descriptions should have a high unique word ratio covering various part-of-speech tags, therefore descriptions with high noun ratio are discarded; ( ) descriptions with high repetition of tokens are discarded; and ( ) some predefined boiler-plate prefixsuffix sequences are removed. our constructed dataset consists of about k recipes with k associated images. furthermore, we augment our parallel corpus using similarly filtered recipeqa data [ ] , which contains images for each step of the recipes in addition to visual question answering data. the final dataset contains over k recipes in total and k images. the seq seq retrieval task is formalised as follows: given a sequence of text passages, x = {x , x , ..., x t }, retrieve a sequence of images i = {i , i , . .., i t } (from a data repository) which best describes the semantic meanings of the text passages, i.e., p(i|x) = we address the seq seq retrieval problem by considering three aspects: ( ) encoding the contextual information of text passages; ( ) capturing the semantics of the images retrieved (in a story/recipe); and ( ) learning the relatedness between each text passage and its corresponding image. it is natural to use rnns to encode a sequence of text passages. here, we encode a text sequence using a bi-directional gru (bi-gru). given a text passage, we use the attention mechanism to capture the contextual information of the whole recipe. we map the text embedding into a latent topic z t by using a vae. in order to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows a distribution conditional on the latent topic z t− from the previous step. we decode the corresponding image vector i t conditional on the latent topic, to learn the relatedness between text and image with a multi-layer perceptron and obtain a synthetic image embedding point generated from its associated text embedding point. our proposed variational recurrent seq seq (vrss) model is illustrated in fig. . below, we describe each of the main components of the vrss model. we use a bi-gru to learn the hidden representations of the text passage (e.g. one recipe instruction) in the forward and backward directions. the two learned hidden states are then concatenated to form the text segment to encode a sequence of such text passages (e.g. one recipe), a hierarchical bi-gru is used which first encodes each text segment and subsequently combines them. image encoder. to generate the vector representation of an image, we use the pre-trained modified resnet cnn [ ] . in experiments, this model produced a well distributed feature space when trained on the limited domain, namely food related images. this was verified using t-sne visualisations [ ] , which showed less clustering in the generated embedding space as compared to embeddings obtained from models pre-trained on imagenet [ ] . to capture global context, we feed the bi-gru encodings into a top level bi-gru. assuming the hidden state output of each text passage x l in the global context is h c l , we use an attention mechanism to capture its similarity with the hidden state output of the t th text passage h t as the context vector is encoded as the combination of l text passages weighted by the attentions as c t = l l= α l h c l . this ensures that any given text passage is influenced more by others that are semantically similar. at the t th step text x t of the text sequence, the bi-gru output h t is combined with the context c t and fed into a vae to generate the latent topic z t . two prior networks f μ θ and f Σ θ define the prior distribution of z t conditional on the previous z t− . we also define two inference networks f μ φ and f Σ φ which are functions of h t , c t , and z t− : unlike the typical vae setup where the text input x t is reconstructed by generation networks, here we generate the corresponding image vector i t . to generate the image vector conditional on z t , the generation networks are defined which are also conditional on z t− : the generation loss for image i t is then: − kl(q(z t |x ≤t , z