Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. This report is a terse narrative report, and when processing is complete you will be linked to a more complete narrative report. Eric Lease Morgan Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 43 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 3339 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 55 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 4 user 4 image 3 task 3 query 3 document 2 word 2 review 2 model 2 graph 2 dataset 2 claim 2 Lucene 2 BM25 1 view 1 tweet 1 topic 1 text 1 term 1 system 1 symptom 1 session 1 sentence 1 seed 1 schema 1 recommendation 1 ranker 1 question 1 product 1 premise 1 patent 1 passage 1 ontology 1 node 1 network 1 location 1 list 1 lexicon 1 language 1 label 1 item 1 irony 1 feature 1 entity 1 english 1 embedding 1 early 1 domain 1 dmp 1 disease 1 damage Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 961 model 762 query 721 document 559 task 522 word 512 user 469 information 465 text 436 result 435 dataset 418 image 383 approach 381 term 354 retrieval 350 representation 345 embedding 339 network 338 method 329 system 327 review 323 datum 310 language 282 feature 273 score 271 set 252 graph 242 work 234 search 227 performance 226 attention 218 number 214 node 208 context 200 evaluation 199 learning 197 domain 195 analysis 193 topic 192 training 188 vector 185 sentence 180 function 175 label 174 time 174 similarity 166 claim 165 relevance 157 product 157 premise 155 sentiment Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 161 al 134 q 132 et 118 IR 103 Sect 94 BM25 86 Table 84 Fig 77 Eq 68 j 68 Retrieval 65 S 63 Lucene 61 u 60 Information 60 English 56 k 54 t 52 K 51 Twitter 51 COLTR 49 D 48 T 47 Bantu 45 . 42 d 42 BERT 41 s 41 i 41 DOI 39 TREC 39 Neural 37 m 37 TransRev 36 sha 33 c 33 Task 32 M 31 L 30 eRisk 30 C 29 VRSS 29 F 28 y 28 w 28 BC 27 LSTM 27 CNN 27 A 26 Wikipedia Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 1842 we 428 it 214 they 207 i 96 them 22 one 20 us 12 you 9 he 6 itself 4 u 4 ours 4 me 3 she 3 s 3 ourselves 2 themselves 2 ndcg@10 2 him 2 's 1 Π 1 her 1 f Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 4101 be 1083 use 695 have 466 base 347 learn 304 show 300 propose 227 do 219 consider 207 provide 195 give 173 follow 170 generate 166 make 163 train 160 include 145 rank 138 evaluate 135 compare 134 set 131 find 127 compute 124 contain 120 embed 117 define 116 describe 115 perform 114 obtain 109 support 107 retrieve 107 represent 106 identify 103 see 100 introduce 94 predict 93 improve 92 relate 92 present 92 exist 91 take 91 apply 89 extract 87 select 87 combine 82 focus 80 capture 79 report 78 require 77 outperform 77 need Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 482 not 403 - 282 different 282 also 269 more 244 such 224 other 222 only 206 well 199 first 197 same 193 neural 168 semantic 161 then 160 large 152 new 151 however 144 similar 135 e.g. 134 relevant 131 most 130 good 128 deep 117 previous 116 high 112 social 109 therefore 104 specific 104 available 103 non 101 multi 97 as 96 long 95 many 92 single 83 modal 83 cross 82 online 81 original 79 second 78 several 78 common 75 multiple 74 important 74 further 73 simple 71 local 70 very 70 standard 70 small Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 84 good 48 most 19 least 13 wide 11 high 10 Most 7 near 6 large 6 bad 4 long 3 small 3 old 3 low 3 late 3 close 2 big 1 weak 1 topmost 1 strong 1 slow 1 slight 1 simple 1 short 1 hard 1 great 1 fast 1 easy 1 early 1 Least 1 ImageNet 1 -which 1 -there 1 -d Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 83 most 15 least 8 well 1 widest Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 7 github.com 2 neural-ir-explorer.ec.tuwien.ac.at 1 www.gurobi.com 1 slidewiki.org 1 ielab Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 2 http://neural-ir-explorer.ec.tuwien.ac.at/ 2 http://github.com/david-morris/SlideImages/ 2 http://github.com/bioinformatics-ua/BioASQ 1 http://www.gurobi.com/ 1 http://slidewiki.org/ 1 http://ielab 1 http://github.com/ly233/Seed-Guided-Topic-Model 1 http://github.com/Valentyn1997/kg-alignment-lessons-learned 1 http://github.com/MaziarMF/deep-k-means Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 6 document is relevant 3 approaches do not 3 data using t 3 graph embedding methods 3 image does not 3 reviews are not 3 word embedding models 3 word embedding vectors 2 approach does not 2 approach is better 2 approaches are not 2 data are available 2 dataset containing text 2 dataset is less 2 embedding is then 2 image embedding space 2 images retrieved so 2 information is not 2 methods are not 2 model is not 2 model was able 2 results include articles 2 retrieval using block 2 retrieval using monolingual 2 users are more 2 users do not 2 word embedding techniques 2 work is not 2 work was partially 1 approach includes verbs 1 approach is also 1 approach is beneficial 1 approach is different 1 approach is domain 1 approach is not 1 approach is semi 1 approach provided results 1 approach provides users 1 approach use l 1 approach was not 1 approaches are almost 1 approaches are common 1 approaches are incomparable 1 approaches are mkl 1 approaches are often 1 approaches are only 1 approaches have also 1 approaches is also 1 approaches is lower 1 approaches perform better Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 3 reviews are not available 2 information is not available 1 approach does not explicitly 1 approach is not only 1 approach was not scalable 1 approaches are not antagonist 1 approaches are not probabilistic 1 data is not available 1 dataset is not publicly 1 documents are not available 1 documents containing no seed 1 methods are not yet 1 model was not fine 1 queries have no lemmas 1 result was not strictly 1 review is not available 1 set does not necessarily 1 system is not able 1 systems are not able 1 systems is not directly 1 text is not available 1 users do not accurately 1 users do not simply 1 users were not aware 1 work is not possible A rudimentary bibliography -------------------------- id = cord-020896-yrocw53j author = Agarwal, Mansi title = MEMIS: Multimodal Emergency Management Information System date = 2020-03-17 keywords = damage; system; tweet summary = We present MEMIS, a system that can be used in emergencies like disasters to identify and analyze the damage indicated by user-generated multimodal social media posts, thereby helping the disaster management groups in making informed decisions. To this end, we propose MEMIS, a multimodal system capable of extracting information from social media, and employs both images and text for identifying damage and its severity in real-time (refer Sect. Therefore, we effectively have three models for each modality: first for filtering the informative tweets, then for those pertaining to the infrastructural damage (or any other category related to the relief group), and finally for assessing the severity of damage present. Similarly, if at least one of the text and the image modality predicts an informative tweet as containing infrastructural damage, the tweet undergoes severity analysis. Here, we use attention fusion to combine the feature interpretations from the text and image modalities for the severity analysis module [12, 26] . doi = 10.1007/978-3-030-45439-5_32 id = cord-020843-cq4lbd0l author = Almeida, Tiago title = Calling Attention to Passages for Biomedical Question Answering date = 2020-03-24 keywords = document; passage summary = This paper presents a pipeline for document and passage retrieval for biomedical question answering built around a new variant of the DeepRank network model in which the recursive layer is replaced by a self-attention layer combined with a weighting mechanism. On the other hand, models such as the Deep Relevance Matching Model (DRMM) [3] or DeepRank [10] follow a interaction-based approach, in which matching signals between query and document are captured and used by the neural network to produces a ranking score. The main contribution of this work is a new variant of the DeepRank neural network architecture in which the recursive layer originally included in the final aggregation step is replaced by a self-attention layer followed by a weighting mechanism similar to the term gating layer of the DRMM. The proposed model was evaluated on the BioASQ dataset, as part of a document and passage (snippet) retrieval pipeline for biomedical question answering, achieving similar retrieval performance when compared to more complex network architectures. doi = 10.1007/978-3-030-45442-5_9 id = cord-020880-m7d4e0eh author = Barrón-Cedeño, Alberto title = CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media date = 2020-03-24 keywords = claim; task summary = Task 3 asks to retrieve text snippets from a given set of Web pages that would be useful for verifying a target tweet''s claim. Finally, the lab offers a fifth task that asks to predict the check-worthiness of the claims made in English political debates and speeches. Task 3 is defined as follows: Given a check-worthy claim on a specific topic and a set of text snippets extracted from potentially-relevant webpages, return a ranked list of all evidence snippets for the claim. Once we acquire annotations for Task 1, we share with participants the Web pages and text snippets from them solely for the check-worthy claims, which would enable the start of the evaluation cycle for Task 3. Task 4 is defined as follows: Given a check-worthy claim on a specific topic and a set of potentially-relevant Web pages, predict the veracity of the claim. doi = 10.1007/978-3-030-45442-5_65 id = cord-020912-tbq7okmj author = Batra, Vishwash title = Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise Illustration date = 2020-03-17 keywords = VRSS; image; text summary = We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. More concretely, we incorporate the global context information encoded in the entire text sequence (through the attention mechanism) into a variational autoencoder (VAE) at each time step, which converts the input text into an image representation in the image embedding space. To capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows the distribution conditional on the latent topic from the previous time step. -We propose a new variational recurrent seq2seq (VRSS) retrieval model for seq2seq retrieval, which employs temporally-dependent latent variables to capture the sequential semantic structure of text-image sequences. Our work is related to: cross-modal retrieval, story picturing, variational recurrent neural networks, and cooking recipe datasets. doi = 10.1007/978-3-030-45439-5_4 id = cord-020814-1ty7wzlv author = Berrendorf, Max title = Knowledge Graph Entity Alignment with Graph Convolutional Networks: Lessons Learned date = 2020-03-24 keywords = GCN; entity summary = In this work, we focus on the problem of entity alignment in Knowledge Graphs (KG) and we report on our experiences when applying a Graph Convolutional Network (GCN) based model for this task. Graph Convolutional Networks (GCN) [7, 9] , which have been recently become increasingly popular, are at the core of state-of-the-art methods for entity alignments in KGs [3, 6, 22, 24, 27] . 1. We investigate the reproducibility of the published results of a recent GCNbased method for entity alignment and uncover differences between the method''s description in the paper and the authors'' implementation. Overview of used datasets with their sizes in the number of triples (edges), entities (nodes), relations (different edge types) and alignments. GCN-Align [22] is a GCN-based approach to embed all entities from both graphs into a common embedding space. Semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference Entity alignment between knowledge graphs using attribute embeddings doi = 10.1007/978-3-030-45442-5_1 id = cord-020890-aw465igx author = Brochier, Robin title = Inductive Document Network Embedding with Topic-Word Attention date = 2020-03-17 keywords = document; topic; word summary = doi = 10.1007/978-3-030-45439-5_22 id = cord-020808-wpso3jug author = Cardoso, João title = Machine-Actionable Data Management Plans: A Knowledge Retrieval Approach to Automate the Assessment of Funders’ Requirements date = 2020-03-24 keywords = dmp; ontology summary = In order to guide researchers through the process of managing their data, many funding agencies (e.g. the National Science Foundation (NSF), the European Commission (EC), or the Fundação para a Ciência e Tecnologia (FCT) have created and published their own open access policies, as well as requiring that any grant proposals be accompanied by a Data Management Plan (DMP). The DMP is a document describing the techniques, methods and policies on how data from a research project is to be created or collected, documented, accessed, preserved and disseminated. The second part comprises of the execution of the following four tasks and results in both the collection of the necessary mappings between the ontology and the identified DMP templates, and creation of DL queries based on the funders'' requirements. The DMP Common Standard Ontology (DCSO) 1 , was created with the objective of providing an implementation of the DMP Common Standards model expressed through the usage of semantic technology, which has been considered a possible solution in the data management and preservation domains [9] . doi = 10.1007/978-3-030-45442-5_15 id = cord-020908-oe77eupc author = Chen, Zhiyu title = Leveraging Schema Labels to Enhance Dataset Search date = 2020-03-17 keywords = dataset; label; schema summary = doi = 10.1007/978-3-030-45439-5_18 id = cord-020899-d6r4fr9r author = Doinychko, Anastasiia title = Biconditional Generative Adversarial Networks for Multiview Learning with Missing Views date = 2020-03-17 keywords = Cond; view summary = In this paper, we present a conditional GAN with two generators and a common discriminator for multiview learning problems where observations have two views, but one of them may be missing for some of the training samples. We address the problem of multiview learning with Generative Adversarial Networks (GANs) in the case where some observations may have missing views without there being an external resource to complete them. We demonstrate that generated views allow to achieve state-of-the-art results on a subset of Reuters RCV1/RCV2 collections compared to multiview approaches that rely on Machine Translation (MT) for translating documents into languages in which their versions do not exist; before training the models. 3.2); -Achieve state-of-the art performance compared to multiview approaches that rely on external view generating functions on multilingual document classification; and which is another challenging application than image analysis which is the domain of choice for the design of new GAN models (Sect. doi = 10.1007/978-3-030-45439-5_53 id = cord-020914-7p37m92a author = Dumani, Lorik title = A Framework for Argument Retrieval: Ranking Argument Clusters by Frequency and Specificity date = 2020-03-17 keywords = claim; premise summary = From an information retrieval perspective, an interesting task within this setting is finding the best supporting and attacking premises for a given query claim from a large corpus of arguments. From an information retrieval perspective, an interesting task within this setting is finding the best supporting (pro) and attacking (con) premises for a given query claim [31] . Given a user''s keyword query, the system retrieves, ranks, and presents premises supporting and attacking the query, taking similarity of the query with the premise, its corresponding claim, and other contextual information into account. We assume that we work with a large corpus of argumentative text, for example collections of political speeches or forum discussions, that has already been mined and transferred into claims with the corresponding premises and stances. We consider the following problem: Given a controversial claim or topic, for example "We should abandon fossil fuels", a user searches for the most important premises from the corpus supporting or attacking it. doi = 10.1007/978-3-030-45439-5_29 id = cord-020916-ds0cf78u author = Fard, Mazar Moradi title = Seed-Guided Deep Document Clustering date = 2020-03-17 keywords = SD2C; seed; word summary = The main contributions of this study can be summarized as follows: (a) We introduce the Seed-guided Deep Document Clustering (SD2C) framework, 1 the first attempt, to the best of our knowledge, to constrain clustering with seed words based on a deep clustering approach; and (b) we validate this framework through experiments based on automatically selected seed words on five publicly available text datasets with various sizes and characteristics. The constrained clustering problem we are addressing in fact bears strong similarity with the one of seed-guided dataless text classification, which consist in categorizing documents based on a small set of seed words describing the classes/clusters. This can be done by enforcing that seed words have more influence either on the learned document embeddings, a solution we refer to as SD2C-Doc, or on the cluster representatives, a solution we refer to as SD2C-Rep. Note that the second solution can only be used when the clustering process is based on cluster representatives (i.e., R = {r k } K k=1 with K the number of clusters), which is indeed the case for most current deep clustering methods [1] . doi = 10.1007/978-3-030-45439-5_1 id = cord-020888-ov2lzus4 author = Formal, Thibault title = Learning to Rank Images with Cross-Modal Graph Convolutions date = 2020-03-17 keywords = PRF; image; model summary = While most of the current approaches for cross-modal retrieval revolve around learning how to represent text and images in a shared latent space, we take a different direction: we propose to generalize the cross-modal relevance feedback mechanism, a simple yet effective unsupervised method, that relies on standard information retrieval heuristics and the choice of a few hyper-parameters. The model can be understood very simply: similarly to PRF methods in standard information retrieval, the goal is to boost images that are visually similar to top images (from a text point of view), i.e. images that are likely to be relevant to the query but were initially badly ranked (which is likely to happen in the web scenario, where text is crawled from source page and can be very noisy). doi = 10.1007/978-3-030-45439-5_39 id = cord-020901-aew8xr6n author = García-Durán, Alberto title = TransRev: Modeling Reviews as Translations from Users to Items date = 2020-03-17 keywords = embedding; review; user summary = doi = 10.1007/978-3-030-45439-5_16 id = cord-020830-97xmu329 author = Ghanem, Bilal title = Irony Detection in a Multilingual Context date = 2020-03-24 keywords = arabic; irony summary = We show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony. We aim here to bridge the gap by tackling ID in tweets from both multilingual (French, English and Arabic) and multicultural perspectives (Indo-European languages whose speakers share quite the same cultural background vs. We can justify that by, the language presentation of the Arabic and French tweets are quite informal and have many dialect words that may not exist in the pretrained embeddings we used comparing to the English ones (lower embeddings coverage ratio), which become harder for the CNN to learn a clear semantic pattern. The CNN architecture trained on cross-lingual word representation shows that irony has a certain similarity between the languages we targeted despite the cultural differences which confirm that irony is a universal phenomena, as already shown in previous linguistic studies [9, 24, 35] . doi = 10.1007/978-3-030-45442-5_18 id = cord-020834-ch0fg9rp author = Grand, Adrien title = From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance date = 2020-03-24 keywords = Lucene; Suel summary = We share the story of how an innovation that originated from academia-blockmax indexes and the corresponding block-max Wand query evaluation algorithm of Ding and Suel [6] -made its way into the open-source Lucene search library. We see this paper as having two main contributions beyond providing a narrative of events: First, we report results of experiments that attempt to match the original conditions of Ding and Suel [6] and present additional results on a number of standard academic IR test collections. 3 Support for block-max indexes was the final feature that was implemented, based on the developers'' reading of the paper by Ding and Suel [6] , which required invasive changes to Lucene''s index format. The story of block-max Wand in Lucene provides a case study of how an innovation that originated in academia made its way into the world''s most widely-used search library and achieved significant impact in the "real world" through hundreds of production deployments worldwide (if we consider the broader Lucene ecosystem, which includes systems such as Elasticsearch and Solr). doi = 10.1007/978-3-030-45442-5_3 id = cord-020835-n9v5ln2i author = Jangra, Anubhav title = Text-Image-Video Summary Generation Using Joint Integer Linear Programming date = 2020-03-24 keywords = ILP; image summary = doi = 10.1007/978-3-030-45442-5_24 id = cord-020815-j9eboa94 author = Kamphuis, Chris title = Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants date = 2020-03-24 keywords = BM25; Lucene summary = Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene''s often maligned approximation of document length. Although learning-to-rank approaches and neural ranking models are widely used today, they are typically deployed as part of a multi-stage reranking architecture, over candidate documents supplied by a simple term-matching method using traditional inverted indexes [1] . Our goal is a large-scale reproducibility study to explore the nuances of different variants of BM25 and their impact on retrieval effectiveness. Their findings are confirmed: effectiveness differences in IR experiments are unlikely to be the result of the choice of BM25 variant a system implemented. We implemented a variant that uses exact document lengths, but is otherwise identical to the Lucene default. Storing exact document lengths would allow for different ranking functions to be swapped at query time more easily, as no information would be discarded at index time. doi = 10.1007/978-3-030-45442-5_4 id = cord-020806-lof49r72 author = Landin, Alfonso title = Novel and Diverse Recommendations by Leveraging Linear Models with User and Item Embeddings date = 2020-03-24 keywords = item; recommendation summary = title: Novel and Diverse Recommendations by Leveraging Linear Models with User and Item Embeddings In this paper, we present EER, a linear model for the top-N recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy. In this paper, we propose a method to augment an existing recommendation linear model to make more diverse and novel recommendations, while maintaining similar accuracy results. Experiments conducted on three datasets show that our proposal outperforms the original model in both novelty and diversity while maintaining similar levels of accuracy. On the other side, as results in Table 3 show, ELP is able to provide good figures in novelty and diversity, thanks to the embedding model capturing non-linear relations between users and items. It is common in the field of recommender systems for methods with lower accuracy to have higher values in diversity and novelty. FISM: factored item similarity models for top-n recommender systems doi = 10.1007/978-3-030-45442-5_27 id = cord-020794-d3oru1w5 author = Leekha, Maitree title = A Multi-task Approach to Open Domain Suggestion Mining Using Language Model for Text Over-Sampling date = 2020-03-24 keywords = LMOTE summary = title: A Multi-task Approach to Open Domain Suggestion Mining Using Language Model for Text Over-Sampling In this work, we introduce a novel over-sampling technique to address the problem of class imbalance, and propose a multi-task deep learning approach for mining suggestions from multiple domains. Experimental results on a publicly available dataset show that our over-sampling technique, coupled with the multi-task framework outperforms state-of-the-art open domain suggestion mining models in terms of the F-1 measure and AUC. In our study, we generate synthetic positive reviews till the number of suggestion and non-suggestion class samples becomes equal in the training set. All comparisons have been made in terms of the F-1 score of the suggestion class for a fair comparison with prior work on representational learning for open domain suggestion mining [5] (refer Baseline in Table 3 ). In this work, we proposed a Multi-task learning framework for Open Domain Suggestion Mining along with a novel language model based over-sampling technique for text-LMOTE. doi = 10.1007/978-3-030-45442-5_28 id = cord-020851-hf5c0i9z author = Losada, David E. title = eRisk 2020: Self-harm and Depression Challenges date = 2020-03-24 keywords = early; task summary = doi = 10.1007/978-3-030-45442-5_72 id = cord-020801-3sbicp3v author = MacAvaney, Sean title = Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning date = 2020-03-24 keywords = TREC; english summary = In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. [28] leveraged a data set of Wikipedia pages in 25 languages to train a learning to rank algorithm for Japanese-English and Swahili-English cross-language retrieval. In particular, to circumvent the lack of training data, we leverage transfer learning techniques to train Arabic, Mandarin, and Spanish retrieval models using English training data. We evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. Because large-scale relevance judgments are largely absent in languages other than English, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. doi = 10.1007/978-3-030-45442-5_31 id = cord-020931-fymgnv1g author = Meng, Changping title = ReadNet: A Hierarchical Transformer Framework for Web Article Readability Analysis date = 2020-03-17 keywords = English; feature; sentence summary = doi = 10.1007/978-3-030-45439-5_3 id = cord-020904-x3o3a45b author = Montazeralghaem, Ali title = Relevance Ranking Based on Query-Aware Context Analysis date = 2020-03-17 keywords = query; term summary = The primary goal of the proposed model is to combine the exact and semantic matching between query and document terms, which has been shown to produce effective performance in information retrieval. In basic retrieval models such as BM25 [30] and the language modeling framework [29] , the relevance score of a document is estimated based on explicit matching of query and document terms. Finally, our proposed model for relevance ranking provides the basis for natural integration of semantic term matching and local document context analysis into any retrieval model. [13] proposed a generalized estimate of document language models using a noisy channel, which captures semantic term similarities computed using word embeddings. Note that in this experiment, we only consider methods that select expansion terms based on word embeddings and not other information sources such as the top retrieved documents for each query (PRF). doi = 10.1007/978-3-030-45439-5_30 id = cord-020848-nypu4w9s author = Morris, David title = SlideImages: A Dataset for Educational Image Classification date = 2020-03-24 keywords = dataset; image summary = Currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. In this paper, we address this issue and present SlideImages, a dataset for the task of classifying educational illustrations. SlideImages contains training data collected from various sources, e.g., Wikimedia Commons and the AI2D dataset, and test data collected from educational slides. Born-digital and educational images need further benchmarks on challenging information retrieval tasks in order to test generalization. While document scans and born-digital educational illustrations have materially different appearance, these papers show that the utility of deep neural networks is not limited to scene image tasks (Fig. 1) . The related DocFigure dataset covers similar images and has much more data than SlideImages. In this paper, we have presented the task of classifying educational illustrations and images in slides and introduced a novel dataset SlideImages. doi = 10.1007/978-3-030-45442-5_36 id = cord-020811-pacy48qx author = Muhammad, Shamsuddeen Hassan title = Incremental Approach for Automatic Generation of Domain-Specific Sentiment Lexicon date = 2020-03-24 keywords = lexicon summary = title: Incremental Approach for Automatic Generation of Domain-Specific Sentiment Lexicon To this end, we propose an approach to automatically generate a domain-specific sentiment lexicon using a vector model enriched by weights. Although research has been carried out on corpus-based approaches for automatic generation of a domain-specific lexicon [1, 4, 5, 7, 9, 10, 14] , existing approaches focused on creation of a lexicon from a single corpus [4] . To this end, this work proposes an incremental approach for the automatic generation of a domain-specific sentiment lexicon. We aim to investigate an incremental technique for automatically generating domain-specific sentiment lexicon from a corpus. Can we automatically generate a sentiment lexicon from a corpus and improves the existing approaches? After detecting the domain shift, we merge the distribution using a similar approach discussed (in updating using the same corpus) and generate the lexicon. doi = 10.1007/978-3-030-45442-5_81 id = cord-020918-056bvngu author = Nchabeleng, Mathibele title = Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages date = 2020-03-17 keywords = Bantu; Runyankore; language summary = doi = 10.1007/978-3-030-45439-5_11 id = cord-020832-iavwkdpr author = Nguyen, Dat Quoc title = ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents date = 2020-03-24 keywords = chemical; patent summary = title: ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents ChEMU involves two key information extraction tasks over chemical reactions from patents. In this paper, we propose a new evaluation lab (called ChEMU) focusing on information extraction over chemical reactions from patents. Our goals are: (1) To develop tasks that impact chemical research in both academia and industry, (2) To provide the community with a new dataset of chemical entities, enriched with relational links between chemical event triggers and arguments, and (3) To advance the state-of-the-art in information extraction over chemical patents. The ChEMU lab at CLEF-2020 1 offers the two information extraction tasks of Named entity recognition (Task 1) and Event extraction (Task 2) over chemical reactions from patent documents. ChEMU will focus on two new tasks of named entity recognition and event extraction over chemical reactions from patents. doi = 10.1007/978-3-030-45442-5_74 id = cord-020820-cbikq0v0 author = Papadakos, Panagiotis title = Dualism in Topical Relevance date = 2020-03-24 keywords = query; user summary = To this end, in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms for eventually producing an answer which provides a better overview of the related conceptual and information space. In this paper we elaborate on the idea of leveraging the available antonyms of the original query terms (if they exist), for eventually producing an answer which provides a better overview of the related information and conceptual space. In their comments for these queries, users mention that the selected (i.e., dual) list "provides a more general picture" and "more relevant and interesting results, although contradicting". For the future, we plan to define the appropriate antonyms selection algorithms and relevance metrics, implement the proposed functionality in a meta-search setting, and conduct a large scale evaluation with real users over exploratory tasks, to identify in which queries the dual approach is beneficial and to what types of users. doi = 10.1007/978-3-030-45442-5_40 id = cord-020909-n36p5n2k author = Papadakos, Panagiotis title = bias goggles: Graph-Based Computation of the Bias of Web Domains Through the Eyes of Users date = 2020-03-17 keywords = biased; domain summary = -the bias goggles model for computing the bias characteristics of web domains for a user-defined concept, based on the notions of Biased Concepts (BCs), Aspects of Bias (ABs), and the metrics of the support of the domain for a specific AB and BC, and its bias score for this BC, -the introduction of the Support Flow Graph (SFG), along with graph-based algorithms for computing the AB support score of domains, that include adaptations of the Independence Cascade (IC) and Linear Threshold (LT) propagation models, and the new Biased-PageRank (Biased-PR) variation that models different behaviours of a biased surfer, -an initial discussion about performance and implementation issues, -some promising evaluation results that showcase the effectiveness and efficiency of the approach on a relatively small dataset of crawled pages, using the new AGBR and AGS metrics, -a publicly accessible prototype of bias goggles. doi = 10.1007/978-3-030-45439-5_52 id = cord-020871-1v6dcmt3 author = Papariello, Luca title = On the Replicability of Combining Word Embeddings and Retrieval Models date = 2020-03-24 keywords = Fisher; model summary = doi = 10.1007/978-3-030-45442-5_7 id = cord-020905-gw8i6tkn author = Qu, Xianshan title = An Attention Model of Customer Expectation to Improve Review Helpfulness Prediction date = 2020-03-17 keywords = attention; product; review summary = To model such customer expectations and capture important information from a review text, we propose a novel neural network which leverages review sentiment and product information. In order to address the above issues, we propose a novel neural network architecture to introduce sentiment and product information when identifying helpful content from a review text. In the cold start scenario, our proposed model demonstrates an AUC improvement of 5.4% and 1.5% on Amazon and Yelp data sets, respectively, when compared to the state of the art model. From Table 5 , we see that adding a sentiment attention layer (HSA) to the base model (HBiLSTM) results in an average improvement in the AUC score of 2.0% and 2.6%, respectively on the Amazon and Yelp data sets. In this paper, we describe our analysis of review helpfulness prediction and propose a novel neural network model with attention modules to incorporate sentiment and product information. doi = 10.1007/978-3-030-45439-5_55 id = cord-020872-frr8xba6 author = Santosh, Tokala Yaswanth Sri Sai title = DAKE: Document-Level Attention for Keyphrase Extraction date = 2020-03-24 keywords = CRF; document summary = doi = 10.1007/978-3-030-45442-5_49 id = cord-020936-k1upc1xu author = Sanz-Cruzado, Javier title = Axiomatic Analysis of Contact Recommendation Methods in Social Networks: An IR Perspective date = 2020-03-17 keywords = BM25; user summary = doi = 10.1007/978-3-030-45439-5_12 id = cord-020885-f667icyt author = Sharma, Ujjwal title = Semantic Path-Based Learning for Review Volume Prediction date = 2020-03-17 keywords = graph; network; node summary = In this work, we present an approach that uses semantically meaningful, bimodal random walks on real-world heterogeneous networks to extract correlations between nodes and bring together nodes with shared or similar attributes. In this work, -We propose a novel method that incorporates restaurants and their attributes into a multimodal graph and extracts multiple, bimodal low dimensional representations for restaurants based on available paths through shared visual, textual, geographical and categorical features. In this section, we discuss prior work that leverages graph-based structures for extracting information from multiple modalities, focussing on the auto-captioning task that introduced such methods. For each of these sub-networks, we perform random walks and use a variant of the heterogeneous skip-gram objective introduced in [6] to generate low-dimensional bimodal embeddings. Our attention-based model combines separately learned bimodal embeddings using a late-fusion setup for predicting the review volume of the restaurants. doi = 10.1007/978-3-030-45439-5_54 id = cord-020875-vd4rtxmz author = Suwaileh, Reem title = Time-Critical Geolocation for Social Good date = 2020-03-24 keywords = LMP; location summary = To address this problem, I aim to exploit different techniques such as training neural models, enriching the tweet representation, and studying methods to mitigate the lack of labeled data. In my work, I am interested in tackling the Location Mention Prediction (LMP) problem during time-critical situations. The location taggers have to address many challenges including microblogging-specific challenges (e.g., tweet sparsity, noisiness, stream rapid-changing, hashtag riding, etc.) and the task-specific challenges (e.g., time-criticality of the solution, scarcity of labeled data, etc.). Alternatively, Sultanik and Fink [25] , used Information Retrieval (IR) based approach to identify the location mentions in tweets. Moreover, Hoang and Mothe [8] combined syntactic and semantic features to train traditional ML-based models whereas Kumar and Singh [13] trained a Convolutional Neural Network (CNN) model that learns the continuous representation of tweet text and then identifies the location mentions. doi = 10.1007/978-3-030-45442-5_82 id = cord-020903-qt0ly5d0 author = Tamine, Lynda title = What Can Task Teach Us About Query Reformulations? date = 2020-03-17 keywords = session; task summary = task-based sessions represent significantly different background contexts to be used in the perspective of better understanding users'' query reformulations. Using insights from large-scale search logs, our findings clearly show that task is an additional relevant search unit that helps better understanding user''s query reformulation patterns and predicting the next user''s query. To design support processes for task-based search systems, we argue that we need to: (1) fully understand how user''s task performed in natural settings drives the query reformulations changes; and (2) gauge the level of similarity of these changes trends with those observed in time-based sessions. With this in mind, we perform large-scale log analyses of users naturally engaged in tasks to examine query reformulations from both the time-based session vs. To identify query reformulation patterns, most of the previous works used large-scale log analyses segmented into time-based sessions. doi = 10.1007/978-3-030-45439-5_42 id = cord-020891-lt3m8h41 author = Witschel, Hans Friedrich title = KvGR: A Graph-Based Interface for Explorative Sequential Question Answering on Heterogeneous Information Sources date = 2020-03-17 keywords = graph; question; user summary = doi = 10.1007/978-3-030-45439-5_50 id = cord-020932-o5scqiyk author = Zhong, Wei title = Accelerating Substructure Similarity Search for Formula Retrieval date = 2020-03-17 keywords = list; query summary = In text similarity search, query processing can be accelerated through dynamic pruning [18] , which typically estimates score upperbounds to prune documents unlikely to be in the top K results. As a result, the posting list entry also stores the root node ID for indexed paths, in order to reconstruct matches substructures at merge time. Define partial upperbound matrix W = {w i,j } |Tq|×|T| where T = {T(m), m ∈ T q } are all the token paths from query OPT (T is essentially the same as tokenized P(T q )), and a binary variable x |T|×1 indicating which corresponding posting lists are placed in the non-requirement set. We have presented rank-safe dynamic pruning strategies that produce an upperbound estimation of structural similarity in order to speedup formula search using subtree matching. Our dynamic pruning strategies and specialized inverted index are different from traditional linear text search pruning methods and they further associate query structure representation with posting lists. doi = 10.1007/978-3-030-45439-5_47 id = cord-020927-89c7rijg author = Zhuang, Shengyao title = Counterfactual Online Learning to Rank date = 2020-03-17 keywords = COLTR; DBGD; ranker summary = doi = 10.1007/978-3-030-45439-5_28 id = cord-020846-mfh1ope6 author = Zlabinger, Markus title = DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations date = 2020-03-24 keywords = disease; symptom summary = doi = 10.1007/978-3-030-45442-5_54