key: cord-0039563-ky2nzzh7
authors: Couto, Francisco M.; Krallinger, Martin
title: Proposal of the First International Workshop on Semantic Indexing and Information Retrieval for Health from Heterogeneous Content Types and Languages (SIIRH)
date: 2020-03-24
journal: Advances in Information Retrieval
DOI: 10.1007/978-3-030-45442-5_87
sha: 4de667a4f35ba884063611ba6f9c9737c5def9a1
doc_id: 39563
cord_uid: ky2nzzh7

The application of Information Retrieval (IR) and deep learning strategies to explore the vast amount of rapidly growing health-related content is of utmost importance, but is also particularly challenging, due to the very specialized domain language, and implicit differences in language characteristics depending on the content type. This workshop aims at presenting and discussing current and future directions for IR and machine learning approaches devoted to the retrieval and classification of different types of health-related documents ranging from layman or patient generated texts to highly specialized medical literature or clinical records. It includes a session on the MESINESP shared task, supported by the Spanish National Language Technology plan (Plan TL), in order to address the importance and impact of community evaluation efforts, in particular BioASQ, BioCreative, eHealth CLEF, MEDIQA and TREC, as scenarios for exploring evaluation settings and generate data collections of key importance for promoting the development and comparison of IR resources. Additionally, an open session will address IR technologies for heterogeneous health-related content open to multiple languages with a particular interest in the exploitation of structured controlled vocabularies and entity linking, covering the following topics: multilingual and non-English health-related IR, concept indexing, text categorization, generation of evaluation resources biomedical document IR strategies; scalability, robustness and reproducibility of health IR and text mining resources; use of specialized machine translation and advanced deep learning approaches for improving health related search results; medical Question Answering search tools; retrieval of multilingual health related web-content; and other related topics.

Keywords: Semantic indexing · Ontologies · Controlled vocabularies · Information Retrieval · Text mining · Natural language processing · Biomedical informatics

There is an increasing interest in exploiting the vast amount of rapidly growing content related to health [7] by means of Information Retrieval [12] (IR) and deep learning strategies [14, 18] . Health-related content is particularly challenging, due to the highly specialized domain language and implicit differences in language characteristics depending on the content type (patient-generated content like discussion forum [15] , blogs [8] , social media [17] and other Internet sources, healthcare documentation and clinical records [6] , professional or scientific publications [9] , clinical practice guidelines, clinical trials documentation, medical questionnaires, medical informed consent documents, etc.). Moreover, it is also critical to provide search solutions for non-English content as well as cross-language or multilingual IR solutions [4, 10, 16] .

Efficient retrieval of biomedical documents is key for evidence-based medicine, preparing systematic reviews or retrieval of particular clinical case studies. Due to particular search conditions of caregivers and healthcare professionals (limited amount of time spent per patient), they are also in need of more sophisticated retrieval approaches applied to electronic health records [11] , a type of content highly challenging due to its telegraphic and domain specific language and the presence of negations and abbreviations. There is also interest in processing patient-generated content like social media and patient fora, a key resource for rare disease research, clinical trials patient selection/stratification or for discovering new patient-reported symptoms and treatment-related adverse effects. In the health-domain, indexing strategies relying on structured controlled vocabularies, like MeSH/DeCS or SNOMED CT, represent a critical component for efficient biomedical search engines, enabling query expansion and refinement [2] and the improvement of recommender systems [3] .

Currently, most of the Biomedical NLP and IR research is being done on English documents [13] , and only few tasks have been carried out on non-English texts [5] . Many structured controlled vocabularies are also available only in English [19] . Nonetheless, it is important to note that there is also a considerable amount of medically relevant content published in languages other than English and particularly clinical texts are entirely written in the native language of each country, with a few exceptions. The critical importance of semantic indexing with medical vocabularies motivated several-shared tasks in the past, in particular the BioASQ tracks 1 , with a considerable number of participants and impact in the field. Following the outline of previous medical indexing efforts, in particular the success of the BioASQ tracks centered on PubMed, the BioASQ MESINESP TASK 2 , supported by the Spanish National Language Technology plan (Plan TL), proposes to carry out the first task on semantic indexing of Spanish medical texts. This workshop will be a forum where the community can present and discuss current and future directions for the area based on the experience in participating at the MESINESP shared task or other medical IR, QA or text categorization evaluation campaigns, as well as the exploitation of evaluation settings and data collections generated through these kind of community evaluation efforts (both during and after the competition period).

In addition to the MESINESP and shared task/evaluation campaign participation experience session, the workshop will include an Open Session covering IR technologies for heterogeneous health-related content open to multiple languages with a particular interest in the exploitation of structured controlled vocabularies and entity linking for document indexing and semantic search applications.

Among the proposed topics for the Open Session are: (1) multilingual and non-English health related IR, concept indexing and text categorization strategies, (2) generation of evaluation resources for biomedical document IR strategies, (3) scalability, robustness, reproducibility, utility and usability [1] of health IR and text mining resources, (4) use of specialized machine translation and advanced deep learning approaches for improving health related search results, (5) medical Question Answering search tools, (6) retrieval of multilingual health related web-content. Note that we will also consider other submissions related to innovative cutting-edge health and biomedical IR strategies, including evaluation and Gold Standard evaluation data set generation.

All the teams implementing systems for MESINESP will be invited to submit an article describing their participation strategy. The program committee will review the papers and select which of them will have a presentation slot at the workshop. For the Open Session we will invite researchers to submit novel IR approaches to process heterogeneous health-related content with particular interest in non-English content, novel content types as well as semantic indexing strategies exploiting structured controlled vocabularies and ontologies.

We expect that further investigation on the topics will continue after the workshop, based on new insights obtained through discussions during the event. As a venue to compile the results of the follow-up investigation, a journal special issue will be organized to be published a few months after the workshop.

BioCreative III interactive task: an overview

Knowledge representation and management: a linked data perspective

Hybrid semantic recommender system for chemical compounds

Findings of the WMT 2019 biomedical translation shared task: evaluation for MEDLINE abstracts and biomedical terminologies

Impact of translation on named-entity recognition in radiology texts

Text analysis and information extraction from Spanish written documents

Data and Text Processing for

How valuable is medical social media data? Content analysis of the medical web

Finding mentions of abbreviations and their definitions in Spanish clinical cases: the BARR2 shared task evaluation results

Overview of the CLEF eHealth evaluation lab

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review

Information retrieval and text mining technologies for chemistry

Text mining for bioinformatics using biomedical literature

BioBERT: pre-trained biomedical language representation model for biomedical textmining

AZDrugMiner: an information extraction system for mining patient-reported adverse drug events in online patient forums

Automatic de-identification of medical texts in Spanish: the meddocan track, corpus, guidelines, methods and evaluation of results

Detecting drugs and adverse events from Spanish social media streams

BiOnt: deep learning using multiple biomedical ontologies for relation extraction

The MeSpEN resource for English-Spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations