key: cord-0747286-ktinqoc7 authors: Valentin, Sarah; Arsevska, Elena; Rabatel, Julien; Falala, Sylvain; Mercier, Alizé; Lancelot, Renaud; Roche, Mathieu title: PADI-web 3.0: A new framework for extracting and disseminating fine-grained information from the news for animal disease surveillance date: 2021-12-03 journal: One Health DOI: 10.1016/j.onehlt.2021.100357 sha: fffabe20c59b9fb5bd418106f4d749fccb81311b doc_id: 747286 cord_uid: ktinqoc7 PADI-web (Platform for Automated extraction of animal Disease Information from the web) is a biosurveillance system dedicated to monitoring online news sources for the detection of emerging animal infectious diseases. PADI-web has collected more than 380,000 news articles since 2016. Compared to other existing biosurveillance tools, PADI-web focuses specifically on animal health and has a fully automated pipeline based on machine-learning methods. This paper presents the new functionalities of PADI-web based on the integration of: (i) a new fine-grained classification system, (ii) automatic methods to extract terms and named entities with text-mining approaches, (iii) semantic resources for indexing keywords and (iv) a notification system for end-users. Compared to other biosurveillance tools, PADI-web, which is integrated in the French Platform for Animal Health Surveillance (ESA Platform), offers strong coverage of the animal sector, a multilingual approach, an automated information extraction module and a notification tool configurable according to end-user needs. Over the past decades, the number of outbreaks due to (re)emerging animal and human infectious diseases has been increasing in many parts of the world. In addition to the well-known role of human and animal mobility in the spread of pathogens, climate change and biodiversity loss are likely to exacerbate the global burden of these diseases [1, 2] . National and international institutions are currently experiencing a global paradox, conciliating trade extension with the control of the risk to human and animal health. In this context, the Epidemic Intelligence Service of the Center for Disease Control (CDC) is considered to be the earliest public health system dedicated to early warning [3] . It was created to enhance the surveillance and eradication of both infectious and noninfectious human diseases, such as poliomyelitis and leukemia, and also monitors for bioterrorism. The epidemic intelligence (EI) concept, as it is used today, was developed in the early 2000s. The French Institut de Veille Sanitaire (Institute of Health Surveillance) and the European Centre for Disease Prevention and Control (ECDC) proposed an EI framework to enhance disease surveillance in Europe in 2006 [4, 5] . Eight years later, the World Health Organisation (WHO) published a comprehensive guide providing key definitions and detailing the implementation of early warning activities [6] . EI corresponds to a formalized surveillance process that encompasses 'all activities related to the early identification of potential health hazards that may represent a risk to health, and their verification, assessment and investigation' [6] . It relies on two main channels of information: indicator-based surveillance (IBS) and event-based surveillance (EBS). Indicator-based surveillance is defined as 'the systematic collection, monitoring, analysis and interpretation of structured data (i.e. indicators)' [6] . It corresponds to conventional surveillance of formal sources and is based on established case definitions. Event-based surveillance is defined by the WHO as 'the organized collection, monitoring, assessment and interpretation of mainly unstructured ad hoc information regarding health events or risks, which may represent an acute risk to human [or animal] health' [6] . The definitions and concepts from both ECDC and WHO were elaborated for public health. However, they have been successfully transferred to other domains, such as plant health [7] and both terrestrial and aquatic animal health [8] [9] [10] . Both EBS and IBS can be formally represented as consecutive steps, corresponding to the flow of epidemiological information from its detection to its communication to the relevant authorities [4, 11, 12] (see Fig. 1 ). The first stages of the EI process consist of identifying and extracting relevant information from heterogeneous data. This process is based on 4 steps: (1) data detection, (2) data triage, (3) signal verification and (4) risk assessment and alert communication. 1. Data detection consists of defining modalities (e.g., format) through which raw data are detected and collected. The strategy differs according to the type of source. IBS usually relies on well-established notification procedures submitted by a country for a given disease according to international health regulations. EBS systems use specific queries implemented through RSS (Really Simple Syndication) feeds to detect news on an outbreak of a disease. 2. Data triage is an important step to avoid overwhelming the EI system with irrelevant data. Raw data is filtered (i.e. triaged) based on given selection criteria. In IBS, selection criteria could vary according to the institutional mandates, such as (i) its geographical coverage, e.g., worldwide (with the World Organisation for Animal Health (OIE)), regional (with ECDC), and (ii) its thematic mandate, e.g., animal health or public health. In EBS, data triage is based on (i) data filtering to remove duplicates and irrelevant data and (ii) data selection for sorting information according to the EBS system priority criteria. Data retained as relevant regarding early warning activities are referred to as 'signals'. 3. Signal verification aims to validate the truthfulness of a signal. This step is crucial in the EBS workflow since the data sources are informal. Once validated, a signal can become an 'event', i.e. a manifestation of the threat in a given affected host or population and at a given location and date [13] . This definition encompasses events from all possible known origins, such as infectious, zoonotic, food safety, chemical, radiological or nuclear, as well as pathogens of unknown origin in the EBS context. 4. The final stage deals with risk assessment and alert communication. Risk assessment aims at determining the level of risk related to a detected event. The event becomes an alert if the risk is considered significant to health. Alerts are communicated through channels adapted to relevant authorities (e.g., public health national networks, ministries of health, international organizations) or a larger network (e.g., end-users of EBS systems). This paper presents new functionalities of the EBS system, PADI-web (Platform for Automated extraction of animal Disease Information from the webhttps://padi-web.cirad.fr/) dedicated to the monitoring of online news sources for the detection of emerging/new animal infectious diseases. PADI-web has collected more than 380,000 news items since 2016. The first descriptions of PADI-web have been published elsewhere [8, 14] . This paper presents new functionalities of PADI-web based on the integration of: (i) a new fine-grained classification system, (ii) automatic methods to extract terms and named entities with text-mining approaches, (iii) semantic resources for indexing keywords and (iv) an automatic notification system for end-users that used these new methods. PADI-web was developed to meet the needs of the French Epidemic Intelligence System (i.e. FEIS) via online news monitoring. FEIS has been involved in the activities of the French Platform for Animal Health Surveillance (ESA Platform) since 2013. FEIS aims to identify, monitor and analyze reports of animal health hazards (including zoonotic diseases) threatening animal populations in France by monitoring official and unofficial information sources. PADI-web has been integrated into FEIS activities by ad hoc use depending on the epidemiological news. PADI-web successfully identified signals for current outbreaks of diseases that are notifiable to OIE, such as avian influenza (AI), African swine fever (ASF), foot-and-mouth disease (FMD), and bluetongue (BTV) [15] , and human diseases such as COVID-19 [16] . Moreover, PADI-web is able to detect the first signals of emerging infectious disease outbreaks in a timely manner, as illustrated by the detection of primary FMD outbreaks in East Asia in 2016 [8] . PADI-web also provided alerts of ASF, FMD and BTV emergence within previously unaffected areas, for which we could not find any official confirmation [8] . The paper is structured as follows: in Section 2, we discuss related work; in Sections 3 and 4, we describe the proposed extensions of PADIweb and the results obtained; and in Section 5, we conclude the work. In the context of EBS implementations, two important tasks must be considered. First, EBS systems must identify relevant texts (e.g., news, documents, sentences) related to epidemiological issues. The related work associated with this issue is detailed in Section 2.1. Second, epidemiological information (e.g., locations, symptoms, etc.) related to an event have to be extracted in these relevant texts. This issue is presented in Section 2.2. Most EBS systems (e.g., MediSys, HealthMap, GPHIN, Argus, AquaticHealth.net, PADI-web) [8] ) involve binary classification, i.e., news articles identified as relevant or irrelevant. Interestingly, there is no formal definition of relevance. This is a significant limitation in comparing EBS performances. In addition, the lack of shared gold standards and annotated resources hampers knowledge and experience sharing. Most commonly, there are two types of classification methods: (i) PADI-web 1.0 (first version [8] ), MediSys and AquaticHealth use a keyword-based approach, (ii) GPHIN, Argus and HealthMap rely on machine learning-based classifiers. PADI-web 1.0 [8] categorized collected news articles by using a list of 32 outbreak-related keywords. News articles are classified as relevant if they contain in the text (title and body) one of the keywords related to an outbreak event (e.g. 'outbreak' 'cases' 'spread') [17] . MediSys classification relies on a more sophisticated approach involving Boolean combinations and keyword weightings. A document is considered relevant if it matches one of a predefined set of alerts [18, 19] . Two types of signals (i.e. single and combination) are implemented. A single signal consists of attributing positive and negative weights to relevant and irrelevant keywords. An article is kept if the sum of the keyword weights it contains is above a given threshold. A combination signal is based on keywords combined by Boolean expressions (i.e. 'AND' and 'AND NOT'). Documents are selected if they contain at least two relevant keywords and do not include any irrelevant keywords. News articles of AquaticH ealth.net are tagged by the users if they contain specific key terms usually the scientific names for diseases. This strategy is based on the assumption that the authors using correct scientific terminology are more likely to disseminate relevant information. Other systems (i.e. HealthMap, BioCaster, Argus, GPHIN) rely on supervised machine learning classifiers, namely, three Bayesian algorithms and support vector machines (SVMs). The classifiers are trained on manually labeled data and automatically learn rules to label unclassified news articles. HealthMap uses a Bayesian machine learning algorithm. GPHIN automatically computes a relevance score for each retrieved report. This score corresponds to the confidence estimate of the SVM classifier [20] . Relevant articles in Argus were identified by keyword matching (with a set of concepts and keywords relevant to infectious disease surveillance) combined with Bayesian software tools. Experts further evaluate the automatic classification with GPHIN, HealthMap and Argus systems. In GPHIN, articles with a high relevance score are published immediately, while the system discards low-scoring reports automatically. Analysts triage the remaining medium-relevance reports. Analysts also review automatically discarded articles to verify that relevant information has not been erroneously filtered out by the automated system [20] . BioCaster classification is totally automated. A naive Bayes classifier was trained on a gold standard corpus. Each labeled article was manually assigned to the following classes: alert, publish, check and reject. However, a binary classification was implemented with the alert, publish and check classes being merged into a single category (i.e. relevant). In its current version, PADI-web integrates a supervised classifier [14] . Our paper proposes an original finegrained classification based on sentence classification to highlight new epidemiological information, as described in Section 3. Epidemiological information related to an event must be extracted into relevant texts (e.g., articles, sentences). This issue is discussed in the following subsection. Information extraction (IE) aims at locating specific pieces of information in textual data [21] . Entity extraction, also called named entity recognition (NER), is an IE subtask that seeks to locate and classify textual elements into predefined categories: (i) locations (e.g., 'Lagos','China'), (ii) temporal expressions (e.g., 'last month', 'July 28, 1990'), (iii) organizations (e.g., 'Ministry of Health'), (iv) person names, (v) quantities (e.g., '2'), and so on. This list of predefined categories can be extended to include domain-specific entities (thematic entities). In the animal health domain, we deal with (i) disease names (e.g., 'avian influenza', 'AI'), (ii) causal agents (e.g.,'H5N1 virus'), (iii) animal species (e.g., 'chicken'), (iv) symptoms (e.g., 'appetite loss'), and so on. In the context of online news, it is important to distinguish geographic entity extraction and resolution from identifying event-related locations. Geographic entity extraction and resolution aim at correctly extracting and identifying all locations from a text. Two types of approaches are used and combined to extract entities from texts: (1) dictionary-based approaches and (2) classifier-based approaches. The dictionary-based approach involves matching terms from a document with a list of keywords. Some dictionaries can have an ontological structure rather than a simple list of terms. Ontologies aim at modeling the relations between entities [22] . In the health domain, an ontology can represent the causality relationships between a disease and a pathogen [23] . Geographical dictionaries are usually called gazetteers. Dictionary and ontologies need regular updates to include new terms. This requires time-consuming manual work. Note that terms can be ambiguous, e.g., 'May' can refer either to a date, a location or a person's name. In the sentence 'The virus can be transmitted between pigs by their body fluids', the term 'body fluids' refers to the transmission route, yet in another context it may relate only to an anatomy concept. In location extraction, this level of ambiguity is referred to as geo/ non-geo ambiguity [24] . To overcome the rigidity of the dictionarybased approach, another approach consists of considering NER as a classification task, where the type of entity is the label to assign. Conditional random fields (CRFs) are among the most prominent classifiers used for NER [25] at the core of well-established pretrained NER tools, including StanfordNER [26] and NLTK [27] . This approach is designed for sequential data; CRFs predict the probability of an output sequence according to an input sequence [28] . The classification approach is particularly suitable for misspelt locations or short texts in terms of length, such as tweets, for which gazetteer lookups suffer from low precision due to irrelevant matches [29] . While classifier-based approaches achieve good results, they are limited to the given categories used for the training steps. Recent tools allow users to add new types of entities to NER algorithms by training the model on annotated datasets, such as the neural network-based NER algorithm from the SpaCy package [30] . Locations are also prone to another level of ambiguity that occurs when several distinct places have the same name, i.e. the geo/geo ambiguity, or referent ambiguity. Several methods are described in the literature to address these kinds of spatial ambiguities [24, 31, 32] . The PADI-web pipeline involves six steps ranging from online news collection to the extraction of epidemiological features: (1) data collection, (2) data processing, (3) data classification, (4) sentence classification, (5) information extraction and (6) user notification. All these steps are summarized in the following subsections, and they were previously detailed in [14] . The extensions proposed into PADI-web 3.0 (i.e. sentence classification, named entity annotation, terminology extraction, and notification) are described in subsections 3.4, 3.5 and 3.6. The different steps of the pipeline are shown in Fig. 2. PADI-web retrieves articles daily from the news aggregator Google News through customized RSS feeds. An RSS feed is a combination of terms (disease names, symptoms or hosts). These terms have been identified by an approach combining text mining and domain experts. The RSS feeds are of two types: • Disease-based RSS feeds consist of disease names (e.g., bluetongue OR BTV) and target seven animal diseases. • Symptom-based RSS feeds include clinical signs and hosts (e.g., fever AND pigs) without any disease names. These feeds enable the detection of diseases that are yet to be monitored by PADI-web, as well as unknown diseases [33] . RSS feeds are implemented in 16 languages (e.g. English, Chinese, Arabic, Italian, French, Russian, Turkish, etc.). To avoid duplicates, PADI-web checks if collected articles already exist in the database based on their URL. The webpages of the news articles that were not duplicates are visited to retrieve their content. The BeautifulSoup 1 and readability-lxml 2 Python libraries are used to collect the content of webpages and remove irrelevant elements (e.g., pictures, hyperlinks, etc.) [34] . Used as a piece of the preprocessing task of the PADI-web pipeline, these libraries allow us to isolate and work on the cleaned content of web documents from the web, i.e. only the body, titles, etc. All news articles that are not in English are translated using the Translator API of the Microsoft Azure system [35] . The classification step allows the selection of relevant news among all the news collected. A relevant news article is a news article that is related to a disease event. Relevant news includes the description of a current outbreak as well as its socioeconomic impacts, preparedness, prevention and control measures, etc. The classification module is based on a supervised machine learning approach described in [14] . For classification tasks, annotation is usually at the document level. The labels are often related to the news' relevance to filter out the irrelevant ones [14, [36] [37] [38] . Other classification frames assign a broad thematic label to the news, such as 'outbreak-related' or 'socioeconomic' [39] . PADI-web already includes a classification module for texts (i.e. news). Into PADI-web 3.0, the sentence classification feature extends it to automatically classify all the sentences of a new text. This enables highlighting fine-grained epidemiological information (e.g., risk events, preventive and control measures, etc.). Each new text acquired by PADIweb is segmented into sentences (with SpaCy 3 ). These sentences are then automatically processed by a classifier that contains a specific classification model. The sentences are represented with a classical vector space model (i.e. bag-of-words). To learn the models and automatically classify sentences, we need to provide labeled data (i.e. training data). For the proposed fine-grained classification, a dedicated corpus has been built [40] . This one is summarized below. A detailed version of the guidelines and the labeled corpus are publicly available in a Dataverse repository [40] . For each sentence, two labeling approaches are proposed: event and information types. The Event type label aims to differentiate sentences referring to the current/recent outbreak (Current event and Risk event) from sentences referring to old outbreaks (Old event) or general information (General). Sentences that do not contain any epidemiological information are considered irrelevant (Irrelevant). The information type level describes the sentence epidemiological topic. As an epidemiological topic, we include the notification of a suspected or confirmed event, the description of a disease in an area (Descriptive epidemiology and Distribution), preventive or control measures against a disease outbreak (Preventive and control measures) , an event's economic and/or political impacts (Economic and political consequences), its suspected or confirmed transmission mode (Transmission pathway), the expression of concern and/or facts about risk factors (Concern and risk factors) and general information about the epidemiology of a pathogen or a disease (General epidemiology). The final step aims to extract epidemiological entities from the relevant news content. In PADI-web 3.0, the previous information extraction method founded on rule-based systems and data mining techniques has been entirely replaced by a named entity recognition process detailed below. As described in Section 2.2, several tools exist for named entity recognition (NER). SpaCy has been integrated into PADI-web 3.0. SpaCy already includes powerful NER models that allow recognizing general named entities in texts using several languages. With PADI-web, we can 3 https://spacy.io/ use a classical model to identify well-known named entities such as locations and organizations. Moreover, specific models for entity recognition for the animal disease surveillance domain (e.g., host, etc.) could be used (see Section 2.2). We have used a labeled dataset [41] to learn and integrate a domain-specific model, which is able to detect host and disease names, as well as numbers of cases related to an outbreak. Both types of entities (i.e. general and specific) are then recognized with PADI-web 3.0. For location names, regular calls to the Geonames gazetteer API [42] aim to associate each recognized location name with a Geonames entity ID (see Fig. 3 ). To annotate textual data, Brat 4 was integrated into PADI-web 3.0. Brat is a powerful annotation tool that contains all the needed functionalities to prepare and annotate a corpus (see Fig. 4 ). Moreover, PADI-web includes the possibility to export texts from the PADI-web database and convert them into the Brat format. PADI-web also includes the option to use the Brat annotated corpora to update an existing model and update it with new examples. Terminology extraction is an important task in the natural language processing (NLP) domain. This enables the extraction of relevant and discriminative terms in textual data. Into PADI-web 3.0, we integrated a tool called BioTex [43] to highlight terms extracted from the textual data (i.e. news) of PADI-web. We use BioTex because it was initially built for the medical domain. BioTex combines linguistic and statistical information adapted to biomedical areas. To select the appropriate terms, BioTex uses two principles: (i) a combination of statistical methods, e.g., term frequency-inverse document frequency (TF-IDF), Okapi BM25 and C-value measures [43] ; and (ii) use of a list of syntactic structures of the terms that have been learned with relevant sources in the medical domain, e.g., MeSH (Medical Subject Heading). The terms extracted with BioTex can be either words (e.g., 'pig') or multiword terms (e.g., 'wild pig', 'domestic pig'). To index data dealing with the agriculture domain, the AGROVOC 5 thesaurus has been integrated into PADI-web 3.0. AGROVOC is the largest linked open dataset dealing with the agriculture domain. AGROVOC is a thesaurus that contains 38,780 concepts and 808,000 Each keyword source is in one of 3 different source types that PADIweb 3.0 handles: from semantic resources (i.e., AGROVOC), from users (e.g., epidemiological-related keywords provided by a user), and from automatic processes (i.e., SpaCy and BioTex). Keywords are organized according to their source type and have a different color depending on it. In the text body, icons standing for keyword matches follow the same color code as in lists of keywords (see Fig. 5 ). Finally, a keyword alignment mechanism is integrated into PADIweb 3.0 for entity and keyword recognition with elements of the PADI-web database. This keyword alignment mechanism is based on the Levenshtein distance [44, 45] : two keywords are aligned if their distance is below a user-specified threshold. The distance measure is a normalized Levenshtein distance (normalized between 0 and 1 by dividing it by the length of the longest string). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one source string into the target string. An operation is an insertion, deletion, or substitution of a single character. The following subsection presents a notification system that uses the new functionalities presented in Subsections 3.4 and 3.5. Registered users can subscribe to notification emails in PADI-web 3.0 to receive regular emails summarizing some basic information about the recently collected texts. Notification emails aim to provide a PADI-web user with a compact list of texts collected recently. It is also possible to filter articles by specifying some text sources (e.g., RSS feeds) in the users' preferences. They are sent daily or weekly (according to the users' preferences) and contain three types of information: • The list of new texts that have been collected during the period covered by the notification. The maximum amount of texts listed in an email is thirty. A link to the PADI-web search page is provided for a complete inspection (see Fig. 6 ). • Information about the classification of new texts covered by the notification email. For each classification task, the number of new texts that have been assigned to each classification label is provided (see Fig. 7 ). • Information about keywords and extracted information from the new texts. The 30 most frequent keywords found in the texts are listed with their respective frequency, i.e., the number of texts where each entity or keyword occurs (see Fig. 7 ). Daily notification emails are sent every morning starting from 7 am (according to the PADI-web server time). Weekly notifications are sent every Monday morning. This section presents experimental results obtained with sentence classification. We evaluated a global classification model able to correctly identify both the event type and information type of an unlabeled sentence (see Section 3.4). We used an annotated corpus that contains 1244 sentences (from 87 news articles). We applied classical NLP techniques to transform the sentences into numerical vectors (punctuation removal, conversion to lowercase, splitting into tokens and Term Frequency -Inverse Document Frequency weighting) [46, 47] . We compared three classifiers that are widely used for text classification, i.e. Naive Bayes (NB), support vector machines (SVMs) and multilayer perceptrons (MLPs). We estimated the performances of the trained models using precision, recall, F-measure and 5-cross-validation methods. Precision corresponds to the number of relevant sentences of a given class over the number of sentences attributed to this class. Recall is the number of relevant sentences of a given class over the real number of relevant sentences associated with this class. The F-measure is the harmonic mean of precision and recall. We presented the results using an MLP classifier in Tables 1 and 2 . Classification scores for some classes are better (e.g., Current event, Descriptive epidemiology) than other ones (e.g., Old event, Distribution). We could explain this situation with the unbalanced datasets used. Moreover, during the annotation phase, many instances of dedicated classes (e.g., general epidemiology) involved the same sentence structure (e.g., 'The virus causes a hemorrhagic fever with high mortality rates in pigs') that favors machine learning approaches. MLP and SVM achieved comparatively equal performances and outperformed the NB classifier [48] . These behaviors were identical for both event type and information type classifications [48] . Into PADI-web 3.0, a selection of model families is trained on the current dataset (random forests with various parameters, linear support vector classification, neural networks, Gaussian-based models, K-nearest neighbors, etc.), using a 5-fold cross-validation scheme. The model obtaining the highest mean accuracy score 6 is used to classify each new retrieved sentence. For article-level classification tasks, a model is trained and built automatically every night using existing user classification labels. Currently, the trained classification models integrated into PADI-web 3.0 reach a mean accuracy score of 0.94 for the article-level Relevance task using a random forest classifier. For the sentence-level Event type and Information type classification tasks, the accuracy is 0.66 and 0.49, respectively, with a random forest classifier in both cases. The preprocessing tasks applied in our experiments to optimize the results summarized in this paper and detailed in [48] have to be integrated into PADI-web 3.0. This paper presented the extraction of epidemiological event information for animal disease surveillance using new functionalities of PADI-web. These functionalities are based on semantic information and fine-grained information integrated into machine-learning approaches. [49] noted that health agencies have been reluctant to incorporate outputs from biosurveillance EBS tools into their systems because many technical issues had not yet been addressed. In the development of PADIweb 3.0 and its new functionalities, experts were solicited at different levels: identification of users' operational needs, annotation guideline creation, corpus annotation, qualitative evaluation of method outputs and regular feedback loops concerning the developments' methods and outputs. Health experts are inclined to integrate automatic processes beyond event-based surveillance outputs, as they directly support the decision-making process [50] . The links and collaborations between informatics and epidemiology should therefore be strengthened and promoted. All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. Furthermore, each author certifies that this material or similar material as not been and will not be submitted to or published in any other publication before its appearance in One Health. All persons who have made substantial contributions to the work reported in the manuscript (e.g., technical help, writing and editing assistance, general support) are named in the Acknowledgements. Impacts of biodiversity on the emergence and transmission of infectious diseases Biodiversity loss and the rise of zoonotic pathogens The epidemic intelligence Service of the Center for Disease Control What is epidemic intelligence, and how is it being improved in Europe? Weekly Releases Epidemic intelligence: a new framework for strengthening disease surveillance in Europe Assessment and Response to Acute Public Health Events: Implementation of Early Warning and Response with a Focus on Event-Based Surveillance, Interim Version Edition Development and testing of the media monitoring tool MedISys for the monitoring, early identification and reporting of existing and emerging plant health threats Web monitoring of emerging animal infectious diseases integrated in the French animal health epidemic intelligence system Using AquaticHealth.net to detect emerging trends in aquatic animal health Using internet intelligence to manage biosecurity risks: a case study for aquatic animal health On behalf of the early alerting, reporting project of the Global Health security initiative, evaluation of epidemic intelligence systems integrated in the early alerting and reporting project for the detection of A/H5N1 influenza events International epidemic intelligence at the Institut de Veille Sanitaire The new international health regulations: a revolutionary change in global health security, The New Zealand Med PADI-web: a multilingual event-based surveillance system for monitoring animal infectious diseases Padi-web: An event-based surveillance system for detecting, classifying and processing online news Monitoring online media reports for early detection of unknown diseases: insight from a retrospective study of COVID-19 emergence PADI-web: platform for automated extraction of animal disease information from the web Joint Research Centre, Institute for the Protection and the Security of the Citizen, How to Maximise Event-Based Surveillance Web-Systems: the Example of ECDC/JRC Collaboration to Improve the Performance of MedISys Text mining from the web for medical intelligence Mining knowledge from text using information extraction What is an ontology? A framework for enhancing spatial and temporal granularity in report-based health surveillance systems Web-a-where: geotagging web content Conditional random fields: probabilistic models for segmenting and labeling sequence data The Stanford CoreNLP natural language processing toolkit Proceedings of the ACL Interactive Poster and Demonstration Sessions Named entity recognition based on conditional random fields Location detection and disambiguation from twitter messages Montani, spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing Info Xtract location normalization: A hybrid approach to geographic references in information extraction Extracting and exploring the geo-temporal semantics of textual resources Identification of terms for detecting early signals of emerging infectious disease outbreaks on the web Beautiful soup documentation Customized neural machine translation with Microsoft Translator, library Catalog Classifying disease outbreak reports using N-grams and semantic features Biological, Translational, and Clinical Language Processing An exploratory study of a text classification framework for internet-based surveillance of emerging epidemics Automatic online news monitoring and classification for syndromic surveillance Annotation of epidemiological information in animal disease-related news articles: guidelines and manually labelled corpus PADI-web corpus: labeled textual data in animal health domain Assessment of the accuracy of geonames gazetteer data Biomedical term extraction: overview and a new methodology Binary Codes Capable of Correcting Deletions An information-theoretic definition of similarity The impact of preprocessing on text classification Term-weighting approaches in automatic text retrieval Extraction and Combination of Epidemiological Information from Informal Sources for Animal Infectious Diseases Surveillance Social media and internet-based data in global systems for public health surveillance: a systematic review Regular expression based medical text classification using constructive heuristic approach This work was funded by the French General Directorate for Food (DGAL), the French Agricultural Research Centre for International Development (CIRAD) and the SONGES Project (FEDER and Occitanie). This work was also supported by the French National Research Agency under the Investments for the Future Program, referred to as ANR-16-CONV-0004.We thank Valérie De Waele and Aline Vilain (Moriskin project coordinated by Sciensano (Institute of Public and Animal Health) -Belgian Federal Public Service Health) for the annotation task of the corpus cited in this paper. This study was partially funded by EU grant 874850 MOOD and is catalogued as MOOD014. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission.