key: cord-0290291-jks3ooyn authors: Krauer, F.; Schmid, B. V. title: Mapping the plague through natural language processing date: 2021-04-30 journal: nan DOI: 10.1101/2021.04.27.21256212 sha: 528c392b89d0bfc67923b51ac62820b5ec1d1632 doc_id: 290291 cord_uid: jks3ooyn Plague has caused three major pandemics with millions of casualties in the past centuries. There is a substantial amount of historical and modern primary and secondary literature about the spatial and temporal extent of epidemics, circumstances of transmission or symptoms and treatments. Many quantitative analyses rely on structured data, but the extraction of specific information such as the time and place of outbreaks is a tedious process. Machine learning algorithms for natural language processing (NLP) can potentially facilitate the establishment of datasets, but their use in plague research has not been explored much yet. We investigated the performance of five pre-trained NLP libraries (Google NLP, Stanford CoreNLP, spaCy, germaNER and Geoparser.io) for the extraction of location data from a German plague treatise published in 1908 compared to the gold standard of manual annotation. Of all tested algorithms, we found that Stanford CoreNLP had the best overall performance but spaCy showed the highest sensitivity. Moreover, we demonstrate how word associations can be extracted and displayed with simple text mining techniques in order to gain a quick insight into salient topics. Finally, we compared our newly digitised plague dataset to a re-digitised version of the famous Biraben plague list and update the spatio-temporal extent of the second pandemic plague mentions. We conclude that all NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database. The dissemination of historical plague across the globe is being studied and discussed since many 28 decades. The second pandemic entered Europe in 1347 [1] , but it may have started a century earlier 29 in Central Asia and China [2] . The pandemic caused millions of casualties before gradually 30 disappearing from Europe and the Mediterranean in the 18 th century. The third pandemic is thought 31 to have started towards the end of the 18 th century in Yunnan (China) [3] . It reached Hong Kong in 32 1894 from where it spread globally. Since the 19 th century -and perhaps even earlier -many scholars 33 have collected data on plague outbreaks in a more or less systematic manner. Among the earliest information such as the size of the outbreaks, symptoms, treatments, control measures, putative 48 transmission routes and ecological aspects. The body of publications is large, but narrative text must 49 be converted into quantitative data in order to be usable for epidemiological analyses. However, the 50 extraction of data from running text is time and labour intensive. In the past few years, advances in machine learning algorithms and increasing computing efficiency The POS analysis partitions a running text into tokens (usually singular words) and returns 67 information about the morphological class of each token (e.g. nouns, verbs). The NER analysis 68 identifies and classifies tokens or combinations of tokens into pre-defined categories based on rules 69 (i.e. a dictionary), statistical predictions, or both. A special case of NLP NER is the extraction of 70 geographical data from a text (geoparsing). Geoparsing consists of two main steps: 1. Tagging, i.e. 71 identification of a geographical entity (toponym), and 2. Geocoding, i.e. linkage of the geographical 72 entity with GIS data such as coordinates. The GIS information is usually looked up in a geographical 73 gazetteer. In theory, both steps can be done by hand and/or separately, but automated workflows may 74 be preferable because they are faster and potentially more reproducible. General NLP libraries have 75 to be combined with a geocoding service to deliver the same results as a designated geoparser. In general, text mining tools can accelerate the generation of large datasets, but their performance has 77 to be sufficient to outweigh the errors arising from the automated process. The performance of these 78 algorithms depends on the chosen model or algorithm, and the structure and language of the text. Ideally, an NLP algorithm has a high recall or sensitivity (e. g. the proportion of locations that are 80 correctly identified as locations) and a high specificity (e. g. the proportion of non-locations that are 81 correctly identified as non-locations). Various NLP algorithms and libraries have been tested for 82 modern English medical and non-medical texts and their performances differ substantially (see e.g. 83 [27, 33] ). The literature on performance evaluation of NLP libraries for more historical texts is sparser. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2021. ; https://doi.org/10.1101/2021.04.27.21256212 doi: medRxiv preprint For example the sensitivity and the precision of the Edinburgh Geoparser, a popular tool for historical In a first preprocessing step, we cleaned the raw OCR text manually. We removed interspersed tables, 117 page numbers and page headers, and corrected misaligned text. We also removed end-of-line 118 hyphenations and notes in the book margins that were erroneously included in the running text. We 119 checked the text file for OCR errors by looking for special characters and words that were not 120 recognized by the Notepad++ Spell Checker. We then established the gold standard dataset of location 121 toponyms, with both authors independently annotating the preprocessed text using the annotator tool 122 webanno (version 3.5.9) [37] . We then compared the two annotations and established a consensus 123 document. This list of toponyms contained all geographical entities in the text irrespective of whether 124 the location was linked to plague or not. We included all administrative place, region or country 125 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2021. ; https://doi.org/10.1101/2021.04.27.21256212 doi: medRxiv preprint names as well as natural features such as "the Black Sea". Associative toponyms such as "the Bishop of Avignon" were excluded because they are not true locations. This gold standard list was used for the evaluation of the tagging performance of various NLP libraries (see below). We then used this 128 list to generate the final dataset of places with plague outbreaks. For this we extracted text snippets 129 of 50 characters before and after each toponym to obtain the context and decided for each case 130 individually whether it was linked to a specific plague outbreak. Furthermore, we also extracted the 131 corresponding years (usually a four-digit string) using regular expression (regex) and allocated them 132 manually to the corresponding toponym. We also linked the referenced author names (i.e. the source 133 of the information) with the corresponding places wherever it was available. Finally, we batch 134 geocoded these locations using the REST Geoparser.io only returns toponyms and the corresponding GIS information but not the tokenization 167 of the complete text. All algorithms accept running text except germaNER, which requires a priori 168 tokenization. We therefore used the tokenization returned by spaCy as an input for germaNER. The . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2021. german Stanford CoreNLP java library (version 2018-10-05) was downloaded from the Stanford NLP Github Page (https://stanfordnlp.github.io/CoreNLP/human-languages.html) and accessed through the R package coreNLP (version 0.4.2) [46] . SpaCy (v2.0) was downloaded and accessed through the facilitate the automated geoparsing approach, we removed all words or sentences in parentheses, 175 which were mainly author names and references and thus irrelevant for the tagging. We then assessed the performance to identify toponyms of each of the five approaches compared to 177 the gold standard using various indicators. For this, we first combined all results and the gold standard The formal definition of all measures is given in supplement Table S1 . accepted also partial (fuzzy) match for Geonames. We then compared the performance of these two 210 services to the geocoded gold standard dataset. For this, we combined the three datasets and calculated 211 the Euclidean distances between the three centroid coordinates for each toponym. We assessed the 212 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2021. ; https://doi.org/10.1101/2021.04.27.21256212 doi: medRxiv preprint performance only for exactly located entities. We considered two places a match if both types were a 213 country and the country ISO codes agreed. For entities that were not countries we considered it a 214 match if the standard and comparator were in the same country and the Euclidian distance between 215 the centroids of the standard and comparator was less than 30 km (for small entities with a standard 216 bounding box up to 30 km), or less than half of the bounding box diameter of the standard (for larger 217 entities with a standard bounding box diameter of more than 30 km). Based on the count of matches 218 we calculated the proportion of toponyms identified (i.e. whether there was a result nor not) and the 219 proportion of toponyms correctly identified for each approach. We also examined the mismatches 220 and checked whether there was a potential regional or other bias in the geocoding. All Geocoding transmission for further analysis as well as the word for mouse, which was historically often used as 232 a synonym for rats. We then aimed to learn more about the context of these most frequent lemmata. Context cannot be extracted straightforwardly from a text, but we can use word embeddings (i.e. 234 neighbor words) to investigate which words occur together frequently. For this, we constructed a co-235 occurrence matrix, which is a technique that counts how often any two words occur together within 236 a given window. The matrix was constructed with a window of five words before and after each 237 lemma. We then chose the ten most frequently co-occuring words for each of the selected nouns 238 above and visualized the connection between them with a network plot. Finally, we summarized the spatial and temporal coverage of our data set and compared it with a re-241 digitised version of Biraben's list (see supplemental Text S1). For this, we merge the two datasets by . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Germanized spelling (e.g. "Hoschiarpur" for "Hoshiarpur"), latin spelling (e.g. "Centumcellae" for 288 "Civitavecchia"), composite entities (e.g. "Gurjewscher Kreis"), historic regions (e.g. "Podolien") or 289 ambiguous words (e.g. "Sind" is a location but also a conjugated verb form of "to be"). The only 290 token that was falsely identified as a location by all algorithms was "Santa Maria", which can be both 291 a church name as well as a place name. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2021. ; To evaluate the geocoding performances, we compared the 4087 locations from the gold standard 304 that were true plague locations and that were identifiable with the exact coordinates. Geonames As shown in Figure 2A , outbreaks in towns were discussed more than twice as much as outbreaks in 315 villages. Surprisingly, the word for clothes occurred as often as the word for rats. Figure 2B shows 316 that the dissemination by ships was featured mostly in the 14 th -15 th and from the mid-17 th to 19 th 317 century. Quarantine occurred from 1500 and onwards. The discussion of rats and mice was generally 318 sparse until the end of the 19 th century when the third pandemic started. It was at this time when the 319 relationship of rats, fleas and plague transmission was discovered by Simond [50]. It is however 320 unclear, whether in earlier epidemics dying rodents were rarely mentioned because the connection to 321 plague transmission was unknown at the time or because they were not the main actors in the plague 322 transmission cycle. We also explored how these most frequent topics related to each other and other 323 lemmata with a word embedding approach. The resulting word associations are shown in Figure 2C . As anticipated, the words plague, village, town, human and year co-occurred often. We also found a 325 word cluster describing the symptoms of plague (buboes, carbuncle, petechiae, fever). The "evil" was 326 also often associated with the arrival of ships and sailors in the ports, which often triggered quarantine. Interestingly, the word for clothes was strongly connected to house, sick people, bed, town, equipment 328 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2021. In the 18 th century the focus appears to have shifted to Eastern Europe and North Africa. Finally, in 370 the 19 th century the majority of outbreaks seemed to be reported in southeast Europe and West Asia. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2021. ; https://doi.org/10.1101/2021.04.27.21256212 doi: medRxiv preprint NLP libraries combined with geoparsers/geocoding tools are extremely useful to quickly generate 397 quantitative data, but they have some shortcomings when it comes to digitizing plague treatises. As 398 anticipated, these models cannot distinguish whether the mention of a geographical unit is related to 399 a specific plague outbreak or not. This information can only be extracted from the context, but 400 standard models are not trained to recognize these situations. In this study, we have checked the link 401 to a plague outbreak for each location entry manually, which is far from ideal. Moreover, the detection 402 of time units was not optimal. We did not test the year numbers recognition formally, but we observed 403 that Google, spaCy and Stanford CoreNLP don't differentiate between years and any other number. For our gold standard, we used regular expressions (regex), which can identify specific combinations 405 of letters or numbers. The final linking of a specific year with a specific plague location was done 406 manually again, since the order of appearance and the format in which years and locations were 407 reported was not consistent throughout the text. Thus, the current NLP algorithms cannot replace 408 manual work entirely. The decision to use these tools is therefore a trade-off between time gained and 409 precision lost. For larger texts, it may be useful to perform a pilot study on a subset of the text and 410 compare the manual annotation to a NLP approach as we did in our study. If the sensitivity is above 411 an acceptable level and the required additional manual effort is limited, NLP might be a suitable 412 approach. In terms of performance, it is more important to have a high sensitivity than a high 413 specificity, because it is easier to remove false positives in the results than look for false negatives 414 (missed locations) in the text. The main potential (and challenge) of NLP and geoparsing for plague 415 research lies in custom trained models and reproducible, automated workflows. Many of the analysis 416 that we did manually or in separate steps can potentially be improved with an automated procedure. Preprocessing of the raw OCR text prior to applying the NLP algorithms is inevitable, but OCR errors 418 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2021. ; https://doi.org/10.1101/2021.04.27.21256212 doi: medRxiv preprint Laying the Corpses to Rest: Grain, Embargoes, and Yersinia pestis in the Black 486 Plague and the Fall of Baghdad (1258) Plague: A Disease Which Changed the 490 Path of Human Civilization Die grossen Volkskrankheiten des Mittelalters: Historisch-pathologische 493 Gesammelt und in erweiterter Bearbeitung Versuch einer geographischen Darstellung einiger Pestepidemien 1891 A history of epidemics in Britain Geschichte der Pestepidemien in Russland von der Gründung des Reiches bis The "black death The second plague pandemic and its recurrences in the Middle East: 1347-529 1894 Out of the West: Formation of a Permanent Plague Reservoir in South-Central 531 1349-1356) and its Implications * The Last Plague in the Baltic Region La peste dans les possessions insulaires du Grand Seigneur (xviie-xixe siècles) The Black Death in Iran, according to Iranian Historical 538 Accounts from the Fourteenth through Fifteenth Centuries Digital epidemiology: what is it, and where is it going? A systematic review of natural 543 language processing and text mining of symptoms from electronic patient-authored text data Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in 557 the 19th Century Text mining and annotation of outbreak reports of the Third Plague Pandemic What's missing in geographical 562 parsing? Lang Resour Eval Edinburgh geoparser for georeferencing digitized historical collections Climate and society in long-term perspective: Opportunities and pitfalls in the use of historical 568 datasets Biraben 2.0: A Black Death Digital Archive A Web-based Tool for the Integrated Annotation of Semantic and 574 Syntactic Structures Mapping the plague through natural 576 language processing Google Cloud Natural Language API Training and evaluating a German named entity recognizer with 582 semantic generalization Explosion. spaCy v2 Explosion GermaNER: Free Open German 586 Named Entity Recognition Tool Wrappers Around Stanford CoreNLP Tools Wrapper to the 'spaCy' 'NLP' Library A Replicable Comparison 606 Study of NER Software: StanfordNLP, NLTK Comparing the Performance of 608 Text analytics APIs, Part 1: The bigger players Digitizing historical 612 plague Dangers of Noncritical Use of Historical Plague Data. Emerging 617 infectious diseases 24 Plague persistance in Western Europe: a hypothesis Biraben's lists of the plague epidemics of the second plague pandemic, 621 1346 -c. 1690: problems, basis, uses. Annales de démographie historique n°138 Putting Africa on the Black Death map: Narratives from genetics and history The plague that never left: restoring the Second Pandemic to Ottoman and 626 Turkish history in the time of COVID-19 The R code and the digitised plague datasets are available in a public repository [38] Competing interests statement