Language Resources for Historical Newspapers: the Impresso Collection Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 958–968 Marseille, 11–16 May 2020 c© European Language Resources Association (ELRA), licensed under CC-BY-NC 958 Language Resources for Historical Newspapers: the Impresso Collection Maud Ehrmann?, Matteo Romanello?, Simon Clematide†, Phillip Benjamin Ströbel†, Raphaël Barman? ?Digital Humanities Laboratory, EPFL †Institute for Computational Linguistics, Zurich University ?{maud.ehrmann, matteo.romanello, raphael.barman}@epfl.ch.ch †{siclemat, pstroebel}@cl.uzh.ch Abstract Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents. Keywords: historical and multilingual language resources, historical texts, multi-layered historical semantic annotations, OCR, named entity processing, topic modeling, text reuse, digital humanities 1. Introduction Digitization efforts are slowly but steadily contributing an increasing amount of facsimiles of cultural heritage docu- ments. As a result, it is nowadays commonplace for many memory institutions to create and maintain digital reposito- ries which offer rapid, time- and location-independent ac- cess to documents (or surrogates thereof), allow to virtually bring together disperse collections, and ensure the preser- vation of fragile documents thanks to on-line consultation (Terras, 2011). Beyond this great achievement in terms of preservation and accessibility, the next fundamental chal- lenge –and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and re- trieve information from this ‘Big Data of the Past’ (Kaplan and di Lenardo, 2017). In this regard, and following decisive grassroots efforts led by libraries to improve OCR (Optical Character Recogni- tion) technology and generalize full-text search over histor- ical document collections (see, e.g., the Impact1 and Trove2 projects), the Digital Humanities (DH), Natural Language Processing (NLP) and Computer Vision (CV) communities are pooling forces and expertise to push forward the pro- cessing of facsimiles, as well as the extraction, linking and representation of the complex information enclosed in tran- scriptions of digitized collections. These interdisciplinary efforts were recently streamlined within the far-reaching Europe Time Machine project3 which ambitions, in gen- eral, the application of artificial intelligence technologies on cultural heritage data and, in particular, to achieve text understanding of historical material. 1http://www.impact-project.eu 2https://trove.nla.gov.au 3https://www.timemachine.eu This momentum is particularly vivid in the domain of dig- itized newspaper archives, for which there has been a no- table increase of research initiatives over the last years. Besides individual works dedicated to the development of tools (Yang et al., 2011b; Dinarelli and Rosset, 2012; Moreux, 2016; Wevers, 2019), or to the usage of those tools (Kestemont et al., 2014; Lansdall-Welfare et al., 2017), events such as evaluation campaigns (Rigaud et al., 2019; Clausner et al., 2019) or hackathons4 based on digitized newspaper data sets have multiplied. Additionally, several large consortia projects proposing to apply computational methods to historical newspapers at scale have recently emerged, including ViralTexts5, Oceanic Exchanges6, im- presso7, NewsEye8, and Living with Machines9 (Ridge et al., 2019). These efforts are contributing a pioneering set of text and image analysis tools, system architectures, and graphical user interfaces covering several aspects of histor- ical newspaper processing and exploitation. Yet, the application of text processing tools on historical documents in general, and historical newspapers in partic- 4See the 2017 edition of the Coding Da Vinci cultural hackathon, https://www.deutsche-digitale- bibliothek.de/content/journal/aktuell/kicking-coding-da-vinci-berlin?lang=en 5A project aiming at mapping networks of reprinting in 19th-century newspapers and magazines (US, 2012-2016): https://viraltexts.org 6A project tracing global information networks in historical newspaper repositories from 1840 to 1914 (US/EU, 2017-2019): https://oceanicexchanges.org 7https://impresso-project.ch 8A digital investigator for historical newspapers (EU, 2018- 2021): https://www.newseye.eu 9A project which aims at harnessing digitised newspaper archives (UK, 2018-2023): https://www.turing.ac.uk/research/research- projects/living-machines http://www.impact-project.eu https://trove.nla.gov.au https://www.timemachine.eu/ https://www.deutsche-digitale-bibliothek.de/content/journal/aktuell/kicking-coding-da-vinci-berlin?lang=en https://www.deutsche-digitale-bibliothek.de/content/journal/aktuell/kicking-coding-da-vinci-berlin?lang=en https://viraltexts.org/ https://oceanicexchanges.org/ https://impresso-project.ch/ https://www.newseye.eu/ https://www.turing.ac.uk/research/research-projects/living-machines https://www.turing.ac.uk/research/research-projects/living-machines 959 ular, poses new challenges (Sporleder, 2010; Piotrowski, 2012). First, the language under study is mostly of ear- lier stage(s) and usually features significant orthographic variation (Bollmann, 2019). Second, due to the acqui- sition process and/or document conservation state, inputs can be extremely noisy, with errors which do not resem- ble tweet misspellings or speech transcription hesitations for which adapted approaches have already been devised (Linhares Pontes et al., 2019a; Chiron et al., 2017; Smith and Cordell, 2018). Further, and due to the diversity of the material in terms of genre, domain and time period, lan- guage resources such as corpora, benchmarks and knowl- edge bases that can be used for lexical and semantic pro- cessing of historical texts are rather sparse and heteroge- neous. Finally, archives and texts from the past are not as anglophone as in today’s information society, making mul- tilingual resources and processing capacities even more es- sential (Neudecker and Antonacopoulos, 2016). Overall, and as demonstrated by Vilain et al. (2007), the transfer of NLP approaches from one domain or time period to another is not straightforward, and performances of tools initially developed for homogeneous texts of the immedi- ate past are affected when applied on historical material (Ehrmann et al., 2016). This echoes the statement of Plank (2016), according to whom what is considered as standard or canonical data in NLP (i.e. contemporary news genre) is more a historical coincidence than an objective evidence or reality: non-canonical, heterogeneous, biased and noisy data is more prevalent than is commonly believed, and his- torical texts are no exception. In this respect, and in light of the above, it can therefore be considered that historical lan- guage(s) belong to the family of less-resourced languages for which further efforts are still needed. To help alleviate this deficiency, this paper presents a ‘full- stack’ historical newspaper data set collection composed of text and image resources produced, curated and pub- lished within the context of the ‘impresso - Media Mon- itoring of the Past’ project10. These resources relates to historical newspaper material in French, German and Lux- embourgish and include: OCRed texts together with their related facsimiles and language models, benchmarks for ar- ticle segmentation, OCR black letter and named entity pro- cessing, and multi-layer semantic annotations (named enti- ties, topic modeling and text reuse). The objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents. More precisely, these resources can support: (a) NLP research and applications dealing with historical language, with a set of ‘ready-to-parse’ historical texts covering 150 years in French and German, and a set of language models; (b) Model training and performance assessment for three tasks, namely article segmentation, OCR transcription and named entity processing (for the first time on such material for the latter), with manually transcribed and annotated corpora; 10https://impresso-project.ch (c) Historical corpus exploration and digital history re- search, with various stand-off semantic annotations. To the best of our knowledge, the impresso resource col- lection represents the most complete historical newspapers data set series to date. In the following, we introduce the impresso project (Section 2), present the impresso resource collection (Sections 3, 4 and 5), account for major exist- ing historical language resources (Section 6), and conclude (Section 7). 2. Mining 200 years of historical newspapers: the impresso project impresso - Media Monitoring of the Past’ is an interdisci- plinary research project in which a team of computational linguists, designers and historians collaborate on the se- mantic indexing of a multilingual corpus of digitized his- torical newspapers11. The primary goals of the project are to apply text mining techniques to transform noisy and un- structured textual content into semantically indexed, struc- tured, and linked data; to develop innovative visualization interfaces to enable the seamless exploration of complex and vast amounts of historical data12; to identify needs on the side of historians which may also translate into new text mining applications and new ways to study history; and to reflect on the usage of digital tools in historical sciences from a practical, methodological, and epistemological point of view. In doing so, impresso addresses the challenges posed by large-scale collections of digitized newspapers, namely: (1) newspaper silos: due to legal restrictions and digitisa- tion policy constraints, data providers (libraries, archives and publishers) are bound to provide incomplete, non- representative collections which have been subjected to digitization and OCR processing of varying quality; (2) big, messy data: newspaper digital collections are characterised by incompleteness, duplicates, and abundant inconsisten- cies; (3) noisy, historical text: imperfect OCR, faulty article segmentation and lack of appropriate linguistic resources greatly affect image and text mining algorithms’ robust- ness; (4) large and heterogeneous corpora: processing and exploitation requires a solid system architecture and infras- tructure, and interface design should favor efficient search and discovery of relevant content; and (5) transparency: critical assessment of inherent biases in exploratory tools, digitized sources and annotations extracted from them is paramount for an informed usage of data in digital scholar- ship context. With respect to source processing, impresso applies and im- prove a series of state-of-the-art natural language and im- age processing components which produce, in fine, a large- scale, multilingual, semantically indexed historical news- paper collection. The various lexical and semantic anno- tations generated thereof are combined and delivered to 11The project is funded by the Swiss National Science Foun- dation for a period of three years (2017-2020) and involves three main applicants: DHLAB from the Ecole polytechnique fédérale de Lausanne (EPFL), ICL from the University of Zurich, and C2DH from the University of Luxembourg. 12https://impresso-project.ch/app/# https://impresso-project.ch https://impresso-project.ch/app/ 960 digital scholars via a co-designed, innovative and power- ful graphical user interface. Furthermore, and this is the focus of the present paper, those sources and annotations are also published apart from the interface for further usage by cultural heritage partners, and DH and/or NLP commu- nities. Finally, some of the text and image mining compo- nents are subject to systematic evaluation, for which ground truth data are produced. All publicly released impresso resources, i.e. corpora, benchmarks and annotations, are published on the project’s website13 and on impresso zenodo community14 with de- tailed documentation. Table 2 summarizes the links and DOIs of the datasets. 3. Impresso Corpora The first resource is a set of normalized, ‘ready-to-process’ newspaper textual corpora which, for copyrights reasons, do not correspond to the full impresso newspaper collection accessible through the interface. 3.1. Original Sources impresso gathers a consortium of Swiss and Luxembour- gish research and cultural heritage institutions and focuses primarily on sources of these countries in French, German, and Luxembourgish. Provided by its partners,15 impresso original sources correspond as of November 2019 to 76 newspapers. Concretely speaking, sources consist of ei- ther both OCR output and images, or only OCR. Regard- ing images, they are thus either served online via the IIIF Image API16 of the impresso infrastructure, or accessed di- rectly via the data provider’s IIIF endpoint . Text and lay- out acquisition outputs (i.e. OCR and OLR) come, for their part, in a variety of METS/ALTO format flavors, some- times complemented by proprietary formats of private ser- vice providers. Overall, the current collection amounts to ca. 77TB, text and image combined. More newspaper titles in French and English will be acquired and ingested during the last year of the project. 3.2. Legal Framework Original sources are subject to copyright law and impresso has received permission from its partners to use them, pro- vided that legal terms of use are respected upon online ac- cess and/or download. More specifically, digital documents are subject to two different right statements: (1) public do- main, or unrestricted: documents are no longer in copy- right and may be used without restriction for all purposes, including commercial; (2) academic use, or restricted: doc- uments are still under copyright and their use is restricted to personal and/or academic purposes, with the possibility 13https://impresso-project.ch/project/datasets 14https://zenodo.org/communities/impresso 15Namely: the Swiss National Library, the National Library of Luxembourg, the Media Center and State Archives of Valais, the Swiss Economic Archives, the journal Le Temps (Ringier group), the journal Neue Zürcher Zeitung, and other local and interna- tional data providers. 16Defined by the International Image Interoperability Frame- work, an interoperable technology and community framework for image delivery: https://iiif.io to download the text or not. The present impresso corpus release includes unrestricted documents and a part of re- stricted ones (for personal and academic usage). Depend- ing on negotiations with data providers and on the inclusion of new collections, the situation is very likely to evolve in the future and impresso original source release will be com- plemented. 3.3. Source Processing The original files provided by our partners encode the structure and the text of digital objects according to METS/ALTO XML library standards. METS (Metadata Encoding and Transmission Standard17) encodes various metadata as well as information on the physical and logical structure of the object, while ALTO (Analyzed Layout and Text Object18) represents information of OCR recognized texts, i.e. describes text and layout information (coordinates of columns, lines and words on a page). While very precise and complete, these XML files contain more information than necessary in a text mining context, and are cumber- some to process. Moreover, METS and ALTO schemas are flexible and libraries usually adapts them according to their text acquisition capacities, resulting in a variety of input variants. Combined with the existence of different file hier- archies, source identifiers and image mappings, as well as other OCR/OLR proprietary formats, these inputs require, to say the least, a great deal of processing before they can finally be parsed. To this end, each library input is converted into ‘canonical’ files where information is encoded according to impresso JSON schemas,19 from which ‘ready-to-process’ files can easily be derived. Defined iteratively and shared with other newspaper projects, these JSON schemas act as a central, common format which a) allows the seamless processing of various data sources; b) preserves the information neces- sary for NLP processing and interface rendering only; and c) drastically reduces file sizes, thereby allowing easier pro- cessing in distributed environments. Schemas and converters are published and documented on- line and are not described further here. An important point to mention, though, is that we mint and assign unique, canonical identifiers to newspaper issues, pages as well as content items (i.e. newspaper contents below the page level such as articles, advertisements, images, tables, weather forecasts, obituaries, etc.) 3.4. Release The impresso corpora are released in two versions, both distributed as compressed archives (bzip2) of data in newline-delimited JSON format: 1) the ‘canonical’ ver- sion, with a fine-grained logical and physical representation of newspaper contents, including image coordinates and 2) the ‘ready-to-process’ version, which offer ‘reconstructed’ content item full texts, that is to say continuous strings non divided by OCR token units. This reconstruction signifi- cantly reduces the overhead when parsing the entire dataset, 17http://www.loc.gov/standards/mets 18https://www.loc.gov/standards/alto 19https://github.com/impresso/impresso-schemas https://impresso-project.ch/project/datasets https://zenodo.org/communities/impresso/ https://iiif.io/ http://www.loc.gov/standards/mets/ https://www.loc.gov/standards/alto/ https://github.com/impresso/impresso-schemas 961 Number of items Unrestricted Restricted (with download) Restricted (w/o download) Total # issues 79,746 337,163 187,860 604,769 # pages 399,363 4,132,821 399,363 4,931,547 # tokens 572,030,104 9,374,592,395 2,641,896,310 12,588,518,809 # content items 1,461,700 38,948,561 4,269,189 44,679,450 # images 32,964 3,030,126 417,732 3,480,822 Table 1: Global statistics on the impresso corpora. which amount to 145GB compressed (restricted and unre- stricted). The impresso corpus currently contains 76 newspapers: 50 from Switzerland and 26 from Luxembourg. AS mentioned previously, contents are subject to different license regi- mens, depending on the permissions given by cultural her- itage institutions and rights holders. In Table 1 we provide some basic statistics about our corpora, divided by license type. The release will contain all contents in the public do- main (unrestricted), as well as those available for academic use and for which the text can be downloaded (restricted with download, negotiations ongoing). The released corpora amount to almost 10 billion tokens of textual contents, covering a time span of more than 200 years (see Fig. 1), and contain roughly 3 million images. 3.5. Metadata Contextual information about digital collections is essen- tial and we attempt to provide as much information as possible, even though this is neither the core expertise nor part of the main objectives of the project. Impresso newspaper metadata corresponds to descriptive (e.g. title, dates, place of publication), structural (issue, page, con- tent items), and administrative metadata (file timestamps, file creator, preservation metadata). These metadata were given by cultural institutions and, most of the time, com- pleted by the impresso team (either technical or descriptive metadata). Since this metadata set does not intend to re- place library professional information but is rather meant for statistical ‘data science’ purposes, each record contains links to authority information such as the original biblio- graphic notice and the library portal. Impresso newspaper metadata is encoded in JSON format, covers all newspapers and is published under a CC-BY 4.0 license.20 4. Impresso Benchmarks In order to support the training and evaluation of some pro- cessing components, several benchmarks were produced. They include material from both restricted and unrestricted collections, for which right clearance has been achieved. All are released under open licenses. 4.1. Article Segmentation Ground Truth Exploration and automatic processing of digitized newspa- per sources is greatly hindered by the sometimes low qual- 20https://creativecommons.org/licenses/by/ 4.0/ ity of legacy OCR and OLR (when present) processes: con- tent items are incorrectly transcribed and incorrectly seg- mented. In an effort to address these shortcomings, im- presso developed an approach for content item recognition and classification exploiting both textual and visual features (Barman et al., 2020). The objectives were, on the one hand, to filter out noisy or unwanted material before the ap- plication of subsequent NLP processes (e.g. removing all meteo tables and title banners before running topic model- ing or text re-use) and, on the other hand, to allow faceted search on content item types (e.g. search “xyz” in type of items ‘editorials’). To this end, a set of newspaper images was manually annotated and several experiments were conducted (Bar- man, 2019). Although newspaper content items can be of many types,21 we choose to focus on four classes that were deemed suitable for developing a first prototype, as well as meaningful within the impresso context, as follows: 1. Feuilleton, i.e. an excerpt of a bigger work published over time in several issues of a newspaper, correspond- ing to the French roman-feuilleton or the English se- rial; 2. Weather forecast, i.e. a text or image with the predic- tion of weather, or even a report of past weather mea- surements; 3. Obituary, i.e. a small notice published by relatives of a deceased person; 4. Stock exchange table, i.e. a table reporting the values of different national stocks. Three newspapers from the French speaking part of Switzerland covering a period of ca. 200 years (1798- 2017) were considered for the annotation.22 To obtain a diachronic ground truth, three issues were sampled every three or five years for the whole duration of each newspa- per. The sampled images were annotated using the VGG Image Annotator v.2.0.8 (VIA), a simple web interface for annotating images with annotation export in JSON format (Dutta and Zisserman, 2019). Concretely speaking, each annotated image is associated with the list of its regions (i.e. coordinates) and their corresponding labels. Overall, 21There is little to no agreement among historians and/or librar- ians about a ‘base’ newspaper content items taxonomy. 22The Gazette de Lausanne, the Impartial and the Journal de Genève. https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ 962 Figure 1: Distribution of tokens over years (whitespace tokenization was applied). 4624 page scans were annotated – among which 1208 with at least one annotation –, amounting to 2773 annotated re- gions. Work is ongoing and once models would have reached a satisfying level of precision, they will be applied on the whole collection to filter out elements before text process- ing and enable faceted search over content item types. This article segmentation data set (annotations and images) is published under a CC-BY-SA 4.0 license, using VIA as well as the standard object annotation COCO23(Lin et al., 2014) formats. 4.2. Black Letter OCR Ground Truth We created a publicly available ground truth (i.e., a man- ually corrected version of text) for black letter newspa- per print for the assessment of the OCR quality of the German-language Neue Zürcher Zeitung (NZZ) (Ströbel, Phillip Benjamin and Clematide, Simon, 2019). We sam- pled one front page per year for the long period the NZZ has been published in black letter (1780 - 1947), resulting in a diachronic ground truth of 167 pages. We used the Transkribus24 tool do complete the annotations. We pub- lished the ground truth as tiff images and corresponding XML files25. First experiments on improving the OCR for this kind of data showed that elaborated deep learning mod- els (Weidemann et al., 2018) reach character accuracies of 99.52% and that they are transferable to other newspaper data and to better images than present in the ground truth (Ströbel and Clematide, 2019). 4.3. Named Entity Processing Ground Truth After image segmentation and transcription, the last im- presso benchmark relates to an information extraction task, named entity (NE) processing. NE processing tools are increasingly being used in the context of historical docu- ments and research activities in this domain target texts of 23http://cocodataset.org/#format-data 24https://transkribus.eu/Transkribus 25https://github.com/impresso/NZZ-black-letter-ground-truth different nature (e.g. museum records, state-related docu- ments, genealogical data, historical newspapers) and dif- ferent tasks (NE recognition and classification, entity link- ing, or both). Experiments involve different time periods, focus on different domains, and use different typologies. This great diversity demonstrates how many and varied the needs –and the challenges– are, but also makes perfor- mance comparison difficult, if not impossible. In this context, the impresso project organises a CLEF 2020 Evaluation Lab, named ‘HIPE’ (Identifying Historical Peo- ple, Places and other Entities) (Ehrmann et al., 2020).26 The HIPE shared task puts forward two NE processing tasks, namely: (1) the named entity recognition and classifi- cation (NERC) task, with two sub-tasks of increasing level of difficulty with high-level vs. finer-grained entity types, and (2) the named entity linking task. The HIPE corpus is composed of content items from the impresso Swiss and Luxembourgish newspapers, as well as from American newspapers, on a diachronic basis.27 For each language, articles of four different newspapers were sampled on a decade time-bucket basis, according to the time span of the newspaper (longest duration spans ca. 200 years). More precisely, articles were first randomly sam- pled from each year of the considered decades, with the constraints of having a title and more than 100 characters. Subsequently to this sampling, a manual triage was applied in order to keep journalistic content only and to remove un- desirable items such as feuilleton, cross-words, weather ta- bles, time-schedules, obituaries, and what a human could not even read because of OCR noise. This material was manually annotated according to HIPE annotation guidelines, derived from the Quaero annotation guide.28 Originally designed for the annotation of ‘ex- 26See CLEF 2020: https://clef2020.clef-initiative.eu and HIPE: https://impresso.github.io/CLEF-HIPE-2020 27From the Swiss National Library, the Luxembourgish Na- tional Library, and the Library of Congress, respectively. 28See the original Quaero guidelines: http://www.quaero.org/media/files/bibliographie/quaero-guide- annotation-2011.pdf http://cocodataset.org/#format-data https://transkribus.eu/Transkribus/ https://github.com/impresso/NZZ-black-letter-ground-truth https://clef2020.clef-initiative.eu/ https://impresso.github.io/CLEF-HIPE-2020/ http://www.quaero.org/media/files/bibliographie/quaero-guide-annotation-2011.pdf http://www.quaero.org/media/files/bibliographie/quaero-guide-annotation-2011.pdf 963 tended’ named entities (i.e. more than the 3 or 4 traditional entity classes) in French speech transcriptions, Quaero guidelines have furthermore been used on historic press corpora (Rosset et al., 2012). HIPE slightly recast and sim- plifies them, considering only a subset of entity types and components, as well as of linguistic units eligible as named entities29. The annotation campaign was carried out by the task orga- nizers with the support of trilingual collaborators. We used INCEpTION as an annotation tool (Klie et al., 2018), with the visualisation of image segments alongside OCR tran- scriptions. For each language, a sub-sample of the corpus was annotated by two annotators and inter-annotator agree- ment is computed, before and after an adjudication. As of March 2020, 21000 top-level entity mentions were anno- tated and linked to Wikidata. For each task and language the corpus is divided into train- ing, dev and test data sets, with the only exception of En- glish for which only dev and test are produced. These man- ually annotated materials are released in IOB format with hierarchical information. Even though many evaluation campaigns on NE were orga- nized over the last decades,30 only one considered French historical texts (Galibert et al., 2012) and, to the best of our knowledge, this is the first multilingual, diachronic named entity-annotated historical corpus. 5. Impresso Lexical and Semantic Annotations Finally, a wealth of annotations as well as language mod- els are automatically computed over the whole impresso collection. They include: at lexical level, linguistic pre- processing (lemmatisation and historical spelling normal- ization), word embeddings, OCR quality assessment and n-grams; at referential level, NE mentions and linked en- tities; at conceptual level, topics, topic models, and topic- annotated content items; at collection level, text reuse clus- ters and passages; and, finally, visual signatures of pho- tographs and pictures contained in newspapers. These en- richments of our content items are represented as stand-off annotations and are released under CC-BY or CC-BY-SA 4.0 license. However, not all annotation data sets are fully ready at the moment; the following sections present those which are part of the current release. 5.1. OCR quality assessment In order to automatically assess the loss of information due to OCR noise, we compute a simple OCR quality mea- sure inspired by spell-checker approach of Alex and Burns (2014). In our case, it basically corresponds to the propor- tion of words of an historical newspaper article that can be found in the Wikipedia corpus of the corresponding lan- guage. Given the multilingual nature of our texts and the large number of names in newspapers, this offers a practical approach, especially for German where normal nouns and 29HIPE guidelines are available at: https://doi.org/10.5281/zenodo.3677171 30E.g. MUC, IREX, ACE, CoNLL, KBP, ESTER, HAREM, QUAERO, GERMEVAL, etc. proper nouns are capitalized. Before actually comparing the words, we normalise diacritical marks the same way as our text retrieval system Solr does before indexing the con- tent. Therefore, for instance, we consider the frequently oc- curring OCR errors Bäle or Bàle as equivalent to the correct spelling of the town Bâle, because they are all normalized to the same string bale. The reason for this normalisation ap- proach in OCR assessment is that we want to inform our im- presso users about the real loss of recall they should expect when actually running standard keyword queries over our text collection (Bäle will be found even is the user search for Bàle, but Bâte would not return any result, and this is the loss we want to account for). The OCR quality assessment is a number between 0 and 1 that is distributed along with our data as stand-off anno- tation for each content item. Impresso interface users will probably quickly grasp the meaning of the numbers by just being exposed to texts and their corresponding OCR quality assessment, and learn to interpret them with respect to the type of article, e.g. stock market prices with many abbrevia- tions that will lower the score. As our approach is unsuper- vised, we need to formally evaluate it similar to Alex and Burns (2014) by testing whether there is a reasonable cor- relation between the automatically computed quality and some ground truth character error rate. 5.2. Word Embeddings As mentioned earlier, the full impresso collection cannot be distributed due to copyright restrictions. Having the mate- rial at hand, however, allows us to compute historical news- papers genre-specific lexical resources such as word em- beddings that can be distributed to the public. Specifically, we build classical type-level word embeddings with fast- text31. This choice is motivated by fasttext’s support for subword modeling (Bojanowski et al., 2016), which is a useful feature in the presence of OCR errors. There has been recent work on top of fasttext for bringing the embed- dings of misspelled words even closer to the correct ver- sions via supervised training material (Piktus et al., 2019). Well-known drawbacks of type-level word embeddings are that (a) they enforce their users to adhere to the same tokenisation rules that their producers applied and, more severely, (b) they cannot differentiate the meanings of am- biguous words, or words that change their meaning in cer- tain constructions. The simple character-based approach proposed by Akbik et al. (2018) (“contextualized string em- beddings”32) has successfully tackled these two problems and led to excellent results for NER. Our own experiments with NER on noisy French historical newspapers addition- ally proved the resilience of these embeddings trained on in-domain material to OCR errors (Bircher, 2019). Within the impresso interface, word embeddings are mainly used for suggesting similar words in the keyword search (including cross-lingual), thereby supporting query expan- sion by semantic or OCR noise variants. Query expansion is also offered for the lexical n-gram viewers. Two types of word embeddings derived from the impresso text material are published: Character-based contextualized 31https://fasttext.cc 32https://github.com/zalandoresearch/flair https://doi.org/10.5281/zenodo.3677171 https://fasttext.cc https://github.com/zalandoresearch/flair 964 string embeddings and classical type-level word embed- dings with subword information. 5.3. Topic Models The impresso web application supports faceted search with respect to language-specific topics (French, German, Lux- embourgish). We use the well-known MALLET33 toolkit, which allows the training and inference of topic models with Latent Dirichlet Allocation (Blei et al., 2003). First, linguistic preprocessing is applied to the data. For POS tagging, the spaCy34 library is used because of its ro- bustness in the presence of OCR noise. However, spaCy lemmatization is not always very satisfactory and further analyzers and sources are used to complement its results. For German, we rely mostly on the broad-coverage mor- phological analyser GERTWOL35, and are currently work- ing on the problem of lemmatization of words with histor- ical spelling and/or OCR errors (see Jurish (2012) for ear- lier work based on finite-state approaches for German). For French, we use the full-form lexicon Morphalou36 (ATILF, 2019) to complete lemma information not provided by spaCy. Dealing with the low-resourced Luxembourgish language is more difficult (although spaCy now has PoS tagging support for this language), mostly because of many spelling variants and reforms this language has seen over the last 150 years. Then, under the assumption that topics are more inter- pretable if they consist of nouns and proper nouns only, we reduce the corpus size by excluding all other parts of speech based on the information obtained from spaCy. As an addi- tional benefit, this filtering drastically reduces the number of tokens of the corpus that topic modeling has to deal with. Next, topics are computed on this reduced, preprocessed material. Although the German part of the collection is of reasonable size, the French material is however still too big for MALLET and sampling of articles containing at least 10 nouns and/or proper nouns is applied. In order to keep the facets for topic search manageable and interpretable, and at the same time account for the diversity of contents found in newspapers, we set the number of topics for German and French to 100. For the French topics, we directly fit topic distributions for about a third of our overall data. Topic in- ference with the model trained on the sample is used for the remaining articles. Topic inference also solves the problem that our collections is continuously growing, and recom- puting topic models from scratch each time is not feasible. Additionally, historians prefer to have semantically stable topic models for their work. Therefore, we also apply topic inference on newly added German texts. Topic models, as well as topics and content item topic as- signments are released in JSON format.37 Topics are also available within the impresso web interface, where they (a) serve as search facets, i.e., users can restrict their search results to articles containing only certain topics; or (b) the 33http://mallet.cs.umass.edu 34https://spacy.io/ 35http://www2.lingsoft.fi/doc/gertwol 36http://www.cnrtl.fr/lexiques/morphalou 37Also documented online at https://github.com/impresso/impresso-schemas users can select topics as entry points to explore the topic modeling based soft-clustering of articles over the entire corpus; or (c) they provide the basis for an article recom- mender system based on topic distribution similarity. Fu- ture work will focus on the evolution of topics over time and cross-lingual topic modeling. 5.4. Text Reuse Text reuse can be defined as the meaningful reiteration of text beyond the simple repetition of common language. It is such a broad concept that it can be understood at differ- ent levels and studied in a large variety of contexts. In a publishing or teaching context, plagiarism can be seen as text re-use, should portions of someone else’s text be re- peated without appropriate attribution. In the context of literary studies, text re-use is often used as a synonym for literary phenomena like allusions, paraphrases and direct quotations. Text reuse is a very common phenomenon in histori- cal newspapers too. Nearly-identical articles may be re- purposed in multiple newspapers as they stem from the very same press release. In newspapers from the period before the advent of press agencies, text reuse instances can be in- teresting to study the dynamics of information spreading, especially when newspapers in the same language but from different countries are considered. In more recent newspa- pers text reuse is very frequent due to cut-and-paste jour- nalism being an increasingly common practice. We used passim38 (Smith et al., 2015) to perform the auto- matic detection of text reuse. Passim is an open source soft- ware that uses n-grams to effectively search for alignment candidates, the Smith-Waterman algorithm to perform the local alignment of candidate document pairs, and single- link clustering to group similar passages into text reuse clusters. As a pre-processing step we used passim to identify boil- erplate within our corpus. This step allows us to reduce the input size of approximately 10%, by removing mostly short passages that are repeated within the same newspa- per within a time window of 30 days. We then run passim on the entire corpus after boilerplate passages have been re- moved: passim outputs all text passages that were identified as belonging to a text reuse cluster. As opposed to boiler- plate detection, text reuse detection explicitly targets reuse instances across two or more sources (i.e. newspapers). We post-process passim’s output to add the following infor- mation: • size, i.e. the number of text passages in the cluster; • lexical overlap, expressed as the proportion of unique tokens shared by all passages in a cluster; • time delta: the overall time window covered by a given cluster (expressed in number of days); • time gap: following Salmi et al. (2019), we compute the longest gap (expressed in number of days) between the publication of any two passages in a cluster. 38https://github.com/dasmiq/passim http://mallet.cs.umass.edu/ https://spacy.io/ http://www2.lingsoft.fi/doc/gertwol/ http://www.cnrtl.fr/lexiques/morphalou/ https://github.com/impresso/impresso-schemas https://github.com/dasmiq/passim 965 Dataset DOIs Impresso Historical Newspaper Textual Material 10.5281/zenodo.3706823 Impresso Newspaper Metadata 10.5281/zenodo.3706833 Impresso OCR Quality Assessment 10.5281/zenodo.3709465 Impresso OCR ground truth 10.5281/zenodo.3333627 Impresso Article Segmentation Ground Truth 10.5281/zenodo.3706863 Impresso HIPE Shared Task Named Entity Gold Standard 10.5281/zenodo.3706857 Impresso Word Embeddings 10.5281/zenodo.3706808 Impresso Topic Modelling Data 10.5281/zenodo.3706840 Impresso Text Reuse Data 10.5281/zenodo.3706850 Table 2: Impresso datasets DOIs. This information is added to each text reuse cluster with the goal of easing the retrieval as well as the analysis of detected text reuse. Since passim detects several million clusters in the entire impresso corpus, we need to further characterize each cluster if we want to enable historians to find instances of text reuse that are of interest to them. Each of these additional dimensions characterizes a certain aspects of reuse: lexical overlap allows for distinguishing almost exact copies of a piece of news from re-phrasings or paraphrases; time delta is an indicator of the longevity of a given piece of news; and, finally, time gap captures the viral nature of news spreading, especially its pace of publication. We release as a resource (in JSON format) the boilerplate and text reuse passages as detected by passim, as well as the additional information we compute at cluster-level. This data can be used to filter out duplicates from the input corpus, given the detrimental effects that such duplicates have on semantic models (e.g. topics, word embeddings) (Schofield et al., 2017). Text reuse information is currently used in the impresso in- terface as an additional navigation aid, as it points users to existing reuses of the news article in focus. Future upgrades of the interface will include a dedicated text reuse explorer, which will allow users to search over and browse through all text reuse clusters, and to filter them based on several criteria (i.e. size, lexical overlap, time gap, time delta). 6. Related work This section briefly summarizes previous efforts with re- spect to historical language resources. We focus here on historical newspapers and refer the reader to Sporleder (2010) and Piotrowski (2012) for further information on historical language in general. Digitized newspaper corpora, understood here as consist- ing of both images and OCR, primarily exist thanks to the considerable efforts of national libraries, either as individ- ual institutions, either as part of consortia, e.g. the Eu- ropeana Newspaper project (Neudecker and Antonacopou- los, 2016). Those institutions are the custodians of these digital assets which, after having been hidden behind dig- ital library portals for long, are now increasingly making their way to the public via APIs and/or data dumps (e.g. the French National Library APIs39 and the National Library of Luxembourg open data portal40). Impresso corpora are by no means meant to compete with these repositories, but rather to complement them, with derived, working ‘sec- ondary’ versions of the material in a form that is suitable for NLP needs. To our knowledge, and since corpus prepa- ration is often done by private companies mandated to de- velop digital portals, no ‘ready-to-process’ set of historical newspaper corpus such as the impresso one exists. Several instances of OCR and article segmentation bench- marks exists thanks to, among others, the long-standing se- ries of conference and shared tasks organized by the doc- ument analysis community41 impresso annotated data sets are, in this regard, not new but complementary: German Black Letter ground truth is not common and, given the va- riety of historical newspaper material, article segmentation over page scans of different sources is beneficial. With respect to word embeddings, the companion web- site42 of Hamilton et al. (2016) provides word2vec em- beddings for French and German derived from Google n- grams. More recently, Riedl (2019) released German word embedding data sets derived from historical newspapers. In the last years, a few gold standards were publicly re- leased for named entities: Galibert et al. (2012) shared a French named entity annotated corpus of historical newspa- pers from the end of the 19th century and Neudecker (2016) published four data sets of 100 pages each for Dutch, French, and German (including Austrian) as part of the Eu- ropeana Newspapers project. Besides, Linhares Pontes et al. (2019b) have recently published a data set for the evalu- ation of NE linking where various types of OCR noise were introduced. In comparison, the HIPE corpus has a broader temporal coverage and additionally covers English. Regarding topic modeling, Yang et al. (2011a) gives an overview of earlier work on historical newspapers. Finally, as far as text reuse is concerned, very few resources and/or benchmarks were published to date. Franzini et al. (2018) have published a ground truth dataset to benchmark 39http://api.bnf.fr 40https://data.bnl.lu/data/historical-newspapers 41In particular the ICDAR conferences, e.g. http://icdar2019.org/. 42https://nlp.stanford.edu/projects/histwords https://doi.org/10.5281/zenodo.3706823 https://doi.org/10.5281/zenodo.3706833 https://doi.org/10.5281/zenodo.3709465 https://doi.org/10.5281/zenodo.3333627 https://doi.org/10.5281/zenodo.3706863 https://doi.org/10.5281/zenodo.3706857 https://doi.org/10.5281/zenodo.3706808 https://doi.org/10.5281/zenodo.3706840 https://doi.org/10.5281/zenodo.3706850 http://api.bnf.fr https://data.bnl.lu/data/historical-newspapers/ http://icdar2019.org/ https://nlp.stanford.edu/projects/histwords 966 the detection of a specific type of text reuse (i.e. literary quotations). The Viral Texts project has published an online interface, the Viral Texts Explorer43, which makes search- able and explorable text reuse clusters extracted from 19th century newspapers. A similar online interface was pro- vided also by Salmi et al. (2019) for 13 million text reuse clusters extracted from Finnish press (1771–1920). 7. Conclusion and Perspectives We have presented a series of historical newspaper datasets – the impresso resource collection – composed of corpora, benchmarks, semantic annotations and language models in French, German, Luxembourgish and English covering ca. 200 years. Produced in the context of a collaborative, interdisciplinary project which aims at enabling critical text mining of 200 years of newspapers, this collection includes different types of resources that could support the needs of several communities. The textual corpora we release are large-scale, diachronic, multilingual and with real-world OCR quality. Their availability will foster further research on NLP methods applied to historical texts (e.g. OCR post- correction, semantic drift, named entity processing). Simi- larly, our benchmarks will fill an important gap in the adap- tation of existing approaches via e.g. transfer learning, as well as enable performance assessment and comparisons. Language models will naturally find their use in many ap- plications, while lexical and semantic annotations will sup- port historical corpus exploration and be suitable for use at public participatory events such as hackathons. As future work we attempt to integrate more textual material (French and English notably), to release additional annotations (im- age visual signatures, historical n-grams and named enti- ties) and to serialize our data in more formats in addition to JSON. 8. Acknowledgements We warmly thank the impresso team as well as student col- laborators Camille Watter, Stefan Bircher, Julien Nguyen Dang for their annotation work. Authors also gratefully acknowledge the financial support of the Swiss National Science Foundation (SNSF) for the project impresso – Media Monitoring of the Past under grant number CR- SII5 173719. 9. Bibliographical References Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA, August. Association for Computational Linguis- tics. Alex, B. and Burns, J. (2014). Estimating and rating the quality of optically character recognised text. pages 97– 102. ACM Press. ATILF. (2019). Morphalou. ORTOLANG (Open Re- sources and TOols for LANGuage) –www.ortolang.fr. 43https://viraltexts.northeastern.edu/clusters Barman, R., Ehrmann, M., Clematide, S., Oliveira, S. A., and Kaplan, F. (2020). Combining Visual and Textual Features for Semantic Segmentation of Historical News- papers (submitted). Journal of Data Mining and Digital Humanities. https://arxiv.org/abs/2002.06144. Barman, R. (2019). Historical newspaper semantic seg- mentation using visual and textual features. Master the- sis, EPFL. Bircher, S. (2019). Toulouse and Cahors refer to loca- tions, but T<