id author title date pages extension mime words sentences flesch summary cache txt work_j27wztqtp5ctncb7i7hico3lt4 Maud Ehrmann Language Resources for Historical Newspapers: the Impresso Collection 2020.0 11 .pdf application/pdf 8477 660 48 In this context, this paper presents a collection of historical newspaper data sets composed of text and image material in terms of genre, domain and time period, language resources such as corpora, benchmarks and knowledge bases that can be used for lexical and semantic processing of historical texts are rather sparse and heterogeneous. historical newspaper material in French, German and Luxembourgish and include: OCRed texts together with their related facsimiles and language models, benchmarks for article segmentation, OCR black letter and named entity processing, and multi-layer semantic annotations (named entities, topic modeling and text reuse). With respect to source processing, impresso applies and improve a series of state-of-the-art natural language and image processing components which produce, in fine, a largescale, multilingual, semantically indexed historical newspaper collection. n-grams; at referential level, NE mentions and linked entities; at conceptual level, topics, topic models, and topicannotated content items; at collection level, text reuse clusters and passages; and, finally, visual signatures of photographs and pictures contained in newspapers. ./cache/work_j27wztqtp5ctncb7i7hico3lt4.pdf ./txt/work_j27wztqtp5ctncb7i7hico3lt4.txt