LIBER Webinar: Generating Metadata with Artificial Intelligence libereurope.eu CC BY HOST Jeannette Frey LIBER President Director, Bibliothèque Cantonale et Universitaire (BCU) Lausanne Jeannette.frey@bcu.unil.ch mailto:Jeannette.frey@bcu.unil.ch libereurope.eu CC BY SPEAKER Martijn Kleppe Head of Research, National Library of the Netherlands (KB) Martijn.Kleppe@kb.nl mailto:Martijn.Kleppe@kb.nl libereurope.eu CC BY NOTES ○ The webinar is being recorded. ○ Slides and a recording will be shared by email after the webinar. ○ Questions? Put them in the chat box. ○ 10-15 minutes of discussion will take place following the presentations. Generating metadata with AI EXPERIENCE OF THE KB, NATIONAL LIBRARY OF THE NETHERLANDS Martijn Kleppe – Head of Research Martijn.kleppe@kb.nl | @martijnkleppe | www.kb.nl/martijnkleppe mailto:Martijn.kleppe@kb.nl https://twitter.com/martijnkleppe http://www.kb.nl/martijnkleppe https://w w w .kb.nl/en/new s/2019/kb-explores-artificial-intelligence-to-generate-m etadata Kleppe, M., Veldhoen, S, Waal-Gentenaar, M., Oudsten, B. den, & Haagsma, D. (2019). Exploration possibilities Automated Generation of Metadata. http://doi.org/10.5281/zenodo.3375192 Sara Veldhoen Meta van der Waal-Gentenaar Dorien Haagsma Brigitte den Oudsten https://www.kb.nl/en/news/2019/kb-explores-artificial-intelligence-to-generate-metadata http://doi.org/10.5281/zenodo.3375192 I. Introduction II. Set-up experiment III. Lessons learned IV. Next steps Outline I. INTRODUCTION • About me • Research Department at KB, National Library of the Netherlands (18 fte) • Topics: Digital Preservation, Public Library Research, Copyright, Data Science • KB Researchagenda 2018-2022 Introduction https://zenodo.org/com m unities/kbnl/search?page=1& size=20 https://w w w .kb.nl/en/organisation/research-expertise https://zenodo.org/communities/kbnl/search?page=1&size=20 https://www.kb.nl/en/organisation/research-expertise • 5 Research themes: • Informationsociety • Publications • Access & Sharing • Customers • Impact • 8 Researchgroups with KB colleagues from the whole organisation Introduction - KB Research Agenda ht tp s: // w w w .k b. nl /e n/ or ga ni sa tio n/ re se ar ch -e xp er tis e/ re se ar ch -a ge nd a- 20 18 -2 02 2 ht tp s: // do i.o rg /1 0. 52 81 /z en od o. 12 54 22 6 https://www.kb.nl/en/organisation/research-expertise/research-agenda-2018-2022 • Short term: Proof of Concept, internships, researcher-in-residence, workshops • Long term: Collaborate with partners: academic, libraries, industry • 1 Researchgroup on (Semi-) automated metadata Introduction - KB Research Agenda ht tp s: // w w w .k b. nl /e n/ or ga ni sa tio n/ re se ar ch -e xp er tis e/ re se ar ch -a ge nd a- 20 18 -2 02 2 ht tp s: // do i.o rg /1 0. 52 81 /z en od o. 12 54 22 6 https://www.kb.nl/en/organisation/research-expertise/research-agenda-2018-2022 • Literature review: • Media sector • Heritage institutes • Libraries • Site visits • Which part of the process do we focus on? Introduction – Research Group Introduction – Metadata process at KB Introduction – Metadata process at KB • Literature review: • Media sector • Heritage institutes • Libraries • Sight visits • Which part of the process do we focus on? • ICT with Industry Workshop Introduction – Research Group • Dutch Research Council (NWO) • Formulate use-case & get selected • Small funding required (1,5K EUR) • Full week • 13 participants • Workingspace & hotel Lorentz Center Leiden Introduction – ICT with Industry Workshop ht tp s: // w w w .lo re nt zc en te r. nl /l c/ w eb /2 01 9/ 10 61 /i nf o. ph p3 ?w si d= 10 61 & ve nu e= O or t https://www.lorentzcenter.nl/lc/web/2019/1061/info.php3?wsid=1061&venue=Oort II. SET-UP EXPERIMENT Researcher Physical publications Since 1789 Researcher Search Interface Physical Repository Physical publications Since 1789 Researcher Search Interface Physical Repository Physical publications Manual Annotation of keywords Since 1789 Researcher Search Interface Digital Repository Full text digital publications Manual Annotation of keywords Since 2003 Researcher Search Interface Physical & Digital Repository Physical & full text digital publications Manual Annotation of keywords Since 2019 Researcher Search Interface Physical & Digital Repository Physical & full text digital publications Manual Annotation of keywords Since 2019 Set up - Research question Dissertations Brinkman Topics ____________ ____________ ____________ ____________ ____________ Mapping + metadata Research question: How can we automatically label dissertations with relevant keywords from the Brinkman thesaurus? Set up – Data & Thesaurus • Data – Dissertations: Full text and metadata via 6 university libraries • Thesaurus Brinkman - ‘Brinkeys’: 15K keywords, since 1885 ØChallenge: Map dissertations of university libraries with titles in KB Catalog Set up – Data & Thesaurus In the Ideal World: Every Thesis has an ISBN Every Author has an ORCID Every thesis is in Dutch (or English) A title is always written consistently Author names are written consistently All text is in UTF-8 Every university uses the same keywords consistently Set up – Approaches • Naive Baselines: • Lexical overlap between titles and Brinkeys • Lexical overlap keywords universities and Brinkeys • Methods: • Naive Bayes: simple machine learning algorithm that predicts a Brinkey on the basis of the words that appear in the title and/or a summary • Word Embeddings: neural networks that places the meaning of words in a continuous virtual “vector space” • Fasttext Set up – Approaches • Annif • Finnish National Library • Use own thesaurus • Open Source • Combination of techniques • Ariadne • OCLC Research • Trained on a lot of data • Scores very well • Not open source https://www.oclc.org/research/themes/data-science/ariadne.html http://annif.org/ https://www.oclc.org/research/themes/data-science/ariadne.html http://annif.org/ Set up – Results Focus on Recall: if the system outputs a list of twenty possible Brinkeys, are the correct Brinkeys according to our thesaurus among them? III. LESSONS LEARNED Lessons • Data, data, data: Quality, Amount • Do not underestimate preprocessing • How to keep up with researchers that go beyond state of the art? • “The human perspective, expertise and skill will remain necessary for guaranteeing the quality that we as the KB, National Library of the Netherlands represent” • Results still vague for cataloguers at KB III. NEXT STEPS Researcher Search Interface Physical & Digital Repository Ingest physical & full text digital publications Manual Annotation of keywords Next steps Researcher Search Interface Physical & Digital Repository Ingest physical & full text digital publications Manual Annotation of keywords Next steps Fol low up 1: Int erf ace tha t sug ges ts key wo rds to ann ota tor s http://lab.kb.nl/ http://lab.kb.nl/ https://lab.kb.nl/tool/brinkeys-tool https://lab.kb.nl/tool/brinkeys-tool Researcher Search Interface Physical & Digital Repository Ingest physical & full text digital publications Manual Annotation of keywords Next steps Fol low up 1: Int erf ace tha t sug ges ts key wo rds to ann ota tor s Fol low up 2: App ly t ech niq ues to o the r ty pes of doc um ent s Next steps • Experiments with full text data: • Do we need full text or is title or summary sufficient? • Do we need different approaches per type of text? • Set up a dedicated & highly secure server with full text files • Main focus on Annif: • Open source • Use own thesaurus • Active user community (http://swib.org/swib19/programme.html) • Experiment with other types of materials: documents of 16&17th century http://swib.org/swib19/programme.html ACKNOWLEDGEMENT Fantastic participants ICT with Industry Workshop Alex Brandsen Leiden University Hugo de Vos Leiden University Karen Goes VU Amsterdam Lin Huang Leiden University Hugo Huurdeman University of Amsterdam Aruembyeol Kim VU Amsterdam Sepideh Mesbah TU Delft Myrthe Reuver Radboud University Shenghui Wang University of Twente & OCLC Richard Zijdeman IISG & Stirling University Iris Hendrickx Radboud University Great colleagues at KB Erik Vos Arjan Dekker Lida Zoutewelle Angelique Tempels Meta van der Waal Gentenaar Enno Meijers Sara Veldhoen Brigitte den OudstenIrene Wolters Rene van der Ark Willem Jan Faber Dorien Haagsma Interested in more? Working on similar challenges? LET’S COLLABORATE! Generating metadata with AI EXPERIENCE OF THE KB, NATIONAL LIBRARY OF THE NETHERLANDS Martijn Kleppe – Head of Research Martijn.kleppe@kb.nl | @martijnkleppe | www.kb.nl/martijnkleppe mailto:Martijn.kleppe@kb.nl https://twitter.com/martijnkleppe http://www.kb.nl/martijnkleppe THANKS! Questions? Please put them in the chat box. Slides and a recording will be sent to all registered delegates. Intro Extro slides.pdf 20191108 - Webinar Liber v0.pdf