aghassibake-supporting-2020 ---- Supporting Data Visualization Services in Academic Libraries / About JITP Submit Guidelines for Authors Issues Assignments Blueprints Reviews Teaching Fails Tool Tips Issue Eighteen Supporting Data Visualization Services in Academic Libraries 2 responses December 10, 2020 Negeen Aghassibake, University of Washington Libraries Justin Joque, University of Michigan Library Matthew L. Sisk, Navari Family Center for Digital Scholarship, University of Notre Dame Abstract Data visualization in libraries is not a part of traditional forms of research support, but is an emerging area that is increasingly important in the growing prominence of data in, and as a form of, scholarship. In an era of misinformation, visual and data literacy are necessary skills for the responsible consumption and production of data visualizations and the communication of research results. This article summarizes the findings of Visualizing the Future, which is an IMLS National Forum Grant (RE-73-18-0059-18) to develop a literacy-based instructional and research agenda for library and information professionals with the aim to create a community of praxis focused on data visualization. The grant aims to create a diverse community that will advance data visualization instruction and use beyond hands-on, technology-based tutorials toward a nuanced, critical understanding of visualization as a research product and form of expression. This article will review the need for data visualization support in libraries, review environmental scans on data visualization in libraries, emphasize the need for a focus on the people involved in data visualization in libraries, discuss the components necessary to set up these services, and conclude with the literacies associated with supporting data visualization. Introduction Now, more than ever, accurately assessing information is crucially important to discourse, both public and academic. Universities play an important role in teaching students how to understand and generate information. But at many institutions, learning how to effectively communicate findings from the research process is considered idiosyncratic for each field or the express domain of a particular department (e.g. applied mathematics or journalism). Data visualization is the use of spatial elements and graphical properties to display and analyze information, and this practice may follow disciplinary customs. However, there are many commonalities in how we visualize information and data, and the academic library, at the heart of the university, can play a significant role in teaching these skills. In the following article, we suggest a number of challenges in teaching complex technological and methodological skills like visualization and outline a rationale for, and a strategy to, implement these types of services in academic libraries. However, the same argument can be made for any academic support unit, whether college, library, or independently based. Why Do We Need Data Visualization Support in Libraries? In many ways the argument for developing data visualization services in libraries mirrors the discussion surrounding the inclusion and extension of digital scholarship support services throughout universities. In academic settings, libraries serve as a natural hub for services that can be used by many departments and fields. Often, data visualization (like GIS or text-mining) expertise is tucked away in a particular academic department making it difficult for students and researchers from different fields to access it. As libraries already play a key role in advocacy for information literacy and ethics, they may also serve as unaffiliated, central places to gain basic competencies in associated information and data skills. Training patrons how to accurately analyze, assess, and create data visualizations is a natural enhancement to this role. Building competencies in these areas will aid patrons in their own understanding and use of complex visualizations. It may also help to create a robust learning community and knowledge base around this form of visual communication. In an age of “fake news” and “post-truth politics,” visual literacy, data literacy, and data visualization have become exceedingly important. Without knowing the ways that data can be manipulated, patrons are not as capable of assessing the utility of the information being displayed or making informed decisions about the visual story being told. Presently, many academic libraries are investing resources in data services and subscriptions. Training students, faculty and researchers in ways of effectively visualizing these data sources increases their use and utility. Finally, having data visualization skills within the library also comes with an operational advantage, allowing more effective sharing of data about the library. We are the Visualizing the Future Symposia, an Institute of Museum and Library Services National Forum Grant-funded group created to develop instructional and research materials on data visualization for library professionals and a community of practice around data visualization. The grant was designed to address the lack of community around data visualization in libraries. More information about the grant is available at the Visualizing the Future website. While we have only included the names of the three main authors; this work was a product of the work of the entire cohort, which includes: Delores Carlito, David Christensen, Ryan Clement, Sally Gore, Tess Grynoch, Jo Klein, Dorothy Ogdon, Megan Ozeran, Alisa Rod, Andrzej Rutkowski, Cass Wilkinson Saldaña, Amy Sonnichsen, and Angela Zoss. We are currently halfway through our grant work and, in addition to providing publicly available resources for teaching visualization, are also in the process of synthesizing and collecting shared insights into developing and providing data visualization instruction. This present article represents some of the key findings of our grant work. Current Environment In order to identify some broad data visualization needs and values, we reviewed three environmental scans. The first was carried out by Angela Zoss, who is one of the co-investigators on the grant, at Duke University (2018) based on a survey that received 36 responses from 30 separate institutions. The second, by S.K. Van Poolen (2017), focuses on an overview of the discipline and includes results from a survey of Big Ten Academic Alliance institutions and others. And the final report by Ilka Datig for Primary Research Group Inc (2019) provides a number of in-depth case studies. While none of the studies claim to provide an exhaustive list of every person or institution providing data visualization support in libraries, in combination they provide an effective overview of the state of the field. Institutions The combined environmental scans represent around thirty-five institutions, primarily academic libraries in the United States. However, the Zoss survey also includes data from the Australian National University, a number of Canadian universities, and the World Bank Group. The universities represented vary greatly in size and include large research institutions, such as the University of California Los Angeles, and small liberal arts schools, such as Middlebury and Carleton College. Some appointments were full-time, while others reported visualization as a part of other job responsibilities. In the Zoss survey, roughly 33% of respondents reported the word “visualization” in their job title. Types of activities The combined scans include a variety of services and activities. According to the Zoss survey, the two most common activities (i.e. activities that the most respondents said they engaged in) were providing consultations on visualization projects and giving short workshops or lectures on data visualization. After that other services offered include: providing internal data visualization support for analyzing and communicating library data; training on visualization hardware and spaces (e.g. large scale visualization walls, 3D CAVEs); and managing such spaces and hardware. Resources needed These three environmental scans also collectively identify a number of resources that are critical for supporting data visualization in librarians. One of the key elements is training for new librarians, or librarians new to this type of work, on visualization itself and teaching/consulting on data visualization. They also mention that resources are required to effectively teach and support visualization software, including access to the software, learning materials, but also ample time is required for librarians to learn, create and experiment themselves so that they can be effective teachers. Finally they outline the need for communities of practice across institutions and shared resources to support visualization. It’s About the People In all of our work and research so far, one important element seems worth stressing and calling out on its own: It is the people who make data visualization services work. Even visualization services focused on advanced instructional spaces or immersive and large scale displays, require expertise to help patrons learn how to use the space, maintain and manage technology, schedule events to create interest, and, especially in the case of advanced spaces, create and manage content to suggest the possibilities. An example of this is the North Carolina State University Libraries’ Andrew W. Mellon Foundation-funded project “Immersive Scholar” (Vandegrift et al.), which brought visiting artists to produce immersive artistic visualization projects in collaboration with staff for the large scale displays at the library. We encourage any institution that is considering developing or expanding data visualization services to start by defining skill sets and services they wish to offer rather than the technology or infrastructure they intend to build. Some of these skills may include programming, data preparation, and designing for accessibility, which can support a broad range of services to meet user needs. Unsupported infrastructure (stale projects, broken technology, etc.) is a continuing problem in providing data visualization services, and starting any conversation around data visualization support by thinking about the people needed is crucial to creating sustainable, ethical, and useful services. As evidenced by both the information in the environmental scans and the experiences of Visualizing the Future fellows, one of the most consistently important ways that libraries are supporting visualization is through consultations and workshops that span technologies from Excel to the latest virtual reality systems. Moreover, using these techniques and technologies effectively requires more than just technical know-how; it requires in-depth considerations of design aesthetics, sustainability, and the ethical use and re-use of data. Responsible and effective visualization design requires a variety of literacies (discussed below), critical consideration of where data comes from, and how best to represent data—all elements that are difficult to support and instruct without staff who have appropriate time and training. Services Data visualization services in libraries exist both internally and externally. Internally, data visualization is used for assessment (Murphy 2015), marketing librarians’ skills and demonstrating the value of libraries (Bouquin and Epstein 2015), collection analysis (Finch 2016), internal capacity building (Bouquin and Epstein 2015), and in other areas of libraries that primarily benefit the institution. External services, in contrast, support students, faculty, researchers, non-library staff, and community members. Some examples of services include individual consultations, workshops, creating spaces for data visualization (both physical and virtual), and providing support for tools. Some libraries extend visualization services into additional areas, like the New York University Health Sciences Library’s “Data Visualization Clinic,” which provides a space for attendees to share and receive feedback on their data visualizations from their peers (Zametkin and Rubin 2018), and the North Carolina State University Libraries’ Coffee and Viz Series, “a forum in which NC State researchers share their visualization work and discuss topics of interest” that is also open to the public (North Carolina State University Libraries 2015). In order to offer these services, libraries need staff who have some interest and/or experience with data visualization. Some models include functional roles, such as data services librarians or data visualization librarians. These functional librarian roles ensure that the focus is on data and data visualization, and that there is dedicated, funded time available to work on data visualization learning and support. It is important to note that if there is a need for research data management support, it may require a position separate from data visualization. Data services are broad and needs can vary, so some assessment on the community’s greatest needs would help focus functional librarian positions. Functional librarian roles may lend themselves to external facing support and community building around data visualization outside of internal staff. A needs assessment can help identify user-centered services, outreach, and support that could help create a community around data visualization for students, faculty, researchers, non-library staff, and members of the public. Having a community focused on data visualization will make sure that services, spaces, and tools are utilized and meeting user needs. There is also room to develop non-librarian, technical data visualization positions, such as data visualization specialists or tool-specific specialist positions. These positions may not always have an outreach or community building focus and may be best suited for internal library data visualization support and production. Offering data visualization support as a service to users is separate from data visualization support as a part of library operations, and the decision on how to frame the positions can largely be determined by library needs. External data visualization services can include workshops, training sessions, consultations, and classroom instruction. These services can be focused on specific tools, such as Tableau, R, Gephi, and so on. They can be focused on particular skills, such as data cleaning and normalizing, dashboard design, and coding. They can also address general concerns, such as data visualization transparency and ethics, which may be folded into all of the services. There are some challenges in determining which services to offer: Is there an interest in data visualization in the community? This question should be answered before any services are offered to ensure services are utilized. If there are any liaison or outreach librarians at your institution, they may have deeper insight into user needs and connections to the leaders of their user groups. Are there staff members who have dedicated time to effectively offer these services and support your users? Is there funding for tools you want to teach? Do you have a space to offer these services? This does not have to be anything more complicated than a room with a projector, but if these services begin to grow, it is important to consider the effectiveness of these services with a larger population. For example, a cap on the number of attendees for a tool-specific workshop might be needed to ensure the attendees receive enough individual support throughout the session. If all of these areas are not addressed, there will be challenges in providing data visualization services and support. Successful data visualization services have adequate staffing, access to the required tools and data, space to offer services (not necessarily a data wall or makerspace, but simply a space with sufficient room to teach and collaborate), and community that is already interested and in need of data visualization services. Literacies The skills that are necessary to provide good data visualization services are largely practical. We derive the following list from our collective experience, both as data visualization practitioners and as part of the Visualizing the Future community of practice. While the following list is not meant to be exhaustive, these are the core competencies that should be developed to offer data visualization services, either from an individual or as part of a team. A strong design sense: Without an understanding of how information is effectively conveyed, it is difficult to create or assess visualizations. Thus, data visualization experts need to be versed in the main principles of design (e.g. Gestalt, accessibility) and how to use these techniques to effectively communicate visual information. Awareness of the ethical implications of data visualizations: Although the finer details are usually assessed on a case by case basis, a data visualization expert should be able to interpret when a visualization is misleading and have the agency to decline to create biased products. This is a critical part of enabling the practitioner to be an active partner in the creation of visualizations. An understanding, if not expertise, in a variety of visualization types: Network visualizations, maps, glyphs, Chernoff Faces, for example. There are many specialized forms of data visualization and no individual can be an expert in all of them, but a data visualization practitioner should at least be conversant in many of them. Although universal expertise is impractical, a working knowledge of when particular techniques should be used is a very important literacy. A similar understanding of a variety of tools: Some examples include Tableau, PowerBI, Shiny, and Gephi. There are many different tools in current use for creating static graphics and interactive dashboards. Again, universal expertise is impractical, but a competent practitioner should be aware of the tools available and capable of making recommendations outside their expertise. Familiarity with one or more coding languages: Many complex data visualizations happen at the command line (at least partially) so there is a need for an effective practitioner to be at least familiar with the languages most commonly used (likely either R or Python). Not every data visualization expert needs to be a programmer, but familiarity with the potential for these tools is necessary. Conclusion The challenges inherent in building and providing data visualization instruction in academic libraries provide an opportunity to address larger pedagogical issues, especially around emerging technologies, methods, and roles in libraries and beyond. In public library settings, the needs for services may be even greater, with patrons unable to find accessible training sources when they need to analyze, assess, and work with diverse types of data and tools. While the focus of our grant work has been on data visualization, the findings reflect the general difficulties of balancing the need and desire to teach tools and invest in infrastructure with the value of teaching concepts and investing in individuals. It is imperative that work teaching and supporting emerging technologies and methods focus on supporting the people and the development of literacies rather than just teaching the use of specific tools. To do so requires the creation of spaces and networks to share information and discoveries. Bibliography Bouquin, Daina and Helen-Ann Brown Epstein. 2015. “Teaching Data Visualization Basics to Market the Value of a Hospital Library: An Infographic as One Example.” Journal of Hospital Librarianship 15, no. 4: 349–364. https://doi.org/10.1080/15323269.2015.1079686. Datig, Ilka. 2019. Profiles of Academic Library Use of Data Visualization Applications. New York: Primary Research Group Inc. Finch, Jannette L. and Angela R. Flenner. 2016. “Using Data Visualization to Examine an Academic Library Collection.” College & Research Libraries 77, no. 6: 765–778. https://doi.org/10.5860/crl.77.6.765. “Immersive Scholar.” Accessed June 26, 2020. https://www.immersivescholar.org/. LaPolla, Fred Willie Zametkin and Denis Rubin. 2018. “The “Data Visualization Clinic”: a library-led critique workshop for data visualization.” Journal of the Medical Library Association 106, no. 4: 477–482. https://doi.org/10.5195/jmla.2018.333. Murphy, Sarah Anne. 2015. “How data visualization supports academic library assessment.” College & Research Libraries News 76, no. 9: 482–486. https://doi.org/10.5860/crln.76.9.9379. North Carolina State University Libraries. “Coffee & Viz.” Accessed December 4, 2019. https://www.lib.ncsu.edu/news/coffee–viz. Van Poolen, S.K. 2017. “Data Visualization: Study & Survey.” Practicum study at the University of Illinois. Zoss, Angela. 2018. “Visualization Librarian Census.” TRLN Data Blog. Last modified June 16, 2018. https://trln.github.io/data-blog/data%20visualization/survey/visualization-librarian-census/. About the Authors Negeen Aghassibake is the Data Visualization Librarian at the University of Washington Libraries. Her goal is to help library users think critically about data visualization and how it might play a role in their work. Negeen holds an MS in Information Studies from the University of Texas at Austin. Matthew Sisk is a spatial data specialist and Geographic Information Systems Librarian based in Notre Dame’s Navari Family Center for Digital Scholarship. He received his PhD in Paleolithic Archaeology from Stony Brook University in 2011 and has worked extensively in GIS-based archaeology and ecological modeling. His research focuses on human-environment interactions, the spatial scale environmental toxins and community-based research. Justin Joque is the Visualization Librarian at the University of Michigan. He completed his PhD in Communications and Media Studies at the European Graduate School and holds a Master of Science in Information (MIS) from the University of Michigan. This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license. Share this article Tags: Data Visualizationdigital pedagogylibrarianslibrariesliteracy Previous: Ethnographies of Datasets: Teaching Critical Data Analysis through R Notebooks Next: Interdisciplinarity and Teamwork in Virtual Reality Design 'Supporting Data Visualization Services in Academic Libraries' has 2 comments December 10, 2020 @ 10:29 am Introduction / […] approach to cultivating such interdisciplinary collaboration: leveraging the library. In “Supporting Data Visualization Services in Academic Libraries,” the authors identify a host of factors that can lead to more successful support of responsible […] Reply December 10, 2020 @ 10:28 am Table of Contents / […] Supporting Data Visualization Services in Academic Libraries Negeen Aghassibake, Justin Joque, and Matthew L. Sisk […] Reply Would you like to share your thoughts? Cancel replyYour email address will not be published. Anti-spam word: (Required)* To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word. This site uses Akismet to reduce spam. Learn how your comment data is processed. Issues Table of Contents: Issue Thirteen Table of Contents: Issue Twelve Table of Contents: Issue Eleven Table of Contents: Issue Ten Table of Contents: Issue Nine Table of Contents: Issue Eight Table of Contents: Issue Seven Table of Contents: Issue Six Table of Contents: Issue Five Table of Contents: Issue Four Table of Contents: Issue Three Table of Contents: Issue Two Table of Contents: Issue One About the Journal The mission of The Journal of Interactive Technology and Pedagogy is to promote open scholarly discourse around critical and creative uses of digital technology in teaching, learning, and research. We are committed first and foremost to teaching and learning, and intend that the journal itself—both in process and in product—provide opportunities to reveal, reflect on, and revise academic publication and classroom practice. Read more... Call for Submissions Call for Submissions: General Issue Submission Deadline: November 15th, 2018 For this general issue, we are interested in contributions that take advantage of the affordances of digital platforms in creative ways. We invite both textual and multimedia submissions employing interdisciplinary and creative approaches in the humanities, sciences, and social sciences. Read more… Feeds RSS Join Our Email List Contact Us Search for: Need help with the Commons? Visit our help page Send us a message Skip to toolbar CUNY Academic Commons Home People Groups Sites Papers Events News Help About About the Commons Contact Us Publications on the Commons Image Credits Privacy Policy Project Staff Terms of Service Log In Register Help Help | Contact Us | Privacy Policy | Terms of Service | Image Credits | Creative Commons (CC) license unless otherwise noted Built with WordPress | Protected by Akismet | Powered by CUNY
altman-building-2021 ---- Chapter 8 Building a Machine Learning Pipeline Audrey Altman Digital Public Library of America As a new machine learning (ML) practitioner, it is important to develop a mindful approach to the craft. By mindful, I mean possessing the ability to think clearly about each individual piece of the process, and understanding how each piece fits into the larger whole. In my experience, there are many good tutorials available that will help you work with an individual tool, deploy a specific algorithm, or complete a single task. It is more difficult to find guidelines for building a holistic system that supports the entire ML workflow. My aim is to help you build just such a system, so that you are free to focus on inquiry and discovery rather than struggling with in- frastructure and process. I write this as a software developer who has, at one time or another, been on the wrong end of all the recommendations presented here, and hopes to save you from similar headaches. Many of the examples and design choices are drawn from my experiences at the Digital Public Library of America, where I have worked alongside a very talented team of developers. This is by no means an exhaustive text, but rather a bit of pragmatic advice and a jumping-off point for further research, designed to give you a clearer idea of which questions to ask throughout your practice. This article reviews the basic machine learning workflow, discussing design considerations along the way. It offers recommendations for data storage, guidelines on selecting and working with ML algorithms, and questions to guide tool selection. Finally, it describes some challenges with scaling up. My hope is that the insight presented here, combined with your good judgement, will empower you to get started with the actual practice of designing and executing a machine learning project. 89 90 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Algorithm selection As you begin ingesting and preparing data, you’ll want to explore possible machine learning al- gorithms to perform on your dataset. Choose an algorithm that fits your research question and data. If you’re not sure which algorithm to choose and not constrained by time, experiment with several different options and see which one yields the best results. Start by determining what gen- eral type of learning algorithm you need, and proceed from there to research and select one that specifically addresses your research question. In supervised learning, you train a model to predict an output condition based on given in- put conditions; for example, predicting whether or not a patient has some disease based on their symptoms, or the topic of a news article based on keywords in the text. In order for supervised learning to work, you need labeled training data, meaning data in which the outcome is already known. Examples include records of symptoms in patients who were known to have the disease (or not), or news articles that have already been assigned topics. Classification and regression are both types of supervised learning. In a classification problem, you are predicting a discrete number of possible outcomes. For example, “based on what I know about this book, will it make the New York Times Best Seller list?” is a classification problem because there are two discrete outcomes: yes or no. Classification algorithms include naive Bayes, decision trees, and k-nearest neighbor. Regression problems try to predict an outcome from a continuum of possibilities, i.e., “based on what I know about this book, what will its retail price be?” Regression algorithms include linear regression and regression trees. In unsupervised learning, the ML algorithm discovers a new pattern. The training data is unlabeled, meaning there is no indication of how the data should be organized at the outset. A common example is clustering, in which the algorithm groups items together based on features it finds mathematically significant. Perhaps you have a collection of news articles (with no existing topic labels), and you want to discover common themes or topics that appear throughout the collection. The algorithm will not tell you what the themes or topics are, but will show which articles group together. It is then up to the researcher to work out the common thread. In addition to serving your research question, your algorithm should also be a good fit for your data. Specific considerations will vary for each dataset and algorithm, so make sure you know the strengths and weaknesses of your algorithm and how they relate to the unique qualities of your dataset. For example, algorithms differ in their abilities to handle datasets with a very large number of features, handle datasets with high variance, efficiently process very large datasets, and glean meaningful intelligence from very small datasets. Is it important that your algorithm be easy to explain? Some algorithms, such as neural nets, function as black boxes, and it is difficult to decipher how they arrive at their decisions. Other algorithms, such as decision trees, are easy to understand. Can you prepare your data for the algorithm with a reasonable amount of pre- processing? Can you find examples of success (or failure) from people using similar datasets with the same algorithm? Asking these sorts of questions will help you to choose an algorithm that works well for your data, and will also inform how you prepare your data for optimal use. Finally, consider whether or not you are constrained by time, hardware, or available toolsets. Different algorithms require different amounts of time and memory to train and/or execute. Dif- ferent ML tools offer implementations of different algorithms. Altman 91 The machine learning pipeline The metaphor of a pipeline is often used for a machine learning workflow. This metaphor cap- tures the idea of data channeled through a series of sequential transformations. However, it is important to note that each stage in the process will need to be repeated and honed through- out the course of your project. Therefore, don’t think of yourself as building a single intelligent model, such as a decision tree or clustering algorithm. Instead, build a pipeline with pieces that can be swapped in and out as needed. Data flows through the pipeline and outputs a version of a decision tree, clustering algorithm, or other intelligent model. Throughout your process, you will tweak your pipeline, making many intelligent models. Eventually you will select the best model for your use case. To use another metaphor, don’t build a car, build an assembly line for making cars. While the final output of a machine learning workflow is some sort of intelligent model, there are many factors that make repetition and iteration necessary. ML processes often involve subjective decisions, such as which data points to ignore, or which configurations to select for your algorithm. You will want to test different possibilities to see what works best. As you learn more about your dataset throughout the course of the project, you will go back and tweak parts of your process. You may discover biases in your data or algorithms that need to be addressed. If you are working collaboratively, you will be incorporating asynchronous feedback from members of your team. At some point, you may need to introduce new or revised data, or try a new tool or algorithm. It is also prudent to expect and plan for errors. Human errors are inevitable, and hardware errors, such as network timeouts or memory overloads, are common. For all of these reasons, you will be well-served by a pipeline composed of modular, repeatable steps, each with discrete and stable output. A modular pipeline supports a batch processing workflow, in which whole datasets undergo a series of transformations. During each step of the process, a large amount of data (possibly the entire dataset) is transformed all at once and then incrementally stored. This can be contrasted with a real-time workflow, in which individual records are transformed instantaneously (e.g. a li- brarian updates a single record in library catalog); or a streaming workflow, in which a continuous flow of data is pushed through an entire pipeline, often without incremental storage along the way (e.g. performing analysis on a continuous stream of new tweets). Batch processing is com- mon in the research and development phase of an ML project, and may also be a good choice for a production system. When designing any step in the batch processing pipeline, assume that at some point you will need to repeat it either exactly as is, or with modifications. Documenting your process lets you compare the outputs of different variations and communicate the ways in which your choices impact the final results. If you’re writing code, version control software can help. If you’re doing more manual data manipulations, such as editing data in spreadsheets, you will need an inten- tional system of documenting exactly which transformations you are applying to your data. It is generally preferable to automate processes wherever possible so that you can repeat them with ease and consistency. A concrete example from my own experience demonstrates the importance of a pipeline that supports repetition. In my first ever ML project, I worked with a set of XML library data con- verted to CSV. I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. This whole pro- 92 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 cess took me countless hours, and when an updated dataset became available, there was no way to reproduce my painstaking cleanup process. I was stuck with outdated data, and my final output was doomed to grow more and more irrelevant as time wore on. Since then, I have always written repeatable scripts for all my data cleanup tasks. Each decision you make will have an impact on the final results, so it is important to keep clear documentation and to verify your assumptions and hypotheses wherever possible. Sometimes there will be explicit tests to perform; at other times, you may just need to look at data—make a quick visualization, perform a simple calculation, or glance through a sample of records. Be cognizant of the potential to introduce error or bias. For example, you could remove a field that you don’t think is important, but that would, in fact, have a meaningful impact on the final result. All of these precautions will strengthen confidence in your final outcomes and make them intelligible to your collaborators and other audiences. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. Data acquisition The first step is to acquire the data that you will be using for your machine learning project. You may need to combine data from several different sources. There are many ways to acquire data, including downloading files, querying a database or API, or scraping web pages. Depending on the size of the source data and how it is made available, this can be a quick and simple step or the most challenging bottleneck in your pipeline. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least dur- ing the initial phase of testing different algorithms or configurations. Having a raw, immutable copy of your initial dataset (or datasets) ensures that you can always go back to the beginning of your ML process and start over with exactly the same input. It will also save you from the possi- bility that the source data will change from beneath you, thereby compromising your ability to compare the outputs of different operations (for more on this, see the section on data storage). If possible, it’s often worthwhile to learn about how the original data was created, especially if you are getting data from multiple sources that differ in subtle ways. Data preparation Data preparation involves cleaning data and transforming it into an appropriate format for sub- sequent machine learning tasks. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you’ve started training and testing models. The first step of data preparation is to parse your acquired data and transform it into a com- mon, usable schema. Acquired data often comes in file formats that are good for data sharing, such as XML, JSON, or CSV. You can parse these files into whatever schema makes sense to man- age the various transformations you want to perform, but it can help to have a sense of where you are headed. Your eventual choice of data format will likely be dictated by your ML algo- rithms; likely candidates include multidimensional arrays, tensors, matrices, and DataFrames. Look ahead to specific functions in the specific libraries you plan to use, and see what type of input data is required. You don’t have to use these same formats during your data preparations, though it can simplify the process. Altman 93 Data cleanup and transformation is an art. Data is messy, and the messier the data, the harder it is to analyze and uncover underlying patterns. Yet, we are only human, and perfect data is far beyond our reach. To strike a workable balance, focus on those cleanup tasks that you know (or strongly suspect) will have a significant impact on the final product. Cleanup and transfor- mation operations include removing punctuation or stopwords from textual data, standardizing date and number formats, replacing missing or dummy values with a meaningful default, and excluding data that is known to be erroneous or atypical. You will select relevant data points, and you may need to represent them in a new way: a birth date becomes age range; a place name be- comes geo-coordinates; a text document becomes a word density vector. There are many possible normalizations to perform, depending on your dataset and which algorithm(s) you plan to use. It’s not a bad idea to ensure that there’s a genuinely unique identifier for each record (even if you don’t see an immediate need for one). This is also a good time to reflect on any biases that might be inherent in your data, and whether or not you can adjust for them; even if you cannot, under- standing how they might impact the ML process will help you conduct a more nuanced analysis and frame your final results. At the very least, you can record biases in the documentation so that future researchers will be aware of them and react accordingly. As you become more familiar with the data, you will likely hone your cleanup process and iterate through the steps multiple times. The more you can learn about the data, the better your preparations will be. During the data preparation phase, practitioners often make use of visualizations and query frameworks to pic- ture their data holistically, identify patterns, and find errors or outliers. Some ML tools support these features out-of-the-box, or are intentionally interoperable with external query and visual- ization tools. For a lightweight tool, consider spreadsheet or notebook software. Depending on your use case, it may be worthwhile to put your data into a temporary database or search index so that you can make use of a more sophisticated query interface. Model testing and training During the testing and training phase, you will build multiple models and determine which one gives you the best results. One of the main ways you will tune your model is by trying multiple combinations of hyperparameters. A hyperparameter is a value that you set before you run the learning process, which impacts how the learning process works. Hyperparameters control things like the number of learning cycles an algorithm will iterate through, the number of layers in a neural net, the characteristics of a cluster, or the number of decision trees in a forest. Often, you will also want to circle back to your data preparation steps to try different configurations, apply new enhancements, or address new problems and particularities that you’ve uncovered. The process is deceptively simple: try out different configurations until you get a good result. The challenge comes when you try to define what constitutes a good (or good-enough) result. Measuring the quality of a machine learning model takes finesse. Start by asking: What would you expect to see if the model learned perfectly? Equally important, what would you expect to see if the model didn’t learn anything at all? You can often utilize randomness as a stand-in for no learning, e.g. “if a result was selected at random, the probability of the desired outcome would be X”. These two questions will help you to set benchmarks at both extremes of the realm of possible outcomes. Perfection is illusive, and the return on investment dwindles after a while, so be prepared to stop training once you’ve arrived at an acceptably good model. In a supervised learning problem the dataset is split into training and testing datasets. The algorithm uses the training data to “learn” a set of rules that it can subsequently apply to new, 94 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 unseen data to predict the outcome. The testing dataset (also called a validation dataset) is used to test how well the model performs. Often, a third dataset is held out as well, reserved for fi- nal testing after the model has been trained. This third dataset provides an additional bulwark against bias and overfitting. Results are typically evaluated based on some statistical measure- ment that is directly relevant to your research question. In a classification problem, you might optimize for recall or precision. In a regression problem, you can use formulas such as the root- mean square deviation to measure how well the regression line matches the actual data points. How you choose to optimize your model will depend on your specific context and priorities. Testing an unsupervised model is not as straightforward, since there is no preconceived no- tion of correct and incorrect categorization. You can sometimes rely on a known pattern in the underlying dataset that you would reasonably expect to be reflected in a successful model. There may also be characteristics of the final model that indicate success. For example, if you are work- ing with a clustering algorithm, models with dense, well-defined clusters are probably better than sparse clusters with vague boundaries. In unsupervised learning, you may want to hold back some portion of your data to perform an independent validation of your results, or you may use the entire dataset to build the model—it depends on what type of testing you want to perform. Application of results As the final step of your workflow, you will use your intelligent model to perform some task. Perhaps you will use it for scholarly analysis of a dataset, or perhaps you will integrate it into a software product. If it is the former, consider how to export any final data and preserve the artifacts of your project. If it is the latter, consider how the model, its outputs, and its contin- ued maintenance will fit into existing systems and workflows. Planning for interoperability may influence decisions from tool selection to data formats and storage. Immutable data storage Immutable data storage can benefit the batch-processing ML pipeline, especially during the ini- tial research and development phase. This type of data storage supports iteration and allows you to compare the results of many different experiments. Treating data as immutable means that af- ter each significant change or set of changes to your data, you save a new snapshot of the dataset that is never edited or changed. It also allows you to be flexible and adaptive with your data model. Immutable data storage has become a popular choice for data-intensive or “big data” applications as a way to easily assemble large quantities of data, often from multiple sources, without having to spend time upfront crafting a strict data model. You may have heard the term “data lake” to refer to such large, unstructured collections of data. This can be contrasted with a “data warehouse”, which usually indicates a highly structured, centralized repository such as a relational database. To demonstrate how immutable supports iteration and experimentation, consider the fol- lowing scenario: You start with an input file Kvn/�i�X+bp, and then perform some cleanup operation over the data, such as converting all measurements in miles to kilometers, rounded to the nearest whole number. If you were treating your data as mutable, you might overwrite the original contents of Kvn/�i�X+bp with the transformed values. The problem with this ap- proach comes if you want to test some alteration of your cleanup operation. Say, for example, you wanted to round all your conversions to the nearest tenth instead. Since you no longer have your original data, you would have to start the entire ML process from the top. If you instead Altman 95 treated your data as immutable, you would keep Kvn/�i�X+bp in its original state, and save the output of your cleanup operation in a new file, say Kvn+H2�Mn/�i�X+bp. That way, you could return to Kvn/�i�X+bp as many times as you wished, try different operations on this data, and easily compare the results of these operations knowing the source data was exactly the same for each one. Think of each immutable dataset as a place in your process that you can safely reset to anytime you want to try something new or correct for some bias or failure. To illustrate the benefits of a flexible data model, consider a mutable data store, such as a relational database. Before you put any data into the database, you would first need to design a system of tables with set fields and datatypes, and the relationships between those tables. This can feel like putting the cart before the horse, especially if you are starting with a dataset with which you are not yet intimately familiar, and you want the ability to experiment with different algorithms, all of which might require slightly different transformations on the original dataset. Revisiting the example in the previous paragraph, you might initially have defined your distance datatype as an integer (when you were rounding to the nearest whole number), and would later have to change it to a floating point number (when you were rounding to the nearest tenth). Making this change would mean altering the database schema and migrating all of the existing data to the new type, which is a nontrivial task—especially if you later decide to revert back to the original type. By contrast, if you were working with immutable CSV files, it would be much easier to write out two files, one with each data type, and keep whichever one ultimately proved most effective. Throughout your ML process, you can create several incremental datasets that are essentially read-only. There’s no one correct data storage format, but ideally you would use something sim- ple and space-efficient with the capacity to interoperate with different tools, such as flat files (plain text files without extraneous markup, such as TXT, CSV, or Parquet). Even if your data is ulti- mately destined for a different kind of datastore, such as a relational database or triplestore, con- sider using simple, immutable storage as an intermediary to facilitate iteration and experimenta- tion. If you’re concerned about overwhelming your local drive, cloud storage is a good option, especially if you can read and write directly from your programs or software services. One final benefit of immutable storage relates to scale. Batch processing workflows and im- mutable data storage work well with distributed data processing frameworks, such as MapReduce and Spark. Therefore, if you need to scale your ML project using distributed processing, the in- tegration will be more seamless (for more, see the section on scaling up). Organizing Immutable Data Organizing immutable data stores can be a challenge, especially with multiple users. A little planning can save you from losing track of your experiments and results. A well-ordered direc- tory structure, informative and consistent file names, liberal use of timestamps, and disciplined note-taking are simple but effective strategies. For example, say you were acquiring MARCXML records from an API feed, parsing out subject terms, and building a clustering algorithm around these terms. Let us explore one possible way that you could organize your data outputs through each step of the machine learning pipeline. To enforce a naming convention, create a helper method that generates the output path for each run of a particular data process. This output path includes the date and timestamp of the run—that way you won’t have to think about naming each individual file, and can avoid the phenomenon of a mess of files called Kvn+H2�Mn/�i�X+bp, Kvn+H2�M2`n/�i�X+bp, 96 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Kvn7BM�Hn+H2�M2bin/�i�X+bp, etc. Your file path for the acquired data might be in the format: KvS`QD2+if�+[mBbBiBQMbfK�`+nuuuuJJ..n>>JJaaXtKH In this case, “YYMMDD” represents the date and “HHMMSS” represents the timestamp. Your file path for prepared and cleaned data might be: KvS`QD2+if+H2�Mn/�i�b2ibfbm#D2+ibnuuuuJJ..n>>JJaaX+bp Finally, each clustering model you build could be saved using the file path pattern: KvS`QD2+ifKQ/2Hbf+Hmbi2`nuuuuJJ..n>>JJaa Following this general pattern, you can organize all of the outputs for your entire project. Using date and timestamps in the file name also enables easy sorting and retrieval of the most recent output. For each data output, you will want to maintain a record of the exact input, any special at- tributes of the process (e.g. “this time I rounded decimals to the nearest hundredth”), and metrics that will help you determine success or failure of the process. If you can generate this information automatically for each process, all the better for ensuring an accurate record. One strategy is to include a second helper method in your program that will generate and write out a companion file to each data output. The companion file contains information that will help evaluate results, detect errors, perform optimizations, and differentiate between any two data outputs. In the example project, you could accompany the acquisition output with a text file detailing the exact API call used to fetch the data, the number of records acquired, and the runtime for the process. Keeping companion files as close as possible to their outputs helps prevent accidental separation, so save it to: KvS`QD2+if�+[mBbBiBQMfK�`+nuuuuJJ..n>>JJaaXiti In this case, the date and timestamp should exactly match that of its companion XML file. When running processes that test and train models, you can include information in your com- panion file about hyperparameters and whatever metrics you are using to evaluate the quality of the model. In our example, the companion file to each cluster model may contain the file path for the cleaned input data, the number of clusters, and a measure of cluster variance. Working with machine learning algorithms New technologies and software advances make machine learning more accessible to “lay” users, by which I mean those of us without advanced degrees in mathematics or data science. Yet, the algorithms are complex, and you need at least an intuitive understanding of how they work if you hope to implement them correctly. I use the following three questions as a guide for under- standing an algorithm. Keep in mind that any one project will likely make use of several complex algorithms along the way. These questions help ensure that I have the information I truly need, and avoid getting bogged down with details best left to mathematicians. • What do the inputs and outputs of the algorithm mean? There are two parts to answering this question. First is the data structure, e.g. “this is a vector with 300 integers.” Second Altman 97 is knowing what this data describes, e.g. “each vector represents a document, and each integer specifies the number of times a particular word appears in that document.” You also need to be aware of specific implementation details—perhaps the input needs to be normalized in some way, perhaps the output has been smoothed (a technique that com- pensates for noisy data or outliers). This may seem straightforward, but it can be a lot to keep track of once you’ve gone through several layers of processing and abstraction. • What effect do different hyperparameters have on the algorithm? Part of the machine learn- ing process is tuning hyperparameters, or trying out multiple configurations until you get satisfying results. Part of the frustration is that you can’t try every possible configuration, so you have to do some intelligent guesswork. Twiddling hyperparameters can feel enig- matic and unitutive, since it can be difficult to predict their impact on the final outcome. The better you understand hyperparameters and their roles in the ML process, the more likely you are to make reasonable guesses and adjustments—though you should always be prepared for a surprise. • Canyouexplainhowthisalgorithmworkstoalaypersonandwhyit’sbeneficialtotheproject? There are two benefits to articulating a response to this question. First, it ensures that you really understand the algorithm yourself. And second, you will likely be called on to give this explanation to co-collaborators and other stakeholders. A good explanation will build excitement around the project, while a befuddling one could sow doubt or disinterest. It can be difficult to strike a balance between general summary and technical equations, since your stakeholders will likely include people with diverse backgrounds, so do your best and look for opportunities for people with different expertises to help refine your team’s understanding of the algorithm. Learning more about the underlying math can help you make better, more nuanced decisions about how to deploy the algorithm, and is fascinating in its own right—but in most cases I have found that the above three questions provide a solid foundation for machine learning research. Tool selection Tool selection is an important part of your process and should be approached thoughtfully. A good approach is to articulate and prioritize the needs of your team, and make selections that meet these needs. I’ve listed some possible questions for consideration below, many of which you will recognize as general concerns for any tool selection process. • What sorts of features and interfaces do they offer? If you require a specific algorithm, the ability to make data visualizations, or query interfaces, you can find tools to meet these specific needs. • How well do tools interoperate with one another, or with other parts of your existing systems? One of the advantages of a well-designed pipeline is that it will enable you to swap out software components if the need arises. For example, if your data is in a format that is interoperable with many systems, it frees you from being tied down to any specific tool. • How do the tools align with the skill sets and comfort levels of your team? For example, con- sider what coding languages your collaborators know, and whether or not they have the 98 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 capacity to learn a new one. If you have someone who is already a wiz with a preferred spreadsheet program, see if you can export data into a compatible file format. • Arethetoolsstable,well-documented,andwell-supported? Machine learning is a fast-changing field, with new algorithms, services, and software features being developed all the time. Something new and exciting that hasn’t yet been road-tested may not be worth the risk if there is a more dependable alternative. Furthermore, there tends to be more scholarship, documented use cases, and tutorials for older, more widely-adopted tools. • Are you concerned about speed and scale? Don’t get bogged down with these considerations if you’re just trying to get a working pilot off the ground, but it can help to at least be aware of how problems are likely to manifest as your volume of data increases, or as you integrate into time-sensitive workflows. You and your team can work through these questions and articulate additional requirements relevant to your specific context. Scaling up Scaling up in machine learning generally means that you need to work with a larger volume of data, or that you need processes to execute faster. Recent advances in hardware and software make the execution of complex computations magnitudes faster and more efficient than they were even a decade ago, and you can often achieve quite a bit by working on a personal computer. Yet, time is valuable, and it can be difficult to iterate and experiment effectively when individual processes take too long to execute. There are many ML software packages that can help you make efficient use of whatever hard- ware you have, including your personal computer. Some examples at the time of writing are Apache Spark, TensorFlow, Scikit-learn, and Microsoft Cognitive Toolkit, each with their own strengths and applications. In addition to providing libraries for building and testing models, these software packages optimize algorithmic performance, memory resources, data through- puts, and/or parallel computations. They can make a remarkable difference in both processing speed and the amount of data you can comfortably handle. There are also services that allow you to submit executable code and data to the cloud for processing, such as Google AI Platform. Managing your own hardware upgrades is not without challenge. You may be lucky enough to have access to a high-powered computer capable of accelerated processing. A common example is a computer with GPUs (graphics processing units), which break complex processes into many small tasks and run them in parallel. However, these powerful machines can be prohibitively ex- pensive. Another scaling technique is distributed or cluster computing, in which complex pro- cesses are distributed across multiple computers, often in the cloud. A cloud cluster can bring significant cost savings, but managing one requires specialized knowledge and the learning curve can be rather steep. It is also important to note that different algorithms require different scal- ing techniques. Some clustering algorithms, for example, scale well with GPUs but not with distributed computing. Even with the right hardware and software, scaling up can be a tricky business. ML processes tend to have dramatic spikes in memory or network use, which can tax your systems. Not all ML algorithms scale well, causing memory use or execution time to grow exponentially as more data is added. Sometimes you have to add additional, complexity-reducing steps to your pipeline to Altman 99 handle data at scale. Some of the more common machine learning languages, such as Python and R, execute relatively slowly, putting the onus on developers to optimize operations for efficiency. In anticipation of these and other challenges, it is often a good idea to start with a scaled-down pilot or proof of concept, and not to underestimate the time and resources necessary to scale up from there. Conclusion New technologies make it possible for more researchers and developers to leverage the power of machine learning. Building an effective machine learning system means supporting the entire workflow, from data acquisition to final analysis. Practitioners must be mindful of how each im- plementation decision and subjective choice—from the way you structure and store your data to the algorithms you use to the ways you validate your results—will impact the efficiency of opera- tions and the quality of learned intelligence. This article has offered some practical guidelines for building ML systems with modular, repeatable processes and intelligible, verifiable results. There are many resources available for further research, both online and in your libraries, and I encour- age you to consult with subject specialists, data scientists, mathematicians, programmers, and data engineers. May your data be clean, your computations efficient, and your results profound. Further Reading I include here a few suggestions for further reading on key topics. I have also found that in the fast-changing world of machine learning technologies, blogs, internet communities, and online classes can be a great source of information that is current, introductory, and/or geared toward practitioners. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining. Boston: Pearson Addison Wesley. See chapter 2 for data preparation strategies. Later chap- ters introduce common classification and clustering algorithms. Marz, Nathan and James Warren. 2015. Big Data: Principles and best practices of scalable real- time data systems. Shelter Island: Manning. “Part 1: Batch Layer” discusses immutable storage in depth. Kleppmann, Martin. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reli- able, Scalable, and Maintainable Systems. Boston: O’Reilly. “Chapter 10: Batch Process- ing” is especially relevant if you are interested in scaling up.
bandyopadhyay-beyond-2021 ---- Beyond Node Embedding: A Direct Unsupervised Edge Representation Framework for Homogeneous Networks Sambaran Bandyopadhyay1 and Anirban Biswas2 and Narasimha Murty3 and Ramasuri Narayanam4 Abstract. Network representation learning has traditionally been used to find lower dimensional vector representations of the nodes in a network. However, there are very important edge driven mining tasks of interest to the classical network analysis community, which have mostly been unexplored in the network embedding space. For applications such as link prediction in homogeneous networks, vec- tor representation (i.e., embedding) of an edge is derived heuristically just by using simple aggregations of the embeddings of the end ver- tices of the edge. Clearly, this method of deriving edge embedding is suboptimal and there is a need for a dedicated unsupervised approach for embedding edges by leveraging edge properties of the network. Towards this end, we propose a novel concept of converting a net- work to its weighted line graph which is ideally suited to find the em- bedding of edges of the original network. We further derive a novel algorithm to embed the line graph, by introducing the concept of col- lective homophily. To the best of our knowledge, this is the first direct unsupervised approach for edge embedding in homogeneous infor- mation networks, without relying on the node embeddings. We val- idate the edge embeddings on three downstream edge mining tasks. Our proposed optimization framework for edge embedding also gen- erates a set of node embeddings, which are not just the aggregation of edges. Further experimental analysis shows the connection of our framework to the concept of node centrality. 1 Introduction Network representation learning (also known as network embedding) has gained significant interest over the last few years. Traditionally, network embedding [22, 12, 28] maps the nodes of a homogeneous network (where nodes denote entities of similar type) to lower di- mensional vectors, which can be used to represent the nodes. It has been shown that such continuous node representations outperform conventional graph algorithms [2] on several node based downstream mining tasks like node classification, community detection, etc. Edges are also important components of a network. From the point of downstream network mining analytics, there are plenty of network applications - such as computing edge betweenness centrality [20] and information diffusion [24] - which heavily depend on the infor- mation flow in the network. Compared to the conventional down- stream node embedding tasks (such as node classification), these tasks are more complex in nature. But similar to node based ana- lytics, there is a high chance to improve the performance of these tasks in a continuous lower dimensional vector space. Thus, it makes 1 IBM Research & IISc, Bangalore, email: sambband@in.ibm.com 2 Indian Institute of Science, Bangalore, email: anirbanb@iisc.ac.in 3 Indian Institute of Science, Bangalore, email: mnm@iisc.ac.in 4 IBM Research, Bangalore, email: ramasurn@in.ibm.com sense to address these problems in the context of network embed- ding via direct representation of the edges of a network. As a first step towards this direction, it is important to design dedicated edge embedding schemes and validate the quality of those embeddings on some basic edge-centric downstream tasks. (a) Synthetic Graph (b) node2vec (c) line2vec Figure 1: Edge Visualization: (a) We created a small synthetic net- work with two communities. So, there are three types of edges: Green (or red) edges with both the end points belonging to the green (or red respectively) community; Blue edges with end points belonging to two different communities. (b) node2vec embedding (8 dimensional) of the edges obtained by taking average of the embeddings of the end vertices and then used t-SNE for visualization. (c) Direct edge em- beddings (8 dimensional) obtained by line2vec and then used t-SNE for visualization. Clearly, line2vec is superior which visually sepa- rates the edge communities, compared to that with the conventional way of aggregating node embeddings to obtain edge representation. In the literature, there are indirect ways to compute embedding of an edge in an information network. For tasks like link predic- tion, where a classifier needs to be trained on both positive (existing) and negative (not existing) edge representations, a simple aggrega- tion function [12] such as vector average or Hadamard product has been used on the representations of the two end vertices to derive the vector representation of the corresponding edge. Typically node embedding algorithms use the homophily property [18] by respecting different orders of node proximities in a network. As the inherent ob- jective functions of these algorithms are focused on the nodes of the network, using an aggregation function on these node embeddings to get the edge embedding could be suboptimal. We demonstrate the shortcoming of this approach in Figure 1, where the visualization of the edge embeddings derived by aggregating node embeddings (tak- ing average of the two end nodes) from node2vec [12] on a small synthetic graph do not maintain the edge community structure of the network. Whereas, a direct edge embedding approach line2vec, to be proposed in this paper, completely adheres to the community struc- ture, as edges of different types are visually segregated in the t-SNE plot of the same shown in 1(c). So there is a need to develop algo- rithms for directly embedding edges (i.e., not via aggregating node embeddings) in information networks. We address this research gap ar X iv :1 91 2. 05 14 0v 1 [ cs .S I] 1 1 D ec 2 01 9 in this paper in a natural way. Following are the contributions: • We propose a novel edge embedding framework line2vec, for ho- mogeneous social and information networks. To the best of our knowledge, this is the first work to propose a dedicated unsuper- vised edge embedding scheme which avoids aggregation of the end node embeddings. • We exploit the concept of line graph for edge representation by converting the given network to a weighted line graph. We fur- ther introduce the concept of collective homophily to embed the line graph and produce the embedding of the edges of the given network. • We conduct experiments on three edge-centric downstream tasks. Though our approach is proposed for embedding edges, we further analyze to show that, a set of robust node embeddings, which are not just the aggregation of edges, are also generated in the process. • We experimentally discover the non-trivial connection of the clas- sical concept of node centrality with the optimization framework of line2vec. The source code of line2vec is available at https: //bit.ly/2kfiS2l to ease the reproducibility of the results. Though edge centric network mining tasks such as edge central- ity, network diffusion and link prediction can be benefited from edge embeddings, applications of edge embeddings to tackle them is non- trivial and needs a separate body of work. For example, finding cen- tral edges in the network amounts to detecting a subset of points in the embedding space which are diverse between each other and rep- resent a majority of the other points. We leave them to be addressed in some future work. 2 Related Work and Research Gaps Node embedding in information network has received great interest from the research community. We refer the readers to the survey arti- cles [33] for a comprehensive survey on network embedding and cite only some of the more prominent works in this paragraph. DeepWalk [22] and node2vec [12] are two node embedding approaches which employ different types of random walks to capture the local neigh- borhood of a node and maximize the likelihood of the node context. Struc2vec [23] is another random walk based strategy which finds similar embeddings for nodes which are structurally similar. A deep autoencoder based node embedding technique (SDNE) that preserves structural proximity is proposed in [31]. Different types of node em- bedding approaches for attributed networks are also present in the literature [35, 3, 9]. A semi-supervised graph convolution network based node embedding approach is proposed in [14] and further ex- tended in GraphSAGE [13] which learns the node embeddings with different types of neighborhood aggregation methods on attributes. Recently, node embedding based on semi-supervised attention net- works [28], maximizing mutual information [29], and in the presence of outliers [4] are proposed. Compared to the above, representing edges in information net- works is significantly less matured. Some preliminary works ex- ist which use random walk on edges for community detection in networks [15] or to classify large-scale documents into large-scale hierarchically-structured categories [11]. [1] focuses on the asym- metric behavior of the edges in a directed graph for deriving node embeddings, but it represents a potential edge just by a scalar which determines its chance of existence. [25, 30] derive embeddings for different types of edges in a heterogeneous network, but their pro- posed method essentially uses an aggregation function inside the op- timization framework to generate edge embeddings from the node embeddings. For knowledge bases, embedding entities and relation types in a low dimensional continuous vector space [5, 7, 10] have been shown to be useful. But, several fundamental concepts of graph embedding, such as homophily, are not directly applicable to them. [19] proposes a dual-primal GCN based semi-supervised node em- bedding approach which first aggregates edge features by convolu- tion, and then learns the node embeddings by employing a graph attention on the incident edge features of a node. To the best of our knowledge, [36] is the only work which proposes a supervised approach based on adversarial training and an auto-encoder, purely for edge representation learning in homogeneous networks. But their framework needs a large amount of labelled edges to train the GAN, which makes it restrictive for real world applications. Hence in this paper, we propose a task-independent unsupervised dedicated edge embedding framework for homogeneous information networks to ad- dress the research gaps. 3 Problem Description An information network is typically represented by a graph G = (V,E,W), where V = {v1,v2, · · · ,vn} is the set of nodes (a.k.a. vertices), each representing a data object. E ⊆{(vi,vj )|vi,vj ∈ V} is the set of edges. We assume, |E| = m. Each edge e ∈ E is associated with a weight wvi,vj > 0 (1 if G is unweighted), which indicates the strength of the relation. Degree of a node v is denoted as dv, which is the sum of weights of the incident edges. N(v) is the set of neighbors of the node v ∈ V . For the given network G, the edge representation learning is to learn a function f : e 7→ x ∈ RK , i.e., it maps every edge e ∈ E to a K dimensional vector called edge embedding, where K < m. These edge embeddings should preserve the underlying edge semantics of the network, as described below. Edge Importance: Not all the edges in a network are equally im- portant. For example, in a social network, millions of fans can be connected to a movie star. But any two fans of a movie star may not be similar to each other. So this type of connections are weaker com- pared to an edge which connects two friends who have much lesser number of connections individually [16]. Edge Proximity: The edges which are close to each other in terms of their topography or semantics should have similar embeddings. Similar to the concepts of node proximities [31], it is easy to define first and higher order edge proximities via incidence matrix. 4 Solution Approach: line2vec We propose an elegant solution (referred as line2vec) to embed each edge of the given network. First we map the network to a weighted line graph, where each edge of the original network is transformed into a node.Then we propose a novel approach for embedding the nodes of the line graph, which essentially provides the edge embed- dings of the original network. For simplicity of presentation, we as- sume that the given network is undirected. Nevertheless, it can triv- ially be generalized for directed graphs. 4.1 Line Graph Transformation Given an undirected graph G = (V,E), the line graph L(G) is the graph such that each node of L(G) is an edge in G and two nodes of L(G) are neighbors if and only if their correspond- ing edges in G share a common endpoint vertex [32]. Formally L(G) = (VL,EL) where VL = {(vi,vj ) : (vi,vj ) ∈ E} and EL = { ( (vi,vj ), (vj,vk) ) : (vi,vj ) ∈ E , (vj,vk) ∈ E}. Figure 2 https://bit.ly/2kfiS2l https://bit.ly/2kfiS2l shows how to convert a graph into the line graph [8]. Hence the line graph transformation induces a bijection from the set of edges of the given graph to the set of nodes of the line graph as l : e 7→ v where ∀e ∈ E, ∃ v ∈ VL and if two edges ei,ej ∈ E are adjacent there will be an corresponding edge e ∈ EL in the line graph. Figure 2: Transformation process of a graph into its line graph. (a) Represents an information network G. (b) Each edge in the original graph has a corresponding node in the line graph. Here the green edges represent the nodes in line graph. (c) For each adjacent pair of edges in G there exists an edge in L(G). The dotted lines here are the edges in the line graph. (d) The line graph L(G) of the graph G 4.2 Weighted Line Graph Formation We propose to construct a weighted line graph for our problem even if the original graph is unweighted. These weights would help the random walk in the later stage of line2vec to focus more on the rel- evant nodes in the line graph. It is evident from Section 4.1 that a node of degree k in the original graph G produces k(k−1)/2 edges in the line graph L(G). Therefore high degree nodes in the origi- nal graph may get over-represented in the line graph. Often many of these incident edges are not that important to the concerned node in the given network, but they can potentially change the movement frequency of a random walk in the line graph. We follow a simple strategy to overcome this problem. The goal is to ensure that the line graph not only reflects the topology of the original graph G (which is guaranteed by Whitney graph isomorphism theorem [32] in almost all cases) but also the dynamics of the graph is not affected by the transformation process. The edge weights are defined to facilitate a random walk on L(G), as described in Section 4.3.1. Intuitively if we start a random walk from a node vij ≡ (vi,vj ) ∈ L(G) and want to traverse to vjk ≡ (vj,vk) ∈ L(G), then it is equivalent to selecting the node vj ∈ G from (vi,vj ) and move to vk ∈ G. If G is undirected, we define the probability of choosing vj to be propor- tional to dvi dvi +dvj . Here, dvi and dvj are the degrees of the end point nodes of the edge (vi,vj ) and an edge in general is more important to the endpoint node having lower degree than the other endpoint with a higher degree [16]. Then selecting vk is proportional to edge weight of ejk ≡ (vj,vk) ∈ E. Hence, for any two adjacent edges eij ≡ (vi,vj ) and ejk ≡ (vj,vk), we define the edge weight for the edge (eij,ejk) of the line graph L(G) as follows: w(eij,ejk) = di di + dj × wjk∑ r∈N(vj ) wjr −wij (1) This completes the formation of the weighted line graph from any given network. 4.3 Embedding the Line Graph Here we propose a novel approach to embed the nodes of the line graph. Line graph is a special type of graph which comes with some nice properties. Below is one important observation that we exploit in embedding the line graph. Lemma 1 Each (non-isolated) node in the graph G induces a clique in the corresponding line graph L(G). Proof 1 Let’s assume that a (non-isolated) node v in the graph G has nv edges connected to it. So these nv edges are neighbors of each other. Hence in the corresponding line graph L(G), each of these edges would be mapped to a node and each of these nodes is connected to all the other nv −1 nodes. Thus there is a clique of size nv induced in the line graph by node v. This can be visualized in Fig. 2, where the node 1 in (a) with de- gree 3 induces a clique of size 3, including the nodes (1,2), (1,3) and (1,4) into the corresponding line graph in (d). Lemma 1 is interesting because it tells that the nodes of the line graph exhibit some col- lective property, rather than just pairwise property. To clarify, in the given network, two nodes are pairwise connected by an edge, but in the line graph, a group of nodes form a clique. Pairwise homophily [18], which has been the backbone to many standard embedding al- gorithms [31], is not sufficient for embedding the line graph. Hence we propose a new concept ‘collective homophily’ applicable to the line graph. We explain it below. Figure 3: Collective Homophily ensures the embeddings of the edges which are connected via a common node in the network, stay within a sphere of small radius. 4.3.1 Collective Homophily and Cost Function Formulation We emphasize that all the nodes, which are part of a clique in a line graph, should be close to each other in the embedding space. One way to enforce collective homophily is to introduce a sphere (of small radius R ∈ R) in the embedding space and ensure that embedding of the nodes (in the line graph) which are part of a clique, remain within the sphere. Hence any two embeddings within a sphere are at a maxi- mum of 2R distance apart from each other. The concept is explained in Fig. 3. Smaller the radius R, embeddings of the neighbor edges would be closer to each other and hence the better the enforcement of collective homophily. Note that a sum of pairwise homophily loss in the embedding space may lead to some pairs being very close to each other and others may still be quite far. So, we formulate the objective function to embed the (weighted) line graph as follows. Let us introduce some notation. Bold face letters like u (or v) denote a node in the line graph L(G), which can also be denoted by uuv when the correspondence with the edge (u,v) ∈ E in the original graph G is required. Normal face letters like u,v denote nodes in the given graph. xv ∈ RK (equivalently xuv) denotes the embedding of the node vuv in line graph (or the edge (u,v) ∈ E). To map the nodes of the line graph to vectors, first we want to pre- serve different orders of node proximities in the line graph. For this, a truncated random walk based sampling strategy S is used to provide a set of nodes NS(v) as context to a node v in the network. Here we employ the random walk proposed by [12], which balances between the BFS and DFS search strategy in the graph. As the generated line graph is a weighted one, we consider the weights of the edges while computing the node transition probabilities. Let X denote the matrix with each row as the embedding xv of a node v of the line graph. As- suming conditional independence of the nodes, we seek to maximize (w.r.t. X) the log likelihood of the context of a node as:∑ v∈VL log P(NS(v)|xv) = ∑ v∈VL ∑ v′∈NS (v) log P(v ′|xv) Each of the above probabilities can be represented using standard softmax function parameterized by the dot product of xv′ and xv. As usual, we also approximate the computationally expensive denomi- nator of the softmax function using some negative sampling strategy N̄(v) for any node v. The above equation, after simple algebraic manipulations, leads to maximizing the following:∑ v∈VL ∑ v′∈NS (v) xv′ ·xv −|NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) (2) Next, we implement the concept of collective homophily as pro- posed above. Each node u ∈ V (in the original network) induces a clique in the line graph (Lemma 1). An edge (u,v) ∈ E corresponds to the node vuv ∈ VL in the line graph. So we want all the nodes of the form vuv ∈ VL belong to a sphere centered at cu ∈ RK and of radius Ru, where v ∈ N(u) (neighbors of u). As collective ho- mophily suggests that embeddings of these nodes must be close to each other, we minimize the sum of all such radii. This with Eq. 2 gives the final cost function of line2vec as follows. min X,R,C ∑ v∈VL [ |NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) − ∑ v′∈NS (v) xv′ ·xv ] + α ∑ u∈V R 2 u such that, ||xuv −cu||22 ≤ R 2 u, ∀v ∈N(u), ∀u ∈ V Ru ≥ 0, ∀u ∈ V (3) Here, α is a positive weight factor. The constraint ||xuv −cu||22 ≤ R2u ensures that nodes of the form xuv belong to the sphere of radius Ru and centered at cu. We use R and C to denote set of all such radii and centers respectively. 4.3.2 Solving the Optimization Equation 3 is a non-convex constrained optimization problem. We use penalty functions [6] technique to convert this to an uncon- strained optimization problem as follows: min X,R,C ∑ v∈VL [ |NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) − ∑ v′∈NS (v) xv′ ·xv ] + α ∑ u∈V R 2 u + λ ∑ u∈V ∑ v∈N(u) g(||xuv −cu||22 −R 2 u) + ∑ u∈V γug(−Ru) (4) Here the function g : R → R is defined as g(t) = max(t, 0). So it imposes a penalty to the cost function in Eq. 4 when the argument inside g is positive, i.e., when there is a violation of the constraints in Eq. 3. We use a linear penalty g(t) as the gradient does not van- ish even when t → 0+. To solve the unconstrained optimization in Eq. 4, we use stochastic gradient descent, computing gradients w.r.t. each of X, R and C. We take subgradient when t = 0 for g(t). All the penalty parameters λ and γu’s corresponding to penalty func- tions are positive. When there is any violation of a constraint (or sum of constraints), the corresponding penalty parameter is increased to impose more penalty. We give more importance to the type of con- straints Ru ≥ 0, as violation of them may change the intuition of the solution. So we use different penalty parameters for each of them, so imposing a different penalty to each of such constraints is possible. One can show that under appropriate assumptions, any convergent subsequence of solutions to the unconstrained penalized problems must converge to a solution of the original constrained problem [6]. Very small values of the penalty parameters might lead to the vio- lation of constraints, and very large values would make the gradient descent algorithm oscillate. So, we start with smaller values of λ and γu’s and keep increasing them until all the constraints are satisfied or the gradients become too large making abrupt function changes. Note that, theoretically some of the constraints in Eq. 3 may still be violated, but experimentally we found them satisfied up to a large extent (Section 5). In the final solution, xv gives the vector represen- tation of node v of the line graph, which is essentially the embedding of the corresponding edge in the original network. 4.4 Key Observations and Analysis Both the edge embedding properties mentioned in Section 3 are pre- served in the construction and embedding of the weighted line graph. Particularly, if two edges have a common incident node in the orig- inal network, the corresponding two nodes in the transformed line graph would be neighbors. Also two edges having similar neighbor- hood in the original network lead to two nodes having similar neigh- borhood in the transformed line graph. The random walk and collec- tive homophily preserve both pairwise and collective node proxim- ity of the line graph in the embedding space. Thus different orders of edge proximities of the original network is captured well in the edge embeddings. Also the construction of edge weights in line graph (Sec. 4.2) ensures that underlying importance of edges of the original network is preserved in the transformed line graph, and hence in the embeddings through truncated random walk. Time Complexity: Edge embedding is computationally difficult than node embedding, as the number of edges in a real life net- work is more than the number of nodes. From Lemma 1, each node u in the original network induces a clique of size du (degree of u in G). Hence total number of edges in the line graph is: mL =∑ u∈V ( du 2 ) = ∑ u∈V du(du−1) 2 ≤ |V |d2, where d is the maximum de- gree of a node in the given network. So, the construction of line graph would take O(|V |d2) time. Next, we use alias table for fast computation of the corpus of node sequences in O(mL log(mL)) = O(|V |d2 log(|V |d)) by the random walks, assuming the number of random walks on the line graph, maximum length of a random walk, context window size and the number of negative samples for each node to be constant, as they are the hyper parameters of skip-gram model. Then, the first term (under the sum over the nodes in VL) of Eq. 4 can be computed in O(|VL|) = O(|E|) time. Next, the term weighted by α can be computed in O(|V |) time. Then, for the term weighted by λ, we need to visit each node in V and for each such node, its neighbors in the original graph, which can be computed in a total of O(|E|) time. The last term of Eq. 4 can be computed in O(|V |) time. As we use penalty methods to solve it, the runtime of solving Eq. 4 is O(|E| + |V |). Hence the total runtime complexity of line2vec is O(|V |d2 log(|V |d)). So in the worst case, (for e.g., a nearly complete graph), run time complexity is O(|V |3log|V |). But for most of the real life social networks, the maximum degree can be considered as a constant (i.e., does not grow with the number of nodes). Hence for them, the run time complexity is O(|V |log|V |). 5 Experimental Evaluation We conduct detailed experiments on three downstream edge centric network mining tasks and thoroughly analyze the proposed optimiza- tion framework of line2vec. 5.1 Design of Baseline Algorithms Unsupervised direct edge embedding for information network itself is a novel problem. Existing approaches only aggregate the embed- dings of the two end nodes to find the embedding of an edge. So as baselines, we only consider the publicly available implementation of a set of popular unsupervised node embedding algorithms which can work only using the link structure of the graph: DeepWalk, node2vec, SDNE, struc2vec and GraphSage (official unsupervised implemen- tation for the un-attributed networks). We have considered differ- ent types of node aggregation methods such as taking the average, Hadamard product, vector norms of two end node embeddings [12] to generate the edge embeddings for the baseline algorithms. It turns out that average aggregation method performs the best among them. So we report the performance of the baseline methods with average node aggregation, where embedding of an edge (u,v) is computed by taking the average of the node embeddings of u and v. 5.2 Datasets Used and Setting Hyper-parameters We used five real world publicly available datasets for the ex- periments. A summary of the datasets is given in Table 1. For Zachary’s karate club and Dolphin social network (http:// www-personal.umich.edu/˜mejn/netdata/), there are no ground truth community labels given for the nodes. So we use the modularity based community detection algorithm, and label the nodes based on the communities they belong to. For Cora, Pubmed (https://linqs.soe.ucsc.edu/data) and MSA [26], the ground truth node communities are available. The ground truth edge labels are derived as follows. If an edge connects two nodes of the same community (intra community edge), the label of that edge is the common community label. If an edge connects nodes of different communities (inter community edge), then that edge is not consid- ered for calculating the accuracy of downstream tasks. Note that, all the edges (both intra and inter community) are considered for learn- ing the edge embeddings. We also provide the size of the generated weighted line graphs in Table 1. Note that, line graphs are still ex- tremely sparse in nature, which enables the application of efficient data structures and computation on sparse graphs here. We set the parameter α in Eq. 3 to be 0.1 in the experiments. At that value, the two components in the cost function in Eq. 3 con- tribute roughly the same to the total cost in the first iteration of line2vec. The dimension (K) of the embedding space is set as 8 for Karate club and Dolphin social network as they are small in size, Table 1: Summary of the datasets used. Dataset #Nodes #Edges #Edge-Labels #Nodes in L(G) #Edges in L(G) Zachary’s Karate club 34 78 3 78 528 Dolphin social network 62 159 4 159 923 Cora 2708 5278 7 5278 52301 Pubmed 19717 44327 3 44327 699385 MSA 30101 204926 3 204926 6149555 and it is set as 128 for the other three larger datasets (for all the algo- rithms). For the faster convergence of SGD, we set the initial learning rate higher and decrease it over the iterations. We vary the penalty pa- rameters in Eq. 4 over the iterations as discussed in Section 4.3.2 to ensure that the constraints are satisfied at large. 5.3 Penalty Errors of line2vec Optimization We have shown the values of two different penalty errors (or con- straint violation error of the penalty method based optimization) over the iterations of line2vec in Figure 5. For all the datasets, total spher- ical error ∑ u∈V ∑ v∈N(u) g(||xuv − cu||22 − R2u) converges to a small value very close to zero and negative error ∑ u∈V g(−Ru) remains to be zero. This means, almost all the constraints of line2vec formula- tion are satisfied in the final solution. 5.4 Downstream Edge Mining Tasks Edge visualization: It is important to understand if the edge embed- dings are able to separate the communities visually. We use the em- bedding of the edges as input in RK , and use t-SNE [17] to plot the edge embedding in a 2 dimensional space. Fig. 4 shows the edge vi- sualizations by line2vec, along with the baselines algorithms on Cora datasets. Note that, line2vec is able to visually separate the commu- nities well compared to all the other baselines. The same trend was observed even in Fig. 1 for the small synthetic network. Line2vec, be- ing a direct approach for edge embedding via collective homophily, outperforms all the baselines which aggregate node embeddings to generate the embeddings for the edges. Edge Clustering: Like node clustering, edge clustering is also im- portant to understand the flow of information within and between the communities. For clustering the embeddings of the edges, we apply KMeans++ algorithm. To evaluate the quality of clustering, we use unsupervised clustering accuracy [34] which uses different permuta- tions of the labels and chooses the assignment which gives best possi- ble accuracy. Figure 6a shows that line2vec outperforms all the base- lines for edge clustering on all the datasets. DeepWalk and node2vec also perform well among the baselines. Multi-class Edge Classification: We use only 10% edges with ground truth label (as generated in Section 5.2) as the training set, because getting labels is expensive in networks. A logistic regression classifier is trained on the edge embeddings generated by different al- gorithms. The performance on the test set is reported using Micro F1 score. Figure 6b shows that line2vec is better or highly competitive with the state-of-the-art embedding algorithms. node2vec and Deep- Walk follows line2vec closely. On the Dolphin dataset, node2vec outperforms line2vec marginally. Performance of line2vec for edge classification again shows the superiority of a direct edge embedding scheme over the node aggregation approaches. 5.5 Ablation Study of line2vec The idea of line2vec is to embed the line graph for generating the edge embeddings of a given network. There are two main novel com- http://www-personal.umich.edu/~mejn/netdata/ http://www-personal.umich.edu/~mejn/netdata/ https://linqs.soe.ucsc.edu/data (a) DeepWalk (b) node2vec (c) SDNE (d) struc2vec (e) GraphSAGE (f) line2vec Figure 4: Edge visualization on Cora dataset. Different colors represent different edge communities. (a) Spherical Error (b) Non-negative Error Figure 5: Both spherical error ∑ u∈V ∑ v∈N(u) g(||xuv − cu||22 − R2u) and non-negative error ∑ u∈V g(−Ru) in the penalty function based optimization of line2vec converge to zero very fast on all the datasets. ponents in line2vec: first, the construction of weighted line graph; and second, more importantly, proposing the concept of collective homophily on the weighted line graph. In this subsection, we show the incremental benefit of each component through a small experi- ment of edge visualization on the Dolphin dataset, as shown in Fig. 7. We use node2vec (N2V) as the starting point because the skip- gram objective component of line2vec (L2V) is similar to node2vec. Though, visually there is not much difference between Sub-figures 7a and 7b, but there is some improvement when we apply node2vec on the weighted line graph (without using collective homophily) in Sub-fig. 7c. Finally, superiority of line2vec because of using collec- tive homophily on the weighted line graph is clear from Sub-fig. 7c. Thus, both the novel components of line2vev have their incremental benefits for the overall algorithm. 5.6 Parameter Sensitivity of line2vec Figure 8 shows the sensitivity of line2vec with respect to the hyper- parameter α (in Eq. 3 of the main paper) on Karate and Dolphin datasets. We have shown the variation of performance for node clas- sification (both micro and macro F1 scores) and node clustering (un- supervised accuracy). From the figure, one can observe that optimal performance in most of the cases is obtained when the value of α is from 0.05 to 0.1. Around these values, the loss from both the com- ponents of line2vec in Eq. 3 are close to each other. For our other experiments, we fix α=0.1 for all the datasets. 5.7 Interpretation of cu as Node Embedding line2vec is dedicated for direct edge embedding in information net- works. Lemma 1 suggests that each node in the given network G induces a clique in the line graph L(G). Based on the concept of col- lective homophily, corresponding to a node u in G, the clique in the line graph is enclosed by a sphere centered at cu ∈ RK (Eq. 3). In- tuitively, the center acts as a point which is close to the embeddings of all the nodes in the clique induced by u (or equivalently, all the edges incident on u in G). Hence the role of this center in the em- bedding space is similar to the role of the node to its adjacent edges in the graph. This motivates us to consider cu as the node embedding of u ∈ V in G. If (u,v) ∈ E, then the edge embedding of (u,v) should be close to both cu and cv, which in turn pulls cu and cv close to each other. Thus, node proximities are also captured in cu. We use clustering of the nodes (a.k.a. community detection) of the given network to validate the quality of node embedding obtained from the centers of the line2vec optimization. We use k-means++ clustering, as before, on the set of points cu, ∀u ∈ V and validate the clustering quality by using unsupervised accuracy [4]. Figure 6c shows that line2vec, though designed specifically for edge embed- ding, performs really good for a node based mining task. Specifically, for Karate and Dolphins networks, the gain is significantly more than best of the baselines. This result is interesting as we aimed to find edge embeddings, but also generate a set of efficient node embed- dings, which are not just the aggregation of the incident edges. 5.8 Connection of Node Centrality with Ru This subsection analyzes the interpretation of the radius Ru of the sphere enclosing the clique induced by node u ∈ V in the embed- ding space. When a node u has less number of incident edges, and the neighbors are very close to each other in the embedding space (for e.g., they are all from the same sub-community), a small radius Ru should be enough to enclose all the edges incident on u. But when the neighbors of the node u are diverse in nature, the corresponding edges would also be different in terms of strength and semantics. For example, an influential researcher may be directly connected to many other researchers in a research network, but only few of them can be direct collaborators. Hence, a larger sphere is needed to enclose the clique in the line graph induced by such a node. This intuition con- nects radius Ru of a sphere in the embedding space of line graph to the centrality [27] of the node u in the given network. A node which is loosely connected (i.e., less number or very similar neighbors) in the network is less central, and a node which is strongly connected (many or diverse set of neighbors) is considered as highly central. As real life networks are noisy [4], first we experiment with a small synthetic graph as shown in Figure 9 to show the connection between Ru and the centrality of the node u ∈ V . It has three communities and there is a central (red colored) node connecting all the commu- nities. Each community has three sub-communities which are con- nected via the green colored nodes. The degree of each node in this network is kept roughly the same. We use closeness centrality [21], which is used widely in the network analysis literature. The closeness centrality of the nodes are plotted in Fig. 9(b). The nodes in the y-axis are sorted based on their closeness centrality values and as expected, the red node top the list as it is well connected to all the communi- ties, followed by the green nodes, with yellow nodes placed at the bottom. We run line2vec on this synthetic graph and plot the Radius Ru for each node u in Fig. 9(c). Here also, the nodes are sorted in (a) Edge Clustering (b) Edge Classification (c) Node Clustering Figure 6: Performance Comparisons: (a) Micro F1 Score of Edge Classification. (b) Edge Clustering with KMeans++. (c) Node Clustering with KMeans++. Here we use cu as the embedding of the node u in the given network. (a) N2V (b) N2V+LG (c) N2V+WLG (d) L2V Figure 7: Edge visualization on Dolphin Dataset by t-SNE: In the following sub-figures, edge Embeddings are obtained (a) by using node2vec on the input graph and then taking average of end node embeddings for each edge, (b) by using node2vec on an unweighted (conventional) line graph, (c) by using node2vec on our proposed weighted line graph, (d) by line2vec. Clearly, there is an incremental improvement of the quality because of using weighted line graph and then collective homophily as reflected in (c) and (d) respectively. (a) Edge Classification (b) Edge Clustering Figure 8: Sensitivity of line2vec with respect to the hyper-parameter α (in Eq. 3 of the main paper) on Karate and Dolphin datasets: We have shown the variation of performance for edge classification (Mi- cro F1 score) and edge clustering (unsupervised accuracy). the same order as in sub-figure 9(b). As one can see, the red node has the highest value of the radius. As this node is connected to a diverse set of nodes in the network, it needs a larger sphere to enclose the induced clique in the line graph. We also observe that most of the green nodes have higher values of Ru than that of the yellow nodes. The correlation coefficient between the closeness centrality and the radius Ru is 0.56. A more prominent trend can be observed for be- tweenness centrality [27], where the correlation coefficient with the radius Ru is 0.86. On all the real-world datasets, we show the correlation of Ru with the two centrality metrics for all the nodes in Table 2. High positive correlation between them can conclude that radius Ru of a node is roughly proportional to the centrality of the node u in the network. However, a detailed analysis is required to see the scope of introduc- ing a new type of node centrality based on the values of Ru. (a) (b) (c) Figure 9: Relationship between radius Ru associated with each node and closeness centrality in a synthetic graph. (a) shows the structure of the synthetic network. (b) shows the closeness centrality of the nodes, where in Y axis, nodes are sorted based on their centrality values. (c) shows the Ru for all the nodes. Nodes in Y-axis of (c) are sorted in the same order as in (b). The colors of the lines in (b) and (c) correspond to three different types of nodes (colored accordingly) in (a). This figure also shows the high overlap between the top few nodes in both the lists. Table 2: Pearson Correlation-Coefficient(CC) values obtained be- tween the radius(Ru) and centrality values of nodes for different net- works. The centrality measures considered here are Betweenness and Closeness centrality. Dataset Karate Dolphins Cora Pubmed MSA Betweenness CC 0.81 0.66 0.29 0.26 0.35 Closeness CC 0.68 0.78 0.79 0.59 0.72 6 Discussion and Future Work We proposed a novel unsupervised dedicated edge embedding frame- work for homogeneous information and social networks. We convert the given network to a weighted line graph and introduce the con- cept of collective homophily to embed the weighted line graph. Our framework is quite generic. The skip-gram based component in the objective function of line2vec can easily be replaced with any other approach like graph convolution in weighted line graph. Beside, we also plan to extend this methodology for heterogeneous information networks and knowledge bases. There are several edge centric appli- cations in networks. This work, being the first one towards a direct edge embedding, can play a basis to solve some of them in the con- text of network embedding and help to move network representation learning beyond node embedding. REFERENCES [1] Sami Abu-El-Haija, Bryan Perozzi, and Rami Al-Rfou, ‘Learning edge representations via low-rank asymmetric projections’, in Proceedings of the 2017 ACM on Conference on Information and Knowledge Man- agement, pp. 1787–1796. ACM, (2017). [2] Lada A Adamic and Eytan Adar, ‘Friends and neighbors on the web’, Social networks, 25(3), 211–230, (2003). [3] Sambaran Bandyopadhyay, Harsh Kara, Aswin Kannan, and M Narasimha Murty, ‘Fscnmf: Fusing structure and content via non-negative matrix factorization for embedding information net- works’, arXiv preprint arXiv:1804.05313, (2018). [4] Sambaran Bandyopadhyay, N Lokesh, and M Narasimha Murty, ‘Out- lier aware network embedding for attributed networks’, in Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 12– 19, (2019). [5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko, ‘Translating embeddings for modeling multi- relational data’, in Advances in neural information processing systems, pp. 2787–2795, (2013). [6] Kurt Bryan and Yosi Shibberu, ‘Penalty functions and constrained opti- mization’, Dept. of Mathematics, Rose-Hulman Institute of Technology. http:// www. rosehulman. edu/˜ bryan/lottamath/penalty. pdf, (2005). [7] Muhao Chen and Chris Quirk, ‘Embedding edge-attributed relational hierarchies’. SIGIR, (2019). [8] Tim S Evans and Renaud Lambiotte, ‘Line graphs of weighted net- works for overlapping communities’, The European Physical Journal B, 77(2), 265–272, (2010). [9] Hongchang Gao and Heng Huang, ‘Deep attributed network embed- ding.’, in IJCAI, volume 18, pp. 3364–3370, (2018). [10] Zheng Gao, Gang Fu, Chunping Ouyang, Satoshi Tsutsui, Xiaozhong Liu, Jeremy Yang, Christopher Gessner, Brian Foote, David Wild, Ying Ding, et al., ‘edge2vec: Representation learning using edge semantics for biomedical knowledge discovery’, BMC bioinformatics, 20(1), 306, (2019). [11] Mohammad Golam Sohrab, Toru Nakata, Makoto Miwa, and Yutaka Sasaki, ‘Edge2vec: Edge representations for large-scale scalable hier- archical learning’, Computación y Sistemas, 21(4), 569–579, (2017). [12] Aditya Grover and Jure Leskovec, ‘node2vec: Scalable feature learn- ing for networks’, in Proceedings of the 22nd ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, pp. 855– 864. ACM, (2016). [13] Will Hamilton, Zhitao Ying, and Jure Leskovec, ‘Inductive representa- tion learning on large graphs’, in Advances in Neural Information Pro- cessing Systems, pp. 1025–1035, (2017). [14] Thomas N Kipf and Max Welling, ‘Semi-supervised classification with graph convolutional networks’, arXiv preprint arXiv:1609.02907, (2016). [15] Suxue Li, Haixia Zhang, Dalei Wu, Chuanting Zhang, and Dongfeng Yuan, ‘Edge representation learning for community detection in large scale information networks’, in International Workshop on Mobility Analytics for Spatio-temporal and Social Data, pp. 54–72. Springer, (2017). [16] David Liben-Nowell and Jon Kleinberg, ‘The link-prediction problem for social networks’, Journal of the American society for information science and technology, 58(7), 1019–1031, (2007). [17] Laurens van der Maaten and Geoffrey Hinton, ‘Visualizing data us- ing t-sne’, Journal of machine learning research, 9(Nov), 2579–2605, (2008). [18] Miller McPherson, Lynn Smith-Lovin, and James M Cook, ‘Birds of a feather: Homophily in social networks’, Annual review of sociology, 27(1), 415–444, (2001). [19] Federico Monti, Oleksandr Shchur, Aleksandar Bojchevski, Or Litany, Stephan Günnemann, and Michael M Bronstein, ‘Dual-primal graph convolutional networks’, arXiv preprint arXiv:1806.00770, (2018). [20] M.E.J. Newman, Networks: An Introduction, Oxford University Press, Oxford, UK, 2010. [21] Tore Opsahl, Filip Agneessens, and John Skvoretz, ‘Node centrality in weighted networks: Generalizing degree and shortest paths’, Social networks, 32(3), 245–251, (2010). [22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, ‘Deepwalk: Online learning of social representations’, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, (2014). [23] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo, ‘struc2vec: Learning node representations from structural identity’, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. ACM, (2017). [24] E. Rogers, Diffusion of Innovations, Free Press, New York, USA, 1995. [25] Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han, ‘Easing em- bedding learning by comprehensive transcription of heterogeneous in- formation networks’, in Proceedings of the 24th ACM SIGKDD In- ternational Conference on Knowledge Discovery & Data Mining, pp. 2190–2199. ACM, (2018). [26] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo- june Paul Hsu, and Kuansan Wang, ‘An overview of microsoft aca- demic service (mas) and applications’, in Proceedings of the 24th in- ternational conference on world wide web, pp. 243–246. ACM, (2015). [27] Oskar Skibski, Talal Rahwan, Tomasz P Michalak, and Makoto Yokoo, ‘Attachment centrality: An axiomatic approach to connectivity in net- works’, in Proceedings of the 2016 International Conference on Au- tonomous Agents & Multiagent Systems, pp. 168–176. International Foundation for Autonomous Agents and Multiagent Systems, (2016). [28] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio, ‘Graph attention networks’, in International Conference on Learning Representations, (2018). [29] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm, ‘Deep graph infomax’, in Inter- national Conference on Learning Representations, (2019). [30] Janu Verma, Srishti Gupta, Debdoot Mukherjee, and Tanmoy Chakraborty, ‘Heterogeneous edge embedding for friend recommenda- tion’, in European Conference on Information Retrieval, pp. 172–179. Springer, (2019). [31] Daixin Wang, Peng Cui, and Wenwu Zhu, ‘Structural deep network embedding’, in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. ACM, (2016). [32] H. Whitney, ‘Congruent graphs and the connectivity of graphs’, Amer- ican Journal of Mathematics, 54(1), 150–168, (1932). [33] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu, ‘A comprehensive survey on graph neural net- works’, arXiv preprint arXiv:1901.00596, (2019). [34] Junyuan Xie, Ross Girshick, and Ali Farhadi, ‘Unsupervised deep em- bedding for clustering analysis’, in International conference on ma- chine learning, pp. 478–487, (2016). [35] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang, ‘Network representation learning with rich text information.’, in IJCAI, pp. 2111–2117, (2015). [36] Yang Zhou, Sixing Wu, Chao Jiang, Zijie Zhang, Dejing Dou, Ruom- ing Jin, and Pengwei Wang, ‘Density-adaptive local edge representa- tion learning with generative adversarial network multi-label edge clas- sification’, in 2018 IEEE International Conference on Data Mining (ICDM), pp. 1464–1469. IEEE, (2018). 1 Introduction 2 Related Work and Research Gaps 3 Problem Description 4 Solution Approach: line2vec 4.1 Line Graph Transformation 4.2 Weighted Line Graph Formation 4.3 Embedding the Line Graph 4.3.1 Collective Homophily and Cost Function Formulation 4.3.2 Solving the Optimization 4.4 Key Observations and Analysis 5 Experimental Evaluation 5.1 Design of Baseline Algorithms 5.2 Datasets Used and Setting Hyper-parameters 5.3 Penalty Errors of line2vec Optimization 5.4 Downstream Edge Mining Tasks 5.5 Ablation Study of line2vec 5.6 Parameter Sensitivity of line2vec 5.7 Interpretation of cu as Node Embedding 5.8 Connection of Node Centrality with Ru 6 Discussion and Future Work
bielak-attre2vec-2021 ---- AttrE2vec: Unsupervised Attributed Edge Representation Learning See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/348079131 AttrE2vec: Unsupervised Attributed Edge Representation Learning Preprint · December 2020 CITATIONS 0 READS 7 3 authors: Some of the authors of this publication are also working on these related projects: Social networks View project TRANSFoRm View project Piotr Bielak Wroclaw University of Science and Technology 2 PUBLICATIONS 0 CITATIONS SEE PROFILE Tomasz Kajdanowicz Wroclaw University of Science and Technology 113 PUBLICATIONS 829 CITATIONS SEE PROFILE Nitesh V Chawla University of Notre Dame 382 PUBLICATIONS 21,078 CITATIONS SEE PROFILE All content following this page was uploaded by Piotr Bielak on 04 January 2021. The user has requested enhancement of the downloaded file. https://www.researchgate.net/publication/348079131_AttrE2vec_Unsupervised_Attributed_Edge_Representation_Learning?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_2&_esc=publicationCoverPdf https://www.researchgate.net/publication/348079131_AttrE2vec_Unsupervised_Attributed_Edge_Representation_Learning?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_3&_esc=publicationCoverPdf https://www.researchgate.net/project/Social-networks-9?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_9&_esc=publicationCoverPdf https://www.researchgate.net/project/TRANSFoRm-3?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_9&_esc=publicationCoverPdf https://www.researchgate.net/?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_1&_esc=publicationCoverPdf https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf https://www.researchgate.net/institution/Wroclaw_University_of_Science_and_Technology?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf https://www.researchgate.net/institution/Wroclaw_University_of_Science_and_Technology?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf https://www.researchgate.net/institution/University_of_Notre_Dame?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_10&_esc=publicationCoverPdf AttrE2vec: Unsupervised Attributed Edge Representation Learning Piotr Bielaka, Tomasz Kajdanowicza, Nitesh V. Chawlaa,b aDepartment of Computational Intelligence, Wroclaw University of Science and Technology, Poland bDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA Abstract Representation learning has overcome the often arduous and manual featurization of net- works through (unsupervised) feature learning as it results in embeddings that can apply to a variety of downstream learning tasks. The focus of representation learning on graphs has focused mainly on shallow (node-centric) or deep (graph-based) learning approaches. While there have been approaches that work on homogeneous and heterogeneous net- works with multi-typed nodes and edges, there is a gap in learning edge representations. This paper proposes a novel unsupervised inductive method called AttrE2Vec, which learns a low-dimensional vector representation for edges in attributed networks. It sys- tematically captures the topological proximity, attributes affinity, and feature similarity of edges. Contrary to current advances in edge embedding research, our proposal extends the body of methods providing representations for edges, capturing graph attributes in an inductive and unsupervised manner. Experimental results show that, compared to contemporary approaches, our method builds more powerful edge vector representations, reflected by higher quality measures (AUC, accuracy) in downstream tasks as edge classi- fication and edge clustering. It is also confirmed by analyzing low-dimensional embedding projections. Keywords: representation learning, graphs, edge embedding, random walk, neural network, attributed graph. 1. Introduction Complex networks, included attributed and heterogeneous networks, are ubiquitous — from recommender systems to citation networks and biological systems [1]. These networks present a multitude of machine learning problem statements, including node classification, link prediction, and community detection. A fundamental aspect of any such machine learning (ML) task, transductive or inductive, is the availability of fea- turized data. Traditionally, researchers have identified several network characteristics suited to specific ML tasks and used them for the learning algorithm. This practice is arduous as it often entails customizing to each specific ML task, and also is limited to the computable characteristics. This has led to a surge in (unsupervised) algorithms and methods that learn embed- dings from the networks, such that these embeddings form the featurized representation Preprint submitted to Information Sciences January 1, 2021 ar X iv :2 01 2. 14 72 7v 1 [ cs .L G ] 2 9 D ec 2 02 0 Figure 1: Our proposed AttrE2vec model compared to other methods in the task of an attributed graph embedding. Colors denote edge features. On the left we can see a graph, where the features are aligned to substructures of the graph. On the right, the features were shuffled (ca. 50%). Traditional approaches fail to build robust representations, whereas our method includes features information to construct the embedding vectors. of the network for the ML tasks [2, 3, 4, 5, 6]. This area of research is generally no- tated as representation learning in networks. Generally, these embeddings generated by representation learning methods are agnostic to the end use-case, as they are generated in an unsupervised fashion. Traditionally, the focus was on representation learning on homogeneous networks, i.e. the networks that have singular type of nodes and edges, and also do not have attributes attached to the nodes and edges [4]. Existing representation learning models mainly focus on transductive learning, where a model can only be trained using the entire input graph. It means that the model requires all the nodes and a fixed structure of the network in the training phase, e.g., Node2vec [7], DeepWalk [8] and GCN [9], to some extent. Besides, there have been methods focused on heterogeneous networks that incorporate different typed nodes and edges in a network, as well as content at each node [10, 11]. On the other hand, a less explored and exploited approach is the inductive setting. In this approach, only a part of the network is used to train the model to infer embeddings for new nodes. Several attempts have been made in the inductive setting including EP-B [12], GraphSAGE [13], GAT [14], SDNE [15], TADW [16], AHNG[17] or PVECB [18]. There is also recent progress on heterogeneous graph embedding, e.g., MIFHNE [19] or 2 models based on graph neural networks [20]. State-of-the-art network embedding techniques are mostly unsupervised, i.e., aim at learning low-dimensional representations that preserve the structure of an input graph, e.g., GraphSAGE [13], DANE [21], line2vec [22], RCAN [23]. Nevertheless, semi-supervised or supervised methods can learn vector representations but for a specific downstream pre- diction task, e.g., TADW [16] or FSCNMF [24]. Hence it has been shown in the literature that not much supervision is required to learn the embeddings. In recent years, proposed models mainly focus on the graphs that do not contain attributes related to nodes and edges [4]. It is especially noticeable for edge attributes. The majority of proposed approaches consider node attributes only, omitting the richness of edge feature space while learning the representation. Nevertheless, there have been successfully introduced such models as DANE [21], GraphSAGE [13], SDNE [15] or CAGE [25] which make use of node features and EGNN [26], NEWEE [27], EGAT [28] that consume edge attributes. Table 1: Comparison of most representative graph embedding methods with their abilities to learn the representation, with or without attributes, reasoning types and short characteristics. The most prominent and appropriate methods selected to compare to AttrE2vec in experiments are marked with bold text. Method Representation Attributed Reasoning Family Nodes Edges Nodes Edges Transduct. Induct. S u p e r v is e d ECN [29] (2016) X X neigh. aggr. GCN [9] (2017) X X X X GCN/GNN ECC [30] (2017) X X X GCN, DL FSCNMF [24] (2018) X X X GCN GAT [14] (2018) X X X X AE, DL Planetoid [31] (2018) X X X X GNN EGNN [26] (2019) X X X X X X GNN EdgeConv [32] (2019) X X GNN EGAT [28] (2019) X X X X X X GNN Attribute2vec [33] (2020) X X X GCN U n s u p e r v is e d DeepWalk [8] (2014) X X RW, skip-gram TADW [16] (2015) X X X RW, MF LINE [34] (2015) X X RW, skip-gram Node2vec [7] (2016) X X RW, skip-gram SDNE [15] (2016) X X X X AE GraphSAGE [13] (2017) X X X X RW EP-B [12] (2017) X X X X AE Struc2vec [35] (2017) X X RW, skip-gram DANE [21] (2018) X X X X AE Line2vec [22] (2019) X X RW, skip-gram NEWEE [27] (2019) X X X X RW, skip-gram AttrE2vec (2020) X X X X X RW, AE, DL Both node-based embedding methods and graph neural network inspired methods do not generalize effectively to both transductive and inductive settings, especially when there are attributes associated with edges. This work is motivated by the idea of un- supervised learning on networks with attributed edges such that the embeddings are generalizable across tasks and are inductive. To that end, we develop a novel AttrE2vec, an unsupervised learning model that adapts auto-encoder and self-attention network with the use of feature reconstruction and graph structural loss. To learn edge representation, AttrE2vec splits edge neighborhood into two parts, separately for each node endings of the edge, and then generates random 3 edge walks in both neighborhoods. All walks are then aggregated over the node and edge attributes using one of the proposed strategies (Avg, Exp, GRU, ConcatGRU). These are accumulated with the original nodes and edge features and then fed to attention and dense layer to encode the edge. The embeddings are subsequently inferred via a two-step loss function — for both feature reconstruction and graph structural loss. As a consequence, AttrE2vec can explicitly incorporate feature information from nodes and edges at many hops away to effectively produce the plausible edge embeddings for the inductive setting. In summary, our main contributions are as follows: • we propose a novel unsupervised AttrE2vec method, which learns a low-dimensional vector representation for edges that are attributed • we exploit the concept of a graph-topology-driven edge feature aggregation, from simple ones to learnable GRU based, that captures edge topological proximity and similarity of edge features • the proposed method is inductive and allows getting the representation for edges not present in the training phase • we conduct various experiments and show that our AttrE2vec method has superior performance over all of the baseline methods on edge classification and clustering tasks. 2. Related work and Research Gap Embedding information networks has received significant interest from the research community. We refer the readers to the survey articles for a comprehensive overview of network embedding [4, 5, 3, 2] and cite only some of the most prominent works that are relevant. Unsupervised network embedding methods use only the network structure or original attributes of nodes and edges to construct embeddings. The most common method is DeepWalk [8], which in two-phases constructs node neighborhoods by per- forming fixed-length random walks and employs the skip-gram [7] model to preserve the co-occurrences between nodes and their neighbors. This two-phase framework was later an inspiration for learning network embeddings by proposing different strategies for con- structing node neighborhoods or modeling co-occurrences between nodes, e.g., node2vec [7], Struc2vec [35], GraphSAGE [13], line2vec [22] or NEWEE [27]. Another group of un- supervised methods utilizes auto-encoder or graph neural networks to obtain embedding. SDNE [15] uses auto-encoder architecture to preserve first and second-order proximities by jointly optimizing the loss in neighborhood reconstruction. Another auto-encoder based representatives are EP-B [12] and DANE [21]. Supervised network embedding methods are constructed as an end-to-end meth- ods for particular tasks like node classification or link prediction. These methods require network structure, attributes of nodes and edges (if method is capable of using) and some annotated target like node class. The representatives are ECN [29], ECC [30], FSCNMF [24], GAT [14], planetoid [31], EGNN [26], GCN [9], EdgeConv [32], EGAT [28], Attribute2vec [33]. 4 Edge representation learning has been already tackled by several methods, i.e. ECN [29], EGNN [26], line2vec [22], EdgeConv [32], EGAT [28]. However, non of these methods was able to directly take into account attributes of edges as well as perform the learning in an unsupervised manner. All the characteristics of the representative node and edge representation learning methods are grouped in Table 1. 3. Method 3.1. Motivation In the following paragraphs, we explain our three-fold motivation to propose the AttrE2vec. Edge embeddings. For a decade, network processing approaches gather more and more attention as graph data is produced in an increasing number of systems. Network em- bedding traditionally provided the notion of vectorizing nodes that was used in node classification or clustering. However, the edge representation learning did not gather enough attention and was accomplished through node embedding transformation [36]. Nevertheless, such an approach is problematic. For instance, inferring edge type from neighboring nodes’ embeddings may not be the best choice for edge type classification in heterogeneous social networks. We claim that efficient edge clustering, edge attribute re- gression, or link prediction tasks require dedicated and specific edge representations. We expect that the representation learning approach devoted strictly to edges provides more powerful vector representations than traditional methods that require node embeddings trained upfront and transform nodes’ embedding to represent edges. Inductive embedding methods. A vast majority of contemporary network representation learning methods is transductive (see Table 1). It means that any change to the graph requires the whole retraining of the method to provide predictions for unseen cases—such property limits the applicability of methods due to high computational costs. Contrary, the inductive approach builds a predictive ability that can be applied to unseen cases and does not need retraining – in general, inductive methods have a lower computation cost. Considering these advantages, we expect modern edge embedding methods to be inductive. Encoding graph attributes in embeddings. Much of the real-world data exhibits rich at- tribute sets or meta-data that contain crucial information, e.g., about the similarity of nodes or edges. Traditionally, graph representation learning has been focused on ex- ploiting the network structure, omitting the related content. Thus, we may expect to consume attributes as a regularizer over the structure. It would allow overcoming the limitation when the only edge discriminating ability is encoded in the edges’ attributes, not in the graph’s structure. Relying only on the network would produce inconclusive embeddings. 5 3.2. Attributed graph edge embedding We denote an attributed graph as G = (V,E), where V is a set of nodes and E = {(u,v) ∈ V ×V} a set of edges. Every node u and every edge e = (u,v) has associated features: mu ∈ RdV and fuv ∈ RdE , where M ∈ R|V |×dV and F ∈ R|E|×dE are node and edge feature matrices, respectively. By dV we denote dimensionality of node feature space and dE dimensionality of edge feature space. The edge embedding task is defined as learning a function g : E → Rd, which takes an edge and outputs its low-dimensional vector representation. Note that the embedding dimension d should be much less than the original edge feature dimensionality dE, i.e.: d << dE. More specifically, we aim at using the topological structure of the graph and node and edge attributes: f : (E,F,M) → Rd. Figure 2: Overview of the AttrE2vec model. The model first computes edge random walks on two neighborhoods of a given edge (u,v). Each neighbourhood walks are aggregated into Su,Sv. Both are combined with the edge features fuv using an Encoder module, which results into the edge embedding vector huv. The loss function consists of two parts: structural loss (Lcos) and feature reconstruction loss (LMSE). 3.3. AttrE2vec In contrast to traditional node embedding methods, we shift the focus from nodes to edges and consider a graph from an edge perspective. Given any edge e = (u,v), we can observe three natural sources of knowledge: the edge attributes itself and the two neighborhoods - Nu and Nv, located behind nodes u and v, respectively. In AttrE2vec, we exploit all three sources jointly. First, we obtain aggregations (summaries) Su,Sv of the both neighborhoods Nu,Nv. We want to capture the topological structure of the neighborhood, so we perform k edge random walks of length L, which start from node u (or v, respectively) and use a uniformly distributed neighbor sampling approach (DeepWalk-like) to obtain the next edge. Each ith walk wiu started from node u is hence a sequences of edges. RW(G,k,L,u) →{w1u,w 2 u, . . . ,w k u} wiu ≡ (u,u2), (u3,u4), . . . , (uL−1,uL) 6 Next, we take the attributes of the edges (and nodes, if applicable) in each random walk and aggregate them into a single vector using the walk aggregation model Aggw. Siu = Aggw(w i u,F,M) Later, aggregated walks are combined using the neighborhood aggregation model Aggn, which summarizes the neighborhood Su (and Sv, respectively). The proposed implementations of these aggregation are given in Section 3.4. Su = Aggn({S1u,S 2 u, . . . ,S k u}) Finally, we obtain the low dimensional edge embedding huv using an encoder Enc module. It combines the edge attributes fuv with the summarized neighborhood infor- mation Su, Sv. We employ a simple Multilayer Perceptron (MLP) with 3 inputs (each of size equal to the edge features dimensionality) and an attention mechanism over these in- puts, to check how much of the information of each input is used to create the embedding vector (see Figure 3): huv = Enc(fuv,Su,Sv) Figure 3: Encoder module architecture The overall illustration of the method is contained in Figure 2 and the inference algorithm is shown in Algorithm 1. 3.4. Aggregation models For the purpose of the neighborhood aggregation model Aggn, we use an average over vectors Siu, as there is no particular ordering of these vectors (each one was generated by an equally important random walk). In the case of walk aggregation, we propose the following: 7 Algorithm 1: AttrE2vec inference algorithm Data: graph G, edge list xe, edge features F, node features M Params: number of random walks per node k, random walk length L Result: edge embedding vectors huv begin foreach (u, v) in xe do foreach i in (1. . . k) do wiu = RW(G,L,u) Siu = Aggw(w i u,F,M) wiv = RW(G,L,v) Siv = Aggw(w i v,F,M) end Su = Aggn({S1u, . . . ,Sku}) Sv = Aggn({S1v, . . . ,Skv}) huv = Enc(fuv,Su,Sv) end end • average – that computes a simple average of the edge attribute vectors in the random walk; Siu = 1 L L∑ n=1 funun+1 • exponential – that computes a weighted average, where the weights are exponents of the ”minus” position in the random walk so that further away edges are less important than the near ones; Siu = 1 L L∑ n=1 e−nfunun+1 • GRU – that uses a Gated Recurrent Unit [37] architecture, where hidden and input dimension is equal to the edge attribute dimension; the aggregated representation is the output of the last hidden vector; the aggregation process starts here at the end of the random walk and proceeds to the beginning; Siu = GRU({funun+1,fun−1un, . . . ,fu1u2}) • ConcatGRU – that is similar to the GRU-based aggregator, but here we also use the node feature information by concatenating the node attributes with the edge attributes; hence the GRU input size is equal to the sum of the edge and node dimensions; in case there are not any node features available, one could use 8 network-specific features, like degree, betweenness or more advanced techniques like Node2vec; the hidden dimension size and the aggregation direction is unchanged; Siu = ConcatGRU({funun+1 ⊕mun, . . . ,fu1u2 ⊕mu1}) 3.5. Learning AttrE2vec’s parameters AttrE2vec is designed to make the most use of edge attributes and information about the structure of the network. Therefore we propose a loss function, which consists of two main parts: • structural loss Lcos – computes a cosine embedding loss; such function tries to minimize the cosine distance between a given embedding h and embeddings of edges sampled from the random walks h+ (positive), and simultaneously to maximize a cosine distance between an embedding h and embeddings of edges sampled from a set of all edges in the graph h− (negative), except for these in the random walks: Lcos = 1 |B| ∑ huv∈B ∑ h + uv (1 − cos(huv,h+uv)) + ∑ h − uv cos(huv,h − uv) where B denotes a minibatch of edges and |B| the minibatch size, • feature reconstruction loss LMSE – computes a mean squared error of the actual edge features and the outputs of a decoder (implemented as a 3-layer MLP – see Figure 4), that reconstruct the edge features based on the edge embeddings; LMSE = 1 |B| ∑ (huv,fuv)∈B (DEC(huv) −fuv) 2 where B denotes a minibatch of edges and |B| the minibatch size. Figure 4: Decoder module architecture We combine the values of the above loss functions using a mixing parameter λ ∈ [0, 1]. The higher the value of this parameter is, the more structural information is preserved and less focus is one the feature reconstruction. The total loss of AttrE2vec is given as follows: L = λ∗Lcos + (1 −λ) ∗LMSE 9 4. Experiments To evaluate the proposed model’s performance, we perform three tasks: edge classi- fication, edge clustering, and embedding visualization on three real-world datasets. We first train our model on a small subset of edges (inductive setting). Then we use the model to infer embeddings for edges from the test set. Finally, we evaluate them in all downstream tasks: by predicting the class of edges in citation graphs (edge classifi- cation), by applying the K-means++ algorithm (edge clustering; as defined in [22]) and by the dimensionality reduction method T-SNE (embedding visualization). We compare our model to several baselines and contemporary methods in all experiments, see Table 1. Eventually, we check the influence of AttrE2vec’s hyperparameters and per- form an ablation study on artificially generated datasets. We implement our model in the popular deep learning framework PyTorch. All experiments were performed on an NVIDIA GTX1080Ti. Upon acceptance in the journal, we will make our code available at https://github.com/attre2vec/attre2vec and include our DVC [38] pipeline so that all experiments can be easily reproduced. 4.1. Datasets Table 2: Datasets used in the experiments. Name Features Number of Training instances initial pre-processed node edge node edge nodes edges classes inductive transductive Cora 1 433 0 32 260 2 485 5 069 7+1 160 5 069 Citeseer 3 703 0 32 260 2 110 3 668 6+1 140 3 668 Pubmed 500 0 32 260 19 717 44 324 3+1 80 44 324 In order to compare gathered evaluation evidence we focused on well known datasets, that appear in the literature, namely: Cora [39], Citeseer [39] and Pubmed [40]. These are citation networks of scientific papers in several research areas, where nodes are the papers and edges denote citations between papers. We summarize basic statistics about the datasets before and after pre-processing steps in Table 2. Raw datasets contain node features only in the form of high dimensional sparse bags of words. For Cora and Citeseer, these are binary vectors, showing which of the most popular words were used in a given paper, and for Pubmed, the features are in the form of TF-IDF vectors. To adjust the datasets to our problem setting, we apply the following pre-processing steps to obtain edge level features, which are used to train and evaluate our AttrE2vec model: • we create dense vector representations of the nodes’ features by applying Doc2vec [41] in the PV-DBOW variant with a target dimension size of 128; • for each edge (u,v) and its symmetrical version (v,u) (necessary to perform uni- form, undirected random walks) we extract the following features: – 1 feature – cosine similarity of raw node features for nodes u and v (binary BoW; for Pubmed transformed from TF-IDF to binary BoW), 10 https://github.com/attre2vec/attre2vec – 2 features – the ratios of the number of used words (number of ones in the BoW) to all possible words in the document (length of BoW vector) in each paper u and v, – 256 features – concatenation of Doc2vec features for nodes u and v, – 1 feature – a binary indicator, which denotes whether this is an original edge (1) or its symmetrical counterpart (0), • we apply standardization (StandardScaler in Scikit-Learn [42]) of the edge feature matrix. Moreover, we extracted new node features as 32-dimensional Node2vec embeddings to provide the evaluation possibility for one of our model versions (AttrE2vec with Con- catGRU aggregator), which generalizes upon both edge and nodes attributes. Raw datasets provide each node labeled by the research area the paper comes from. To apply this knowledge in the edge classification problem setting, we applied the following rule: if an edge has two nodes from the same class (research area), the edge receives this class; if two nodes have different classes, the edge between these nodes is assigned with a cross-domain citation class. To ensure a fair comparison method, we follow the dataset preparation scheme from EP-B [12], i.e., for each dataset (Cora, Citeseer, Pubmed) we sample 10 train/validation/test sets, where the train set consists of 20 edges per class and the validation and test sets to contain 1 000 randomly chosen edges each. While reporting the resulting metrics, we show the mean values over these ten sampled sets (together with the standard deviation). 4.2. Baselines We compare our method against several baseline methods. In the most simple case, we use the edge features obtained during the pre-processing phase for all datasets (further referred to as Doc2vec). Many standard approaches employ simple node embedding transformations to obtain edge embeddings. The authors of Node2vec [36] proposed binary operators like averaging, Hadamard product, or L1 and L2 norms of vector differences. Here, we will use following methods to obtain node embeddings: DeepWalk [8], Node2vec [36], SDNE [43] and Struc2vec [35]. In preliminary experiments, we evaluated these methods and checked that the Average operator and an embedding size of 64 gives the best results. We will use these models in 2 setups: (a) Avg(M,M) – using only the averaged node features, (b) Avg(M,M)⊕F – like previously but concatenated with the edge features from the dataset (in total 324-dim vectors). We also checked a scheme to compute a 64-dim PCA reduction of the concatenated features to have comparable vector sizes with the 64-dimensional embedding of our model, but these turned out to perform poorly. Note that SDNE has the capability of inductive reasoning, but due to the non-availability of such implementation, we decided to evaluate this method in the transductive scheme (which works in favor of the method). 11 Figure 5: Architecture of the MLP(M,M). Figure 6: Architecture of the MLP(M,M,F). We also extend our body of baselines by more sophisticated approaches – two dense autoencoder architectures. In the first setting MLP(M,M), we train a model (see Figure 5), which reconstructs concatenated embeddings of connected nodes. In the second baseline MLP(M,M,F), the autoencoder (see Figure 6) is extended by edge attributes. In both settings, we employ the mean squared error as the model loss function. The output of the encoders (embeddings) is used in the downstream tasks. The input node embeddings are obtained using the methods mentioned above, i.e., DeepWalk, Node2vec, SDNE, and Struc2vec. The last baseline is Line2vec [22], which is directly dedicated for edges - we use an embedding size of 64. 4.3. Edge classification To evaluate our model in an inductive setting, we need to make sure that test edges are unseen during the model training procedure – we remove them from the graph. Note that all baselines (except for GraphSage, see 1) require all edges during the training phase (i.e., these are transductive methods). After each training epoch of AttrE2vec, we evaluate the embeddings using L2- regularized Logistic Regression (LR) classifier and compute AUC. The regression model is trained on edge embeddings from the train set and evaluated on edge embeddings from the validation set. We take the model with the highest AUC value on the validation set. 12 Table 3: AUC values for edge classification. F denotes the edge attributes (also referred to as ”Doc2vec”), M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator, Avg(M,M) – average operator on node embeddings, MLP(·) – encoder output of MLP autoencoder trained on given attributes. AUC in bold shows the highest value and AUC in italic — the second highest value. Method group/name Vector AUC size Citeseer Cora Pubmed T r a n s d u c ti v e Edge features only; F (Doc2vec) 260 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41 Line2vec 64 86.19 ± 0.28 91.75 ± 1.07 84.88 ± 1.19 Avg(M,M) DeepWalk 64 58.40 ± 1.08 59.98 ± 1.32 51.04 ± 1.23 Node2vec 64 58.26 ± 0.89 59.59 ± 1.11 51.03 ± 1.01 SDNE 64 54.28 ± 1.57 55.91 ± 1.11 50.00 ± 0.00 Struc2vec 64 61.29 ± 0.86 61.30 ± 1.58 54.67 ± 1.46 MLP(M,M) DeepWalk 64 55.88 ± 1.68 57.87 ± 1.53 51.23 ± 0.77 Node2vec 64 55.35 ± 2.26 57.44 ± 0.87 51.48 ± 1.55 SDNE 64 55.56 ± 0.93 56.02 ± 1.22 50.00 ± 0.00 Struc2vec 64 59.93 ± 1.43 59.76 ± 1.80 53.27 ± 1.32 Avg(M,M)⊕F DeepWalk 324 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41 Node2vec 324 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41 SDNE 324 86.14 ± 1.03 88.70 ± 0.51 79.15 ± 1.41 Struc2vec 324 86.21 ± 0.97 88.73 ± 0.48 79.24 ± 1.36 MLP(M,M,F) DeepWalk 64 84.58 ± 1.11 86.47 ± 0.87 78.60 ± 1.84 Node2vec 64 84.65 ± 1.05 86.71 ± 0.68 78.84 ± 1.71 SDNE 64 84.32 ± 1.13 85.99 ± 0.77 78.34 ± 1.07 Struc2vec 64 83.95 ± 1.16 85.54 ± 0.96 77.19 ± 1.42 In d u c ti v e Avg(M,M) GraphSage 64 54.84 ± 1.90 55.16 ± 1.36 51.14 ± 1.64 MLP(M,M) GraphSage 64 55.19 ± 1.04 55.47 ± 1.66 50.36 ± 1.54 Avg(M,M)⊕F GraphSage 324 86.14 ± 0.95 88.68 ± 0.51 79.16 ± 1.41 MLP(M,M,F) GraphSage 64 84.63 ± 1.11 86.14 ± 0.45 78.00 ± 1.85 AttrE2vec (our) Avg 64 88.97 ± 0.82 93.43 ± 0.56 87.68 ± 1.25 Exp 64 88.91 ± 1.10 92.80 ± 0.38 86.18 ± 1.41 GRU 64 88.92 ± 1.13 93.06 ± 0.63 86.39 ± 1.21 ConcatGRU 64 88.56 ± 1.34 92.93 ± 0.61 86.34 ± 1.18 Moreover, an early stopping strategy is implemented– if the validation AUC metric does not improve for more than 15 epochs, the learning is terminated. Our approach to model selection is aligned with the schema proposed in [44] because this approach is more nat- ural than relying on the loss function. This is repeated for all 10 data splits (see: Section 4.1 for details). We report a mean and std AUC measures for 10 test sets (see Table 3) We choose AdamW [45] with a learning rate of 0.001 to optimize our model’s pa- rameters. We also set the size of positive samples to |h+| = 5 and negative samples to |h−| = 10 in the cosine embedding loss. The mixing coefficient is set to λ = 0.5, equally including the influence of features and topological graph structure. We choose an embedding size of 64 as a reasonable value while dealing with edge features of size 260. In Table 3, we summarize the AUC values for baseline methods and for our model. Even though vectors’ original dimensionality is relatively high (260), good results are already yielded using only the edge features (Doc2vec). However, adding structural information about the graph could further improve the results. Using representations from node embedding methods, which are transformed to edge 13 embeddings using the average operator Avg(M,M), achieve poor results of about 50- 60% AUC. However, if these are combined with the edge features from the datasets Avg(M,M)⊕F, the AUC values increase significantly to about 86%, 88% and 79% for Citeseer, Cora, and Pubmed, respectively. Unfortunately, this results in an even higher vector dimensionality (324). The MLP-based approach results lead to similar conclusions. Using only node em- beddings MLP(M,M) we achieve quite poor results of about 50% (on Pubmed) up to 60% (on Cora). With MLP(M,M,F) approach we observe that edge features improve the classification results. The AUC values are still slightly worse than concatenation operator (Avg(M,M)⊕F), but we can reduce the edge embedding size to 64. The Line2vec [22] algorithm achieves very good results, without considering edge features information – we get about 86%, 92% and 85% AUC for Citeseer, Cora, and Pubmed, respectively. These values are higher than for any other baseline approach. Our model performs the best among all evaluated methods. For Citeseer, we gain about 3 percent points compared to the best baselines: Line2vec, Struc2vec (Avg(M,M)⊕F) or GraphSage (Avg(M,M)⊕F). Note that the algorithm is trained only on 140 edges in the inductive setting, whereas all transductive baselines require the whole graph for training. The gains on Cora are 2 pp, and on Pubmed we achieve up to 4pp (and up to 8pp compared only to GraphSage (Avg(M,M)⊕F)). Our model with the Average (Avg) aggregator works the best, whereas the Gated Recurrent Unit (GRU) aggregator achieves the second-best results. 4.4. Edge clustering Similarly to Line2vec [22], we apply the K-Means++ algorithm on the resulting em- bedding vectors and compute an unsupervised clustering accuracy [46]. We summarize the results in Table 4. Our model performs the best in all but one case and achieves significantly better results than other baseline methods. The only exception is for the Pubmed dataset, where Line2vec achieves the best clustering accuracy. Other baseline methods perform similarly as in the edge classification task. Hence, we will not discuss the details, and we encourage the reader to go through the detailed results. 4.5. Embedding visualization For all tested baseline methods and our proposed AttrE2vec method, we compute 2-dimensional projections of the produced embeddings using T-SNE [47] method. We visualize them in Figure 7. In our subjective opinion, these plots correspond to the AUC scores reported in Table 3—the higher the AUC, the better the group separation. In details, for Doc2vec raw edge features seem to form groups, but unfortunately overlap to some degree. We cannot observe any pattern in the node embedding-based settings (Avg(M,M) and MLP(M,M)), they tempt to be quasi-random. When concatenated with the edge attributes (Avg(M,M)⊕F and MLP(M,M,F)) we observe a slightly better grouping, but yet non satisfying. AttrE2vec model produces much more formed groups, with only a little overlapping. To summarize, based on the observed groups’ separability and AUC metrics, our approach works the best among all methods. 14 Figure 7: 2-D T-SNE projections of embedding vectors for all evaluated methods. Columns denotes aggregation approach, beside F that denotes the edge attributes and g(E) that is an edge embedding obtained with graph structure only. Rows gather particular methods. 15 Table 4: Accuracy on edge clustering. F denotes the edge attributes (also referred to as ”Doc2vec”), M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator, Avg(M,M) – average operator on node embeddings, MLP(·) – encoder output of MLP autoencoder trained on given attributes. AUC in bold shows the highest value and AUC in italic — the second highest value. Method group/name Vector Accuracy size Citeseer Cora Pubmed T r a n s d u c ti v e Edge features only; F (Doc2vec) 260 54.13 ± 2.73 54.64 ± 5.86 46.33 ± 1.53 Line2vec 64 54.73 ± 2.56 63.50 ± 1.92 55.26 ± 1.36 Avg(M,M) DeepWalk 64 28.89 ± 1.06 21.93 ± 0.86 27.24 ± 0.50 Node2vec 64 26.82 ± 0.67 21.32 ± 0.62 27.17 ± 0.74 SDNE 64 21.01 ± 0.50 17.97 ± 0.47 31.38 ± 0.69 Struc2vec 64 25.21 ± 1.33 20.15 ± 0.64 32.02 ± 1.49 MLP(M,M) DeepWalk 64 26.36 ± 1.37 21.06 ± 0.57 27.40 ± 0.93 Node2vec 64 26.37 ± 1.64 21.31 ± 0.98 27.67 ± 0.78 SDNE 64 22.27 ± 0.76 17.15 ± 0.36 28.44 ± 1.21 Struc2vec 64 24.22 ± 0.83 19.56 ± 0.49 31.31 ± 1.70 Avg(M,M)⊕F DeepWalk 324 54.13 ± 2.73 54.70 ± 5.85 46.33 ± 1.53 Node2vec 324 54.13 ± 2.73 54.70 ± 5.85 46.33 ± 1.53 SDNE 324 55.29 ± 2.06 55.43 ± 4.63 46.33 ± 1.53 Struc2vec 324 55.59 ± 1.51 52.47 ± 6.52 46.32 ± 1.29 MLP(M,M,F) DeepWalk 64 48.74 ± 4.03 47.38 ± 4.72 46.49 ± 1.20 Node2vec 64 50.80 ± 2.30 48.48 ± 3.38 46.15 ± 1.43 SDNE 64 46.17 ± 3.15 44.87 ± 3.54 45.74 ± 1.89 Struc2vec 64 47.35 ± 3.73 44.38 ± 3.04 45.40 ± 1.72 In d u c ti v e Avg(M,M) GraphSage 64 18.79 ± 0.62 17.70 ± 1.05 27.04 ± 0.71 MLP(M,M) GraphSage 64 18.92 ± 0.98 17.89 ± 0.85 27.09 ± 0.81 Avg(M,M)⊕F GraphSage 324 54.06 ± 2.54 54.82 ± 6.86 46.49 ± 1.64 MLP(M,M,F) GraphSage 64 48.79 ± 4.04 47.49 ± 5.41 45.15 ± 1.54 AttrE2vec (our) Avg 64 59.82 ± 3.30 65.42 ± 1.71 48.86 ± 2.46 Exp 64 59.07 ± 4.65 66.36 ± 3.62 48.02 ± 2.55 GRU 64 60.16 ± 2.25 66.15 ± 3.71 49.41 ± 1.49 ConcatGRU 64 60.71 ± 2.75 66.00 ± 2.21 50.27 ± 3.75 5. Hyperparameter Sensitivity of AttrE2vec We investigate hyperparameters’ effect considering each of them independently, i.e., setting a given parameter and preserving default values for all other parameters. The evaluation is applied for our model’s two inductive variants: with the Average aggregator and with the GRU aggregator. We use all three datasets (Cora, Citeseer, Pubmed) and report the AUC values. We choose following hyperparameter value sets (values with an asterisk denote the default value for that parameter): • length of random walk: L = {4, 8∗, 16}, • number of random walks: k = {4, 8, 16∗}, • embedding size: d = {16, 32, 64∗}, • mixing parameter: λ = {0, 0.25, 0.5∗, 0.75, 1}. 16 Figure 8: Effects of hyperparameters on Cora, Citeseer and Pubmed datasets. The results of all experiments are summarized in Figure 8. We observe that for both aggregation variants, Avg and GRU, the trends are similar, so we will include and discuss them based only on the Average aggregator. In general, the higher the number of random walks k and the length of a single random walk L, the better results are achieved. One may require higher values of these parameters, but it significantly increases the random walk computation time and the model training itself. Unsurprisingly, the embedding size (embedding dimension) also follows the same trend. With more dimensions, we can fit more information into the created representa- tions. However, as an embedding goal is to find low-dimensional vector representations, we should keep reasonable dimensionality. Our chosen values (16, 32, 64) seem plausible while working with 260-dimensional edge features. As for loss mixing parameter λ, we observe that too high values negatively influence the model performance. The greater the value, the more critical the structural loss be- comes. Simultaneously the feature loss becomes less relevant. Choosing λ = 0 causes the loss function to consider feature reconstruction only and completely ignores the em- bedding loss. This yields significantly worse results and confirms that our approach of combining both feature reconstruction and structural embedding loss is justified. In general, the best values are achieved for setting an equal influence of both loss factors (λ = 0.5). 6. Ablation study We performed an ablation study to check whether our method AttrE2vec is invariant to introduced noise in an artificially generated network. We use a barbell graph, which 17 Figure 9: AttrE2vec performance for various noise levels p and mixing parameter values λ ∈{0, 0.5, 1}. Figure 10: 2-D representations of ideal and noisy graph edges using AttrE2vec with λ ∈{0, 0.5, 1}. 18 consists of two fully connected graphs and a path which connects them (see: Figure 1). The graph has seven nodes in each full graph and seven nodes in the path – a total of 50 edges. Next, we generate features from 3 clusters in a 200-dimensional space using isotropic Gaussian blobs. We assign the features to 3 parts of the graph: the first to the edges in one of the full graphs, the second to the edges in the path and the third to the edges in the other full graph. The edge classes are matching the feature clusters (i.e., three classes). Therefore, the structure is aligned with the features, so any good structure based embedding method can fit this data very well (see: Figure 1). A problem occurs when the features (and hence the classes) are shuffled within the graph structure. Methods that employ only a structural loss function will fail. We want to check how our model AttrE2vec, which includes both structural and feature-based loss, performs with different amount of such noise. We will use the graph mentioned above and introduce noise by shuffling p% of all edge pairs, which are from different classes, i.e., an edge with class 2 (originally lo- cated in the path) may be swapped with one from the full graphs (classes 1 or 3). We use our AttrE2vec model with an Average aggregator in the transductive setting (due to the graph size) and report the edge classification AUC for different values of p ∈{0, 0.1, . . . , 0.5, . . . , 0.9, 1} and λ ∈{0, 0.5, 1}. The values of the mixing parameter λ allow us to check how the model behaves when working only with a feature-based loss (λ = 0), only with a structural loss (λ = 1), and with both losses at equal importance (λ = 0.5). We train our model for five epochs and repeat the computations ten times for every (p,λ) pair, due to the shuffling procedure’s randomness. We report the mean and standard deviation of the AUC value in Figure 9. Using only the feature loss or a combination of both losses allows us to achieve nearly 100% AUC in the classification task. The fluctuations appear due to the low number of training epochs and the local optima problem. The performance of the model that uses only structural loss (λ = 1) decreases with higher shuffling probabilities, and from a certain point, it starts improving slightly because shuffling results in a complete swap of two classes, i.e., all features and classes from one graph part are exchanged with all features and classes from another part of the graph. We also demonstrate how our method reacts on noisy data with various λ ∈{0, 0.5, 1}. There are two graphs: one where the features are aligned to substructures of the graph and the second with shuffled features (ca. 50%), see Figure 10. Keeping AttrE2vec with λ = 0.5 allows to represent noisy graphs fairly. 7. Conclusions and future work We introduce AttrE2vec – the novel unsupervised and inductive embedding model to learn attributed edge embeddings by leveraging on the self-attention network with auto- encoder over attribute space and structural loss on aggregated random walks. Attre2vec can directly aggregate feature information from edges and nodes at many hops away to infer embeddings not only for present nodes, but also for new nodes. Extensive experimental results show that AttrE2vec obtains the state-of-the-art results in edge classification and clustering on CORA, PUBMED and CITESEER. 19 Acknowledgments The work was partially supported by the National Science Centre, Poland grant No. 2016/21/D/ST6/02948, and 2016/23/B/ST6/01735, as well as by the Department of Computational Intelligence, Wroc law University of Science and Technology statutory funds. References [1] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, J. Leskovec, R. Barzilay, P. Battaglia, Y. Bengio, M. Bronstein, S. Günnemann, W. Hamilton, T. Jaakkola, S. Jegelka, M. Nickel, C. Re, L. Song, J. Tang, M. Welling, R. Zemel, Open graph benchmark: Datasets for machine learning on graphs (may 2020). arXiv:2005.00687. URL http://arxiv.org/abs/2005.00687 [2] D. Zhang, J. Yin, X. Zhu, C. Zhang, Network Representation Learning: A Survey, IEEE Transac- tions on Big Data 6 (1) (2018) 3–28. doi:10.1109/tbdata.2018.2850013. [3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A Comprehensive Survey on Graph Neural Networks, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–21doi:10.1109/ TNNLS.2020.2978386. [4] B. Li, D. Pi, Network representation learning: a systematic literature review, Neural Computing and Applications 32 (21) (2020) 16647–16679. doi:10.1007/s00521-020-04908-5. [5] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, K. Murphy, Machine Learning on Graphs: A Model and Comprehensive Taxonomy (2020). URL http://arxiv.org/abs/2005.03675 [6] S. Bahrami, F. Dornaika, A. Bosaghzadeh, Joint auto-weighted graph fusion and scalable semi- supervised learning, Information Fusion 66 (2021) 213–228. URL www.scopus.com [7] A. Grover, J. Leskovec, Node2vec: Scalable feature learning for networks, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17- Augu, 2016, pp. 855–864. doi:10.1145/2939672.2939754. [8] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online Learning of Social Representations Bryan, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, ACM Press, New York, New York, USA, 2014, pp. 701–710. doi:10.1145/ 2623330.2623732. URL http://dl.acm.org/citation.cfm?doid=2623330.2623732 [9] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, International Conference on Learning Representations, ICLR, 2017, pp. 1–14. arXiv:1609.02907. URL http://arxiv.org/abs/1609.02907 [10] Y. Dong, N. V. Chawla, A. Swami, Metapath2vec: Scalable representation learning for hetero- geneous networks, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, ACM, New York, NY, USA, 2017, pp. 135–144. doi:10.1145/3097983.3098036. URL https://dl.acm.org/doi/10.1145/3097983.3098036 [11] S. . Wang, V. V. Govindaraj, J. M. Górriz, X. Zhang, Y. . Zhang, Covid-19 classification by fgcnet with deep feature fusion from graph convolutional network and convolutional neural network, Information Fusion 67 (2021) 208–229, cited By :1. URL www.scopus.com [12] A. Garćıa-Durán, M. Niepert, Learning graph representations with embedding propagation, in: Advances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 5120–5131. [13] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Ad- vances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 1025–1035. [14] P. Veličković, A. Casanova, P. Liò, G. Cucurull, A. Romero, Y. Bengio, Graph attention networks, in: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, International Conference on Learning Representations, ICLR, 2018, pp. 1–12. arXiv: 1710.10903. 20 http://arxiv.org/abs/2005.00687 http://arxiv.org/abs/2005.00687 http://arxiv.org/abs/2005.00687 http://arxiv.org/abs/2005.00687 http://dx.doi.org/10.1109/tbdata.2018.2850013 http://dx.doi.org/10.1109/TNNLS.2020.2978386 http://dx.doi.org/10.1109/TNNLS.2020.2978386 http://dx.doi.org/10.1007/s00521-020-04908-5 http://arxiv.org/abs/2005.03675 http://arxiv.org/abs/2005.03675 http://arxiv.org/abs/2005.03675 www.scopus.com www.scopus.com www.scopus.com http://dx.doi.org/10.1145/2939672.2939754 http://dl.acm.org/citation.cfm?doid=2623330.2623732 http://dx.doi.org/10.1145/2623330.2623732 http://dx.doi.org/10.1145/2623330.2623732 http://dl.acm.org/citation.cfm?doid=2623330.2623732 http://arxiv.org/abs/1609.02907 http://arxiv.org/abs/1609.02907 http://arxiv.org/abs/1609.02907 https://dl.acm.org/doi/10.1145/3097983.3098036 https://dl.acm.org/doi/10.1145/3097983.3098036 http://dx.doi.org/10.1145/3097983.3098036 https://dl.acm.org/doi/10.1145/3097983.3098036 www.scopus.com www.scopus.com www.scopus.com http://arxiv.org/abs/1710.10903 http://arxiv.org/abs/1710.10903 [15] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu, 2016, pp. 1225–1234. doi:10.1145/2939672.2939753. [16] C. Yang, Z. Liu, D. Zhao, M. Sun, E. Y. Chang, Network representation learning with rich text information, in: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2015-Janua, 2015, pp. 2111–2117. [17] M. Liu, J. Liu, Y. Chen, M. Wang, H. Chen, Q. Zheng, Ahng: Representation learning on attributed heterogeneous network, Information Fusion 50 (2019) 221–230, cited By :3. URL www.scopus.com [18] L. Lan, P. Wang, J. Zhao, J. Tao, J. Lui, X. Guan, Improving network embedding with partially available vertex and edge content, Information Sciences 512 (2020) 935–951. doi:10.1016/j.ins. 2019.09.083. [19] B. Li, D. Pi, Y. Lin, I. Khan, L. Cui, Multi-source information fusion based heterogeneous network embedding, Information Sciences 534 (2020) 53–71. doi:10.1016/j.ins.2020.05.012. [20] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla, Heterogeneous graph neural network, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2019, pp. 793–803. doi:10.1145/3292500.3330961. URL https://dl.acm.org/doi/10.1145/3292500.3330961 [21] H. Gao, H. Huang, Deep attributed network embedding, in: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2018-July, 2018, pp. 3364–3370. doi:10.24963/ijcai.2018/467. [22] S. Bandyopadhyay, A. Biswas, N. Murty, R. Narayanam, Beyond node embedding: A direct unsu- pervised edge representation framework for homogeneous networks (2019). arXiv:1912.05140. [23] Y. Chen, T. Qian, Relation constrained attributed network embedding, Information Sciences 515 (2020) 341–351. doi:10.1016/j.ins.2019.12.033. [24] S. Bandyopadhyay, H. Kara, A. Kannan, M. N. Murty, FSCNMF: Fusing structure and content via non-negative matrix factorization for embedding information networks (2018). arXiv:1804.05313. [25] D. Nozza, E. Fersini, E. Messina, CAGE: Constrained deep Attributed Graph Embedding, Infor- mation Sciences 518 (2020) 56–70. doi:10.1016/j.ins.2019.12.082. [26] J. Kim, T. Kim, S. Kim, C. D. Yoo, Edge-labeling graph neural network for few-shot learning, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion, Vol. 2019-June, 2019, pp. 11–20. arXiv:1905.01436, doi:10.1109/CVPR.2019.00010. [27] Q. Li, Z. Cao, J. Zhong, Q. Li, Graph representation learning with encoding edges, Neurocomputing 361 (2019) 29–39. doi:10.1016/j.neucom.2019.07.076. [28] L. Gong, Q. Cheng, Exploiting edge features for graph neural networks, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 9203–9211. doi:10.1109/CVPR.2019.00943. [29] C. Aggarwal, G. He, P. Zhao, Edge classification in networks, in: 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, Institute of Electrical and Electronics Engineers Inc., 2016, pp. 1038–1049. doi:10.1109/ICDE.2016.7498311. [30] M. Simonovsky, N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks on graphs, in: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-Janua, 2017, pp. 29–38. doi:10.1109/CVPR.2017.11. [31] T. D. Bui, S. Ravi, V. Ramavajjala, Neural Graph Learning: Training Neural Networks Using Graphs, dl.acm.org 2018-Febua (2018) 64–71. doi:10.1145/3159652.3159731. [32] Y. Wang, Y. Sun, M. M. Bronstein, J. M. Solomon, Z. Liu, S. E. Sarma, Dynamic Graph CNN for Learning on Point Clouds, ACM Transactions on Graphics 38 (5) (2019) 146. doi:10.1145/3326362. [33] T. Wanyan, C. Zhang, A. Azad, X. Liang, D. Li, Y. Ding, Attribute2vec: Deep network embedding through multi-filtering GCN (apr 2020). arXiv:2004.01375. URL http://arxiv.org/abs/2004.01375 [34] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, LINE: Large-scale information network embedding, in: WWW 2015 - Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 1067–1077. doi:10.1145/2736277.2741093. [35] L. F. Ribeiro, P. H. Saverese, D. R. Figueiredo, Struc2vec: Learning node representations from structural identity, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, 2017, pp. 385–394. doi:10.1145/3097983.3098061. [36] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2016, pp. 855–864. [37] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Net- 21 http://dx.doi.org/10.1145/2939672.2939753 www.scopus.com www.scopus.com www.scopus.com http://dx.doi.org/10.1016/j.ins.2019.09.083 http://dx.doi.org/10.1016/j.ins.2019.09.083 http://dx.doi.org/10.1016/j.ins.2020.05.012 https://dl.acm.org/doi/10.1145/3292500.3330961 http://dx.doi.org/10.1145/3292500.3330961 https://dl.acm.org/doi/10.1145/3292500.3330961 http://dx.doi.org/10.24963/ijcai.2018/467 http://arxiv.org/abs/1912.05140 http://dx.doi.org/10.1016/j.ins.2019.12.033 http://arxiv.org/abs/1804.05313 http://dx.doi.org/10.1016/j.ins.2019.12.082 http://arxiv.org/abs/1905.01436 http://dx.doi.org/10.1109/CVPR.2019.00010 http://dx.doi.org/10.1016/j.neucom.2019.07.076 http://dx.doi.org/10.1109/CVPR.2019.00943 http://dx.doi.org/10.1109/ICDE.2016.7498311 http://dx.doi.org/10.1109/CVPR.2017.11 http://dx.doi.org/10.1145/3159652.3159731 http://dx.doi.org/10.1145/3326362 http://arxiv.org/abs/2004.01375 http://arxiv.org/abs/2004.01375 http://arxiv.org/abs/2004.01375 http://arxiv.org/abs/2004.01375 http://dx.doi.org/10.1145/2736277.2741093 http://dx.doi.org/10.1145/3097983.3098061 http://arxiv.org/abs/1412.3555 http://arxiv.org/abs/1412.3555 works on Sequence Modeling (dec 2014). arXiv:1412.3555. URL http://arxiv.org/abs/1412.3555 [38] R. Kuprieiev, D. Petrov, R. Valles, P. Redzyński, C. da Costa-Luis, A. Schepanovski, I. Shcheklein, S. Pachhai, J. Orpinel, F. Santos, A. Sharma, Zhanibek, D. Hodovic, P. Rowlands, Earl, A. Grigorev, N. Dash, G. Vyshnya, maykulkarni, Vera, M. Hora, xliiv, W. Baranowski, S. Mangal, C. Wolff, nik123, O. Yoktan, K. Benoy, A. Khamutov, A. Maslakov, Dvc: Data version control - git for data & models (May 2020). doi:10.5281/zenodo.3859749. URL https://doi.org/10.5281/zenodo.3859749 [39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI Magazine 29 (3) (2008) 93. doi:10.1609/aimag.v29i3.2157. URL https://ojs.aaai.org/index.php/aimagazine/article/view/2157 [40] G. Namata, B. London, L. Getoor, B. Huang, Query-driven Active Surveying for Collective Clas- sification, in: Proceedings ofthe Workshop on Mining and Learn- ing with Graphs, Edinburgh, Scotland, UK., 2012, pp. 1–8. [41] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: 31st International Conference on Machine Learning, ICML 2014, Vol. 4, 2014, pp. 2931–2939. arXiv:1405.4053. URL http://arxiv.org/abs/1405.4053 [42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [43] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 1225–1234. doi:10.1145/2939672.2939753. URL http://doi.acm.org/10.1145/2939672.2939753 [44] D. Q. Nguyen, T. D. Nguyen, D. Phung, A self-attention network based node embedding model (jun 2020). arXiv:2006.12100. URL http://arxiv.org/abs/2006.12100 [45] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization (nov 2017). arXiv:1711.05101. URL http://arxiv.org/abs/1711.05101 [46] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: M. F. Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on Machine Learning, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York, USA, 2016, pp. 478–487. URL http://proceedings.mlr.press/v48/xieb16.html [47] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research 9 (2008) 2579–2605. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html 22 View publication statsView publication stats http://arxiv.org/abs/1412.3555 http://arxiv.org/abs/1412.3555 http://arxiv.org/abs/1412.3555 http://arxiv.org/abs/1412.3555 https://doi.org/10.5281/zenodo.3859749 https://doi.org/10.5281/zenodo.3859749 http://dx.doi.org/10.5281/zenodo.3859749 https://doi.org/10.5281/zenodo.3859749 https://ojs.aaai.org/index.php/aimagazine/article/view/2157 https://ojs.aaai.org/index.php/aimagazine/article/view/2157 http://dx.doi.org/10.1609/aimag.v29i3.2157 https://ojs.aaai.org/index.php/aimagazine/article/view/2157 http://arxiv.org/abs/1405.4053 http://arxiv.org/abs/1405.4053 http://arxiv.org/abs/1405.4053 http://doi.acm.org/10.1145/2939672.2939753 http://dx.doi.org/10.1145/2939672.2939753 http://doi.acm.org/10.1145/2939672.2939753 http://arxiv.org/abs/2006.12100 http://arxiv.org/abs/2006.12100 http://arxiv.org/abs/2006.12100 http://arxiv.org/abs/1711.05101 http://arxiv.org/abs/1711.05101 http://arxiv.org/abs/1711.05101 http://proceedings.mlr.press/v48/xieb16.html http://proceedings.mlr.press/v48/xieb16.html http://www.jmlr.org/papers/v9/vandermaaten08a.html http://www.jmlr.org/papers/v9/vandermaaten08a.html https://www.researchgate.net/publication/348079131 1 Introduction 2 Related work and Research Gap 3 Method 3.1 Motivation 3.2 Attributed graph edge embedding 3.3 AttrE2vec 3.4 Aggregation models 3.5 Learning AttrE2vec's parameters 4 Experiments 4.1 Datasets 4.2 Baselines 4.3 Edge classification 4.4 Edge clustering 4.5 Embedding visualization 5 Hyperparameter Sensitivity of AttrE2vec 6 Ablation study 7 Conclusions and future work
bahnemann-transforming-2021 ---- Transforming Metadata into Linked Data to Improve Digital Collection Discoverability: A CONTENTdm Pilot Project Transforming metadata into linked data to improve digital collection discoverability: A CONTENTdm Pilot Project O C L C R E S E A R C H R E P O R T Transforming Metadata into Linked Data to Improve Digital Collection Discoverability: A CONTENTdm Pilot Project Greta Bahnemann Minnesota Digital Library Michael Carroll Temple University Libraries Paul Clough, University of Miami Libraries Mario Einaudi The Huntington Library, Art Museum, and Botanical Gardens Chatham Ewing Cleveland Public Library Jeff Mixter OCLC Research Jason Roy Minnesota Digital Library Holly Tomren Temple University Libraries Bruce Washburn OCLC Research Elliot Williams University of Miami Libraries © 2021 OCLC. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ January 2021 OCLC Research Dublin, Ohio 43017 USA www.oclc.org ISBN: 978-1-55653-185-9 DOI: 10.25333/fzcv-0851 OCLC Control Number: 1230259668 ORCID iDs Greta Bahnemann, Minnesota Digital Library https://orcid.org/0000-0002-5823-7217 Michael Carroll, Temple University Libraries https://orcid.org/0000-0003-3736-0678 Paul Clough, University of Miami Libraries https://orcid.org/0000-0001-6939-2805 Mario Einaudi, The Huntington Library, Art Museum, and Botanical Gardens https://orcid.org/0000-0002-6859-594X Chatham Ewing, Cleveland Public Library https://orcid.org/0000-0002-8402-0652 Jeff Mixter, OCLC Research https://orcid.org/0000-0002-8411-2952 Jason Roy, Minnesota Digital Library https://orcid.org/0000-0002-3644-1970 Holly Tomren, Temple University Libraries https://orcid.org/0000-0002-6062-1138 Bruce Washburn, OCLC Research http://orcid.org/0000-0003-4396-7345 Elliot Williams, University of Miami Libraries https://orcid.org/0000-0001-6925-7144 Please direct correspondence to: OCLC Research oclcresearch@oclc.org Suggested citation: Bahnemann, Greta, Michael Carroll, Paul Clough, Mario Einaudi, Chatham Ewing, Jeff Mixter, Jason Roy, Holly Tomren, Bruce Washburn, and Elliot Williams. 2021. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability: A CONTENTdm Pilot Project. Dublin, OH: OCLC Research. https://doi.org/10.25333/fzcv-0851. http://creativecommons.org/licenses/by/4.0/ http://www.oclc.org https://orcid.org/0000-0002-5823-7217 https://orcid.org/0000-0003-3736-0678 https://orcid.org/0000-0001-6939-2805 https://orcid.org/0000-0002-6859-594X https://orcid.org/0000-0002-8402-0652 https://orcid.org/0000-0002-8411-2952 https://orcid.org/0000-0002-3644-1970 https://orcid.org/0000-0002-6062-1138 http://orcid.org/0000-0003-4396-7345 https://orcid.org/0000-0001-6925-7144 mailto:oclcresearch@oclc.org https://doi.org/10.25333/fzcv-0851 C O N T E N T S Acknowledgments ........................................................................... viii Executive Summary ........................................................................... ix Introduction ....................................................................................... 11 Three-Phase Project Plan ..................................................................13 Phase 1: Mapping textual metadata to entities ..................................................... 15 Phase 2: Tools for managing metadata in Wikibase ............................................. 15 Phase 3: Wikibase entities drive discovery ........................................................... 15 The Wikibase Environment ................................................................16 Developing A Data Model .................................................................. 17 Describing the “type” of a creative work at three levels ...................................... 18 Distinguishing between instances of concepts and ontological classes ............ 19 Managing the data model in Wikibase ................................................................. 20 Managing source metadata outside of the data model ....................................... 21 Gathering and Transforming Metadata ............................................ 22 Selecting and analyzing collections from pilot partner CONTENTdm sites ........23 Optimizing tools and workflows for reconciliation and transformation ..............24 Adding related entities to the Contentdm Wikibase from external sources .......25 Creating entities in advance for anticipated matches ....................................26 Testing an alternative openrefine reconciliation endpoint .............................26 Creating placeholder entities for things that could not be reconciled .......... 27 Representing Compound Objects .................................................... 28 Syndicating Data in Standard Schemas ........................................... 29 Wikibase Ecosystem Advantages ..................................................... 29 Implementing authority control ............................................................................29 Decreasing cataloging inefficiencies, increasing descriptive quality ................ 30 Generating data visualizations ..............................................................................32 User Interface Extensions ................................................................ 33 MediaWiki gadgets .................................................................................................33 Adding the Mirador viewer ...............................................................................33 Showing contextual information from Wikidata ..............................................33 Contextual Data and Image from DBPedia and Wikimedia Commons Embedded in the Wikibase User Interface ......................................................34 Revealing constraint violations ........................................................................34 CONTENTdm custom pages ..................................................................................35 Embedding Schema.org JSON-LD in CONTENTdm pages ............................ 36 Showing contextual information for headings based on Wikibase data ........ 37 New Applications .............................................................................. 39 The Image Annotator ............................................................................................ 39 User study results .............................................................................................42 The Retriever ......................................................................................................... 43 The Describer ........................................................................................................ 46 The Explorer and the Transportation Hub ............................................................. 47 The Field Analyzer ..................................................................................................53 Cohort Communication .................................................................... 55 Partner Reflections ........................................................................... 56 Cleveland Public Library ........................................................................................56 The Huntington Library, Art Museum, And Botanical Gardens ............................58 Minnesota Digital Library .......................................................................................59 Invitation ...........................................................................................................59 Development of three tools by OCLC ..............................................................59 Leveraging the power of linked data ............................................................. 60 Concluding thoughts ....................................................................................... 61 Temple University Libraries .................................................................................... 61 University of Miami Libraries ................................................................................ 63 Key Findings and Conclusions .......................................................... 64 Testing the linked data value proposition ............................................................ 64 Evaluating a shared data model ........................................................................... 64 Selecting and transforming metadata .................................................................65 Continuing the journey to linked data ..................................................................65 Working partnerships represent strength in numbers ........................................ 66 Notes ................................................................................................. 67 F I G U R E S FIGURE 1 Planned project phases ................................................................................... 14 FIGURE 2 The Wikibase Ecosystem .................................................................................16 FIGURE 3 A CONTENTdm class hierarchy data model .................................................... 17 FIGURE 4 Example type, classification used, and process or format properties and values for a description of a postcard ................................... 18 FIGURE 5 A depicts statement for the concept of “Dogs” ............................................. 19 FIGURE 6 A type classification of “dog” for a specific dog ............................................ 19 FIGURE 7 The “dog” class is defined by the concept of “Dogs” ................................... 20 FIGURE 8 Wikibase templates for proposing new properties ........................................ 21 FIGURE 9 Unmapped CONTENTdm metadata displayed in the Wikibase user interface using a Gadget extension ....................................................... 22 FIGURE 10 Wikibase Discussion page for a collection review ......................................... 23 FIGURE 12 A “placeholder” entity for a person without an established identity .............27 FIGURE 13 Example “has creative work part” statements and sequencing for the first four parts of an album ................................................................ 28 FIGURE 14 Other names associated with the Los Angeles Dodgers entity .................... 30 FIGURE 15 First parts of the description of Jasper Wood ................................................ 31 FIGURE 16 SPARQL Query map visualization of places depicted in works from a collection ................................................................................. 32 FIGURE 17 Mirador image viewer embedded in the Wikibase user interface ................ 33 FIGURE 18 Contextual data and image from DBPedia and Wikimedia Commons embedded in the Wikibase user interface .................................. 34 FIGURE 19 A constraint violation indicating that the “occupation” property should only be used for instances of the type “person” ............................... 35 FIGURE 20 Schema.org data evaluated using Google’s Structured Data Testing Tool ..................................................................................................... 37 FIGURE 21 Additional contextual information displayed in CONTENTdm based on entity descriptions in the pilot Wikibase ....................................... 38 FIGURE 22 Image Annotator initial view of an image and subjects ................................ 40 FIGURE 23 Image Annotator cropping an image of a person ........................................ 40 FIGURE 24 Image Annotator after adding more depicted subjects ................................ 41 FIGURE 25 Wikibase item updated with illustrated depicts statements ........................ 42 FIGURE 26 Retriever search results from Wikidata, VIAF, and FAST for “Lake Vermilion” ............................................................................... 44 FIGURE 27 Retriever entity editor ..................................................................................... 45 FIGURE 28 Wikibase entity created by the Retriever ....................................................... 45 FIGURE 29 Editing essential details for an entity in the Describer .................................. 46 FIGURE 30 Explorer home page ....................................................................................... 48 FIGURE 31 Explorer Transportation Hub and related collections .................................. 49 FIGURE 32 Explorer search results for “strike” ................................................................. 50 FIGURE 33 Explorer view of a truck bringing employees home during a PTC walkout ....................................................................................... 51 FIGURE 34 Explorer view of a protest against the Philadelphia Transportation Company ................................................................................. 51 FIGURE 35 Explorer view of an 1899 Cleveland transit strike in Public Square .................................................................................................. 52 FIGURE 36 Explorer view of streetcars parked on the street during a transit strike ................................................................................................. 53 FIGURE 37 Field Analyzer field usage chart .................................................................... 54 FIGURE 38 Field Analyzer list of field values .................................................................... 55 viii A C K N O W L E D G M E N T S The OCLC CONTENTdm Linked Data Pilot project team consisted of the following OCLC staff: Hanning Chen, Eric Childress, Shane Huddleston, Jeff Mixter, Mercy Procaccini, and Bruce Washburn. The Linked Data project team wishes to thank the project partners who enthusiastically and generously collaborated with us in this endeavor. Your vision for and commitment to a linked data future have been illuminating and inspiring. OCLC particularly appreciates the efforts of those who contributed to or co-authored this report: • Cleveland Public Library: Chatham Ewing, Rachel Senese, Amia Wheatley • The Huntington Library, Art Museum, and Botanical Gardens: Mario Einaudi • Minnesota Digital Library: Greta Bahnemann, Jolie Graybill, Jason Roy • Temple University Libraries: Michael Carroll, Stefanie Ramsay, Holly Tomren • University of Miami Libraries: Paul Clough, Elliot Williams The team also acknowledges the consultation, guidance, and support provided by our OCLC colleagues: Dave Collins, Rachel Frick, Marti Heyman, Erik Mayer, Carolyn Morgan, Andrew Pace, Taylor Surface, and Diane Vizine-Goetz. Thank you to Jeanette McNicol for the excellent design of this report and to Erica Melko for her skillful editing. E X E C U T I V E S U M M A R Y In the CONTENTdm Linked Data Pilot project, OCLC partnered with institutions that manage their digital collections with OCLC’s CONTENTdm service to investigate methods for—and the feasibility of—transforming metadata into linked data to improve the discoverability and management of digitized cultural materials and their descriptions. This report, Transforming Metadata into Linked Data to Improve Digital Collection Discoverability, describes the course of the project and its primary areas of investigation and summarizes key findings and conclusions generated by the collaborative study. The project was designed to help the OCLC team and the pilot participants better understand the following questions: • How divergent are the descriptive data practices across the institutions using CONTENTdm, and what tools are needed to make that assessment? • Can a shared and extensible data model be developed to support the differing needs and demands for a range of material types and institution types? • What is the right mix of human attention and automation to effectively reconcile metadata headings to linked data entities? • What types of tools can help extend the description of cultural materials to subject matter experts? • After metadata from different institutions and collections is transformed, are there new discovery tools that can help researchers find new—or previously hidden—connections through a centralized discovery system? • What are the institutional and individual interests in the paradigm shift of moving to linked data? Over the course of the pilot, the project team and partners observed improved metadata management and discovery in action... Five organizations representing a cross-section of different types of institutions—The Huntington Library, Art Museum, and Botanical Gardens; the Cleveland Public Library; the Minnesota Digital Library; Temple University Libraries; and University of Miami Libraries—participated in the project. ix The pilot focused on developing efficient workflows for transforming metadata, evaluating existing interfaces to leverage linked data, and testing applications built in the Wikibase environment for managing the newly created linked data. Over the course of the pilot, the project team and partners observed improved metadata management and discovery in action and reflected on the potential benefits: higher-quality and richer metadata can be managed with greater efficiency by staff, and linked data can be used to add contextual information and to create a network of connections that better reflects knowledge in the real world. This context and these connections can help researchers achieve a fuller understanding of collection materials, inviting increased engagement and use by community members. Higher-quality and richer metadata can be managed with greater efficiency by staff, and linked data can be used to add contextual information and to create a network of connections that better reflects knowledge in the real world. While the pilot project findings are based on a limited set of institutions and collections, they strongly suggest that there is significant potential for improved discovery and more efficient data management when the materials that have been digitized are described using a shared data model, where headings are associated with linked data entities and relationships, and when the entities and relationships are brought together into a single aggregation. An overarching question driving the linked data project was, for a paradigm shift of this magnitude, how can the foundational changes be made more scalable, affordable, and sustainable? The project showed that the scope and magnitude of the effort required to completely analyze, transform, and reconcile all current descriptive metadata into consistently modeled linked data is beyond the reach of a single centralized agency. It will require substantial and shared resource commitments from a decentralized community of practitioners who will need to be supplied with easily accessible tools and workflows for carrying out the transition. Evidence gathered during the project and detailed in this report about data modeling, metadata reconciliation, and data analysis provides new knowledge about how these tools and workflows could be designed and used. x I N T R O D U C T I O N The CONTENTdm1 Linked Data Pilot project (also referred to throughout this report as the “Linked Data project”) is the latest (as of 2020) in a series of investigations2 that OCLC has organized and led over several years in the interest of developing a shared understanding how libraries, archives, and museums can make the transition to linked data. OCLC works in partnership with these institutions to increase researchers’ ability to discover, evaluate, and use digitized cultural materials, principally through its support of the CONTENTdm service for building, preserving, and showcasing a library’s unique digital collections. This Linked Data project was focused on envisioning and evaluating scalable and affordable systems and workflows that will be needed to produce rich linked data representations of entities and relationships, which will then help to make visible connections that were formerly invisible. The project was grounded in the context of the linked data value proposition, which states that these best practices for publishing structured data on the web—using URIs (Uniform Resource Identifiers) as names for things, using HTTP URIs so that people can look up those names, providing useful information using standards when someone looks up a URI, and including links to other URIs so that people can discover more things—lead to an interconnected global network of data that can serve both developers and researchers.3 Five organizations representing a cross-section of different types of institutions—The Huntington Library, Art Museum, and Botanical Gardens; the Cleveland Public Library; the Minnesota Digital Library; Temple University Libraries; and University of Miami Libraries—participated as partners in the project. The pilot participants collaborated with OCLC on a range of focused studies, including developing efficient workflows for transforming source metadata into linked data, evaluating CONTENTdm interface customizations to leverage linked data for discovery and syndication, and testing new applications built in the Wikibase environment for data retrieval, image annotation, editing, metadata analysis, and discovery. This report describes the course of the CONTENTdm Linked Data Pilot project and its primary areas of investigation, shares the experiences of the five participating partner institutions, and summarizes key findings and conclusions generated by that collaborative study. The Linked Data project’s focus on sustainability and scalability posed many questions to pursue, including: How divergent are the descriptive data practices across the institutions using CONTENTdm, and what tools are needed to make that assessment? The large volume of cultural material descriptive metadata stored in CONTENTdm offered an excellent test bed for evaluating a large-scale transition to linked data. Additionally, the outcomes 12 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability and findings from OCLC’s Metadata Refinery project completed in 2016 and its Project Passage4 linked data prototype completed in 2018 provided important insights into how to implement a system to facilitate the mapping, reconciliation, storage, and retrieval of structured data for unique digital materials. This pilot project built on those insights and successes. The sections below that describe the Wikibase5 environment, the steps for gathering and transforming metadata, and the prototype “Field Analyzer” application highlight the challenges of applying this work at scale. Can a shared and extensible data model be developed to support the differing needs and demands for a range of material types and institution types? The wide variety of data models and descriptive practices currently used across CONTENTdm could be significantly easier for staff to manage if there was a shared data model available, and if that shared model could also support rich discovery for researchers in a single, aggregated discovery system. This project set out to develop a shared data model, building on existing standards but allowing for extensions as evidence surfaced in the source metadata for additional classes and relationships. The section below on developing the data model provides an overview and examples of the results of that work. What is the right mix of human attention and automation to effectively reconcile metadata headings to linked data entities? The project spent substantial time and effort on testing reconciliation workflows and prototyping new tools to make this work more efficient while maintaining quality. Prototyping a new metadata reconciliation endpoint helped us understand the potential for improving the performance of what can be a time-consuming automated process. The development of the “Retriever” web application for finding related entities in other systems and transforming them into new Wikibase entities addressed a cataloger workflow stumbling block. Both prototypes are described below. What types of tools can help extend the description of cultural materials to subject matter experts? The project team developed—and the participants tested—an “Image Annotator” prototype application that could be used by either library staff or subject matter experts from outside the library to associate subject headings with depicted entities in images, envisioning how the transformed data along with new tools could open the door to more and richer descriptions from an engaged community. The description below of the Image Annotator includes a summary of its usability test results. After metadata from different institutions and collections is transformed, are there new discovery tools that can help researchers find new, or previously hidden, connections through a centralized discovery system? The “Explorer” prototype application, developed during the project and described below, demonstrated the ability to search across data from a range of repositories, with searching and faceting powered by entities derived from authority files and from vocabularies created by librarians. And the “Transportation Hub” virtual collection included in the Explorer gave the project team and participants a way to test linked data discovery in action, working with thematically related item descriptions that were supplied by a cross-section of institutions and collections and transformed into separate entities and relationships. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 13 What are the institutional and individual interests in the paradigm shift of moving to linked data? The close collaboration between OCLC and the pilot project partners was one of the most rewarding aspects of the project. Given that most of the project was carried out as people and the organizations they work for were experiencing transformative disruptions to their lives and work as the 2020 COVID-19 pandemic began and unfolded, it was unclear at first what relative priority and attention the pilot could receive. But attention and participation from the participants—and support from OCLC—never wavered, and we mutually benefited greatly from the endeavor. Look to the following sections on cohort communication and the partner reflections for more insights and perspectives on the impact of this project and the partners’ first-hand views on the implications for our shared futures. The findings of the project—detailed in this report—about data modeling, metadata reconciliation, and data analysis provide new knowledge about how these tools and workflows could be designed and used, which we anticipate will inform future linked data investigations and developments from the library, archives, and museum communities. The CONTENTdm Linked Data Pilot project is another stage in a growing body of linked data research and development that OCLC has undertaken over the past decade. The findings of the project—detailed in this report—about data modeling, metadata reconciliation, and data analysis provide new knowledge about how these tools and workflows could be designed and used, which we anticipate will inform future linked data investigations and developments from the library, archives, and museum communities. Three-Phase Project Plan The pilot project was planned as a one-year effort to be carried out in three phases (figure 1) so that the project could address the most pressing questions first and allow for reconsideration and adjustments to the plan as it progressed: • Phase 1: Concentrated on mapping metadata for digital collections to descriptions of related entities: works, people, organizations, places, concepts, and events. Three partner institutions joined the project in Phase 1: The Huntington Library, Art Museum, and Botanical Gardens; the Cleveland Public Library; and the Minnesota Digital Library. 14 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability • Phase 2: Focused on a needs assessment and prototypes for managing metadata in the Wikibase environment. Two more partner institutions joined the project in Phase 2 after the OCLC team had developed a better understanding of the institutional support requirements, and to expand representation of materials from academic research libraries: Temple University Libraries and University of Miami Libraries. • Phase 3: Anticipated testing an end-user discovery experience based entirely on the data and tools developed within the Wikibase environment. CONTENTdm Linked Data Planned Project Phases FIGURE 1. Planned project phases.6 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/phase-diagram.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 15 PHASE 1: MAPPING TEXTUAL METADATA TO ENTITIES In the first phase, the plan was to focus on the systems and workflows needed to clean up, analyze, and reconcile CONTENTdm metadata for input into a linked data environment. Building on the project team’s prior experience with the environment in OCLC’s earlier linked data pilot, Project Passage, the Wikibase extension to the MediaWiki platform was selected as the project’s linked data environment. The Wikibase environment of related databases, indexes, and services is described in fuller detail below. In this phase, linked data was expected to be shown in the CONTENTdm interface, delivering data from the pilot project Wikibase using CONTENTdm’s custom Javascript feature. PHASE 2: TOOLS FOR MANAGING METADATA IN WIKIBASE In the second phase, the work was expected to focus on the Wikibase editing interface and on supplementary tools that could be used to extend that environment. These tools would help bridge the gap between CONTENTdm staff user expectations and the features and limitations of the Wikibase environment. The design and development of mechanisms for returning data from the Wikibase to the production CONTENTdm environment were also expected to be part of this phase. PHASE 3: WIKIBASE ENTITIES DRIVE DISCOVERY The focus of phase three was intended to be on a discovery interface that relied solely on data within the Wikibase to evaluate the features that could be part of a redesigned CONTENTdm discovery system. As the project unfolded, the project team made adjustments to the original plan, responding to new findings from its early phases. For example, work on some staff tools for editing Wikibase data began during Phase 1 (planned for Phase 2). On the other hand, the initial plan included the prototyping of a user interface for entity editing as an alternative to the Wikibase user interface but the team did not completely build and test that prototype before the project ended. The Phase 2 work anticipated that the project would encourage loading data from the Wikibase back into the CONTENTdm system using its “Catcher”7 web service that can add and edit metadata using a standard XML-based method. But given the conditions of the pilot project, the project partners could not be sure if the modified headings would conflict with their ongoing data management work. The first two phases concentrated on building and evaluating workflows for analyzing, transforming, and reconciling CONTENTdm metadata into Wikibase Linked Data with as complete and lossless a result as was feasible. In the third phase, a new course was charted to see how much linked data could be generated from CONTENTdm with minimal human intervention and evaluate the results in a front-end discovery application to more clearly demonstrate the linked data value proposition. These types of adjustments to the project plan are expected in a research-oriented pilot, where a full understanding of the issues and questions that will naturally surface over time are not defined at the outset. 16 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability The Wikibase Environment To proceed through the planned phases, the project needed an effective and proven platform for working with linked data. Based on the successful results from Project Passage (2018), the CONTENTdm Linked Data Pilot project used the Wikibase environment, which includes several interrelated APIs, databases, indexes, and services (figure 2): • MediaWiki is the primary software platform, the same software on which the Wikipedia8 encyclopedia and other “wikis” operate. • To handle structured data, the Wikibase extension to MediaWiki is used, which is the same software that supports the Wikidata9 knowledge base. Together, MediaWiki and Wikibase provide both a user interface for searching and editing and a range of APIs for access to authentication and editing services. • But to support linked data, a parallel system is synchronized with the Wikibase data, including its own linked database or triplestore that can be accessed using a linked data query language called SPARQL.10 A SPARQL Query service user interface is also provided. These powerful tools are the product of years of open source software development and support provided by the Wikidata and Wikimedia communities. The Wikibase Ecosystem FIGURE 2 . The Wikibase Ecosystem.11 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/wikibase-system-architecture.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 17 Developing A Data Model CONTENTdm repositories employ a wide range of vocabularies and institution-specific data dictionaries. Some institutions apply patterns to their data descriptions that are consistent across all their collections, while others use different patterns for different collections, either due to evolution of their institutional preferences over time and the effort required to maintain and revise “legacy” patterns in previously described collections, or to account for special characteristics in the data and use cases associated with specific collections. For the Linked Data project Wikibase, a single data model was needed that could reflect the variations seen in the metadata across CONTENTdm sites. Rather than selecting an existing data model to which we’d force CONTENTdm metadata to conform, the pilot project tested the theory that, through sampling current metadata and looking for general patterns, a model could be developed that was driven by data and that avoided speculation. Where appropriate, the properties and classes defined in the project data model were linked to equivalent properties and classes in other ontologies and vocabularies. CONTENTdm Class Hierarchy Data Model FIGURE 3. A CONTENTdm class hierarchy data model.12 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/class-ontology.png 18 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability This work began by looking across an inventory of CONTENTdm metadata for the most common metadata practices, leveraging CONTENTdm’s ability to help institutions relate their local vocabulary terms to the Dublin Core13 element set and associated controlled vocabularies. This step identified the classes and relationships that would be encountered most frequently in the pilot participants’ data and gave a starting point for building the pilot project data model. A field analysis survey was conducted for about 13 million records, selected from all CONTENTdm sites, that evaluated the most frequently used fields to identify important properties for creative works. From that same CONTENTdm survey, the most frequently used terms were extracted to build an initial class taxonomy for creative works. This method was later revised based on conversations with partners and colleagues. The class hierarchy from the project’s data model is illustrated in figure 3. DESCRIBING THE “TYPE” OF A CREATIVE WORK AT THREE LEVELS As analysis of the pilot project participants’ data began, new classes and relationships were encountered and were evaluated as possible extensions to the data model. One part of the model that changed substantially was the Creative Work taxonomical branch. It was originally populated with the “types” of creative works based on how they were described in the source metadata, but that resulted in a large and unstructured list of classes. After consulting with the pilot partners and with colleagues at the J. Paul Getty Trust, the team decided to revise the model using a three-level approach. At the top level, creative work “type” classes were mapped to the Dublin Core DCMI Type14 terms. An immediate benefit of that decision was the ability to neatly facet results across the different DCMI Types, a common way of providing a high-level filter for search and retrieval of digital collections. To refine the top level DCMI Type classes, a second level “classification used” property was created that was associated with 25 “classification” entities. The set of classification entities was developed based on work done in the Linked.Art15 project as well as through consultation with colleagues at the pilot partner Minnesota Digital Library. If more detail was needed, a third level for the “process or format” property could be used to connect the item to any conceptual entity. An example of this revision to the data model is illustrated in figure 4 for a postcard, which is a type of “image,” uses the classification “Prints,” and adds a “process or format” of “Postcards.” Example of Mapped DCMI Data Levels for a Postcard FIGURE 4. Example type, classification used, and process or format properties and values for a description of a postcard.16 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/entity-Q73226.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 19 DISTINGUISHING BETWEEN INSTANCES OF CONCEPTS AND ONTOLOGICAL CLASSES Distinguishing between instances of concepts and ontological classes presented a data modeling challenge. This challenge is related to how, in the library domain, controlled vocabularies have been developed and translated to ontology-based systems. Concepts derived from a controlled vocabulary can be used both as conceptual entities for subject headings and as ontological classifications for a specific instance of the subject. A good example of this dual use is the concept entity of “Dogs.” As a concept it can be used to describe what a photograph depicts as seen in figure 5. “Depicts” Statement for the Concept of “Dogs” FIGURE 5. A depicts statement for the concept of “Dogs.”17 View a larger image online. But “dog” can also be used as an ontological class to describe specific dogs, such as the dog named “Buck” who appears in a photograph (figure 6). Type Classification of “Dog” for a Specific Dog FIGURE 6. A type classification of “dog” for a specific dog.18 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/entity-Q147731.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q142481.png 20 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability To distinguish the conceptual entity of “Dogs” from the ontological class “dog” in the pilot data model, an “is defined by” property was created, based on the property “isDefinedBy,” which is found in the linked data modeling vocabulary RDF Schema19 to connect the class to the conceptual entity that describes it (figure 7). The “Dog” Class “isDefinedBy” the Concept of “Dogs” FIGURE 7. The “dog” class is defined by the concept of “Dogs.”20 View a larger image online. MANAGING THE DATA MODEL IN WIKIBASE OCLC staff took advantage of the components built into the Wikibase infrastructure to manage the process of developing the data model, using a template form to submit and review proposals for new properties and classes. This approach helped illustrate the expected advantages that these additions to the model would bring and provided a history to look back on as the project proceeded. OCLC staff found that these templates and the proposal/review/acceptance workflow were an effective way for a small but distributed team to manage the process and recommends this approach to others who are building a system using the Wikibase software platform (figure 8). https://researchworks.oclc.org/cdmld/screenshots/entity-Q73829.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 21 Wikibase Templates for Proposing New Properties FIGURE 8. Wikibase templates for proposing new properties.21 View the larger Wikibase images property proposal and property proposal/is defined by online. MANAGING SOURCE METADATA OUTSIDE OF THE DATA MODEL Some of the CONTENTdm source metadata fell outside of the information that was expected to be accounted for in the data model. The data model was intended to support the description of cultural materials, but the source metadata also included technical information about their digital representations and administrative data associated with the cataloging process. To prevent this additional information from being lost in the transformation process, the associated fields and values were indexed in a system separate from the Wikibase. The indexed data included the identifier for the associated entity in the project Wikibase. This allowed unmapped elements to be displayed in the Wikibase user interface (illustrated in figure 9) without disrupting the data model with entities and relationships that were administrative or technical in nature. https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal.png https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal.png https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal-is-defined-by.png 22 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Unmapped CONTENTdm Metadata Displayed in the Wikibase User Interface Using a Gadget Extension FIGURE 9. Unmapped CONTENTdm metadata displayed in the Wikibase user interface using a Gadget extension.22 View a larger image online. Gathering and Transforming Metadata The primary focus of the first phase of the linked data project involved assembling metadata describing digitized cultural materials and transforming it to descriptions of related entities. The following notes provide a detailed view of that work, including how metadata was selected for inclusion and analyzed, the development of tools and workflows to manage the transformation, and how the database was enriched to build more connections between entities. https://researchworks.oclc.org/cdmld/screenshots/entity-Q143578.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 23 SELECTING AND ANALYZING COLLECTIONS FROM PILOT PARTNER CONTENTDM SITES Pilot project participants were asked to suggest a small group of CONTENTdm collections that they wanted to work with. OCLC suggested working with collections of varying sizes and content types but emphasized that the described materials should be primarily visual (photographs, prints, maps, etc.) rather than finding aids or PDF documents. In some cases, for very large collections, OCLC chose to represent a subset of the entire collection, given the pilot project’s resource constraints.23 Wikibase Discussion Page for a Collection Review FIGURE 10. Wikibase Discussion page for a collection review.24 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/cdm-item-talk-Q148309.png 24 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability OCLC staff exported CONTENTdm metadata for each suggested collection and created an entity description for it in the CONTENTdm Wikibase and used the Wikibase “Discussion Page” feature to develop a metadata crosswalk, analyzing fields used in the collection and mapping them to Wikibase properties and classes. After OCLC staff created the initial crosswalk, individual meetings were held with each pilot site to review the initial analysis and address questions. This process highlighted the importance of domain expertise when thinking through the metadata transformation process, as institution-specific, and sometimes collection-specific, cataloging practices cannot always be discerned by others outside the institution. OPTIMIZING TOOLS AND WORKFLOWS FOR RECONCILIATION AND TRANSFORMATION After analyzing collection fields and reviewing the analysis with the pilot participants, OCLC created a project for each collection in the program OpenRefine25 (figure 11), which provides tools for data analysis, cleanup, and reconciliation. OpenRefine has a significant learning curve but is a tool OCLC has used frequently for metadata analysis; it was a natural fit for this project and proved to be an effective platform. CONTENTdm Collection Metadata in an OpenRefine Project FIGURE 11. CONTENTdm collection metadata in an OpenRefine project.26 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/openrefine-project.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 25 As OCLC staff gained more experience with CONTENTdm metadata, reusable OpenRefine recipes were developed for carrying out generic data transformation tasks, which helped speed up the data processing for OCLC staff. For example, a recipe was developed for looking up an item’s Wikibase identifier using its IIIF27 Manifest URL (“IIIF” is an image interoperability standard, and it defines a “manifest” that represent the digital content associated with a collection or item) and retrieving data from the pilot project’s linked data “triplestore”28 database, a recipe for converting personal names from indirect order to direct order, and a recipe to extract and format individual height and width values and corresponding unit from extent data text strings. The code for each recipe was documented and stored in a Wikibase Help page for sharing and reuse by OCLC staff. An important advantage of the OpenRefine platform is its ability to reconcile strings of text against external vocabularies to obtain a persistent identifier for the thing that the text string describes. The reconciliation feature is built into OpenRefine and can be configured to compare strings against external OpenRefine-compatible reconciliation endpoints. OCLC staff worked with the OpenRefine reconciliation endpoint software29 developed for the Wikidata community and reconfigured it as an endpoint for the project Wikibase. That way OpenRefine could be used to reconcile text strings against matches found through the OpenRefine reconciliation endpoints for the CONTENTdm Wikibase and could also use the similar endpoint supported by the Wikidata community to reconcile strings against Wikidata. OCLC also made use of OpenRefine endpoints developed and hosted by others to reconcile against the OCLC FAST30 subject terminology system, the VIAF31 authority file service, and the GeoNames32 service for geographic data. After cleaning up and reconciling the CONTENTdm metadata, OCLC staff exported the data from OpenRefine and used locally developed scripts written in the Python33 scripting language to restructure the data to match the format specified for the Wikidata QuickStatements34 application. This is a tab-separated format with a set of rules for adding data to a Wikibase, with each row representing a single component of the item’s description. And OCLC utilized the Pywikibot35 library to develop another application that could read the QuickStatements data and load it into the Wikibase. The most significant barrier to quickly transforming and loading CONTENTdm metadata into the project Wikibase was the absence . . . of Wikibase entity descriptions for the people, organizations, places, concepts, and events that are represented in the CONTENTdm records. ADDING RELATED ENTITIES TO THE CONTENTDM WIKIBASE FROM EXTERNAL SOURCES The most significant barrier to quickly transforming and loading CONTENTdm metadata into the project Wikibase was the absence, especially in the early stages, of Wikibase entity descriptions for the people, organizations, places, concepts, and events that are represented in the CONTENTdm records. In a linked data environment, each of those related entities must have its own entity description in the system, so that relationships can be defined between the entities. For example, when transforming a CONTENTdm record for a photograph, a “photographer” property should be added to the entity describing the photograph with a link to a separate entity for the photographer. 26 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Unless those related entities are already in the project Wikibase and can be matched through OpenRefine reconciliation, the data loading process stalls until data for the related entities can be found, transformed, and loaded. To move the process along and create entities as quickly as possible, OCLC staff initially created entities just for the creative work and its direct string-based properties (e.g., its title, description, height, width, IIIF Manifest URL, etc.). Once that step was completed, OpenRefine and the project’s SPARQL Query Service were used to look up the newly created Wikibase identifier for each item and those identifiers were added into a new column in the OpenRefine project. That step was followed by the creation of one or more new OpenRefine projects focused on reconciling strings for related entities and making connections between those entities and the creative work entities in the Wikibase. Creating entities in advance for anticipated matches OCLC also created Wikibase entity descriptions in advance for concepts and places that were anticipated to be mentioned in the CONTENTdm source data so the OpenRefine reconciliation process would find something to match against. Entities for concepts were based on a set of headings from OCLC’s FAST subject vocabulary. Staff selected subject headings that are widely used in other databases with the expectation that these would represent headings that would also occur in CONTENTdm metadata. The subject headings were transformed and loaded into the Wikibase as concept entities. This created an initial set of about 75,000 concept entities. In a second step, the FAST data was analyzed to find “broader concept” relationships for the 75,000 concept entities, and new concept entities were created for all of the “broader concept” FAST headings. Adding broader concept entities resulted in a total of over 100,000 concept entities being added to the Wikibase to support the CONTENTdm metadata matching process. Entities for places that were anticipated to be found in CONTENTdm metadata were created based on information from the GeoNames geographical database, beginning with data describing cities with a population larger than 15,000 along with other place descriptions from administratively higher levels (countries, states, provinces, territories, counties, etc.). The GeoNames data processing produced about 70,000 Place entities for reconciliation. This step of prepopulating the Wikibase with descriptions of entities for anticipated CONTENTdm headings helped reduce the barriers for entity creation. But the limits that were applied to the external sources meant that there were potential matches still to be found in FAST or GeoNames that had not been included, and potentially additional or richer data available from VIAF and Wikidata. Unmatched headings were reconciled against those services in OpenRefine, and if matches were found, the external source data was retrieved, converted, and loaded into the project Wikibase, and reconciliation was attempted again. OCLC also developed a separate application called the “Retriever,” described in more detail later in this report, that staff used to search for matches in Wikidata, VIAF, and FAST and create new entities with a simple web interface. Testing an alternative openrefine reconciliation endpoint During the second phase of the pilot, OCLC prototyped a new reconciliation endpoint for matching against headings in the project Wikibase, relying on separate indexes of entity data to speed up the reconciliation process. The performance metrics for this prototype service were very encouraging, as it does not rely on SPARQL Queries and the triplestore for matching, which can Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 27 slow the process down. The index response times were consistently much faster in OCLC tests. This efficiency gain comes at the cost of replicating and synchronizing data from the Wikibase in another index, but for this project the costs were easily managed. OCLC staff provided a detailed presentation on this prototype work as part of the OCLC DevConnect Online 202036 series. The DevConnect webinar sparked some interest from a few developers that work on OpenRefine, and the OpenRefine Reconciliation Service API and OCLC has consulted with them to determine if any of our optimizations can be incorporated into their projects. Creating placeholder entities for things that could not be reconciled Some entities mentioned in CONTENTdm records could not be found in the controlled vocabularies and authority control systems that were used by OCLC for reconciliation. This was an anticipated, and indeed one of the points of carrying out this pilot was to better understand how these references appear and how to account for them in a Wikibase, where the established identity of the entity is of great importance. The solution the team settled on was to create a “placeholder” entity with as much information about the referenced entity as could be extracted from the CONTENTdm description, for instance its type (person, organization, etc.), birth and death dates (if present), occupation, and a consistently applied component of the Wikibase description that would help suggest, for potential future matches during reconciliation, that the entity’s identity had not yet been established (figure 12). A “Placeholder” Entity for a Person without an Established Identity FIGURE 12 . A “placeholder” entity for a person without an established identity. 37 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/entity-Q144548.png 28 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Representing Compound Objects Descriptions of cultural materials that consist of multiple parts, such as a photograph album or the recto and verso views of a postcard, can be structured in CONTENTdm as “compound objects.” Compound objects maintain the sequential order of related digitized items and can include an “object description” of the whole item, along with more detailed “item descriptions” about each part.38 The project team tested two ways to maintain this structure and descriptive detail in Wikibase entities that were created from CONTENTdm compound object metadata. In the most granular and detailed approach, for a photograph album, an entity for the album was created along with separate entities for the album cover and its individual pages. Each cover or page entity has a “part of creative work” property linking back to the album entity. While this approach acknowledges the whole-part relationship of pages to the album, that relationship on its own cannot represent the sequential order of the parts. To document their sequential order, “has creative work part” statements were added to the description of the album, linking to the related parts, and each statement was qualified with a “series ordinal” property to represent the numeric sequence of the pages (figure 13). Example “has creative work part” Statements and Sequencing for the First Four Parts of an Album FIGURE 13. Example “has creative work part” statements and sequencing for the first four parts of an album. 39 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/entity-Q144548.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 29 In reviewing other compound objects, a more typical pattern revealed that they included very little item-level descriptive data beyond a default caption such as “Page 1,” “Page 2,” etc. It was also noted that the IIIF Manifest that is present for compound objects maintains the structure, sequence, and captions of related images. That led to a decision to describe most compound objects as a single entity without separate entities for the items in the compound object, relying on access to the structure and sequence and caption-level metadata in the corresponding IIIF Manifest. Syndicating Data in Standard Schemas The data managed in the CONTENTdm Wikibase is accessible through MediaWiki APIs and the Wikibase user interface and can be transformed by Wikibase in several formats including the RDF40 linked data formats Turtle,41 N-Triples,42 and JSON-LD,43 along with the non-RDF formats of a proprietary Wikibase JSON44 object and Serialized PHP.45 The pilot study also evaluated mechanisms for transforming data from Wikibase into schemas used by other systems where this data may eventually be shared. Specifically, OCLC added equivalent class and equivalent property statements for the CONTENTdm data model’s classes and properties, which were then used by a conversion program to crosswalk the data into either the DPLA Metadata Application Profile46 or Schema.org.47 The Schema.org transformation was used with a CONTENTdm Custom Javascript extension, described below, to test embedding JSON-LD linked data within CONTENTdm item pages. A separate conversion process was developed to convert the project’s Wikibase data model representation into an RDF OWL Ontology48 description. This conversion demonstrated the portability of both the Wikibase data model and the instance data if Wikibase were to be replaced by another structured data management system. The exported ontology also provided a clear way to see the model, separate from the instance data, which helped the project team explain how the data was created and structured. In testing how data exports could be created and used, OCLC developed a conversion process that took the Wikibase JSON data and generated JSON-LD data that conformed to the JSON-LD 1.1 specification and followed the emerging W3C best practices for JSON-LD49 as well as IIIF JSON-LD design patterns.50 This conversion demonstrated the versatility and portability of the Wikibase JSON data and provided “developer friendly” data for our prototype applications, such as the Explorer application described in this report, to use. Wikibase Ecosystem Advantages The selection of the MediaWiki environment and its Wikibase extension brings several advantages right out of the box. Without custom software development or user interface design and testing, these can be employed to produce new data management and user experience benefits. IMPLEMENTING AUTHORITY CONTROL CONTENTdm currently has a traditional record-oriented data model, where headings for various entities are based on a single string. Varying cataloging practices and sources for controlled vocabularies can, in that approach, create obstacles to searching for the name of a person, organization, concept, place, or event if you do not know the exact form of the heading. But in the 30 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Wikibase environment, any number of different heading strings and in different languages can be associated with an entity, greatly increasing the effectiveness of recall while strongly supporting precision as well. Other Names Associated with the Los Angeles Dodgers Entity FIGURE 14. Other names associated with the Los Angeles Dodgers entity. 51 View a larger image online. For example, in CONTENTdm a precise search to find works associated with the Los Angeles Dodgers baseball team may (depending on the cataloging practices of the institution) need to use the Library of Congress (LC) heading “Los Angeles Dodgers (Baseball team).” But in the Wikibase environment, the entity describing that organization could be found using that LC preferred form, or any of several current colloquial names or previous official names, including “LA Dodgers,” “Brooklyn Dodgers,” “Trolley Dodgers,” “Brooklyn Grays,” and others (figure 14). In the Wikibase environment each entity is registered with and retrievable with its own unique identifier, separate from any and all names with which it may be associated. DECREASING CATALOGING INEFFICIENCIES, INCREASING DESCRIPTIVE QUALITY In a record-oriented system like CONTENTdm, if a cataloger wants to include biographic or other descriptive information about an entity associated with a work, such as information about the photographer of an image or about a depicted person, that information needs to be added to as the value of a field in every record where it is applicable. Then, if information about the related entity needs to change, all the associated records need to be updated to keep that information current and synchronized. This data management overhead may be one reason why descriptions of related entities are not common in traditional cataloging environments. https://researchworks.oclc.org/cdmld/screenshots/entity-Q166325.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 31 First Parts of the Description of Jasper Wood FIGURE 15. First parts of the description of Jasper Wood. 52 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/entity-Q147700.png 32 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability In the Wikibase environment, entities for works and for things associated with the work are maintained separately. The description of the photographer, or of the depicted person, can be entered and maintained in one entity description as illustrated in figure 15, and any changes to that description can be immediately seen through the relationships the entity has to other entities. This efficiency improvement could encourage richer descriptions of related entities, including context and relations that are not typically added in existing record-oriented systems. GENERATING DATA VISUALIZATIONS As the system architecture diagram included in this report represents, the Wikibase ecosystem includes a component that watches for changes in the Wikibase entity descriptions, retrieves that data in the form of linked data triples, updates the data in a linked data database or “triplestore,” and provides a separate user interface for querying that data using the SPARQL Query language. The user interface has built-in tools for constructing SPARQL queries and determining how the results can be visualized. The SPARQL query language is a powerful tool for making connections between and across entities, producing results that would be difficult and, in some cases, not feasible in a traditional record-oriented system. As shown in figure 16, a simple SPARQL query can retrieve all of the entities for places that are said to be depicted by works in a collection and, using the geographic coordinates in the place entity description, locate the place in a map visualization along with information about the related work. SPARQL Query Map Visualization of Places Depicted in Works from a Collection FIGURE 16. SPARQL Query map visualization of places depicted in works from a collection. 53 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/sparql-visualization.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 33 User Interface Extensions MEDIAWIKI GADGETS The MediaWiki platform for Wikibase provides a Gadgets54 extension that can be used to develop and add custom features to the user interface. OCLC staff took advantage of this feature to extend the interface, both to alter the user experience and to provide quality assurance tools. Adding the Mirador viewer Mirador55 is a configurable, extensible, and easy-to-integrate image viewer that enables image annotation and comparison of images from repositories dispersed around the world. It can interpret the metadata and images that are included in IIIF Presentation Manifests. CONTENTdm generates IIIF manifests for all its image-based content, so Mirador was a great fit for this pilot project. Without an embedded image viewer, the Wikibase item entity displays are limited to text and are static. The Mirador viewer adds a degree of interactivity to the user experience: images can be viewed in detail, and for compound objects pages can be turned, without leaving the Wikibase user interface, as shown in figure 17. Mirador Image Viewer Embedded in the Wikibase User Interface FIGURE 17. Mirador image viewer embedded in the Wikibase user interface. 56 View a larger image online. Showing contextual information from Wikidata One of the most important value propositions of working with linked data is for entities to link to other related things in other systems, leveraging the network to obtain more contextual data “on the fly” instead of duplicating data across systems. And in the linked data project Wikibase, many entities included identifiers for descriptions of the same entity in Wikidata. https://researchworks.oclc.org/cdmld/screenshots/entity-Q165895.png 34 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability OCLC developers created a Wikibase gadget that could detect the presence of the related Wikidata identifier in an entity description, make a connection to Wikidata in real-time to find an associated Wikipedia article link, and use the Wikipedia link to obtain a summary description of the entity and, in many cases, a related image from Wikimedia Commons. OCLC developers found that this MediaWiki Gadget was simple to write. But the Gadget depended on a separate and more complex application created by OCLC developers that made all the system connections and carried out the database searches for contextual information and cached that information so as not to overburden the other shared services. The resulting contextual information included in a Wikibase entity description of San Francisco is illustrated in figure 18. Contextual Data and Image from DBPedia and Wikimedia Commons Embedded in the Wikibase User Interface FIGURE 18. Contextual data and image from DBPedia and Wikimedia Commons embedded in the Wikibase user interface. 57 View a larger image online. Revealing constraint violations Constraints58 are a Wikibase quality assurance feature that can be defined for properties and classes to describe their expected or allowed uses. For example, the property for “birthplace” might have a type constraint set indicating that the property should only be used for items that are an instance the class “Person,” that the object of the birthplace statement should be an instance of the class “Place” or one of its subclasses, and a cardinality constraint indicating that an entity should not have more than one birthplace statement. Leveraging the project’s SPARQL Query Service and its triplestore, OCLC developed a gadget that can compare the properties set for an item with any constraints set for the property and return a list of “constraint violations.” In some cases, these will represent errors in the description that should be changed. In other cases, they can point to adjustments that may be needed to the data model. https://researchworks.oclc.org/cdmld/screenshots/entity-Q71945.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 35 As illustrated in figure 19, a constraint violation is noted for the Soviet Space Dog “Laika.” An occupation property has been set to “Astronauts,” but the type constraint for the occupation property indicates that it should only be used for instances of the type “person.” This view helps the project team see the violations generated by unexpected data and decide whether to modify these constraints, in this case based on what the community decides about occupations and whether they can be associated with other beings other than persons. Constraint Violation Indicating the “Occupation” Property Should Only Be Used for Instances of the Type “Person” FIGURE 19. A constraint violation indicating that the “occupation” property should only be used for instances of the type “person.”59 View a larger image online. CONTENTDM CUSTOM PAGES A very useful feature of the CONTENTdm system is the ability to create Custom Pages60 using CSS and Javascript to adjust and extend the default user interface features. You can find a wide array of examples in the CONTENTdm Customization Cookbook site.61 The CONTENTdm pilot used this customization feature to test how linked data from the pilot project Wikibase could power two enhancements to the production CONTENTdm system’s item displays. https://researchworks.oclc.org/cdmld/screenshots/entity-Q73246.png 36 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Embedding Schema.org JSON-LD in CONTENTdm pages Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the internet, on web pages, in email messages, and beyond. By mapping the CONTENTdm Wikibase data model to Schema.org classes and properties, and by developing a conversion program to generate Schema.org-compatible descriptions of entities in the Wikibase, OCLC developed a CONTENTdm customization that embeds the Schema.org data within a CONTENTdm item page, formatted as JSON-LD, to make the content of the page easier for search engines to find and interpret (table 1). TABLE 1. Example Schema.org JSON-LD for a CONTENTdm entity Example Schema.org JSON-LD for a CONTENTdm entity Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 37 The visibility to search engines of this embedded JSON-LD Schema.org metadata can be evaluated using applications like Google’s Structured Data Testing Tool (figure 20).62 Schema.org Data Evaluated Using Google’s Structured Data Testing Tool FIGURE 20. Schema.org data evaluated using Google’s Structured Data Testing Tool.63 View a larger image online. Showing contextual information for headings based on Wikibase data Similar to the Wikibase user interface gadget that adds contextual information about a single entity by connecting through Wikidata to obtain related information from Wikipedia, DBPedia, and Wikimedia Commons, an application was written that could be called by a CONTENTdm Custom Javascript component and, using the CONTENTdm item identifier as a way to find the related entity for the work in the pilot Wikibase, also find other entities related to the work in the Wikibase (the collection of which it is a part, subjects that it is about, its creator, and more), and for each of those entities look for and display more information, including an abstract and a thumbnail image. This customization, shown in figure 21, was demonstrated to the pilot participants and there was interest in applying it to some of their collections, but the project did not see a production implementation of it, beyond OCLC’s testing, before the pilot period ended. https://researchworks.oclc.org/cdmld/screenshots/google-structured-data-testing-tool.png 38 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Additional Contextual Information Displayed in Contentdm Based on Entity Descriptions in the Pilot Wikibase FIGURE 21. Additional contextual information displayed in CONTENTdm based on entity descriptions in the pilot Wikibase.64 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/cdm15725-p16003coll7-14.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 39 New Applications The Linked Data project was well served by the “out of the box” features and functions of the MediaWiki platform, its Wikibase extension, the SPARQL Query service interface, the MediaWiki Gadgets component, and CONTENTdm Custom Pages. But for more complex investigations that were carried out during the project, the following prototype applications were developed: • The Image Annotator to evaluate how subject matter experts could assist catalogers in describing images • The Retriever to make the process of finding and adding new entity descriptions more efficient • The Describer to investigate alternatives to the default Wikibase editing interface • The Explorer and the Transportation Hub to demonstrate the value of aggregation and new discovery system features that maximize the value of linked data • The Field Analyzer to assist metadata managers with analyzing their current collections THE IMAGE ANNOTATOR The CONTENTdm metadata transformation and reconciliation process produced descriptions of creative works that included, among other statements, relationships to other entities that the creative work either depicted or was in a more general sense “about.” The distinction between these two relationships was not always certain and a project goal was to better understand how this distinction is discerned by those managing digital collections. There was also an interest in testing whether the Wikibase platform could serve as the basis for new application development— in this case for an interface that would let domain experts and others augment the transformed CONTENTdm metadata with new annotations. A user can review the statements that were created as part of the CONTENTdm metadata conversion process and quickly update any statements that need adjusting. The Image Annotator application was developed and tested to investigate those questions. Given the Wikibase entity identifier for a creative work, it initially presents the work’s image along with a list of the “about” or “depicts” statements that are part of the entity description. This selective presentation of just some of the elements associated with the entity description was designed to give focus to the questions at hand: What is the image about, what does it depict, and can portions of the image be associated with depicted things? A user can review the statements that were created as part of the CONTENTdm metadata conversion process and quickly update any statements that need adjusting, for example changing an “about” statement to a “depicts” statement if they determine that the related entity is truly depicted in the image (figure 22). 40 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Image Annotator Initial View of an Image and Subjects FIGURE 22 . Image Annotator initial view of an image and subjects.65 View a larger image online. And for any “depicts” statements, the user can apply the image cropping tool to associate the appropriate portion of the image with the depicted entity, providing a much finer-grained reckoning of the item and supplementing the Wikibase with new images associated with other entities (figure 23). The IIIF Image API supports the management of these selections and the persistent retrieval of the associated images. Image Annotator Cropping an Image of a Person FIGURE 23. Image Annotator cropping an image of a person.66 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/image-annotator-1.png https://researchworks.oclc.org/cdmld/screenshots/image-annotator-2.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 41 The subject relationships shown in the Image Annotator are based on the CONTENTdm source data, but this new application gives users the opportunity to supplement the entity description with more “about” or “depicts” statements by searching for related entities and adding the new connections, with another cropped image if appropriate, as illustrated in figure 24 with the addition of the subjects “Baseball umpires” and “Catchers (Baseball)” with associated images. This [Image Annotator] gives users the opportunity to supplement the entity description with more “about” or “depicts” statements by searching for related entities and adding the new connections. Image Annotator After Adding More Depicted Subjects FIGURE 24. Image Annotator after adding more depicted subjects.67 View a larger image online. Once the changes have been made in the Image Annotator, they can be saved to the Wikibase where they are immediately visible in its user interface (figure 25). https://researchworks.oclc.org/cdmld/screenshots/image-annotator-3.png 42 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Wikibase Item Updated with Illustrated Depicts Statements FIGURE 25. Wikibase item updated with illustrated depicts statements.68 View a larger image online. User study results The usability of the Image Annotator was tested in November and December 2019 in three separate “Think Aloud” user studies.69 In this type of study, test participants use the system while continuously thinking aloud—that is, verbalizing their thoughts as they move through the user interface. Reactions to and suggestions made about the Image Annotator were for the most part very positive, but also identified user interface and indexing improvements that would need to be implemented before it could become a truly productive tool. https://researchworks.oclc.org/cdmld/screenshots/entity-Q148552.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 43 The test results indicated that the Image Annotator application was usable, as everyone was able to complete the exercise steps. The results also helpfully revealed usability issues that had not been previously encountered during OCLC staff testing. The test study wrap-up discussions provided especially good feedback and suggestions for improvements. Test participants noted that the Image Annotator would be a useful tool for cleaning up metadata and would provide an easy way to bring subject matter experts from outside the library into the process of describing cultural materials. The test results also identified several areas of needed improvement before the Image Annotator would be ready for regular use, including search and retrieval, scalability, user interface issues, and guidelines for descriptive practice. In the area of search and retrieval, it was unclear that entering “free text” subject annotations that did not match a heading would not be retained when the entity was updated. Participants found that expected search results were not returned, for example a search for “kite” did not show a match for “kites.” And some participants hoped that vocabularies from nonlibrary domains could be included as the source for related entities: “That’s ultimately what makes digital collections meaningful.” Scalability of the manual effort was a noted concern, in that it takes time to make annotations for individual objects, and collections can include thousands of objects. Some wondered whether crowdsourcing, under certain management and controls, could address that concern. The Image Annotator’s mechanism for adding and removing annotations presented some usability obstacles: the absence of the “camera” button for “about” headings was confusing, a cropping icon would be preferable to a camera icon for that button, the camera and “add a depiction” buttons compete for attention, and there isn’t a way to delete a cropped image without deleting the entire depicts statement. Some participants wondered how many subjects are “enough,” and what subjects are notable enough to deserve annotation. They wished for easier access to other descriptive metadata for the work, to help identify additional depicted subjects. The gray area between “about” and “depicts” was discussed by all without a clear consensus on when to select one or the other, though generally participants felt that “depicts” should have more of a guarantee that you will see the depicted thing in its entirety or as a significant portion in the object. For example, a photograph of Public Square in Cleveland would depict Public Square but be about Cleveland. All the test subjects noted that the Image Annotator was enjoyable to use: • “I like this little system.” • “Once you get going it’s actually kind of fun.” • “One of the things you’re offering is a way to have fun—quite literally a window into new ways of thinking about what we do.” THE RETRIEVER When describing an entity using the Wikibase interface, the workflow can come to a halt if you are trying to establish a relationship between the entity you are describing and some other entity when it is not yet in the Wikibase. To fill this gap, OCLC developed an application called the “Retriever” that can quickly search for an entity described in other systems and transfer those descriptions into the Wikibase as a new entity. 44 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability For example, if you are describing a photograph of Lake Vermilion in Minnesota and wanted to add a “depicts” statement linking the photograph entity to that place entity, if there isn’t already an entity in the Wikibase describing Lake Vermilion you’d need to stop editing the photograph’s description, switch to a new Wikibase editing window to create a new entity for the lake, and then return to the photograph entity description to add the statement claiming that the photograph depicts the lake. That kind of disruption to the workflow can be reduced if there is a way to quickly add the missing entity’s description to the Wikibase. When a Wikibase is in its early stages, unless it is prepopulated with entity descriptions from another source, this situation will be commonplace. But it is also the case that in many instances the missing entity is already described in some other authority control system or vocabulary, so a tool that can find those descriptions, transform the data to align with the classes and properties in the Wikibase, provide an opportunity for human review and correction of the transformed data, and then automatically load the source data into the Wikibase as a new entity can help bridge the gap and keep the cataloging work flowing. OCLC designed the Retriever to provide a simple keyword search interface to look for matching items in Wikidata, VIAF, and FAST (figure 26), a user interface for reviewing and editing data extracted from those sources (figure 27), and a back-end process for loading the transformed data into the Wikibase (figure 28). This application was originally developed, for the same use case, in OCLC’s Project Passage. The user interface component of the application was re- written in the Linked Data pilot to use a different Javascript framework, but the functionality was generally the same. Retriever Search Results from Wikidata, VIAF, and FAST for “Lake Vermilion” FIGURE 26. Retriever search results from Wikidata, VIAF, and FAST for “Lake Vermilion.”70 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/retriever-1.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 45 Retriever Entity Editor FIGURE 27. Retriever entity editor.71 View a larger image online. Wikibase Entity Created by the Retriever FIGURE 28. Wikibase entity created by the Retriever.72 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/retriever-2.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q221424.png 46 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability There is a server-based part of the Retriever application that takes search requests from the browser, handles mapping of external source data elements to the project Wikibase properties and classes, and utilizes the Python Pywikibot library to carry out data loading into the Wikibase. THE DESCRIBER The Linked Data project’s goals included testing editing interface alternatives to the Wikibase default user interface. OCLC began development of a prototype web application named the “Describer” that aspired to provide a guided mode to cataloging entities for works, illustrated in figure 29. The user experience in the Describer would begin by prompting the cataloger to choose the type and classification of the material they were describing. Based on those selections, the Describer would begin prompting for additional details that would be common or expected for entities of that type and classification, factoring in property constraints and other details of the underlying data model. The Describer could also incorporate capabilities and features that had been previously tested in the Image Annotator and in the Retriever. Work on the Describer prototype was not completed before the end of the pilot, but the initial testing suggested promise while also revealing the importance of carefully documenting the data model constraints in order to drive the user experience. Though not part of this pilot, a related investigation that could prove similarly illuminating would be to evaluate a language designed to express the shape of the data, such as SHACL73 or ShEx,74 as the mechanism for defining how the data model works and how that relates to user interface development. Editing Essential Details for an Entity in the Describer FIGURE 29. Editing essential details for an entity in the Describer.75 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/describer-1.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 47 THE EXPLORER AND THE TRANSPORTATION HUB An important value proposition for making the transition to linked data is the ability to browse or navigate across the graph of data connections to find important related entities and reveal relationships that would be hard to see in a more traditional record-oriented search and retrieval system. To evaluate this potential OCLC developed a prototype web application named the “Explorer” to focus on the most frequently occurring connections between entities, see relationships that were described by different institutions for different items in different collections, look for thematically- related content, and follow the graph-based connections to locate important related entities. An important value proposition for making the transition to linked data is the ability to browse or navigate across the graph of data connections to find important related entities and reveal relationships that would be hard to see in a more traditional record-oriented search and retrieval system. The home page of the Explorer lists entities organized across a subset of categories, sorted by frequency, to help researchers jump into the browsing experience and quickly see what the pilot project data is mostly “about” (figure 30). The Explorer also has a keyword search interface. While the collections that were selected for evaluation in the first two phases of the pilot project were all interesting and, as a group, gave us a good idea about the range of data transformation and reconciliation challenges we’d likely encounter when working with other CONTENTdm sites and collections, they were not chosen with any special attention paid to how the materials they describe might thematically overlap. To generate more topically related connections across the pilot participants’ data, OCLC assembled a new selection of CONTENTdm metadata records based on the topic of transportation. Using a general search for transportation-related subjects (the subject terms used were “streetcars,” “transportation,” “roads,” “highways,” “airports,” “railroads,” “automobiles,” “ferries,” “rockets, “ships,” “boats,” “streets,” “paths”), OCLC staff applied a search across all collections for each pilot participant’s CONTENTdm site, gathered the resulting metadata records, and transformed the data for loading into Wikibase, reconciling as many headings to related entities as could be done without significant amounts of human attention. This step provided more data for us to use in assessing the scalability of this data transformation process, as we could compare this more automated and streamlined effort with the very thorough and largely manual process that had been applied to the initial set of pilot project collections. 48 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Explorer Home Page FIGURE 30. Explorer home page.76 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/explorer-1.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 49 OCLC established a new virtual collection entity in the Wikibase for a “CONTENTdm Transportation Hub” and associated all the new Wikibase items for the related works to this collection, along with their original source collection. In the Explorer, the Transportation Hub collection can be selected as the starting point for browsing and selection, with facets helping to narrow the scope to different topics, things depicted, source collections, and more (figure 31). Explorer Transportation Hub and Related Collections FIGURE 31. Explorer Transportation Hub and related collections.7 7 View a larger image online. The Transportation Hub can also be used to narrow a keyword search. For example, a keyword search for “strike” shown in figure 32 matches descriptions of items associated with labor strikes of various kinds (among other things) and narrowing the keyword search result to the Transportation Hub collection can highlight images and other works associated with transit strikes. The Transportation Hub can also be used to narrow a keyword search. https://researchworks.oclc.org/cdmld/screenshots/explorer-2.png 50 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Explorer Search Results for “Strike” FIGURE 32 . Explorer search results for “strike.”78 View a larger image online. For a researcher interested in that topic, the Explorer can return very different perspectives on a particular transit strike; for example a Philadelphia Evening Bulletin newspaper photograph depicting the effect of the Philadelphia Transit Company strike of August 1944 on transportation options for workers (figure 33), contrasts with a John W. Mosley Photograph Collection image of a protest from the previous year in support of hiring African American trolley drivers (figure 34). For a researcher interested in that topic, the Explorer can return very different perspectives on a particular transit strike. https://researchworks.oclc.org/cdmld/screenshots/explorer-3.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 51 Explorer View of a Truck Bringing Employees Home During a PTC Walkout FIGURE 33. Explorer view of a truck bringing employees home during a PTC walkout.79 View a larger image online. Explorer View of a Protest against the Philadelphia Transportation Company FIGURE 34. Explorer view of a protest against the Philadelphia Transportation Company. 80 View a larger image online. https://researchworks.oclc.org/cdmld/screenshots/explorer-4.png https://researchworks.oclc.org/cdmld/screenshots/explorer-5.png 52 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Explorer View of an 1899 Cleveland Transit Strike in Public Square FIGURE 35. Explorer view of an 1899 Cleveland transit strike in Public Square. 81 View a larger image online. The Transportation Hub helps to find images of transit strikes and their impacts in collections across institutions, including a Cleveland Public Library photograph of crowds surrounding a trolley car during a transit strike in during 1899 (figure 35) and a University of Miami photograph of parked trolley cars during a strike in Havana (figure 36). The Transportation Hub helps to find images of transit strikes and their impacts in collections across institutions. https://researchworks.oclc.org/cdmld/screenshots/explorer-6.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 53 Explorer View of Streetcars Parked on the Street during a Transit Strike FIGURE 36. Explorer view of streetcars parked on the street during a transit strike. 82 View a larger image online. The processes could be automated and extended to provide different views of how fields are defined and used across collections in a simple web application. THE FIELD ANALYZER Late in the pilot project, the OCLC developers saw a need for a new tool that could visualize how CONTENTdm fields are defined across different collections for participating institutions. This field- level analysis had been carried out in earlier phases of the project as a largely manual process, using CONTENTdm APIs and custom applications to gather data and reformat it for analysis in OpenRefine. After those manual processes had been ironed out, OCLC staff found that the processes could be automated and extended to provide different views of how fields are defined and used across collections in a simple web application. https://researchworks.oclc.org/cdmld/screenshots/explorer-7.png 54 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Field Analyzer Field Usage Chart FIGURE 37. Field Analyzer field usage chart. 83 View a larger image online. Many participants found it to be a useful addition to their CONTENTdm toolkit, giving them a cross-collection view of how their collection vocabularies are defined. During the pilot, the data that could be listed and visualized by the Field Analyzer was based on a “snapshot” of records copied from CONTENTdm and needed to be periodically refreshed to reflect any subsequent changes made. Pilot participants expressed interest in having the Field Analyzer maintained after the end of the pilot for ongoing use, with access to “live” or frequently synchronized data, and for all collections. https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-1.png Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 55 Field Analyzer List of Field Values FIGURE 38. Field Analyzer list of field values. 84 View a larger image online. Cohort Communication Communication is key to the success of any project, and was vital for effectively collaborating with the Linked Data project participants. In addition to using the CONTENTdm Community Center for addressing questions and tracking progress, the project participants and OCLC staff met every two weeks for an “office hour.” These sessions covered work in progress, planning for future stages, and demonstrations of new applications and processes. Apart from those regularly occurring topics, many sessions included a more open-ended group exploration of other questions, including: • What local authority sources are used for reconciling headings? • What user research practices have been applied to evaluate your systems? • How are access, use, and reuse rights managed at your institution? Are these rights documented in CONTENTdm? Are there different rights assigned for physical vs. digital materials? • How is CONTENTdm technical and administrative metadata managed? • How and when should a “placeholder” entity description be created, for things that lack an established identity? • What are current local practices for metadata cleanup, and do the work location changes made in response to the COVID-19 pandemic impact the priority of that work? • Who is using the CONTENTdm Catcher utility, and who else might use it that isn’t yet? • How could advancing racial equity in CONTENTdm descriptive metadata be facilitated? https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-2.png 56 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability The office hours served as a key point of communication and connection over the course of the project. Engaging in discussions about the challenges and day-to-day work of managing digital collection metadata and receiving real-time feedback about developed applications and tools provided OCLC staff with critical insights to inform and improve project outputs. Exploring the questions listed above as a group helped participants to share experiences and learn from one another. The regular connection points that the sessions provided were especially valuable with the onset of the COVID-19 pandemic and ensuing facility closures. Amid many disruptions, the meetings included periodic check-ins for project participants to discuss and reflect on the effects of the pandemic on their work and their libraries in the near- to long-term; the reported impacts were varied and substantial. Partner Reflections At the end of the Linked Data project, the project partners provided their perspectives, representing both complementary and contrasting views of their experiences, the benefits returned, and implications for the future. CLEVELAND PUBLIC LIBRARY (CHATHAM EWING) Cleveland Public Library (CPL) has partnered with OCLC on metadata issues over the last several years, beginning with Project Passage in 2018-19 and then in 2019-20 with this Linked Data project. CPL believed the projects would have several potential benefits. The projects presented an opportunity to motivate our staff and institution to revisit, revise, and improve our metadata content and structure. It had the potential to lead to better and more accurate description and consequently improved discovery for our customers. The projects also provided motivation to rethink how we might enable more effective sharing through platforms such as DPLA or WorldCat. Over the course of the projects, our digital library staff engaged with and learned from other partners and OCLC’s team. OCLC’s team helped us deeply consider how linked data could have an impact on our descriptive work. CPL staff presented on collections and processes using our scrapbooks project as an example, were recorded live for the purposes of a user interface usability study, submitted CPL collections for analysis through the field analyzer and got back a useful matrix that enabled analysis of our work, frequently conferred with OCLC team members as well as other partners, and more. Our staff worked diligently to raise questions related to public library practice. Though the latter part of the Linked Data project happened during COVID-19, hampering our ability to explore some socially oriented goals with CPL digital library partners, staff were grateful for the intellectual lifeline the Linked Data project provided during the lockdown, and we eagerly anticipate working with project tools in the future to keep exploring some of the community-oriented possibilities for metadata brought up by the project. We believe that much of what we anticipated did happen, but additional insights emerged from the process. The experience and results strongly validate implementing more and more effective approaches to the use of linked data in digital library contexts in public libraries, and we strongly support the report’s call for further investigation into using linked data. We also agree with the recognition that tools for reconciling data, particularly data such as name authorities and discipline specific thesauri, should be an integral part of any advance with regard to digital library tools within OCLC’s suite of digital library applications. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 57 But we feel there is another observation to make: digital collections described using linked data might be able to help explain what is “uniquely the same” about Cleveland as a place in the United States and the world. Representing what is locally unique yet making local information legible to outsiders and creating a mechanism where differences can be understood, bridged, and linked is an important part of what public libraries using newer descriptive systems can surface. Because we already involve diverse community members and community partners in the creation of digital items for our collections, it seemed natural to think about how we might include our partners in description, as well. During the project we looked at several examples of collecting information about and from CPL digital collections. We looked at our digital collection of scrapbooks, local newspapers, local theaters, and the library archives, and we tried to think about places where we had drawn description from the language of the communities we were working with rather than from internationally scoped name authority lists or cataloging thesauri. And that was promising, we thought. Representing what is locally unique yet making local information legible to outsiders and creating a mechanism where differences can be understood, bridged, and linked is an important part of what public libraries using newer descriptive systems can surface. But it was when the project took a turn and collections like these were juxtaposed against one another that things became interesting. The “Transportation Hub” was a useful example of how this concept was explored by the project. Each institution’s collecting around transportation was pulled together into a gathered collection, and the project implemented a platform that offered a glimpse of how to explore the role the digitized items played in each of the communities documented by the separate collections. The project team at OCLC discussed the challenges of normalizing and reconciling the data in the collections for the Transportation Hub, and the process highlighted typical challenges in reconciling data and enabling searching across multiple institutional collections. And, as we mentioned before, the process also highlighted the labor-intensive nature of such work and spotlighted a long- standing need for more robust tools within the OCLC suite of applications for managing controlled vocabularies across collections in the context of digital library tools. However, the OCLC staff’s discussion of the process also raised the question of what to do about significant local variances in uncontrolled description language for digitized items. Perhaps we should also uncover and share different communities’ understandings of more generalized concepts? It would seem that a linked data system holds up the promise of capturing some locally generated data that reflect local variances while also offering traditional authoritative descriptive data. We feel that a linked data system that includes some broader, more locally oriented mechanism for participation in description would be a powerful tool for our work doing digitization in our community. For us at CPL, we began to consider how we might describe collections that not 58 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability only made use of authorized names, subjects, and thesaurus terms to describe Cleveland’s unique and local digital collections, but which also described our city and region’s uniqueness and made it legible through networks of links to alternative information including local lingo, alternate lists of names, and diverse uses of descriptive language found in lists and resources that might supplement more standard descriptions. A natural extension of this kind of thinking is that (for public libraries at least) tools need to be easy to use not only to professional catalogers, but also to community experts as well. A wiki offers a simple, public-facing user interface, but other interfaces might also be designed to be more inclusive and accessible, allowing digital projects to easily incorporate local, grounded expertise that librarians cannot often be expected to have. Linked data systems can facilitate that inclusivity by creating and making connections between related or synonymous local terminology and concepts. Perhaps using linked data for description even has the potential to decenter hierarchies and master narratives about cultural heritage that may be implicit in approved authoritative descriptive practice, allowing for alternative hierarchies and assumptions to surface and enrich descriptive practice. And the local/global break with regard to epistemological understanding revealed through the Transportation Hub implies other breaks that could be drawn out from other collection gatherings that might also contribute to rich, differentiated hierarchies of description that will enable more diverse access through richer and more inclusive community generated description. Perhaps design a system that is usable by expert catalogers (because solid hierarchies are a backbone of effective access), comprehensible by local metadata experts (because local historians have awesome expertise), and is also open enough to capture (and sift) the kinds of description generated at the level of the general user. This might not only lead to higher quality and more comprehensive metadata, but also, if handled well, can create opportunities for deep community listening. We could generate access points to our information based upon empirical observation of how our communities create links within our information. This kind of engagement might enable libraries to engage patrons and learners, using digitization as a process for delving into what really makes their communities wonderful and unique. THE HUNTINGTON LIBRARY, ART MUSEUM, AND BOTANICAL GARDENS (MARIO EINAUDI) The invitation to join this project in August 2019 came as the Huntington Library was reviewing the digital collections accessible in the Huntington Digital Library (HDL), which had been launched in 2011. In 2018 it was determined that an overhaul and a full review of the metadata and structure of the 23 collections was needed. We had hopes that the Linked Data project might aid us in this endeavor. Following the initial ingest of materials selected from three of our collections, the review and initial cleanup work, along with the testing done by Bruce Washburn to feed our metadata into the Wikibase, it was quickly apparent that this pilot would not be able to help that cleanup directly. Rather the Linked Data project provided us the context to better understand our workflows, our metadata, and how we structured that metadata. Importantly, this project did demonstrate the incredible value of linked data as a way of creating and maintaining metadata. Linked data in the Wikibase enabled the creation of a web of connections and context that is lacking in many other systems. A good example of the power of the tools developed using linked data was the Image Annotator. This tool allows the user to highlight a section of an image and then apply one of the known entities to that highlighted section. This creates links between that image and other images in the collection that would not exist—unless the cataloger remembered that x also appeared in y and z. It provided a tantalizing look at a new tool for cataloging materials. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 59 It would have been good to test some of these tools outside of the pilot. The Image Annotator if reconfigured for use with a future CONTENTdm would be a great improvement. Subject specialists could be brought on board a project and asked to identify people or places, with the linked data providing added indexing in a controlled environment in the background. Also, the Explorer tool would enhance discovery across collections, both internally to the library, and if part of a larger linked data universe, to other libraries, large or small. While the project was focused on the benefits of, and how to create, linked data, one tool grew out of the need to analyze the extant data in the participants systems. And that tool, the Field Analyzer, proved so useful that it stands above all the others. This tool enabled us to review all our collections systematically and plan cleanup more effectively. It has allowed us to pursue our goal of descriptive uniformity across all CONTENTdm collections. A companion tool that would replace, or build on, the Catcher interface, allowing for the cleaned-up metadata to be pushed back into our CONTENTdm site would also have been a real boon. But the complexities faced in cleaning up the data, along with the entity-based structure within Wikibase, foreclosed that option. Throughout this Linked Data Pilot project OCLC Staff were incredible, providing guidance, soliciting input, posing questions, and seeking solutions that engaged all the participants. The tools developed and the cleanup done by Bruce Washburn and Jeff Mixter show all the power and promise of linked data, as well as some of the hurdles. Yet, this is a path that should be followed, especially as CONTENTdm shows its age. The leap forward to a new solution has been greatly helped by the solid work done by all on this project. We will use the knowledge gained from this project to rethink our workflows and our descriptive metadata with an eye toward the promise of linked data. MINNESOTA DIGITAL LIBRARY (GRETA BAHNEMANN AND JASON ROY) Invitation In July of 2019, the Minnesota Digital Library (MDL) was asked to join the CONTENTdm Linked Data Pilot project. Initially, we were one of three pilot partners. This invitation was an opportunity for us to see the practical application of Wikidata to MDL’s collection of images. MDL is a collection of digitized cultural heritage materials comprised of images, text-based, cartographic materials, etc. with 67% of our collection represented by images. Given our high percentage of images, we were especially interested to see how our image metadata would reconcile and work with Wikidata. Would MDL’s metadata withstand this kind of work? Development of three tools by OCLC During the Linked Data project, OCLC developed three tools to assist the project participants: 1. Retriever—designed to help pilot partners search for and create entity descriptions. Especially helpful for those new to the process 2. Image Annotator—subject analysis tool that has the potential to change how we describe cultural heritage materials 3. Field Analyzer—developed in response to the need of the pilot project participants but has usefulness beyond the pilot. This tool provides partners with a backend look at their data, and gives a comprehensive view of how data is mapped, field names used, etc. It quickly shows the inconsistencies in a collection’s data regarding field names, mapping, etc. 60 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 4. The Image Annotator has the most potential to change the user’s understanding of digital content. With its capacity to provide both a layer of subject analysis and descriptive details to images in CONTENTdm, it is no less than groundbreaking. For example, an albumen photograph of a late 19th century home in Minneapolis can be “about” the concept/subjects of “Richardsonian Romanesque Style Architecture” and/or “Rock-face Construction;” but it can also “depict” things found in the image, such as a horse-drawn wagon, a fire hydrant, pedestrians, named individuals, etc. This added layer of meaning and contextualization can only add to the user’s understanding of the image. This is a type of analysis traditionally associated with the fields of fine art, architecture, urban planning and has the potential to add more nuanced description to cultural heritage materials and change how users understand these materials. While this tool is valuable and has huge potential for changing how we describe cultural heritage materials, it can be a labor-intensive process that may not be sustainable on a large scale. The Image Annotator has the most potential to change the user’s understanding of digital content. With its capacity to provide both a layer of subject analysis and descriptive details to images in CONTENTdm, it is no less than groundbreaking. Leveraging the power of linked data In terms of linked data support, a lot of initial effort was spent discussing how these controlled vocabularies might best be ingested and stored within the CONTENTdm framework. The rationale, one would believe, behind this was to ensure that it would better integrate with our more hyperlocal vocabularies and taxonomies. That is, how best to blend national vocabularies alongside locally created terms to best describe the source material. Unfortunately, by bringing and storing this “national” data into our local systems we are taking away some of the power of linked data; power that comes in the form of networked vocabularies that work best in a layer above our localized instances. Linked data is powerful, in part because it is not tied to any one system, but rather, integrates content across collections, thereby creating user-discoverable connections across collections and, more importantly, repositories. What may be a path forward is an opportunity for CONTENTdm to create web services that call upon these linked data sources at the point of need. This would allow catalogers and metadata creators the opportunity to align their local descriptive practices more closely with national and international initiatives. CONTENTdm would store the URI, not the term itself, thus creating linkages that would allow for more accurate and consistent sharing without “hardwiring” terms into the CONTENTdm data store. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 61 It is, ultimately, the data store itself that is the most valuable piece of information. From this building block we can construct user interfaces and applications, share out our metadata for others to package, and scale out across multiple, shared repositories. We consume this data to create our local, default CONTENTdm view, but this same data can be packaged and shared in new ways. Applications such as that created by the Minnesota Digital Library85 consume the same data but build it out in different ways; additionally, this same data is openly shared with and aggregated by the Digital Public Library of America86 for use in their national initiative. Same data, different views. Ultimately, it is the data that must remain interoperable enough to work across systems and alongside other data sources. Within the limited timeframe of our project, OCLC was able to provide a proof of concept of the potential for enhancing CONTENTdm metadata through linked data integrations by way of a single new view that builds upon the existing CONTENTdm user facing discovery layer. Ultimately, it is the data that must remain interoperable enough to work across systems and alongside other data sources. Concluding thoughts We believe that this work should result in the further decoupling of some of these tight integrations in order to achieve our desired results: separating out the data store from the data view layer; leveraging the URI for further linkages out toward reliable and trustworthy linked data sources within the data store itself; and allowing for the open sharing of our data (and our assets as well through the existing IIIF infrastructure) with others to achieve large scales of discovery and to better network our data alongside that of our colleagues. Included in all of this should be a discussion of the future application of the tools OCLC developed for this project. The Image Annotator and Field Analyzer could be integrated into the CONTENTdm package/workflow to help CONTENTdm users (both administrators as well as crowd-sourced end users) provide a more robust, nuanced description via the Image Annotator. The Field Analyzer can also help CONTENTdm adopters see their data, across multiple collections, in a single interface. Both tools should be developed further, thereby making CONTENTdm more user-friendly—for both administrators and end users. The Minnesota Digital Library was excited to be a part of this pilot project. In addition to learning more about the practical application of Wikidata, it was a great opportunity to get to know staff at OCLC and speak about the potential future of CONTENTdm in a collaborative environment. TEMPLE UNIVERSITY LIBRARIES (HOLLY TOMREN AND MICHAEL CARROLL) In 2019 we joined the CONTENTdm Linked Data project. Focusing on how this compared to and differed from our previous experience with Project Passage, while Project Passage was more about one-by-one original description, the CONTENTdm pilot was more about batch transformation of existing metadata. OCLC staff consulted with us about how they planned to map our metadata and to answer any of our questions, and we provided feedback about the mappings as well as any questions we had 62 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability about the data model, which was now much expanded from what we had started with in Project Passage. This gave us a sense of what a future data migration would look like, and how migration to a linked data model can be even more complex than a migration from one flat metadata model to another. Linked data also provides different opportunities for how we can search our CONTENTdm metadata, particularly through more indirect relationships between entities in the system. After OCLC transformed our data, we evaluated it to see how this could help us look at our metadata differently, where is there room for further data enrichment, and what are the new relationships and connections we can create with a system that is built to do so. One thing that particularly stood out were the different ways we could browse our data using the Explorer tool that OCLC developed. At Temple, our customized library discovery layer is built on three concepts: Search, Browse, and Recommend. But so far, we have only implemented Search. As we’ve thought internally about Browse features, we’ve struggled with a way to approach this that is different from the standard Title, Author, Subject browse from the past. The CONTENTdm Explorer offers a model that provides a variety of different starting points for browsing and then allows a user to traverse a graph of relationships, which is inspiring as we continue to develop our local discovery environment. Linked data also provides different opportunities for how we can search our CONTENTdm metadata, particularly through more indirect relationships between entities in the system. For example, we were thinking of the use case where we might have an “On This Day” feature to post on social media. We were able to develop queries in the SPARQL endpoint that could help us find images that depict people born on a certain day or images that depict people born in Philadelphia that could be used to help us select featured images for different scenarios. Participating in the project introduced the Wikibase interface and exciting tools to enhance the discovery of and engagement with digital records. The Wikibase offered a glimpse into what a digital collections database that employs linked data might look like and how the cataloging process might change. For instance, the inclusion of clickable headings for each entity has the potential to make it even easier for student catalogers to understand the context of the terms they use to describe an image. The Describer prototype tool was a simplified visual interface that enables cataloging based on the resource type classification. This tool felt more approachable than the Wikibase interface. The text box of the Describer tool automatically suggested verified terms like controlled vocabularies in CONTENTdm, but this tool felt more intuitive and tailored to what the user was typing. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 63 It was very useful from a cataloging perspective to have access—supported by IIIF standards and viewers—to the image and be able to zoom in to see details while describing it. We also thought the Image Annotator had a lot of potential for being able to associate a part of an image with a specific depicts or subject property, and it would be interesting to see how that could be incorporated into the end user discovery experience. One of the potential impacts of this project would be to rethink our cataloging workflows in accordance with a linked data structure. The Temple University team described existing images as a group exercise that proved challenging without the original objects in front of us. It became clear during this exercise that there would also need to generate more nuanced descriptions when cataloging in order to develop a richer network of relationships between entities. The Linked Data project demonstrated the amount of work involved in the transition to linked data, but also that the tools exist and that the workflows can be developed. UNIVERSITY OF MIAMI LIBRARIES (PAUL CLOUGH AND ELLIOT WILLIAMS) Participating in the Linked Data project was an opportunity for us to understand more concretely what it would take to transform our existing collections into linked data and what a linked data version of CONTENTdm might look like. Interacting with our metadata in the Wikibase environment raised valuable questions about how our existing metadata practices might complicate the transition to linked data, such as a lack of standardization of elements and inconsistent uses of existing vocabularies, and inspired us to focus more on data normalization and consistency. Some of the insights and tools that came out of the project, such as the Field Analyzer, will be immediately useful for our work in CONTENTdm, even outside of the transition to linked data. Participating in a cohort with other CONTENTdm users and OCLC staff was also a great opportunity to learn from and with our peers. The Linked Data project demonstrated the amount of work involved in the transition to linked data, but also that the tools exist and that the workflows can be developed. While we appreciate the promise of linked data, we believe that more work still needs to be done to show that the effort will be worth it. 64 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability K E Y F I N D I N G S A N D C O N C L U S I O N S The linked data project reaffirmed some prior lessons learned and provided new insights across a range of concerns, including the expected benefits of working in a linked data environment, the potential to develop a shared data model, a reality check on the effort to transform metadata to linked data, and the essential benefits of a strong partnership. TESTING THE LINKED DATA VALUE PROPOSITION The project confirmed key aspects of the linked data value proposition, that cultural material discovery and data management can be significantly improved when the materials are described using a shared and extensible data model, when metadata string-based headings are transformed to linked data entities and relationships, and when those entities and relationships are brought together into a single discovery system. In this environment, the technology works in service to both the staff, who can more easily and accurately impart the expertise they have about the collections they steward, and to the researcher, who can see more robust connections between— and context about—the cultural materials that make up CONTENTdm collections. In project prototype applications, entities can be retrieved by searches that use a persistent identifier rather than a string heading. This capability provides integrated authority control for the entities and greatly improves the precision and recall performance metrics for discovery. As CONTENTdm string headings are reconciled and converted to entities, additional information from external data sources can automatically and efficiently enrich the entity description. This supports new discovery and data visualization capacities that would be expensive or impossible to achieve in the current CONTENTdm system. For example, place entity descriptions can be enriched with geographic coordinates, which can then be used to generate map-based visualizations of places depicted in cultural materials. In an entity-oriented system like Wikibase, different types of entities have their own distinct representation. This design contrasts with record-oriented systems where the creative work is the primary entity and other types of things are only present as statements representing notes and headings that are associated with the work. Data management and maintenance efficiencies are gained by transforming these statements into entities. For example, a biographical statement about a person can be associated with that person’s entity description, rather than repeated as a note in every record that is in some way about that person. EVALUATING A SHARED DATA MODEL Building an initial data model with a high-level structure informed by other standards, including Dublin Core and Schema.org, provided a solid set of initial classes and properties. The model could be effectively and responsively expanded based on new entities and relationships represented in the source metadata. The metadata and mapping discussions with pilot partners helped OCLC develop the data model, as data was encountered in the CONTENTdm sources that OCLC had not anticipated. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 65 SELECTING AND TRANSFORMING METADATA Data transformation tools should be shared and the workflows decentralized. This will be essential to making the conversion scalable, as the workload is too great for a central agency to carry out. Domain expertise is needed to determine how locally defined fields are used at the institution level and sometimes at the collection level. Though it required considerable manual effort, most headings for concepts and places found in CONTENTdm source metadata could be reconciled to matching entities described in other sources, including the Wikidata knowledge base, the VIAF authority file, FAST, and GeoNames. Not surprisingly given the relative lack of notability of some of the represented people and organizations, those headings often could not be found in one of the external sources OCLC used for reconciliation and led to manual data entry for a “placeholder” entity. Other than the initial field mapping review, pilot participants did not get a more in-depth “behind- the-scenes” view of the data processing workflows, which could have been offered as “office hour” homework or a workshop. In retrospect that appears to be a missed opportunity. For the transition to linked data to be comprehensive and complete, a set of new CONTENTdm tools are called for that can be applied to transformation and reconciliation workflows in a decentralized way, along with fundamental changes to the centralized CONTENTdm system. A paradigm shift of this scale will necessarily take time to carry out and calls for long-term strategies and planning. CONTINUING THE JOURNEY TO LINKED DATA Substantial resource commitments will be required to carry out these data transformations across all CONTENTdm institutions and collections, but the community does not need to wait for the transformation to linked data to be fully completed before they can see benefits. Data management and discovery benefits are applicable from this work in the current CONTENTdm environment, and downstream linked data transformation efficiencies accrue as metadata makes greater use of shared vocabularies and persistent identifiers. For the transition to linked data to be comprehensive and complete, a set of new CONTENTdm tools are called for that can be applied to transformation and reconciliation workflows in a decentralized way, along with fundamental changes to the centralized CONTENTdm system. A paradigm shift of this scale will necessarily take time to carry out and calls for long-term strategies and planning. 66 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability Several of the prototype applications developed during the pilot point the way to advantageous additions to the CONTENTdm toolkit. In particular, the Image Annotator encourages domain experts to enrich material descriptions, and the Field Analyzer helps CONTENTdm users make sense of the variations in field definitions and uses across their collections (a prerequisite for more holistic data rationalization and transformation). The project participants encouraged OCLC to pursue these and other improvements as part of CONTENTdm’s evolution into a linked data platform. WORKING PARTNERSHIPS REPRESENT STRENGTH IN NUMBERS The value of library participants as partners in this project cannot be overstated. As colleagues and thought partners in the work, participants connected with project staff in regularly scheduled office hours throughout the project. Through these meetings and regular communications, project participants shared their thoughts on topics ranging from philosophical approaches and concepts to technical details and provided ongoing feedback that steered the project work toward tools and applications of greatest practical value for library staff and researchers. Recognizing the critical insights contributed by the project partners confirms the importance of involving library staff in this manner for similar technical research projects. Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 67 N O T E S 1. OCLC’s CONTENTdm digital content management service overview: https://www.oclc.org/en/contentdm.html. 2. An overview of OCLC’s history of Linked Data research projects: https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-outputs.html. 3. W3C. “Linked Data.” https://www.w3.org/wiki/LinkedData. 4. An overview of OCLC’s linked data pilot Project Passage: https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html; See also: Godby, Jean, Karen Smith-Yoshimura, Bruce Washburn, Kalan Davis, Karen Detling, Christine Fernsebner Eslao, Steven Folsom, Xiaoli Li, Marc McGee, Karen Miller, Honor Moody, Holly Tomren, and Craig Thomas. 2019. Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage. Dublin, OH: OCLC Research. https://doi.org/10.25333/faq3-ax08. 5. The Wikibase environment includes several components: The MediaWiki Platform: https://www.mediawiki.org/wiki/MediaWiki; MediaWiki. “Wikibase:Overview—MediaWiki extension for managing structured data. Updated 29 December 2020, at 19:51. https://www.mediawiki.org/wiki/Wikibase; Wikipedia. “Triplestore: [...] a purpose-built database for the storage and retrieval of triples through semantic queries.” Updated 12 November 2020, at 18:12 (UTC). https://en.wikipedia.org/wiki/Triplestor; Wikipedia. “SPARQL” (Query service for reading data from the triplestore). Updated 3 January 2021, at 14:42 (UTC). https://en.wikipedia.org/wiki/SPARQL. 6. CONTENTdm Linked Datat Planned Project Phases diagram: https://researchworks.oclc.org/cdmld/screenshots/phase-diagram.png. 7. OCLC. 2020. “Guide to the CONTENTdm Catcher.” Updated 7 August 2020. https://help.oclc.org/Metadata_Services/CONTENTdm/CONTENTdm_Catcher/Guide_to_the _CONTENTdm_Catcher. 8. Wikipedia: https://www.wikipedia.org/. 9. Wikidata: The free knowledge base. Updated 30 December 2019, at 04:00. https://www.wikidata.org/wiki/Wikidata:Main_Page. 10. SPARQL [Linked Data] query language for RDF. W3C Recommendation 15 January 2008. https://www.w3.org/TR/rdf-sparql-query/. https://doi.org/10.25333/C3FC9Q https://www.oclc.org/en/contentdm.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-outputs.html https://www.w3.org/wiki/LinkedData https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-prototype.html https://www.mediawiki.org/wiki/MediaWiki https://www.mediawiki.org/wiki/Wikibase https://en.wikipedia.org/wiki/Triplestore https://en.wikipedia.org/wiki/SPARQL https://researchworks.oclc.org/cdmld/screenshots/phase-diagram.png https://help.oclc.org/Metadata_Services/CONTENTdm /CONTENTdm_Catcher/Guide_to_the_CONTENTdm_Catcher https://help.oclc.org/Metadata_Services/CONTENTdm /CONTENTdm_Catcher/Guide_to_the_CONTENTdm_Catcher https://www.wikipedia.org/ https://www.wikidata.org/wiki/Wikidata:Main_Page https://www.w3.org/TR/rdf-sparql-query/ 68 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 11. Wikibase system architecture diagram: https://researchworks.oclc.org/cdmld/screenshots /wikibase-system-architecture.png. 12. CONTENTdm Class data model visualization: https://researchworks.oclc.org/cdmld /screenshots/class-ontology.png. 13. Background on the Dublin Core Metadata Initiative and the Dublin Core element set: https://dublincore.org. 14. The Dublin Core Metadata Initiative DCMI Type Vocabulary: https://www.dublincore.org/specifications/dublin-core/dcmi-type-vocabulary/. 15. Linked Art project: https://linked.art/. 16. Example use of type, classification, and process or format properties in the description of a postcard: https://researchworks.oclc.org/cdmld/screenshots/entity-Q73226.png. 17. A depicts statement for the concept of “Dogs”: https://researchworks.oclc.org/cdmld/screenshots/entity-Q147731.png. 18. A type statement of “dog” for a specific dog: https://researchworks.oclc.org/cdmld/screenshots/entity-Q142481.png. 19. The RDF Linked Data modeling vocabulary “RDF Schema”: https://www.w3.org/TR/rdf-schema/. 20. The class “dog” is defined by the concept “Dogs”: https://researchworks.oclc.org/cdmld/screenshots/entity-Q73829.png. 21. Wikibase templates for proposing new properties: https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal.png; https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal-is-defined-by.png. 22. Unmapped CONTENTdm metadata displayed in the Wikibase user interface with a Gadget extension: https://researchworks.oclc.org/cdmld/screenshots/entity-Q143578.png. 23. Collections evaluated for the pilot project: • Cleveland Public Library º Cleveland Picture Collection: https://cplorg.contentdm.oclc.org/digital/collection /p4014coll18/search/searchterm/cleveland%20picture%20collection/field/collec /mode/exact/conn/and/order/sortda/ad/asc; º Jasper Wood photos of Cleveland: https://cdm16014.contentdm.oclc.org/digital /collection/p4014coll18/search/searchterm/jasper+wood/field/creato/mode/all /conn/and; º John G. White Collection of Chess and Checkers, Chess Player Portraits Collection: https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll20/search /searchterm/Chess Portraits. https://researchworks.oclc.org/cdmld/screenshots/wikibase-system-architecture.png https://researchworks.oclc.org/cdmld/screenshots/wikibase-system-architecture.png https://researchworks.oclc.org/cdmld/screenshots/class-ontology.png https://researchworks.oclc.org/cdmld/screenshots/class-ontology.png https://dublincore.org https://www.dublincore.org/specifications/dublin-core/dcmi-type-vocabulary/ https://linked.art/ https://researchworks.oclc.org/entity/Q73226 https://researchworks.oclc.org/cdmld/screenshots/entity-Q73226.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q147731.png https://researchworks.oclc.org/entity/Q147731 https://researchworks.oclc.org/cdmld/screenshots/entity-Q142481.png https://www.w3.org/TR/rdf-schema/ https://researchworks.oclc.org/cdmld/screenshots/entity-Q73829.png https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal.png https://researchworks.oclc.org/cdmld/screenshots/cdm-property-proposal-is-defined-by.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q143578.png https://cplorg.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/cleveland%20picture%20collection/field/collec/mode/exact/conn/and/order/sortda/ad/asc https://cplorg.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/cleveland%20picture%20collection/field/collec/mode/exact/conn/and/order/sortda/ad/asc https://cplorg.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/cleveland%20picture%20collection/field/collec/mode/exact/conn/and/order/sortda/ad/asc https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/jasper+wood/field/creato/mode/all/conn/and https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/jasper+wood/field/creato/mode/all/conn/and https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll18/search/searchterm/jasper+wood/field/creato/mode/all/conn/and https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll20/search/searchterm/Chess Portraits https://cdm16014.contentdm.oclc.org/digital/collection/p4014coll20/search/searchterm/Chess Portraits Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 69 • The Huntington Library, Art Museum, and Botanical Gardens: º Edwin Hubble Papers: https://cdm16003.contentdm.oclc.org/digital/collection /p15150coll2/search/searchterm/Edwin%20Hubble%20Papers/field/physic/mode /exact/conn/and; º Palmer Conner Collection of Color Slides of Los Angeles, 1950 - 1970: https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm /Palmer+Conner+Collection+of+Color+Slides+of+Los+Angeles%2C+1950+-+1970/field /physic/mode/all/conn/and/order/nosort; º Photographs of the California Missions by William Henry Jackson https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm /Photographs%20of%20the%20California%20Missions%20by%20William%20 Henry%20Jackson/field/physic/mode/exact/conn/and; º Verner Collection of Panoramic Negatives https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm /Verner+Collection+of+Panoramic+Negatives/field/physic/mode/all/conn/and /order/title. • Minnesota Digital Library: º American Swedish Institute: https://reflections.mndigital.org/?f%5Bcollection _name_ssi%5D%5B%5D=American+Swedish+Institute; º Becker County Historical Society: https://reflections.mndigital.org/?f%5Bcollection _name_ssi%5D%5B%5D=Becker+County+Historical+Society; º Kanabec County Historical Society: https://reflections.mndigital.org/?f%5Bcollection _name_ssi%5D%5B%5D=Kanabec+County+Historical+Society. • Temple University: º John W. Mosley Photograph Collection: https://digital.library.temple.edu/digital/collection/p15037coll17; º Temple History in Photographs. Templana Event Album Collection: https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search /searchterm/Templana%20Event%20Album%20Collection/field/reposa/mode /exact/conn/and; º Temple History in Photographs. Templana Photograph Collection: https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search /searchterm/Templana%20Photograph%20Collection/field/reposa/mode/exact /conn/and; º Temple History in Photographs. Temple Times Photographs: https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search /searchterm/Temple%20Times%20Photographs/field/reposa/mode/exact/conn/and; º Temple University Libraries. YWCA Philadelphia Branches Records: https://digital.library.temple.edu/digital/search/collection /p16002coll6!p15037coll19!p15037coll14!p16002coll2/searchterm /YWCA%20Philadelphia%20Branches%20Records/field/digitb/mode/exact/conn/and. • University of Miami: º Cuban Map Collection: https://merrick.library.miami.edu/cubanHeritage/chc0468/; https://cdm16003.contentdm.oclc.org/digital/collection/p15150coll2/search/searchterm/Edwin%20Hubble%20Papers/field/physic/mode/exact/conn/and https://cdm16003.contentdm.oclc.org/digital/collection/p15150coll2/search/searchterm/Edwin%20Hubble%20Papers/field/physic/mode/exact/conn/and https://cdm16003.contentdm.oclc.org/digital/collection/p15150coll2/search/searchterm/Edwin%20Hubble%20Papers/field/physic/mode/exact/conn/and https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Palmer+Conner+Collection+of+Color+Slides+of+Los+Angeles%2C+1950+-+1970/field/physic/mode/all/conn/and/order/nosort https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Palmer+Conner+Collection+of+Color+Slides+of+Los+Angeles%2C+1950+-+1970/field/physic/mode/all/conn/and/order/nosort https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Palmer+Conner+Collection+of+Color+Slides+of+Los+Angeles%2C+1950+-+1970/field/physic/mode/all/conn/and/order/nosort https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Photographs%20of%20the%20California%20Missions%20by%20William%20Henry%20Jackson/field/physic/mode/exact/conn/and https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Photographs%20of%20the%20California%20Missions%20by%20William%20Henry%20Jackson/field/physic/mode/exact/conn/and https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Photographs%20of%20the%20California%20Missions%20by%20William%20Henry%20Jackson/field/physic/mode/exact/conn/and https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Verner+Collection+of+Panoramic+Negatives/field/physic/mode/all/conn/and/order/title https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Verner+Collection+of+Panoramic+Negatives/field/physic/mode/all/conn/and/order/title https://hdl.huntington.org/digital/collection/p15150coll2/search/searchterm/Verner+Collection+of+Panoramic+Negatives/field/physic/mode/all/conn/and/order/title https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=American+Swedish+Institute https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=American+Swedish+Institute https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=Becker+County+Historical+Society https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=Becker+County+Historical+Society https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=Kanabec+County+Historical+Society https://reflections.mndigital.org/?f%5Bcollection_name_ssi%5D%5B%5D=Kanabec+County+Historical+Society https://digital.library.temple.edu/digital/collection/p15037coll17 https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Event%20Album%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Event%20Album%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Event%20Album%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Photograph%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Photograph%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Templana%20Photograph%20Collection/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Temple%20Times%20Photographs/field/reposa/mode/exact/conn/and https://cdm16002.contentdm.oclc.org/digital/collection/p245801coll0/search/searchterm/Temple%20Times%20Photographs/field/reposa/mode/exact/conn/and https://digital.library.temple.edu/digital/search/collection/p16002coll6!p15037coll19!p15037coll14!p16002coll2/searchterm/YWCA%20Philadelphia%20Branches%20Records/field/digitb/mode/exact/conn/and/order/title/ad/asc https://digital.library.temple.edu/digital/search/collection/p16002coll6!p15037coll19!p15037coll14!p16002coll2/searchterm/YWCA%20Philadelphia%20Branches%20Records/field/digitb/mode/exact/conn/and/order/title/ad/asc https://digital.library.temple.edu/digital/search/collection/p16002coll6!p15037coll19!p15037coll14!p16002coll2/searchterm/YWCA%20Philadelphia%20Branches%20Records/field/digitb/mode/exact/conn/and/order/title/ad/asc https://merrick.library.miami.edu/cubanHeritage/chc0468/ 70 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability º Latin American and Caribbean Photograph Collection: https://merrick.library.miami.edu/cdm/search/collection/asm0304; º Rosenstiel School of Marine & Atmospheric Science Photograph Collection: https://merrick.library.miami.edu/rsmas/rsmasphotos/. 24. Wikibase Discussion page for a collection review: https://researchworks.oclc.org/cdmld/screenshots/cdm-item-talk-Q148309.png. 25. The OpenRefine software for cleaning up, analyzing, and reconciling metadata: https://openrefine.org/. 26. CONTENTdm collection metadata viewed in OpenRefine: https://researchworks.oclc.org/cdmld/screenshots/openrefine-project.png. 27. IIIF International Image Interoperability Framework website: https://iiif.io/. 28. A triplestore is a database to manage linked data “triples”, which are a combination of a subject, predicate, and object: https://en.wikipedia.org/wiki/Triplestore. 29. Wikidata OpenRefine reconciliation endpoint software. See Delpeuch, Antonin. (2017) 2020. “Wetneb/Openrefine-Wikibase.” Python. https://github.com/wetneb/openrefine-wikibase. 30. OCLC’s FAST (Faceted Application of Subject Terminology) system: https://www.oclc.org/research/areas/data-science/fast.html. 31. VIAF OpenRefine reconciliation endpoint service: http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php. 32. The GeoNames service for geographic data: https://www.geonames.org/. 33. The Python scripting language. See “Manual:Pywikibot/Overview - MediaWiki.” n.d. Accessed 7 January 2021. https://www.mediawiki.org/wiki/Manual:Pywikibot/Overview. https://www.python.org/. 34. “Help:QuickStatements - Wikidata.” Edited on 4 January 2021, at 10:41. https://www.wikidata.org/wiki/Help:QuickStatements. 35. Pywikibot Python library overview. See “Manual:Pywikibot/Overview - MediaWiki.” n.d. Accessed 7 January 2021. https://www.mediawiki.org/wiki/Manual:Pywikibot/Overview. https://www.mediawiki.org/wiki/Manual:Pywikibot/Overview. 36. OCLC DevConnect Online 2020 presentation on the alternative OpenRefine reconciliation endpoint software developed during the pilot project. See Mixter, Jeff, and Bruce Washburn. 2020. “Building an OpenRefine Reconciliation Endpoint for a Wikibase project: Lessons Learned.” Produced by OCLC, 20 May 2020. MP4 video presentation, 58:01. https://www.oclc.org/en/events/2020/devconnect-online-2020/devconnect-2020-creating -linked-descriptive-data-for-contentdm.html. 37. A “placeholder” entity for a person without an established identity: https://researchworks.oclc.org/cdmld/screenshots/entity-Q144548.png. https://merrick.library.miami.edu/cdm/search/collection/asm0304 https://merrick.library.miami.edu/rsmas/rsmasphotos/ https://researchworks.oclc.org/cdmld/screenshots/cdm-item-talk-Q148309.png https://openrefine.org/ https://researchworks.oclc.org/cdmld/screenshots/openrefine-project.png https://iiif.io/ https://en.wikipedia.org/wiki/Triplestore https://github.com/wetneb/openrefine-wikibase https://www.oclc.org/research/areas/data-science/fast.html http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php https://www.geonames.org/ https://www.python.org/ https://www.wikidata.org/wiki/Help:QuickStatements https://www.mediawiki.org/wiki/Manual:Pywikibot/Overview https://www.oclc.org/en/events/2020/devconnect-online-2020/devconnect-2020-creating-linked-descriptive-data-for-contentdm.html https://www.oclc.org/en/events/2020/devconnect-online-2020/devconnect-2020-creating-linked-descriptive-data-for-contentdm.html https://researchworks.oclc.org/cdmld/screenshots/entity-Q144548.png https://researchworks.oclc.org/entity/Q144548 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 71 38. An example CONTENTdm compound object for a photograph album. See University of Miami Libraries. “Album Documenting a Sea Journey to Trinidad, Venezuela, and Grenada.” Latin American and Caribbean Photograph Collection. Digital Collections. Accessed 7 January 2021, https://cdm17191.contentdm.oclc.org/digital/collection/asm0304/id/1311. 39. Example “has creative work part” statements for the parts of an album: https://researchworks.oclc.org/cdmld/screenshots/entity-Q73586.png. 40. RDF Resource Description Framework standard for linked data. See W3C Semantic Web. “RDF: Resource Description Framework.” Updated 15 March 2014, at 21:35. https://www.w3.org/RDF/. 41. RDF Turtle textual syntax. See Beckett, David, Tim Berners-Lee, Eric Prud’hommeaux, and Gavin Carothers. 2014. “RDF - Semantic Web Standards.” https://www.w3.org/TR/turtle/. 42. RDF N-triples plain text syntax. See W3C Semantic Web. 2014. “RDR 1.1 N-Triples: A Line-based Syntax for an RDF Graph.” https://www.w3.org/TR/n-triples/. 43. RDF JSON-LD format for linked data. See Sporny, Manu, Dave Longley, Gregg Kellogg, Markus Lanthaler, Pierre-Antoine Champin, and Niklas Lindström. 2020. “JSON-LD 1.1: A JSON-based Serialization for Linked Data.” W3C Editor’s draft. Edited by Gregg Kellogg, Pierre-Antoine Champin and Dave Longley. Posted 14 November 2020. https://w3c.github.io/json-ld-syntax/. 44. JSON (JavaScript Object Notation) data format. See Wikipedia. “JSON.” Updated 31 December 2020, at 22:32 (UTC). https://en.wikipedia.org/wiki/JSON. 45. The PHP Group. “Object Serialization: Serializing Objects - Objects In Sessions. PHP Manual. Accessed 7 January 2021. https://www.php.net/manual/en/language.oop5.serialization.php. 46. DPLA Metadata Application Profile documentation: https://pro.dp.la/hubs/metadata-application-profile. 47. Schema.org metadata schema documentation. See “Organization of Schemas.” 2021. https://schema.org/docs/schemas.html. 48. W3C Semantic Web. “Web Ontology Language (OWL).” Updated 11 December 2013, at 11:38. https://www.w3.org/OWL/. 49. Kellogg, Greg (ed). 2020. “JSON -LD Best Practices: W3C Editor’s Draft 20 February 2020.” W3C (MIT, ERCIM, Keio, Beihang). https://w3c.github.io/json-ld-bp/. 50. Appleby, Michael, Tom Crane, Robert Sanderson, Jon Stroop, and Simeon Warner. 2018. “JSON-LD Design Patterns.” Chap. 3 in IIIF Design Patterns. International Image Interoperability Framework Consortium. https://iiif.io/api/annex/notes/design_patterns/#json-ld-design-patterns. 51. Other names associated with the Los Angeles Dodgers entity: https://researchworks.oclc.org/cdmld/screenshots/entity-Q166325.png. 52. First parts of the description of Jasper Wood: https://researchworks.oclc.org/cdmld/screenshots/entity-Q147700.png. 53. SPARQL Query map visualization of places depicted in works from a collection: https://researchworks.oclc.org/cdmld/screenshots/sparql-visualization.png. https://cdm17191.contentdm.oclc.org/digital/collection/asm0304/id/1311 https://researchworks.oclc.org/cdmld/screenshots/entity-Q73586.png https://www.w3.org/RDF/ https://www.w3.org/TR/turtle/ https://www.w3.org/TR/n-triples/ https://w3c.github.io/json-ld-syntax/ https://en.wikipedia.org/wiki/JSON https://www.php.net/manual/en/language.oop5.serialization.php https://pro.dp.la/hubs/metadata-application-profile https://schema.org/docs/schemas.html https://www.w3.org/OWL/ https://w3c.github.io/json-ld-bp/ https://iiif.io/api/annex/notes/design_patterns/#json -ld-design-patterns https://researchworks.oclc.org/cdmld/screenshots/entity-Q166325.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q147700.png https://researchworks.oclc.org/cdmld/screenshots/sparql-visualization.png 72 Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 54. Wikibase Gadgets extension documentation. See MediaWiki. “Extension:Gadgets.” Updated 16 October 2020, at 11:36. https://www.mediawiki.org/wiki/Extension:Gadgets. 55. Mirador IIIF-compatible image viewer project website: https://projectmirador.org/. 56. Mirador image viewer embedded in the Wikibase user interface: https://researchworks.oclc.org/cdmld/screenshots/entity-Q165895.png. 57. Contextual data and image from DBPedia and Wikimedia Commons embedded in the Wikibase user interface: https://researchworks.oclc.org/cdmld/screenshots/entity-Q71945.png. 58. Constraints quality assurance Wikibase mechanism documentation. See MediaWiki. “Extension:Wikibase Quality Extensions.” Archived 7 January 2019, at 13:45. https://www.mediawiki.org/wiki/Extension:Wikibase_Quality_Extensions. 59. A constraint violation indicating that the “occupation” property should only be used for instances of the type “person” https://researchworks.oclc.org/cdmld/screenshots /entity-Q73246.png. 60. OCLC CONTENTdm Custom Pages with CSS and JavaScript documentation. Updated 28 June 2018. https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization / Custom_pages/Custom_pages_with_CSS_and_JavaScript. 61. OCLC. CONTENTdm Advanced Website Customization Cookbook website: https://help.oclc.org /Metadata_Services/CONTENTdm/Advanced_website_customization/Customization_cookbook. 62. Google’s Structured Data Testing Tool: https://search.google.com/structured-data/testing-tool. [Google has announced that this tool is being discontinued.] 63. CONTENTdm Schema.org data evaluated using the Google Structured Data Testing tool: https://researchworks.oclc.org/cdmld/screenshots/google-structured-data-testing-tool.png. 64. Additional contextual information displayed in CONTENTdm based on entity descriptions in the pilot Wikibase: https://researchworks.oclc.org/cdmld/screenshots /cdm15725-p16003coll7-14.png. 65. Image Annotator initial view with subjects: https://researchworks.oclc.org/cdmld/screenshots/image-annotator-1.png. 66. Image Annotator cropped image of a person: https://researchworks.oclc.org/cdmld/screenshots/image-annotator-2.png. 67. Image Annotator after adding more depicted subjects: https://researchworks.oclc.org/cdmld/screenshots/image-annotator-3.png. 68. Wikibase item updated with depicted subjects and associated cropped images: https://researchworks.oclc.org/cdmld/screenshots/entity-Q148552.png. 69. Nielsen, Jakob. 2012. “Thinking Aloud: The #1 Usability Tool.” Nielsen Norman Group. Posted 15 January 2020. https://www.nngroup.com/articles/thinking-aloud-the-1-usability-tool/. https://www.mediawiki.org/wiki/Extension:Gadgets https://projectmirador.org/ https://researchworks.oclc.org/cdmld/screenshots/entity-Q165895.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q71945.png https://www.mediawiki.org/wiki/Extension:Wikibase_Quality_Extensions https://researchworks.oclc.org/cdmld/screenshots/entity-Q73246.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q73246.png https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/Custom_pages/Custom_pages_with_CSS_and_JavaScript https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/Custom_pages/Custom_pages_with_CSS_and_JavaScript https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/Customization_cookbook https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/Customization_cookbook https://search.google.com/structured-data/testing-tool https://researchworks.oclc.org/cdmld/screenshots/cdm15725-p16003coll7-14.png https://researchworks.oclc.org/cdmld/screenshots/cdm15725-p16003coll7-14.png https://researchworks.oclc.org/cdmld/screenshots/image-annotator-1.png https://researchworks.oclc.org/cdmld/screenshots/image-annotator-2.png https://researchworks.oclc.org/cdmld/screenshots/image-annotator-3.png https://researchworks.oclc.org/cdmld/screenshots/entity-Q148552.png https://www.nngroup.com/articles/thinking-aloud-the-1-usability-tool/ Transforming Metadata into Linked Data to Improve Digital Collection Discoverability 73 70. Retriever search results from Wikidata, VIAF, and FAST for “lake vermilion”: https://researchworks.oclc.org/cdmld/screenshots/retriever-1.png. 71. Retriever entity editor: https://researchworks.oclc.org/cdmld/screenshots/retriever-2.png. 72. Wikibase entity created by the Retriever: https://researchworks.oclc.org/cdmld/screenshots/entity-Q221424.png. 73. Knublauch, Holger, and Dimitris Kontokostas (eds). 2017. “Shapes Constraint Language (SHACL): W3C Recommendation 20 July 2017.” W3C. https://www.w3.org/TR/shacl/. 74. ShEx Shape Expressions Language W3C Recommendation. See Prud’hommeaux, Eric, Lovka Boneva, Jose Labra Gayo, and Gregg Kellogg. 2017. “Shape Expressions Language 2.0: Draft Community Group Report 27 March 2017.” W3C. http://shex.io/shex-semantics-20170327/. 75. Editing essential details for an entity in the Describer: https://researchworks.oclc.org/cdmld/screenshots/describer-1.png. 76. Explorer home page: https://researchworks.oclc.org/cdmld/screenshots/explorer-1.png. 77. Explorer Transportation Hub and related collections: https://researchworks.oclc.org/cdmld/screenshots/explorer-2.png. 78. Explorer search results for “strike”: https://researchworks.oclc.org/cdmld/screenshots/explorer-3.png. 79. Explorer view of a truck bringing workers home during a PTC walkout: https://researchworks.oclc.org/cdmld/screenshots/explorer-4.png. 80. Explorer view of a protest against the Philadelphia Transportation Company: https://researchworks.oclc.org/cdmld/screenshots/explorer-5.png. 81. Explorer view of an 1899 Cleveland transit strike in Public Square: https://researchworks.oclc.org/cdmld/screenshots/explorer-6.png. 82. Explorer view of streetcars parked on the street during a transit strike: https://researchworks.oclc.org/cdmld/screenshots/explorer-7.png. 83. Field Analyzer field usage chart: https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-1.png. 84. Field Analyzer list of field values: https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-2.png. 85. Minnesota Digital Library website. See University of Minnesota. “Minnesota Reflections.” https://reflections.mndigital.org/. 86. Digital Public Library of America website: https://dp.la/. https://researchworks.oclc.org/cdmld/screenshots/retriever-1.png https://researchworks.oclc.org/cdmld/screenshots/retriever-2.png https://www.w3.org/TR/shacl/ http://shex.io/shex-semantics-20170327/ https://researchworks.oclc.org/cdmld/screenshots/describer-1.png https://researchworks.oclc.org/cdmld/screenshots/explorer-1.png https://researchworks.oclc.org/cdmld/screenshots/explorer-2.png https://researchworks.oclc.org/cdmld/screenshots/explorer-3.png https://researchworks.oclc.org/cdmld/screenshots/explorer-4.png https://researchworks.oclc.org/cdmld/screenshots/explorer-5.png https://researchworks.oclc.org/cdmld/screenshots/explorer-6.png https://researchworks.oclc.org/cdmld/screenshots/explorer-7.png https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-1.png https://researchworks.oclc.org/cdmld/screenshots/field-analyzer-2.png https://reflections.mndigital.org/ https://dp.la/ For more information about our work related to digitizing library collections, please visit: oc.lc/digitizing 6565 Kilgour Place Dublin, Ohio 43017-3395 T: 1-800-848-5878 T: +1-614-764-6000 F: +1-614-764-6096 www.oclc.org/research ISBN: 978-1-55653-185-9 DOI: 10.25333/fzcv-0851 RM-PR-216817-WWAE 2101 O C L C R E S E A R C H R E P O R T http://oc.lc/digitizing Acknowledgments Executive Summary Introduction Three-Phase Project Plan Phase 1: Mapping textual metadata to entities Phase 2: Tools for managing metadata in Wikibase Phase 3: Wikibase entities drive discovery The Wikibase Environment Developing A Data Model Describing the “type” of a creative work at three levels Distinguishing between instances of concepts and ontological classes Managing the data model in Wikibase Managing source metadata outside of the data model Gathering and Transforming Metadata Selecting and analyzing collections from pilot partner CONTENTdm sites Optimizing tools and workflows for reconciliation and transformation Adding related entities to the Contentdm Wikibase from external sources Creating entities in advance for anticipated matches Testing an alternative openrefine reconciliation endpoint Creating placeholder entities for things that could not be reconciled Representing Compound Objects Syndicating Data in Standard Schemas Wikibase Ecosystem Advantages Implementing authority control Decreasing cataloging inefficiencies, increasing descriptive quality Generating data visualizations User Interface Extensions MediaWiki gadgets Adding the Mirador viewer Showing contextual information from Wikidata Contextual Data and Image from DBPedia and Wikimedia Commons Embedded in the Wikibase User Interface Revealing constraint violations CONTENTdm custom pages Embedding Schema.org JSON-LD in CONTENTdm pages Showing contextual information for headings based on Wikibase data New Applications The Image Annotator User study results The Retriever The Describer The Explorer and the Transportation Hub The Field Analyzer Cohort Communication Partner Reflections Cleveland Public Library (Chatham Ewing) The Huntington Library, Art Museum, And Botanical Gardens (Mario Einaudi) Minnesota Digital Library (Greta Bahnemann and Jason Roy) Invitation Development of three tools by OCLC Leveraging the power of linked data Concluding thoughts Temple University Libraries (Holly Tomren and Michael Carroll) University of Miami Libraries (Paul Clough and Elliot Williams) Key Findings and Conclusions Testing the linked data value proposition Evaluating a shared data model Selecting and transforming metadata Continuing the journey to linked data Working partnerships represent strength in numbers Notes Figure 1. Planned project phases. Figure 2. The Wikibase Ecosystem. Figure 3. A CONTENTdm class hierarchy data model. Figure 4. Example type, classification used, and process or format properties and values for a description of a postcard. Figure 5. A depicts statement for the concept of “Dogs.” Figure 6. A type classification of “dog” for a specific dog. Figure 7. The “dog” class is defined by the concept of “Dogs.” Figure 8. Wikibase templates for proposing new properties. Figure 9. Unmapped CONTENTdm metadata displayed in the Wikibase user interface using a Gadget extension. Figure 10. Wikibase Discussion page for a collection review. Figure 12. A “placeholder” entity for a person without an established identity. Figure 13. Example “has creative work part” statements and sequencing for the first four parts of an album. Figure 14. Other names associated with the Los Angeles Dodgers entity. Figure 15. First parts of the description of Jasper Wood. Figure 16. SPARQL Query map visualization of places depicted in works from a collection. Figure 17. Mirador image viewer embedded in the Wikibase user interface. Figure 18. Contextual data and image from DBPedia and Wikimedia Commons embedded in the Wikibase user interface. Figure 19. A constraint violation indicating that the “occupation” property should only be used for instances of the type “person.” Figure 20. Schema.org data evaluated using Google’s Structured Data Testing Tool. Figure 21. Additional contextual information displayed in CONTENTdm based on entity descriptions in the pilot Wikibase. Figure 22. Image Annotator initial view of an image and subjects. Figure 23. Image Annotator cropping an image of a person. Figure 24. Image Annotator after adding more depicted subjects. Figure 25. Wikibase item updated with illustrated depicts statements. Figure 26. Retriever search results from Wikidata, VIAF, and FAST for “Lake Vermilion.” Figure 27. Retriever entity editor. Figure 28. Wikibase entity created by the Retriever. Figure 29. Editing essential details for an entity in the Describer. Figure 30. Explorer home page. Figure 31. Explorer Transportation Hub and related collections. Figure 32. Explorer search results for “strike.” Figure 33. Explorer view of a truck bringing employees home during a PTC walkout. Figure 34. Explorer view of a protest against the Philadelphia Transportation Company. Figure 35. Explorer view of an 1899 Cleveland transit strike in Public Square. Figure 36. Explorer view of streetcars parked on the street during a transit strike. Figure 37. Field Analyzer field usage chart. Figure 38. Field Analyzer list of field values. Blank Page
cohen-machine-2021 ---- Chapter 12 Machine Learning + Data Creation in a Community Partnership for Archival Research Jason Cohen Berea College Mario Nakazawa Berea College Introduction: Cultural Heritage and Archival Preservation in Eastern Kentucky In this chapter, two researchers, Jason Cohen and Mario Nakazawa, describe the contexts for an archivally focused project that emerged from a partnership between the Pine Mountain Settle- ment School (PMSS)1 in Harlan County, Kentucky, and scholars and students at Berea College. In this process, we have entered into a critical dialogue with our sources and knowledge pro- duction that Roopika Risam calls for in “self-reflexive” investigations in the digital humanities (2015, para. 16). Risam’s intervention, nevertheless, does not explicitly distinguish questions of class and the concomitant geographic constraints that often accompany the economic and social disadvantages of poverty (Ahmed et al. 2018). Our work demonstrates how class and geography are tied, even in digital archives, to the need for reflexive and diverse approaches to humanist ma- terials. For instance, a recent invited contribution to Proceedings of the IEEE articulates a need 1See ?iiT,ffTBM2KQmMi�BMb2iiH2K2Mib+?QQHX+QK. 137 http://pinemountainsettlementschool.com 138 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for diversity in computing and technology without mentioning class or region as factors shaping these related issues of diversity (Stephan et al. 2012, 1752–5). Given these constraints, perhaps it is also pertinent to acknowledge that the machine learning application we describe in this chapter is itself not particularly novel in scope or method—we describe our data acquisition and prepa- ration, and two parallel implementations of commercially available tools for facial recognition. What stands out as unique are the ethical and practical concerns tied to bringing unique archival materials out of their local contexts into a larger conversation about computer vision as a tool that helps liberate, and at the same time possibly endanger, a subaltern cultural heritage. In that light, we enter our archival investigation into what Bruno Latour has productively named “actor-network theory” (2007, 11–13) because, as we suggest below, our actions were highly conditioned not only by the physical and social spaces our research occupies and where its events occurs, but also because the nature of the historical artifacts themselves act powerfully to shape our work in these contexts. Moreover, the partnership model of curation and archiving that we pursued in this project complicates the very concept of agency because the actions form- ing the project emerged from a continuing dialogue rather than any one decision or hierarchy. As we suggest later, a distributed model for decisions (Sabharwal 2015, 52–5) also revealed the limitations of using a participatory and identity-based model for archival development and man- agement. Indeed, those historical artifacts will exert influence on this network of relations long after any one of us involved in the current project has ceased to pursue them. When we came to this project, we asked a version of a classic question that has arisen in a variety of forms begin- ning with very early efforts by Bell Laboratories, among others, to translate data structures to suit the often flexible needs of humanist data: “what aspects of life are formalizable?” (Weizenbaum 1976, 12). We discovered that while an ontology may represent a formalized relationship of an archive to a database or finding aid, it also asks questions about the ethical implications of what information and embedded relationships can be adequately formalized by an abstract schema. The Promises and Realities of Technology After Coal in Eastern Kentucky Despite the longstanding threats of having to adapt to a post-coal economy, Harlan County, Ken- tucky continues to rely on coal and the mountains from which that coal is extracted as two of the cornerstones that shape the identity of the territory as well as the people who call it home. The mountains of Eastern Kentucky, like much of Appalachia, are by turns beautiful and devastated, and both authors of this essay have found conversations with Eastern Kentucky’s citizens about the role the mountains play and the traditions that emerge from them both insightful and, at times, heartbreaking. This dramatic landscape, with its drastic challenges, may not sound like a place likely to find uses for machine learning. You would not be alone in your assumption. Standing far from urban centers of technology and mobility, Eastern Kentucky combines deeply structural problems of generational poverty with a hard won understanding that, since the moment of the region’s colonization, outsiders have taken resources and made uninformed decisions about what the region needs, or where it should turn in order to gain a better pur- chase on the narrative of American progress, self-improvement, and the unavoidable allures of development-driven capitalism. Suspicion of outsiders is endemic here. And unfortunately, eco- nomic and social conditions, such as the high workplace injury rates associated with mining and extraction-related industries, the effects of the pharmaceutical industry’s abuse of prescription Cohen and Nakazawa 139 opioids to treat a wide array of medical pain symptoms without treating the underlying causal conditions, and the systematic dismantling of federal- and state-level social support programs, have become increasingly acute concerns today. But this trajectory is not new: when President Lyndon B. Johnson announced the beginning of the War on Poverty in 1964, he landed an hour away in Martin County, and subsequently, drove through Harlan on a regional tour to inaugurate the initiative. Successive generations have sought to leave a mark, and all the while, the residents have been collecting their own local histories of their place. Our project, centered on recovering a latent social network of historical families represented by the images held in one local archive, mobilizes this tension between insiders’ persistence and outsiders’ interventions to think about how, as Bruno Latour puts it, we can “reassemble the social” while still respecting the local (2007, 191–2). PMSS occupies a unique position in this social and physical landscape: both local in its emplacement and attention, and a site of philanthropic work that attracted outside money as well as human and cultural capital, PMSS is at once of Harlan County and beyond it. As we sug- gest in the later sections of this essay, PMSS’s position, both within local and straddling regional boundaries, complicates the network we identified. More than that, however, its split position complicates the relationships of power and filiation embedded in its historical social network. While an economy centered on coal continues to define the Eastern Kentucky regional iden- tity, a second history can be told about this place and its people, one centered on resilience, in- dependence, simplicity, and beauty, both of the land and its people. This second history has made outsiders’ recent appeals for the region to court technology as a potential solution for what comes “after coal” particularly attractive to a region that prides itself on its capacity to sustain, out- last, and overcome obstacles. While that techno-utopian vision offers another version of the self- aggrandizing Silicon Valley bootstraps success story J.D. Vance narrates in Hillbilly Elegy (2016), like Vance’s story itself, those narratives most often get told by outsiders to outsiders using re- gional stereotypes as the grounds for a sales pitch. In reality, however, those efforts have largely proven difficult to sustain, and at times, become the sources of potentially explosive accusations of fraud and malfeasance. Recently, for instance, organizations including Mined Minds2 have been accused by residents aiming to prepare for a post-coal economy of misleading students, at least, and of fraud at worst. As with the timber, coal, and gas extraction industries that preceded these software development firms’ aspirations, the promises of technology have not been kind to Eastern Kentucky, and in particular, as with those extraction industries that preceded them, the technological-industrial complex making its pitch in Kentucky’s mountains has not returned resources to the region’s residents whom the work was intended at least nominally to support (Hochschild 2018; Campbell 2019; Bailey 2017). In this context of technology, culture, and the often controversial position machine learning occupies in generating obscure metrics for its classifiers that may embed bias, our project aims to activate its archival holdings and bring critical awareness to the question of how to actively engage with a paper archive of a local place as we venture further into our pervasively digital mo- ment. The School operates today as a regional cultural heritage institution; it opened in 1913 as a residential school and operated as an educational institution until 1974, at which point it trans- formed itself into an environmental and cultural outreach institution focused on developing its local community and maintaining the richness of the region’s cultural resources and heritage. Every year since 1974, PMSS has brought hundreds of students and citizens onto its campus to learn about nature and the landscape, traditional crafts and artistic practices, and musical and dance forms, among many other programs. Similarly, it has created a space for locals to come 2See ?iiT,ffrrrXKBM2/KBM/bXQ`;f. http://www.minedminds.org/ 140 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 together for social events, community celebrations, and festival days, and at the same time, has become a destination for national-level events that create community from shared interests in- cluding foodways, wildflowers, traditional dance forms, and other wide-ranging attractions. Project Background: Preserving Cultural Heritage in Harlan Country The archives of the Pine Mountain Settlement School emerge from its shifting history. The ma- jority of its papers relate to its time as a traditional institution of education, including student records (which continue to be restricted for several reasons, including FERPA constraints, and personal and community interests in privacy), minutes of its board meetings (again, partially re- stricted), and financial and narrative accounts of its many activities across a year. The school’s records are unique because they provide a snapshot, year by year and month by month, of the region’s interests and challenges during key years of the 20th Century, spanning the First World War to Vietnam. In addition, they detail the relations the School maintained with a philanthropic base of donors who helped to support it and shape it, and beyond its local relations, place it into contact with a larger set of cultural interactions than a boarding school that relied on tuition or other profit-driven means to sustain its operations would. While the archival holdings contin- ued to be informally developed by its directors and staff, who kept the official papers organized roughly by year, the archive itself sat largely neglected after 1974. Beginning around the turn of the millennium, a volunteer archivist named Helen Wykle began digitizing items one by one, and soon, hosted a curated selection of those digital surrogates along with interpretive and descrip- tive narration on a WordPress installation, The Pine Mountain Settlement School Collections.3 The PMSS Collections WordPress site has been continuously running and frequently updated by Wykle and the volunteer community members she has organized since 1999.4 Together with her collaborators and volunteers, Wykle has grown the WordPress site to over 2200 pages, including over 30,000 embedded images that include photographs and newspapers; scanned memos, meet- ing minutes and other textual material (in JPG and PDF formats); HTML transcriptions and bibliographies hard-coded into the pages; scanned images of 3-D collections objects like textile looms or wood carving tools; partially scanned runs of serial publications; and other compos- ite visual material. None of those objects was hosted within a regular and complete metadata hierarchy or ontology: no regular scheme of fields or file-naming convention was followed, no controlled vocabulary was maintained, no object-types were defined, no specific fields were re- quired prior to posting, and perhaps unsurprisingly as a result, the search and retrieval functions of the site had deteriorated noticeably. In 2016, Jason Cohen approached PMSS with the idea of using its archives as the basis for curricular development at Berea College.5 Working in collaboration beginning in 2017, Mario Nakazawa and Cohen developed two courses in digital and computational humanities, led a team-directed study in augmented reality in coordination with Pine Mountain, contributed ma- 3See ?iiTb,ffTBM2KQmMi�BMb2iiH2K2MiXM2if. 4Jason Cohen and Mario Nakazawa wish to extend a note of appreciation to Helen Hays Wykle, Geoff Marietta, the former director of PMSS, and Preston Jones, its current director, for welcoming us and enabling us to access the physical archives at PMSS from 2016–20. 5Jason Cohen would like to recognize the support this project received from the National Endowment for the Hu- manities’ “Humanities Connections” grant. See grant number AK-255299-17, description online at ?iiTb,ffb2+m `2;`�MibXM2?X;QpfTm#HB+[m2`vfK�BMX�bTt?74R�;M4�E@k88kNN@Rd. https://pinemountainsettlement.net/ https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 Cohen and Nakazawa 141 terials and methods for a new course in Appalachian Studies, and promoted the use of PMSS archival materials in several other extant courses in history and art history, among others. These new college courses each make use of PMSS historical documents as a shared core of visual and textual material in a digital and computational humanities concentration that clusters around critical archival and textual studies.6 The success of that initial collaboration and course development seeded the potential in 2019– 2021 for a Whiting Public Engagement7 fellowship focused on developing middle and high school curricula for use in Kentucky public schools with PMSS archival materials. That Whiting funded project has generated over 80 lessons keyed to Kentucky state standards; these lessons are cur- rently in use at nine schools across eight school districts, and each school is using PMSS materials to highlight its own regional and local interests. The work we have done with these archives has thus far reached the classrooms of at least eleven different middle and high school teachers, and as a result, touched over 450 students in eastern and central Kentucky public schools. We mention these numbers in order to demonstrate that our collaboration has not been shal- low nor fleeting. We have come to know these archives quite well, and because they are not ade- quately cataloged, the only way to get to know them is to spend time reading through the mate- rials one page at a time. An ancillary consequence of this durable collaboration and partnership across the public-academic divide is the shared recognition early in 2019 that the PMSS archival database and its underlying data structure (a flat SQL database generated by the WordPress inter- face) would provide inadequate stability for records management and quality control in future development. In addition, we discovered that the interpretive materials and metadata associated with the WordPress installation were also insufficient for linked metadata across the objects in this expanding digital archive, for reasons discussed below. As partners, we decided together to migrate to a ContentDM instance hosted by the Ken- tucky Virtual Library,8 a consortium to which Berea College belongs, and which is open to future membership from PMSS. That decision led a team of Berea College undergraduate and faculty re- searchers to scrape the data from the PMSS archive site and supplement the images and transcrip- tions it contains with available textual metadata drawn from the site.9 Alongside the WordPress instance as our reference, we were also granted access to a Dropbox account that hosted higher resolution versions of the images featured on the blog. The scraper pulled over 19,228 unique images (and located over 11,000 duplicate images in the process), 732 document transcriptions for scanned texts on the site, and 380 subject and person bibliographies, including Library of Congress Subject Headings that had been hard-coded into the site’s HTML. We also extracted the unique object identifiers and labels associated with each image, which in WordPress are not associated with the image objects themselves. We used that data to populate the ContentDM in- stance and returned a sparse but stable skeleton for future archival development. In the process, we also learned significantly about how a future implementation of a controlled vocabulary, an image acquisition and processing pipeline, and object documentation standards should work in the next stages of our collaborative PMSS archival development. 6In the original version of the collaboration, we had planned also to teach basic computer programming to high school students during a summer program that also would have used that same set of materials, but with the paired departures of the original co-PI as well as the former director, that plan has thus far remained unfulfilled. 7See ?iiTb,ffrrrXr?BiBM;XQ`;f+QMi2MifD�bQM@+Q?2M. 8See ?iiTb,ffF/HXFvpHXQ`;f. 9Jason Cohen wishes to thank Mario Nakazawa, Bethanie Williams, and Tradd Schmidt for undertaking this project with him. The github repo for the PMSS scraper is hosted here: ?iiTb,ff;Bi?m#X+QKfh`�//@a+?KB/ifSJaana +`�T2`. https://www.whiting.org/content/jason-cohen https://kdl.kyvl.org/ https://github.com/Tradd-Schmidt/PMSS_Scraper https://github.com/Tradd-Schmidt/PMSS_Scraper 142 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 As we developed and refined this new point of entry to the digital archives using the Con- tentDM hosting and framework, some of the ethical issues surrounding this local archive came more clearly into focus. A parallel set of questions arose in response in the first instance to J.D. Vance’s work, and in the second, to outsiders’ claims for technological solutions to the deteri- oration of local and cultural heritage. Because we were creating virtual archival surrogates for materials housed at Pine Mountain, for instance, questions arose from the PMSS board mem- bers related to privacy and use of historical materials. Further, the board was concerned that even historical materials could bear on families present in the community today. We found that while profession-wide responses to archival constraints are shaped predominantly by discussions of copyright and fair use, issues of personal privacy are often left tacit. This gap between legal use and public interests in privacy reveals how tasks executed using techniques in machine learning may impinge upon more ethical constraints of public trust and civic obligation.10 Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextri- cably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the “historical validation” of primary source materials (2017, 424–5). When an AI system recognizes an object, Calo remarks, that object is validated. But how should one handle the lack of a specific vocabulary within a given training set? One answer, of course, would be to train a new set—but that response is becoming increasingly prohibitive for smaller cultural heritage projects like ours: the time and computational power re- quired to execute the training is non-negligible. In addition, training resources (such as data sets, algorithms, and platforms) are increasingly becoming monetized, and we do not have the mar- gins to buy access to new data for training. As a consequence, questions stemming from how one labels material in a controlled vocabulary were also at issue. We encountered a failure in historical validation when, for instance, our AI system labeled a “spinning wheel” as a wheel, but did not de- tect its historical relationship to weaving and textiles. That validation was further obscured when the system also failed to categorize a second form of “spinning wheel,” which refers locally to a home-made merry-go-round.11 In other words, not only did the system flatten a spinning wheel into a generic wheel, it also missed the regional homology between textile production and play, a cultural crux that reveals how this place envisions an intersection between work and recreation. By breaking the associations between two forms of “spinning wheel,” our system erased a small but significant site of cultural inheritance. How, we asked, should one handle such instances of effacement? At one level, one would expect an archival system to be able to identify the prim- itive machine for spinning wool, flax, or other raw materials into usable thread for textiles, but what about the merry-go-round? And what should one do when a system neglects both of these meanings and reduces the object to the same status as a wheel on a tractor, car, or carriage? Similarly, when competing naming conventions arise for landmarks, we were conscious to consider which name should be granted priority as the default designation, and we asked how one should designate a local or historical name, whether for a road, waterway, knob, or other fea- ture, in relationship to a more widely accepted nomenclature such as state route designations or 10The professional conversation in archive and collections management has not been as rich as the one emerging in AI contexts more broadly. For a recent discussion of the conflict in the roles of public trust and civic service that emerge from the context of the powers artificial intelligence holds for image recognition in policing applications, see Elizabeth Joh, “Artificial Intelligence and Policing: First Questions,” Seattle University Law Review 41: 1139–44. 11See “Spinning Wheel” in Cassidy 1985–2012. Cohen and Nakazawa 143 standardized toponym? As we attempted to address the challenge of multiple naming conven- tions, we encountered some of the same challenges that archivists find in dealing with indigenous peoples and their textual, material, and physical artifacts.12 Following an example derived from the Passamaquoddy people, we implemented a small set of “traditional knowledge labels”13 to describe several forms of information, including (a) restrictions on images that should not be shown to strangers (to protect family privacy), (b) places that should remain undisclosed (for in- stance, wild ginseng, ramp, orchid, or morel mushroom patches), and (c) educational materials focused on “how it was done” as related to local skills and crafts that have more modern imple- mentations, but for which the traditional practices have remained meaningful. This included cases such as Maypole dancing and festivals, which remain endowed with ritual significance. In the final analysis, neither the framework supplied by copyright and fair use nor the one supplied by data validation proved singularly adequate to our purposes, but they did provide guidelines from which our facial recognition project could proceed, as we discuss below. Machine Learning in a Local Archive These preliminary discussions of ethics and convention may seem unrelated to the focus this col- lection adopts toward machine learning and artificial intelligence in the archive. However, as we have begun to suggest, the data migration to ContentDM opened the door to machine learning for this project, and those initial steps framed the pitfalls that we continue to navigate as we con- tinue forward. As we suggested at the outset, the technical machine-learning task that we set for ourselves is not cutting edge research as much as an application of existing technologies to a new aspect of archival investigation. We proposed (and succeeded with) an application of commercial facial recognition software to identify the persons in historic photographs in the PMSS archives. We subsequently proposed and are currently working to identify the photographs sharing com- mon but unnamed faces, and in coordination with photographs of known people, to re-create the social network of this historic institution across slices of its history. We describe the next steps briefly below, but let us tarry for a moment with the question of how the ethical concerns we navigated up to this point also influenced our approach to facial recognition. The first of those concerns has to do with commercial and public access to archival materials that, as we suggested above, include materials that are designated as restricted use in some way. We demonstrated to the local members at Pine Mountain how our use case and its con- straints for digital archives fit with the current standards for the fair use of copyrighted materials based on the “substantive transformation” of reproduced objects (Levendowski 2018, 622–9). Since we are not making available large bodies of materials still protected by copyright, and since our use of select materials shifts the context within which they are presented, we were able to negotiate with PMSS to allow us to design a system for facial recognition using the ContentDM instance as our image source. What that negotiation did not consider, however, is when fair use does not provide a sufficiently high standard of control for the institution involved in the appli- cation of algorithms to institutional memory or its technological dependencies. First, to test the facial recognition processes, we reached back to the most primitive and local version of facial recognition software that we could find, Google’s retired platform, the Picasa 12One well-documented digital approach to handling indigenous archival materials includes the Mukurtu platform for indigenous cultural heritage: ?iiTb,ffKmFm`imXQ`;f. 13For the original traditional knowledge labels, see: ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@ i`�/BiBQM�H@FMQrH2/;2@H�#2Hb. https://mukurtu.org/ https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels 144 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 Web Albums API, which was retired in May 2016 and fully deprecated as of March 2018 (Sab- harwal 2016). We chose Picasa because it is a self-contained software application that operates using a locally hosted script and locally hosted images. Given its deprecated status and its loca- tion on a local machine, we were confident that no cloud services would be ingesting the images we fed into the system for our trial. This meant that we could test small data examples without fear of having to upload an entire corpus of material that could subsequently be incorporated into commercial facial recognition engines or pop up unexpectedly in search results. We thus began by upholding a high threshold for privacy and insisting on finding ways for PMSS to maintain control over these images within the grasp of its local directories. The Picasa system created surprisingly good results within the scope we allowed it. It was highly successful at matching the small group of known faces we supplied as test materials. While it would be difficult to supply a numerical match rate first because of this limited test set, and second because we have not expanded the test to a broad sample using another platform, we were anecdotally surprised at how robust Picasa’s matching was in practice. For instance, Picasa matched the images of a single person’s face, Celia Cathcart, from pictures of her as a teenager to images of her as a grandmother. It recognized Cathcart in a group of basketball players, and it also identified her face from side-view and off-center angles, as in a photograph of her looking down at her newborn child. The most immediate limitation of Picasa lies in its tagging, which required manual entry of every name and did not allow any automation. Following the success of that hand-tagging and cross-image identification process, we dis- cussed with our partners whether the next step, using Amazon Web Services’ computer vision and facial recognition platform, ReKognition, would be acceptable. They agreed, and we ran the images through the AWS application, testing our results against samples pulled from our Pi- casa run to verify the results. Perhaps unsurprisingly, AWS ReKognition fared even better with those test cases. Using one photograph image, the AWS application identified all of the Picasa matches as well as three new images that had not previously been tagged with Cathcart’s name. The same pattern held for other images in our sample group: Katherine Pettit was positively iden- tified across more likenesses than had been previously tagged, and Alice Cobb was also positively tracked across images. This positive attribution also reveals a limitation of the metadata: while these three women we have named are important historical figures at PMSS, and while they are widely acknowledged in the archive and well-represented in the photographic record, not all of the photographs have been well-tagged or fully documented in the archive. The newly tagged images that we found would enrich the metadata available to the archive not because these im- ages include surprising faces, but rather, because the tagging has been inconsistent, and over time, previously known faces have become less easy to discern. Like other recent discussions of private materials disclosed within systems trained for match- ing and similarity, we found that the ethics of private materials for this non-private purpose pro- voked strong reactions. While some of the reaction was positive with community members happy to have more images of the School’s founding director, Katherine Pettit, identified, those same community members were not comfortable with our role as researchers identifying people in the photographs in their community’s archive, unsupervised. They wanted instead to verify each positive identification, a point that we agreed with, but which also hindered the process of mov- ing through 19,000 images. They wanted to maintain authority, and while we saw our efforts as contributions to their goals of better describing their archival holdings, it turns out that the larger scope of automation we brought to the project was intimidating. While its legal status and direct ethics seemed settled before the beginning of the project, ultimately, this project contributed to Cohen and Nakazawa 145 a sense among some individuals at PMSS that they were losing control of their own archive.14 That fear of a loss of control led to another reckoning with the project, as we discuss in the next section. What Machine Learning Cannot Learn: An Ethics of the Archive It became clear at the same moment we validated our test case, that our research goals and those of our partners had quickly diverged. We had discussed the scope and use of PMSS materials with our partners at PMSS and laid out in a formally drafted “Memorandum of Understanding” (MOU) adapted from the US Department of Justice (2008; 2017) our shared goals in the project. As we described in the MOU, both partners considered it mutually beneficial for the archive and its metadata to be able to identify faces of named as well as unnamed people. We aimed to capture single-person images as well as groups in order to enrich the archive with cross-links to other pho- tographs or archival materials with a shared subject heading, and we hoped to increase the number of names included in object attributes. Despite those conversations and multiple revisions of the MOU draft, what we discovered was ultimately different than the path our planning had indi- cated. Instead of creating an historical social network using the five decades of photographs we had prepared, we found that the history of the social network and the family and kinship relation- ships detailed through those images was deeply personal for the community living in the region today. We found out the hard way that those kinships reflected economic changes in status and power, realignments among families and their communities, and new patterns in the social fabric formed by the warp of personal relationships and the weft of local institutions (schools, hospi- tals, and local governance). Revealing those changes was not always something that our partners wanted us to do, and these were not patterns we had sought to discover: they are simply there, embedded in the images and the relations among images. These social changes in local alignments—tied in complex ways to marriages and separations, legal conflicts and resolutions, changes in ownership of residential and commercial interests, and other material reflections of that social fabric—remain highly charged and, for those continuing to live in the area, they revealed potentially unexpected parts of the lived realities and values of the place. As a result, even though we had an MOU that worked for the technical details of the project, we could not find common ground for how to handle the competing social and ethical values of the project. As we problem-solved, we tried to describe new forms of restriction and to generate appro- priately sensitive guidelines to handle future use and access, but it turned out that all of these approaches were threatening to the values of a tightly knit community. They, rightly, want to tell their story, and so many people have told it so poorly for so long that they wish to have sole access to the materials from which the narratives are assembled. As researchers interested in open access and stable platform management, we have disagreements with the scholarly and archival implications of this decision, but we ultimately respect the resolve and underlying values that accompany the difficult choices PMSS makes about its public audiences and the corresponding goals it maintains for its collections. Interestingly, Wykle has come to view our work with PMSS collections as another form of the material and cultural extraction that has dominated the region 14See, for another example of the ethical quandaries that may be associated with legal applications of machine learning techniques, Ema et al. 2019. 146 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for generations. While we see our work in light of preservation and access as well as our lasting commitment to PMSS and the region, we have also come to recognize the powerful explanatory force that the idea of “extraction” has become for the communities in a region that has suffered many forms of extraction industries’ negative effects. In acknowledging the limitations of our own efforts, we would posit that our case study offers a counter-example to works that suggest how AI systems can be designed automatically to meet the needs of their constituents (Winfield et al. 2019). We tried to use a design approach to address our research goals and our partner’s needs, and it turned out that the dynamically constructed and evolving nature of those needs outstripped the capacity we could build into our available system of machine learning. The divergence of our goals has led the collaboration to an impasse. Given that we had al- ready outlined further steps in our initial documents that could not be satisfied after the partners identified their divergent intentions, the collaborative scope the partners initially described was not completely fulfilled. The divergence of goals became stark: as researchers interested in the relevance and sustainability of these archives, we were moving the collections toward a more ac- cessible and comprehensive platform with open documentation and protocols for future devel- opment. By contrast, the PMSS staff were moving toward more stringent and local controls over access to the archives in order to limit dissemination. At this juncture, we had some negotiating to do. First, we made the ContentDM instance a password protected and not publicly accessible (private) sandbox rather than a public instance of a virtual digital collection. As PMSS owns the material, they decided shortly thereafter to issue a take-down order of the ContentDM instance, and we complied. As the ContentDM materials were ultimately accessible in the public domain on their live site, this decision revealed how personal the challenges had become. Nothing in- cluded in the take-down order was unique or new material—rather, the ContentDM site simply provided a more accessible format for existing primary material on the WordPress site, stripped of its interpretive and secondary contexts. If there is a silver lining, it lies in this context for use: the “academic divorce” we underwent by discontinuing our collaboration has made it possible for us to continue conducting research on the publicly available archival materials without being obligated to host a live and dynamic reposi- tory for further materials. As a result, we can test best-approaches without having to worry about pushing them to a live production site. Within this constraint, we aim to continue re-creating the historical social network without compromising our partners’ needs for privacy and control of their production site. The mutual decision to terminate further partnership activities based in archival development arose because of these differing paths forward. That decision meant that any further enrichment of the archival materials would not become publicly available, which we saw as a penalty against using the archive at a moment when archives need as much advocacy and visible support as possible. Under these constraints of private accessibility, we have continued to work on the AWS ReKog- nition pipeline and have successfully identified all of the faces of named people featured in the archive, with face and name labels now associated with over 1900 unique images. Our next step, delayed to Spring 2021 as a result of the COVID-19 pandemic, includes the creation of an associative network that first identifies unnamed faces in each image using unique identifiers. The second element of that process will be to generate an historical social network using the co- occurrence among those faces as well as the faces of named people in the available images. Given that our metadata enrichment has already included date associations for most of the images, we are confident that we will be able to reconstruct historically specific networks for a given year or range of years, and moreover, that the association between dates and named people will help us Cohen and Nakazawa 147 to identify further members of the community who are not currently named in the photographs because of the small groups involved in activities and clubs, as well as the generally limited student and teacher populations during any given year. We are now far more sensitive to how the local concerns of this community shape our research methods and outcomes. The longer-term hope, one it is not clear at all that we will be allowed to pursue, would be to use natural language processing tools on the archive’s textual materials, par- ticularly named entity recognition and word vectors, to search and match images where known names occur proximate to the names of unmatched faces. The present goal, however, remains to create a more replete and densely connected network of faces and the places they occupied when they were living in the gentle shadows of Pine Mountain. In order to abide by PMSS community wishes for privacy, we will be using anonymized aggregate results without identifying individuals in the photographs. While this method has the drawback of not being able to reveal the complex- ity of the historical relations at the granular level of individuals, it will allow us to report on the persistence or variation in network metrics, such as network density, centrality, path length, and betweenness measures, among others. In this way, we aim to be able to measure and report on the network and its changes over time without reporting on individuals. We arrived at an anonymiz- ing method as a solution to the dissolved partnership by asking about the constraints of FERPA as well as by looking back at federal and commercial facial recognition practices. In each case, the dark side of these technological tools remains one associated with surveillance, and in the lan- guage of Eastern Kentucky, extraction. We mention this not only to be transparent about our recognition of these limitations, but also in the hopes of opening a new dialogue with our part- ners that might stem from generating interesting discoveries without compromising their sense of the local ownership of their archival materials. Nonetheless, in order to report on the most interesting aspects, the actual people and their local histories of place, the work to be done would remain more at a human level than at a technical one. Conclusion In conclusion, our project describes a success that remains imbricated with a shortcoming in machine learning. The machine learning tasks and algorithms our project implemented serve a mimetic function in the distilled picture of the community they reflect. By matching histori- cal faces to names, the project embraces a form of digital surrogacy: we have aimed to produce a meta-historical account of the present institution’s social and cultural function as a site of social networking and local knowledge transmission. As Robyn Caplan and danah boyd have recently suggested, the “bureaucratic functions” these algorithms promote can be understood by the ways in which they structure users’ behaviors (2018, 3). We would like to supplement Caplan and boyd’s insight regarding the potential coercions involved in how data structures implicitly shape their contents as well as their users’ behaviors. Not only do algorithms promote a kind of bureau- cracy, to ends that may be positive and negative, and sometimes both at once, but further, those same structures may reflect or shape public behaviors and interactions beyond a single platform. As we move between digital and public spheres, our work similarly shifts its scope. The re- search that we intended to have positive community effects was instead read by that very same set of people as an attempt to displace a community from the center of its own history. In other words, the bureaucratic functions embedded in PMSS as an institution saw our new approach to their storytelling as an unwanted and external intervention. As their response suggests, the inter- nal and extant structures for governing their community, its stories, and the people who tell them, 148 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 saw our contribution as an effort to co-opt their control. Where we thought we were offering new tools for capturing, discovering, and telling stories, they saw what Safiya Noble has recently characterized in a specifically racialized context as “algorithms of oppression” (2018). Here the oppression would be geographic, socio-economic, and cultural, rather than racial; nevertheless, the perception that one is being oppressed by systems set into place by agents working beyond one’s own community remains a shared foundation in Noble’s argument and in the unexpected reception of our project. As we move forward with our own project into unknown territories, in which our work-products may never see the light of day because of the value conflicts bound up in making archival objects public and accessible, we have found a real and lasting respect for the institutional dependencies and emplacements within which we all do our work. We hope to channel some of those functions of emplacement to create new forms of accountability and restraint that will allow us to move forward, but at least for now, we have found with our project one limitation of machine learning, and it is not the machine. References Ahmed, Manan, Maira E. Álvarez, Sylvia A. Fernández, Alex Gil, Rachel Hendery, Moacir P. de Sá Pereira, and Roopika Risam. 2018. “Torn Apart / Separados.” Group for Experimental Methods in Humanistic Research. ?iiTb,fftTK2i?Q/XTH�BMi2tiXBMfiQ`M@�T�`i fpQHmK2fkf. Bailey, Ronald. 2017. “The Noble, Misguided Plan to Turn Coal Miners Into Coders.” Reason, November 25, 2017. ?iiTb,ff`2�bQMX+QKfkyRdfRRfk8fi?2@MQ#H2@KBb;mB/2/@ TH�M@iQ@imf. Calo, Ryan. 2017. “Artificial Intelligence Policy: A Primer and Roadmap.” University of Cali- fornia, Davis Law Review 51:399-435. Caplan, Robyn and danah boyd. 2018. “Isomorphism through algorithm: Institutional de- pendencies in the case of Facebook.” Big Data & Society (January-June): 1-12. ?iiTb, ff/QBXQ`;fRyXRRddfky8jN8RdR3d8dk8j. Cassidy, Frederic G. et al., eds. 1985-2012. Dictionary of American Regional English. Cam- bridge, MA: Belknap Press. ?iiTb,ffrrrX/�`2/B+iBQM�`vX+QK. Ema, Arisa et. al. 2019. “Clarifying Privacy, Property, and Power: Case Study on Value Conflict Between Communities.” Proceedings of the IEEE 107, no. 3 (March): 575-80. ?iiTb, ff/QBXQ`;fRyXRRyNfCS_P*XkyR3Xk3jdy98. Harkins, Anthony and Meredith McCarroll, eds. 2019. Appalachian Reckoning: A Region Re- sponds to Hillbilly Elegy. Morgantown, WV: West Virginia University Press. Hochschild, Arlie. 2018. “The Coders of Kentucky.” The New York Times, September 21, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fyNfkRfQTBMBQMfbmM/�vfbBHB+QM@p�HH2v @i2+?X?iKH. Joh, Elizabeth. 2018. “Artificial Intelligence and Policing: First Questions.” Seattle University Law Review 41 (4): 1139-44. Latour, Bruno. 2007. Reassembling the Social: An Introduction of Actor-Network Theory. New York: Oxford University Press. Levendowski, Amanda. 2018. “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem.” Washington Law Review 93 (2): 579-630. Mukurtu CMS. ?iiTb,ffKmFm`imXQ`;f. Accessed December 12, 2019. https://xpmethod.plaintext.in/torn-apart/volume/2/ https://xpmethod.plaintext.in/torn-apart/volume/2/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://doi.org/10.1177/2053951718757253 https://doi.org/10.1177/2053951718757253 https://www.daredictionary.com https://doi.org/10.1109/JPROC.2018.2837045 https://doi.org/10.1109/JPROC.2018.2837045 https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://mukurtu.org/ Cohen and Nakazawa 149 Noble, Safiya. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press. Passamaquoddy People. “Passamaquoddy Traditional Knowledge Labels.” ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@i`�/BiBQM�H@FMQrH2 /;2@H�#2Hb Accessed December 12, 2019. Risam, Roopika. 2015. “Beyond the Margins: Intersectionality and the Digital Humanities.” DHQ: Digital Humanities Quarterly 9 (2). ?iiT,ff/B;Bi�H?mK�MBiB2bXQ`;f/?[f pQHfNfkfyyyky3fyyyky3X?iKH. Robertson, Campbell. 2019. “They Were Promised Coding Jobs in Appalachia. Now They Say It Was a Fraud.” The New York Times, May 12, 2019. ?iiTb,ffrrrXMviBK2bX+QKfky RNfy8fRkfmbfKBM2/@KBM/b@r2bi@pB`;BMB�@+Q/BM;X?iKH. Sabharwal, Anil. 2016. “Moving on from Picasa.” Google Photos Blog. Last modified March 26, 2018. ?iiTb,ff;QQ;H2T?QiQbX#HQ;bTQiX+QKfkyRefykfKQpBM;@QM@7`QK@T B+�b�X?iKH. Sabharwal, Arjun. 2015. Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Boston: Chandos. Stephan, Karl D., Katina Michael, M.G. Michael, Laura Jacob, and Emily P. Anesta. 2012. “So- cial Implications of Technology: The Past, the Present, and the Future.” Proceedings of the IEEE 100, Special Centennial Issue (May): 1752-1781. ?iiTb,ff/QBXQ`;fRyXRRyNf CS_P*XkyRkXkR3NNRN. United States Department of Justice. 2008. “Guidelines for a Memorandum of Understanding.” ?iiTb,ffrrrXDmbiB+2X;QpfbBi2bf/27�mHif7BH2bfQprfH2;�+vfkyy3fRyfk Rfb�KTH2@KQmXT/7. . 2017. “Sample Memorandum of Understanding.” ?iiT,ffrrrX/QDXbi�i2X Q`XmbfrT@+QMi2MifmTHQ�/bfkyRdfy3fKQmnb�KTH2n;mB/2HBM2bXT/7. Vance, J.D. 2016. Hillbilly Elegy: A Memoir of a Family and Culture in Crisis. New York: Harper. Weizenbaum, Joseph. 1976. Computer Power and Human Reason: From Judgment to Calcula- tion. New York: W.H. Freeman and Co. Winfield, Alan F., Katina Michael, Jeremy Pitt, and Vanessa Evers. 2019. “Machine Ethics: the design and governance of ethical AI and autonomous systems.” ProceedingsoftheIEEE 107, no. 3 (March): 509-17. ?iiTb,ff/QBXQ`;fRyXRRyNfCS_P*XkyRNXkNyyekk. https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://doi.org/10.1109/JPROC.2012.2189919 https://doi.org/10.1109/JPROC.2012.2189919 https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf https://doi.org/10.1109/JPROC.2019.2900622
hansen-can-2021 ---- Chapter 14 Can a Hammer Categorize Highly Technical Articles? Samuel Hansen University of Michigan When everything looks like a nail... I was sure I had the most brilliant research project idea for my course in Digital Scholarship tech- niques. I would use the Mathematical Subject Classification (MSC) values assigned to the publi- cations in MathSciNet1 to create a temporal citation network which would allow me to visualize how new mathematical subfields were created and perhaps even predict them while they were still in their infancy. I thought it would be an easy enough project. I already knew how to analyze network data and the data I needed already existed, I just had to get my hands on it. I even sold a couple of my fellow coursemates on the idea and they agreed to work with me. Of course nothing is as easy as that, and numerous requests for data went without response. Even after I reached out to personal contacts at MathSciNet, we came to understand we would not be getting the MSC data the entire project relied upon. Not that we were going to let a little setback like not having the necessary data stop us. After all, this was early 2018 and there had already been years of stories about how artificial intelligence, machine learning in particular, was going to revolutionize every aspect of our world (Kelly 2014; Clark 2015; Parloff 2016; Sangwani 2017; Tank 2017). All the coverage made it seem like AI was not only a tool with as many applications as a hammer, but that it also magically turned all problems into nails. While none of us were AI experts, we knew that machine learning was supposed to be good at classification and categorization. The promise seemed to be that if you had stacks of data, a machine learning algorithm could dive in, find the needles, and arrange them into neatly divided piles of similar sharpness and length. Not only that, but there were pre- built tools that made it so almost anyone could do it. For a group of people whose project was on 1See ?iiTb,ffK�i?b+BM2iX�KbXQ`;f. 159 https://mathscinet.ams.org/ 160 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 life support because we could not get the categorization data we needed, machine learning began to look like our only potential savior. So, machine learning is what we used. I will not go too deep into the actual process, but I will give a brief outline of the techniques we employed. Machine-learning-based categorization needs data to classify, which in our case were mathematics publications. While this can be done with titles and abstracts we wanted to provide the machine with as much data as we could, so we decided to work with full-text articles. Since we were at the University of Wisconsin at the time, we were able to connect with the team behind GeoDeepDive2 who have agreements with many publishers to provide the full text of ar- ticles for text and data mining research (“GeoDeepDive: Project Overview” n.d.). GeoDeepDive provided us with the full text of 22,397 mathematics articles which we used as our corpus. In or- der to classify these articles, which were already pre-processed by GeoDeepDive with CoreNLP,3 we first used the Python package Gensim4 to process the articles into a Python-friendly format and to remove stopwords. Then we randomly sampled 1⁄3 of the corpus to create a topic model using the MALLET5 topic modeling tool. Finally, we applied the model to the remaining articles in our corpus. We then coded the words within the generated topics to subfields within mathe- matics and used those codes to assign articles a subfield category. In order to make sure our results were not just a one-off, we repeated this process multiple times and checked for variance in the results. There was none, the results were uniformly poor. That might not be entirely fair. There were interesting aspects to the results of the topic mod- eling, but when it came to categorization they were useless. Of the subfield codes assigned to arti- cles, only two were ever the dominant result for any given article: Graph Theory and Undefined, which does not really tell the whole story as Undefined was the run-away winner in the article classification race with more than 70% of articles classified as Undefined in each run, including one for which it hit 95%. The topics generated by MALLET were often plagued by gibberish caused by equations in the mathematics articles and there was at least one topic in each run that was filled with the names of months and locations. Add how the technical language of math- ematics is filled with words that have non-technical definitions (for example, map or space), or words which have their own subfield-specific meanings (such as homomorphism or degree), both of which frustrate attempts to code a subfield. These issues help make it clear why so many arti- cles ended up as “Undefined.” Even for the one subfield which had a unique enough vocabulary for our topic model to partially be able to identify, Graph Theory, the results were marginally positive at best. We were able to obtain Mathematical Subject Classification (MSC) values for around 10% of our corpus. When we compared the articles we categorized as Graph Theory to the articles which had been assigned the MSC value for Graph Theory (05Cxx), we found we had a textbook recall-versus-precision problem. We could either correctly categorize nearly all of the Graph Theory articles with a very high rate of false positives (high recall and low precision) or we could almost never incorrectly categorize an article as Graph Theory, but miss over 30% that we should have categorized as Graph Theory (high precision and low recall). Needless to say, we were not able to create the temporal subfield network I had imagined. While we could reasonably claim that we learned very interesting things about the language of mathematics and its subfields, we could not claim we even came close to automatically catego- rizing mathematics articles. When we had to report back on our work at the end of the course, 2See ?iiTb,ff;2Q/22T/Bp2XQ`;f. 3See ?iiTb,ffbi�M7Q`/MHTX;Bi?m#XBQf*Q`2LGSf. 4See ?iiTb,ff`�/BK`2?m`2FX+QKf;2MbBKf. 5See ?iiT,ffK�HH2iX+bXmK�bbX2/mfiQTB+bXT?T. https://geodeepdive.org/ https://stanfordnlp.github.io/CoreNLP/ https://radimrehurek.com/gensim/ http://mallet.cs.umass.edu/topics.php Hansen 161 our main result was that basic, off-the-shelf topic modelling does not work well when it comes to highly technical articles from subjects like mathematics. It was also a welcome lesson in not believing the hype of machine learning, even when a problem looks exactly like the kind machine learning was supposed to excel at solving. While we had a hammer and our problem looked like a nail, it seemed that the former was a ball peen and the latter a railroad tie. In the end, even in the land of hammers and nails, the tool has to match the task. Though we failed to accomplish automated categorization of mathematics, we were dilettantes in the world of machine learning. I believe our project is a good example of how machine learning is still a long way from being the magic tool as some, though not all (Rahimi and Recht 2017), have portrayed it. Let us look at what happens when smarter and more capable minds tackle the problem of classifying mathe- matics and other highly technical subjects using advanced machine learning techniques. Finding the Right Hammer To illustrate the quest to find the right hammer I am going to focus on three different projects that tackled the automated categorization of highly technical content, two of which also attempted to categorize mathematical content and one that looked to categorize scholarly works in general. These three projects provide examples of many of the approaches and practices employed by ex- perts in automated classification and demonstrate the two main paths that these types of projects follow to accomplish their goals. Since we have been discussing mathematics, let us start with those two projects. Both projects began because the participants were struggling to categorize mathematics pub- lications so they would be properly indexed and searchable in digital mathematics databases: the Czech Digital Mathematics Library (DML-CZ)6 and NUMDAM7 in the case of Radim Ře- hůřek and Petr Sojka (Řehůřek and Sojka 2008), and Zentralblatt MATH (zbMath)8 in the case of Simon Barthel, Sascha Tönnies, and Wolf-Tilo Balke (Barthel, Tönnies, and Balke 2013). All of these databases rely on the aforementioned MSC9 to aid in indexing and retrieval, and so their goal was to automate the assignment of MSC values to lower the time and labor cost of requir- ing humans to do this task. The main differences between their tasks related to the number of documents they were working with (thousands for Řehůřek and Sojka and millions for Barthel, Tönnies, and Balke), the amount of the works available (full text for Řehůřek and Sojka, and titles, authors, and abstracts for Barthel, Tönnies, and Balke), and the quality of the data (mostly OCR scans for Řehůřek and Sojka and mostly TeX for Barthel, Tönnies, and Balke). Even with these differences, both projects took a similar approach, and it is the first of the two main pathways to- ward classification I spoke of earlier: using a predetermined taxonomy and a set of pre-categorized data to build a machine learning categorizer. In the end, while both projects determined that the use of Support Vector Machines (Gandhi 2018)10 provided the best categorization results, their implementations were different. The Ře- 6See ?iiTb,ff/KHX+xf. 7See ?iiT,ffrrrXMmK/�KXQ`;f. 8See ?iiTb,ffx#K�i?XQ`;f. 9Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting catego- rization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article㸪s content before they are published. This multi-step process of review yields a built-in accuracy check for the categorization. 10Support Vector Machines (SVMs) are machine learning models which are trained using a pre-classified corpus to split a vector space into a set of differentiated areas (or categories) and then attempt to classify new items by where in the https://dml.cz/ http://www.numdam.org/ https://zbmath.org/ 162 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 hůřek and Sojka SVMs were trained with terms weighted using augmented term frequency11 and dynamic decision threshold12 selection using s-cut13 (Řehůřek and Sojka 2008, 549) and Barthel, Tönnies, and Balke’s with term weighting using term frequency–inverse document frequency14 and Euclidean normalization15 (Barthel, Tönnies, and Balke 2013, 88), but the main difference was how they handled formulae. In particular the Barthel, Tönnies, and Balke group split their corpus into words and formulae and mapped them to separate vectors which were then merged together for a combined vector used for categorization. Řehůřek and Sojka did not differenti- ate between words and formulae in their corpus, and they did note that their OCR scans’ poor handling of formulae could have hindered their results (Řehůřek and Sojka 2008, 555). In the end, not having the ability to handle formulae separately did not seem to matter as Řehůřek and Sojka claimed microaveraged F1 scores of 89.03% (Řehůřek and Sojka 2008, 549) when classify- ing the top level MSC category with their best performing SVM. When this is compared to the microaveraged F1 of 67.3% obtained by Barthel, Tönnies, and Balke (Barthel, Tönnies, and Balke 2013, 88), it would seem that either Řehůřek’s and Sojka’s implementation of SVMs or their ac- cess to full-text led to a clear advantage. This advantage becomes less clear when one takes into account that Řehůřek and Sojka were only working with top level MSCs where they had at least 30 (60 in the case of their best result) articles, and their limited corpus meant that many top-level MSC categories would not have been included. Looking at the work done by Barthel, Tönnies, and Balke makes it clear that these less common MSC categories such as K-Theory or Potential Theory, for which Barthel, Tönnies, and Balke achieved microaveraged F1 measures of 18.2% and 24% respectively, have a large impact on the overall effectiveness of the automated categorization. Remember, this is only for the top level of MSC codes, and the work of Barthel, Tönnies, and Balke suggests it would get worse when trying to apply the second and third level for full MSC categorization to these less-common categories. This leads me to believe that in the case of cat- egorizing highly technical mathematical works to an existing taxonomy, people have come close to identifying the overall size of the machine learning hammer, but are still a long way away from finding the right match for the categorization nail. Now let us shift from mathematics-specific categorization to subject categorization in gen- eral and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search prod- uct.16 While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify. Microsoft took a unique approach in the development of their taxonomy. Instead of rely- vector space the trained model places them. For a more in-depth, technical explanation, see: ?iiTb,ffiQr�`/b/�i�b +B2M+2X+QKfbmTTQ`i@p2+iQ`@K�+?BM2@BMi`Q/m+iBQM@iQ@K�+?BM2@H2�`MBM;@�H;Q`Bi?Kb@Nj9�99 97+�9d. 11Augmented term frequency refers to the number of times a term occurs in the document divided by the number of times the most frequent occurring term appears in the document. 12The decision threshold is the cut-off for how close to a category the SVM must determine an item to be in order for it to be assigned that category. Řehůřek and Sojka㸪s work varied this threshold dynamically. 13Score-based local optimization, or s-cut, allows a machine-learning model to set different thresholds for each category with an emphasis on local, or category, instead of global performance. 14Term frequency–inverse document frequency provides a weight for terms depending on how frequently it occurs across the corpus. A term which occurs rarely across the corpus but with a high frequency within a single document will have a higher weight when classifying the document in question. 15A Euclidean norm provides the distance from the origin to a point in an n-dimensional space. It is calculated by taking the square root of the sum of the squares of all coordinate values. 16See ?iiTb,ff�+�/2KB+XKB+`QbQ7iX+QKf. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://academic.microsoft.com/ Hansen 163 ing on the corpus of articles in the MAG to develop it, they relied primarily on Wikipedia for its creation. They generated an initial seed by referencing the Science Metrix classification scheme17 and a couple thousand FoS Wikipedia articles they identified internally. They then used an iter- ative process to identify more FoS in Wikipedia based on whether they were linked to Wikipedia articles that were already identified as FoS and whether the new articles represented valid entity types—e.g. an entity type of protein would be added and an entity type of person would be ex- cluded (Shen, Ma, and Wang 2018, 3). This work allowed Microsoft to develop a list of more than 200,000 Fields of Study for use as categories in the MAG. Microsoft then used machine learning techniques to apply these FoS to their corpus of over 140 million academic articles. The specific techniques are not as clear as they were with the previ- ous examples, likely due to Microsoft protecting their specific methods from competitors, but the article published to the arXiv by their researchers (Shen, Ma, and Wang 2018) and the write up on the MAG website does make it clear they used vector based convolutional neural networks which relied on Skip-gram (Mikolov et al. 2013) embeddings and bag-of-words/entities features to cre- ate their vectors (“Microsoft Academic Increases Power of Semantic Search by Adding More Fields of Study—Microsoft Research” 2018). One really interesting part of the machine learn- ing method used by Microsoft was that it did not rely only on information from the article being categorized. It also utilized the citations to and references from information about the article in the MAG, and used the FoS the citations and references were assigned in order to influence the FoS of the original article. The identification of potential FoS and their assignment to articles was only a part of Mi- crosoft’s purpose. In order to fully index the MAG and make it searchable they also wished to determine the relationships between the FoS; in other words they wanted to build a hierarchical taxonomy. To achieve this they used the article categorizations and defined a Field of Study A as the parent of B if the articles categorized as B were close to a subset of the articles categorized as A (a more formal definition can be found in (Shen, Ma, and Wang 2018, 4). This work, which cre- ated a six-level hierarchy, was mostly automated, but Microsoft did inspect and manually adjust the relationships between FoS on the highest two levels. To evaluate the quality of their FoS taxonomy and categorization work, Microsoft randomly sampled data at each of the three steps of the project and used human judges to assess their ac- curacy. The accuracy assessments of the three steps were not as complete as they would be with the mathematics categorization, as that approach would evaluate terms across the whole of their data sets, but the projects are of very different scales so different methods are appropriate. In the end Microsoft estimates the accuracy of the FoS at 94.75%, the article categorization at 81.2%, and the hierarchy at 78% (Shen, Ma, and Wang 2018, 5). Since MSC was created by humans there is no meaningful way to compare the FoS accuracy measurements, but the categorization accuracy falls somewhere between that of the two mathematics projects. This is a very impres- sive result, especially when the aforementioned scale is taken into account. Instead of trying to replace the work of humans categorizing mathematics articles indexed in a database, which for 2018 was 120,324 items in MathSciNet18 and 97,819 in zbMath,19 the FoS project is trying to replace the human categorization of all items indexed in MAG, which was 10,616,601 in 2018.20 17See ?iiT,ffb+B2M+2@K2i`BtX+QKf?[42Mf+H�bbB7B+�iBQM. 18See ?iiTb,ffK�i?b+BM2iX�KbXQ`;fK�i?b+BM2ifb2�`+?fTm#HB+�iBQMbX?iKH?/`4Tm#v2�`�v`QT4 2[��`;j4kyR3. 19See ?iiTb,ffx#K�i?XQ`;f?[4TvWj�kyR3. 20See ?iiTb,ff�+�/2KB+XKB+`QbQ7iX+QKfTm#HB+�iBQMbfjjNkj89d. http://science-metrix.com/?q=en/classification https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018 https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018 https://zbmath.org/?q=py%3A2018 https://academic.microsoft.com/publications/33923547 164 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 Both zbMath and MathSciNet were capable of providing the human labor to do the work of assigning MSC values to the mathematics articles they indexed in 2018.21 Therefore using an automated categorization, which at best could only get the top level right with 90% accuracy, was not the right approach. On the other hand, it seems clear that no one could feasibly provide the human labor to categorize all articles indexed by MAG in 2018 so an 80% accurate categorization is a significant accomplishment. To go back to the nail and hammer analogy, Microsoft may have used a sledgehammer but they were hammering a rather giant nail. Are You Sure it’s a Nail? I started this chapter talking about how we have all been told that AI and machine learning were going to revolutionize everything in the world. That they were the hammers and all the world’s problems were nails. I found that this was not the case when we tried to employ it, in an ad- mittedly rather naive fashion, to automatically categorize mathematical articles. From the other examples I included, it is also clear computational experts find the automatic categorization of highly technical content a hard problem to tackle, one where success is very much dependent on what it is being measured against. In the case of classifying mathematics, machine learning can do a decent job but not enough to compete with humans. In the case of classifying everything, scale gives machines an edge, as long as you have the computational power and knowledge wielded by a company like Microsoft. This collection is about the intersection of AI, machine learning, deep learning, and libraries. While there are definitely problems in libraries where these techniques will be the answer, I think it is important to pause and consider if artificial intelligence techniques are the best approach before trying to use them. Libraries, even those like the one I work in, which are lucky enough to boast of incredibly talented IT departments, do not tend to have access to a large amount of unused computational power or numerous experts in bleeding-edge AI. They are also rather no- toriously limited budget-wise and would likely have to decide between existing budget items and developing an in-house machine learning program. Those realities combined with the legitimate questions which can be raised about the efficacy of machine learning and AI with respect to the types of problems a library may encounter, such as categorizing the contents of highly technical articles, make me worry. While there will be many cases where using AI makes sense, I want to be sure libraries are asking themselves a lot of questions before starting to use it. Questions like: is this problem large enough in scale to substitute machines for human labor given that machines will likely be less accurate? Or: will using machines to solve this problem cost us more in equip- ment and highly technical staff than our current solution, and has that factored in the people and services a library may need to cut to afford them? Or: does the data we have to train a machine contain bias and therefore will produce a biased model which will only serve to perpetuate exist- ing inequities and systemic oppression? Not to mention: is this really a problem or are we just looking for a way to employ machine learning to say that we did? In the cases where the answers to these questions are yes, it will make sense for libraries to employ machine learning. I just want libraries to look really carefully at how they approach problems and solutions, to make sure that 21When an article is indexed by MathSciNet it receives initial MSC values from a subject area editor who then passes the article along to an external expert reviewer who suggests new MSC values, completes partial values, and provides potential corrections to the MSC values assigned by the editors (㸫Mathematical Reviews Guide For Reviewers㸬2020) and then the subject area editors will make the final determination in order to make sure internal styles are followed. zbMath follows a similar procedure. Hansen 165 their problem is, in fact, a nail, and then to look even closer and make sure it is the type of nail a machine-learning hammer can hit. References Barthel, Simon, Sascha Tönnies, and Wolf-Tilo Balke. 2013. “Large-Scale Experiments for Math- ematical Document Classification.” In Digital Libraries: Social Media and Community Networks, edited by Shalini R. Urs, Jin-Cheon Na, and George Buchanan, 83–92. Cham: Springer International Publishing. Clark, Jack. 2015. “Why 2015 Was a Breakthrough Year in Artificial Intelligence.” Bloomberg, December 8, 2015. ?iiTb,ffrrrX#HQQK#2`;X+QKfM2rbf�`iB+H2bfkyR8@Rk@y3 fr?v@kyR8@r�b@�@#`2�Fi?`Qm;?@v2�`@BM@�`iB7B+B�H@BMi2HHB;2M+2. Gandhi, Rohith. 2018. “Support Vector Machine—Introduction to Machine Learning Algo- rithms.” Medium. July 5, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfbmTTQ`i@p2+ iQ`@K�+?BM2@BMi`Q/m+iBQM@iQ@K�+?BM2@H2�`MBM;@�H;Q`Bi?Kb@Nj9�9997 +�9d. “GeoDeepDive: Project Overview.’ n.d. Accessed May 7, 2018. ?iiTb,ff;2Q/22T/Bp2XQ` ;f�#QmiX?iKH. Kelly, Kevin. 2014. “The Three Breakthroughs That Have Finally Unleashed AI on the World.” Wired, October 27, 2014. ?iiTb,ffrrrXrB`2/X+QKfkyR9fRyf7mim`2@Q7@�`iB7B +B�H@BMi2HHB;2M+2f. “Mathematical Reviews Guide For Reviewers.” 2015. AmericanMathematicalSociety. February 2015. ?iiTb,ffK�i?b+BM2iX�KbXQ`;fK`2bm#bf;mB/2@`2pB2r2`bX?iKH. “Microsoft Academic Increases Power of Semantic Search by Adding More Fields of Study.” 2018. Microsoft Academic (blog). February 15, 2018. ?iiTb,ffrrrXKB+`QbQ7iX+Q Kf2M@mbf`2b2�`+?fT`QD2+if�+�/2KB+f�`iB+H2bfKB+`QbQ7i@�+�/2KB+@BM +`2�b2b@TQr2`@b2K�MiB+@b2�`+?@�//BM;@7B2H/b@bim/vf. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neu- ral Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–3119. Curran Associates, Inc. ?iiT,ffT�T2 `bXMBTbX++fT�T2`f8ykR@/Bbi`B#mi2/@`2T`2b2Mi�iBQMb@Q7@rQ`/b@�M/@T ?`�b2b@�M/@i?2B`@+QKTQbBiBQM�HBivXT/7. Parloff, Roger. 2016. “From 2016: Why Deep Learning Is Suddenly Changing Your Life.” For- tune. September 28, 2016. ?iiTb,ff7Q`imM2X+QKfHQM;7Q`Kf�B@�`iB7B+B�H@B Mi2HHB;2M+2@/22T@K�+?BM2@H2�`MBM;f. Rahimi, Ali, and Benjamin Recht. 2017. “Back When We Were Kids.” Presentation at the NIPS 2017 Conference. ?iiTb,ffrrrXvQmim#2X+QKfr�i+??p4ZBRu`vjjhZ1. Řehůřek, Radim, and Petr Sojka. 2008. “Automated Classification and Categorization of Math- ematical Knowledge.” In Intelligent Computer Mathematics, edited by Serge Autexier, John Campbell, Julio Rubio, Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, 543–57. Berlin: Springer Verlag. Sangwani, Gaurav. 2017. “2017 Is the Year of Machine Learning. Here’s Why.” Business Insider, January 13, 2017. ?iiTb,ffrrrX#mbBM2bbBMbB/2`XBMfkyRd@Bb@i?2@v2�`@Q7@K �+?BM2@H2�`MBM;@?2`2b@r?vf�`iB+H2b?Qrf8e8R98j8X+Kb. https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://geodeepdive.org/about.html https://geodeepdive.org/about.html https://www.wired.com/2014/10/future-of-artificial-intelligence/ https://www.wired.com/2014/10/future-of-artificial-intelligence/ https://mathscinet.ams.org/mresubs/guide-reviewers.html https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/ https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/ https://www.youtube.com/watch?v=Qi1Yry33TQE https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms 166 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 Shen, Zhihong, Hao Ma, and Kuansan Wang. 2018. “A Web-Scale System for Scientific Knowl- edge Exploration.” Paper presented at the 56th Annual Meeting of the Association for Com- putational Linguistics, Melbourne, July 2018. ?iiT,ff�`tBpXQ`;f�#bfR3y8XRkkRe. Tank, Aytekin. 2017. “This Is the Year of the Machine Learning Revolution.” Entrepreneur, January 12, 2017. ?iiTb,ffrrrX2Mi`2T`2M2m`X+QKf�`iB+H2fk3djk9. http://arxiv.org/abs/1805.12216 https://www.entrepreneur.com/article/287324
harper-generative-2021 ---- Chapter 2 Generative Machine Learning Charlie Harper, PhD Case Western Reserve University Introduction Generative machine learning is a hot topic. With the 2020 election approaching, Facebook and Reddit have each issued their own bans on the category of machine-generated or -altered con- tent that is commonly termed “deep fakes” (Cohen 2020; Romm, Harwell, and Stanley-Becker 2020). Calls for regulation of the broader, and very nebulous category of fake news are now part of US political debates, too. Although well known and often discussed in newspapers and on TV because of their dystopian implications, deep fakes are just one application of generative ma- chine learning. There is a remarkable need for others, especially humanists and social scientists, to become involved in discussions about the future uses of this technology, but this first requires a broader awareness of generative machine learning’s functioning and power. Many articles on the subject of generative machine learning exist in specialized, highly technical literature, but there is little that covers this topic for a broader audience while retaining important high-level informa- tion on how the technology actually operates. This chapter presents an overview of generative machine learning with particular focus on generative adversarial networks (GANs). GANs are largely responsible for the revolution in machine-generated content that has occured in the past few years and their impact on our fu- ture extends well beyond that of producing purposefully-deceptive fakes. After covering genera- tive learning and the working of GANs, this chapter touches on some interesting and significant applications of GANs that are not likely to be familiar to the reader. The hope is that this will serve as the start of a larger discussion on generative learning outside of the confines of technical literature and sensational news stories. 13 14 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.1: The three most-common letters following “F” in two Markov chains trained on an English and Italian dictionary. Three examples of generated words are given for each Markov chain that show how the Markov chain captures high-level information about letter arrangements in the different languages. What is Generative Machine Learning? Machine learning, which is a subdomain of Artificial Intelligence, is roughly divided into three paradigms that rely on different methods of learning: supervised, unsupervised, and reinforce- ment learning (Murphy 2012, 1–15; Burkov 2019, 1–8). These differ in the types of datasets used for learning and the desired applications. Supervised and unsupervised machine learning use labeled and unlabeled datasets, respectively, to assign unseen data to human-generated la- bels or statistically-constructed groups. Both supervised and unsupervised approaches are com- monly used for classification and regression problems, where we wish to predict categorical or quantitative information about new data. A combined form of these two paradigms, called semi- supervised learning, that mixes labeled and unlabeled data also exists. Reinforcement learning, on the other hand, is a paradigm in which an agent learns how to function in a specific environ- ment by being rewarded or penalized for its behavior. For example, reinforcement learning can be used to train a robot to successfully navigate around obstacles in a physical space. Generative machine learning, rather than being a specific learning paradigm, encompasses an ever-growing variety of techniques that are capable of generating new data based on learned patterns. The process of learning these patterns can engage both supervised and unsupervised learning. A simple, statistical example of one type of generative learning is a Markov chain. From a given set of data, a Markov chain calculates and stores the probabilities of a following state based on a current state. For example, a Markov chain can be trained on a list of English words to store the probabilities of any one letter occuring after another letter. These probabilities chain together to represent that chance of moving from the current letter state (e.g. the letter q) to a succeeding letter state (e.g. the letter u) based on the data from which it has learned. If another Markov chain were trained on Italian words instead of English, the probabilities would change, and for this reason, Markov chains can capture important high level information about datasets (Figure 2.1). They can then be sampled to generate new data by starting from a random state and probabilistically moving to succeeding states. In figure 2.1, you can see the Harper 15 Figure 2.2: Images generated with a simple statistical model appear as noise as the model is in- sufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google’s QuickDraw dataset). probability that the letter “F” transitions to the three most common succeeding letters in English and Italian. A few examples of “words” generated by two Markov chains trained on an English and Italian dictionary are also given. The example words are generated by sampling the probabil- ity distributions of the Markov chain, letter by letter, so that the generated words are statistically random, but guided by the learned probability of one letter following another. The different probabilities of letter combinations in English and Italian result in distinctly different generated words. This exemplifies how a generative model can capture specific aspects of a dataset to create new data. The letter combinations are nonsense, but they still reflect the high-level structure of Ital- ian and English words in the way letters join together, such as the different utilization of vowels in each language. These basic Markov chains demonstrate the essence of generative learning: a generative approach learns a distribution over a dataset, or in other words, a mathematical rep- resentation of a dataset, which can then be sampled to generate new data that exists within the learned structure of that dataset. How convincing the generated data appears to a human ob- server depends on the type and tuning of the machine learning model chosen and the data upon which the model has been trained. So, what happens if we build a comparable Markov chain with image data1 instead of words, and then sample, pixel by pixel, from it to generate new images? The results are just noise and the generated images reveal no hint of a wine bottle or circle to the human eye (Figure 2.2). The very simple generative statistical model we have chosen to use is incapable of capturing the distribution of the underlying images sufficiently enough to produce realistic new images. Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,2 1In many examples, I have used the Google QuickDraw Dataset to highlight features of generative machine learning. The dataset is freely available (?iiTb,ff;Bi?m#X+QKf;QQ;H2+`2�iBp2H�#f[mB+F/`�r@/�i�b2i) and licensed under CC BY 4.0. 2The order of a Markov chain reflects how many preceding states are taken into account. For example, a 2nd order Markov chain would look at the preceding two letters to calculate the probability of a succeeding letter. Rudimentary autocomplete is a good example of Markov chains in application. https://github.com/googlecreativelab/quickdraw-dataset 16 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.3 Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Capturing the intricate and often-inscrutable distri- butions that underlie real-world media, like full-sized photographs of people, is where deep (i.e. using neural networks) generative learning shines and where generative adversarial networks have revolutionized machine-generated content. Generative Adversarial Networks The problem of capturing the complexity of an image so that a computer can generate new images leads directly to the emergence of Generative Adversarial Networks, which are a neural-network- based model architecture within the broader sphere of generative machine learning. Although prior deep learning approaches to generating data, particularly variational autoencoders, already existed, it was a breakthrough in 2014 that changed the fabric and power of generative machine learning. Like every big development, it has an origin story that has moved into legend with its many retellings. According to the handed-down tale (Giles 2018), in 2014 doctoral student Ian Goodfellow was at a bar with friends when the topic of generating photos arose. His friends were working out a method to create realistic images by using complex statistical analyses of existing images. Goodfellow countered that it would not work; there were too many variables at play within such data. Instead, he put forth the idea of pairing two neural networks against each other in a type of zero-sum game where the goal was to generate believable fake images. According to the story, he developed this idea into working code that night and his paired neural network architecture produced results the very first time. This was the birth of Generative Adversarial Networks or GANs. Goodfellow’s work was quickly disseminated in what is one of the most influential papers in the recent history of machine learning (Goodfellow et al. 2014). GANs have progressed in almost miraculous ways since 2014, but the crux of their architec- ture remains the coupling of two neural networks. Each neural network has a specific function in the pairing. The first network, called the generator, is tasked with generating fake examples of some dataset. To produce this data it randomly samples from an n-dimensional latent space often labeled Z. In simple terms, the generator takes random noise (really a random list of n-numbers where n is the dimensionality of the latent space) as its input and outputs its attempt at a fake piece of data, such as an image, clip of audio, or row of tabular information. The second neural network, called the discriminator, takes both fake and real data as input. Its role is to correctly dis- criminate between fake and real examples.4 The generator and discriminator networks are then coupled together as adversaries, hence “adversarial” in the name. The output from the generator flows into the discriminator, and information on the success or failure of the discriminator to identify fakes (i.e. the discriminator’s loss) flows back through the network so that the genera- tor and discriminator each knows how well it is performing compared to the other. All of this happens automatically, without any need for human supervision. When the generator finds it is doing poorly, it learns to produce better examples by updating its weights and biases through tra- ditional backpropagation (see especially Langr and Bok 2019, 3–16 for a more detailed summary of this). As backpropagation updates the generator network’s weights and biases, the generator 3This is not to imply that these models do not have immense practical applications in other areas of machine learning. 4Its function is exactly that of any other binary classifier found in machine learning. Harper 17 Figure 2.3: At the heart of a GAN are two neural networks, the generator and the discriminator. As the generator learns to produce fake data, the discriminator learns to separate it out. The pairing of the two in an adversarial structure forces each to improve at its given task. Figure 2.4: A GAN being trained on wine bottle sketches from Google’s quickdraw dataset (?iiTb,ff;Bi?m#X+QKf;QQ;H2+`2�iBp2H�#f[mB+F/`�r@/�i�b2i) shows the genera- tor learning how to produce better sketches over time. Moving from left to right, the generator begins by outputting random noise and progressively generates better sketches as it tries to trick the discriminator. inherently begins to map regions of the randomly sampled Z space to characteristics found in the real dataset. Contrarily, as the discriminator finds that it is not identifying better fakes accurately, it learns to separate these out in new ways. At first, the generator outputs random data and the discriminator easily catches these fakes (Figure 2.4). As the results of the discriminator feed back into the generator, however, the gen- erator learns to trick its foe by creating more convincing fakes. The discriminator consecutively learns to better separate out these more convincing fakes. Turn after turn, the two networks drive one another to become better at their specialized tasks and the generated data becomes in- creasingly like the real data.5 At the end of training, ideally, it will not be possible to distinguish between real and fake (Figure 2.5). In the original publication, the first GANs were trained on sets of small images, like the Toronto Face Dataset, which contains 32 ⇥ 32 pixel grayscale photos of faces and facial expres- sions (Goodfellow et al. 2014). Although the generator’s results were convincing when com- pared to the originals, the fake images were still small, colorless, and pixelated. Since then an explosion of research into GANs and increased computational power has led to strikingly realis- 5See ?iiTb,ffTQHQ+Hm#X;Bi?m#XBQf;�MH�#f (accessed Jan 17, 2020) (Kahng et al. 2019). https://github.com/googlecreativelab/quickdraw-dataset https://poloclub.github.io/ganlab/ 18 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.5: The fully trained generator from Figure 2.4 produces examples that are not readily distinguishable from real world data. The top row of sketches were produced by the GAN and the bottom row were drawn by humans. tic images. The most recent milestone was reached in 2019 by researchers with NVIDIA, who built a GAN that generates high-quality photo-realistic images of people (Karras, Laine, and Aila 2019). When contrasted with the results of 2014 (Figure 2.6), the stunning progression of GANs is self-evident, and it is difficult to believe that the person on the right does not exist. Some Applications of Generative Adversarial Networks Over the past five years, many papers on implementations of GANs have been released by re- searchers (Alqahtani, Kavakli-Thorne, and Kumar 2019; Wang, She, and Ward 2019). The list of applications is extensive and ever growing, but it is worth pointing out some of the major exam- ples as of 2019 and why they are significant. These examples highlight the vast power of GANs and underscore the importance of understanding and carefully scrutinizing this type of machine learning. Data Augmentation One major problem in machine learning has always been the lack of labeled datasets, which are re- quired by supervised learning approaches. Labeling data is time consuming and expensive. With- out good labeled data, trained models are limited in their power to learn and in their ability to generalize to real-world problems. Services, such as Amazon’s Mechanical Turk, have attempted to crowdsource the tedious process of manually assigning labels to data, but labeling has remained a bottleneck in machine learning. GANs are helping to alleviate this bottleneck by generating new labeled data that is indistinguishable from the real data. This process can grow a small la- beled dataset into one that is larger and more useful for training purposes. In the area of medical imaging and diagnostics this may have profound effects (Yi, Walia, and Babyn 2019). For exam- ple, GANs can produce photorealistic images of skin lesions that expert dermatologists are able to separate from real images only slightly over 50% of the time (Baur, Albarqouni, and Navab 2018) and they can synthesize high-resolution mammograms for training better cancer detection algorithms (Korkinof et al. 2018). A corollary effect of these developments in medical imaging is the potential to publicly release Harper 19 Figure 2.6: An image of a generated face from the original GAN publication (left) and the 2019 milestone (right) shows how the ability of GANs to produce photo-realistic images has evolved since 2014. large medical datasets and thereby expand researchers’ access to important data. Whereas the dissemination of traditional medical images is constrained by strict health privacy laws, generated images may not be governed by such rules. I qualify this statement with “may”, because any restrictions or ethical guidelines for the use of medical data that is generated from real patient data requires extensive discussion and legal reviews that have not yet happened. Under certain conditions, it may also be possible to infer original data from a GAN (Mukherjee et al. 2019). How institutional review boards, professional medical organizations, and courts weigh in on this topic will be seen in the coming years. In addition to generating entirely new data, a GAN can augment datasets by expanding their coverage to new domains. For example, autonomous vehicles must cope with an array of road and weather conditions that are unpredictable. Training a model to identify pedestrians, street signs, road lines, and so on with images taken on a sunny day will not translate well to variable real-world conditions. Using one dataset, in a process known as style transfer, GANs can translate one image to other domains (Figure2.7). This can include creating night road scenes from day scenes (Romera et al. 2019) and producing images of street signs under varying lighting condi- tions (Chowdhury et al. 2019). This added data permits models to account for greater variability under operating conditions without the high cost of photographing all possible conditions and manually labeling them. Beyond medicine and autonomous vehicles, generative data augmenta- tion will progressively impact other imaging-heavy fields (Shorten and Khoshgoftaar 2019) like remote sensing (L. Ma et al. 2019; D. Ma, Tang, and Zhao 2019). Creativity and Design The question of whether machines can possess creativity or artistic ability is philosophically diffi- cult to answer (Mazzone and Elgammal 2019; McCormack, Gifford, and Hutchings 2019). Still, in 2018, Christie’s auctioned off its first piece of GAN art for $432,500 (Cohn 2018) and GANs 20 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.7: The images on the left are originals and the images on the right have been modified by a GAN with the ability to translate images between the domains of “dirty lens” and “clean lens” on a vehicle (from Uřičář et al. 2019, fig. 11). Harper 21 Figure 2.8: This example of GauGAN in action shows a sketched out scene on the left turned into a photo-realistic landscape on the right. *If any representatives of Christie’s are reading, the author would be happy to auction this piece. are increasingly assisting humans in the creative process for all forms of media. Simple models, like CycleGAN, are already able to stylize images in the manner of Van Gogh or Monet (Zhu et al. 2017), and more varied stylistic GANs are emerging. GauGAN, a beta tool released by NVIDIA, is a great example of GAN-assisted creativity in action. GauGAN allows you to rough out a scene using a paint brush for different categories, like clouds, flowers, and houses (Figure 2.8). It then converts this into a photo reflecting what you have drawn. The online demo6 remains limited, but the underlying model is powerful and has massive potential (Park et al. 2019). Recently, Martin Scorsese’s The Irishman made headlines for its digital de-aging of Robert Deniro and other actors. Although this process did not involve GANs, it is highly likely that in the future, GANs will become a major part of cinematic post- production (Giardina 2019) through assistive tools like GauGAN. Fashion and product design are also being impacted by the use of GANs. Text-to-image syn- thesis, which can take free text or categories as input to generate a photo-realistic image, has promising potential (Rostamzadeh et al. 2018). By accepting text as input, GANs can let de- signers rapidly generate new ideas or visualize concepts for products at the start of the design process. For example, a recently published GAN for clothing design accepts basic text and out- puts modeled images of the described clothing (Banerjee et al. 2019; Figure 9). In an example of automotive design, a single sketch can be used to generate realistic photos of multiple perspec- tives of a vehicle (Radhakrishnan et al. 2018). The many fields that rely on quick sketching or visual prototyping, such as architecture or web design, are likely to be influenced by the use of GAN-assisted design software in coming years. In a similar vein, GANs have an upcoming role in the creation of new medicines, chemi- cals, and materials (Zhavoronkov 2018). By training a GAN on existing chemical and material structures, research is showing that novel chemicals and materials can be designed with particular properties (Gómez-Bombarelli et al. 2018; Sanchez-Lengeling and Aspuru-Guzik 2018). This is facilitated by how information is encoded in the GAN’s latent space (the n-dimensional space from which the generator samples; see “Z” in Figure 2.3). As the generator learns to produce realistic examples, certain aspects of the original data become encoded in regions of the latent 6See ?iiT,ffMpB/B�@`2b2�`+?@KBM;vmHBmX+QKf;�m;�Mf (last accessed January 12, 2019). http://nvidia-research-mingyuliu.com/gaugan/ 22 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.9: Text-to-image synthesis can generate images of new fashions based on a description. From the input “maroon round neck mini print a-line bodycon short sleeves” a GAN has pro- duced these three photos (from Banerjee et al. 2019, fig. 11). Figure 2.10: Two examples of linearly-spaced mappings across the latent space between generated images A and B. Note that by taking one image and moving closer to another, you can alter prop- erties in the image, such as adding steam, removing a cup handle, or changing the angle of view. These characteristics of the dataset are learned by the generator during training and encoded in the latent space. (GAN built on coffee cup sketches from Google’s QuickDraw dataset) space. By moving through this latent space or sampling particular areas, new data with desired properties can then be generated. This can be seen by periodically sampling the latent space and generating an image as one moves between two generated images (Figure 2.10). In the same way, by moving in certain directions or sampling from particular areas of the latent space, new chem- icals or medicines with specific properties can be generated.7 Impersonation and the Invisible I have reserved some of the more dystopian and likely more well-heard-of applications of GANs for last. This is the area where GANs’ ability to generate convincing media is challenging our perceptions of reality and raising extreme ethical questions (Harper 2018). Deep fakes are, of course, the most well known of these. This can include the creation of fake images, videos, and audio of an individual or the modification of any media to alter what someone appears to be doing or saying. In images and video in particular, GANs make it possible to swap the identity of an individual and manipulate facial attributes or expressions (Tolosana et al. 2020). A large portion 7This is also relevant to facial manipulation discussed below. Harper 23 Figure 2.11: GANs are providing a method to reconstruct hidden images of people and objects. Images 1–3 show reconstructions as compared to an input occluded image (OCC) and a ground truth image (GT) (from Fulgeri et al. 2019, fig. 6). of technical literature is, in fact, now devoted to detecting faked and altered media (see Tolosana et al. 2020, Table IV and V). It remains to be seen how successful any approaches will be. From a theoretical perspective, anything that can detect fakes can also be used to train a better generator since the training process of a GAN is founded on outsmarting a detector (i.e. the discriminator network). One shocking extension of deep fakes that has emerged is transcript to video creation, which generates a video of someone speaking from a written text. If you want to see this at work, you can view clips of Nixon giving the speech written in the case of an Apollo 11 disaster.8 As of now, deep fakes like this remain choppy and are largely limited to politicians and celebrities because they require large datasets and additional manipulation, but this limitation is not likely to last. If the evolution of GANs for images is any predictor, the entire emerging field of video generation is likely to progress rapidly. One can imagine the incorporation of text-to-image and deep fakes enabling someone to produce an image of, say, “politican X doing action Y,” simply by typing it. An application of GANs that parallels deep fakes and is likely more menacing in the short term is the infilling or adding of hidden, invisible, or predicted information to existing media. One nascent use is video prediction from an image. For example, in 2017, researchers were able to build a GAN that produced 1-second video clips from a single starting frame (Vondrick and Torralba 2017). This may not seem impressive, but video is notoriously difficult to work with because the content of a succeeding frame can vary so drastically from the preceding frame (for other examples of on-going research into video prediction, see Cai et al. 2018; Wen et al. 2019). For still images, occluded object reconstruction, in which a GAN is trained to produce a full image of a person or object that is partially hidden behind something else, is progressing (Fulgeri et al. 2019; see Figure 11). For some applications, like autonomous driving, this could save lives as it would help to pick out when a partially-occluded pedestrian is about to emerge from 8See ?iiT,ffM2rbXKBiX2/mfkyRNfKBi@�TQHHQ@/22T7�F2@�`i@BMbi�HH�iBQM@�BKb@iQ@2KTQr2`@K Q`2@/Bb+2`MBM;@Tm#HB+@RRk8. http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125 http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125 24 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 behind a parked car. On the other hand, for surveillance technology, it can further undermine anonymity. Indeed, such GANs are already being explicitly studied for surveillance purposes (Fabbri, Calderara, and Cucchiara 2017). Lastly, I would be remiss if I did not mention that researchers have designed a GAN that can generate an image of what you are thinking about, using EEG signals (Tirupattur et al. 2018). GANs and the Future The tension between the creation of more realistic generated data and the technology to detect maliciously generated information is only beginning. The machine learning and data science plat- form, Kaggle, is replete with publicly-accessible python code for building GANs and detecting fake data. Money, too, is freely flowing in this domain of research; The 2019 Deepfake Detec- tion Challenge sponsored by Facebook, AWS, and Microsoft boasted one million dollars in prizes (?iiTb,ffrrrXF�;;H2X+QKf+f/22T7�F2@/2i2+iBQM@+?�HH2M;2 accessed April 20, 2020). Meanwhile, industry leaders, such as NVidia, continue to fund the training of better and more convincing GANs. The structure of a GAN, with its generator and detector paired adver- sarially, is now being mirrored in society as groups of researchers competitively work to create and discern generated data. The path that this machine-learning arms race will take is unpredictable, and, therefore, it is all the more important to scrutinize it and make it comprehensible to the broader publics whom it will affect. References Alqahtani, Hamed, Manolya Kavakli-Thorne, and Gulshan Kumar. 2019. “Applications of Gen- erative Adversarial Networks (GANs): An Updated Review.” Archives of Computational Methods in Engineering, December. ?iiTb,ff/QBXQ`;fRyXRyydfbRR3jR@yRN@yNj 33@v. Banerjee, Rajdeep H., Anoop Rajagopal, Nilpa Jha, Arun Patro, and Aruna Rajan. 2019. “Let AI Clothe You: Diversified Fashion Generation.” In Computer Vision—ACCV 2018 Work- shops, edited by Gustavo Carneiro and Shaodi You, 75–87. Cham: Springer International Publishing. Baur, Christoph, Shadi Albarqouni, and Nassir Navab. 2018. “Generating Highly Realistic Im- ages of Skin Lesions with GANs” September. ?iiTb,ff�`tBpXQ`;f�#bfR3yNXyR9Ry. Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Self-published, Amazon. Cai, Haoye, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. 2018. “Deep Video Generation, Prediction and Completion of Human Action Sequences.” In Computer Vision — ECCV 2018, edited by Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, 374–90. Lecture Notes in Computer Science. Cham: Springer International Publishing. ?iiTb,ff/QBXQ`;fRyXRyydfNd3@j@yjy@yRkRe@3nkj. Chowdhury, Sohini Roy et al. 2019. “Automated Augmentation with Reinforcement Learning and GANs for Robust Identification of Traffic Signs Using Front Camera Images.” In 53rd Asilomar Conference on Signals, Systems & Computers, 79–83. N.p.: IEEE. ?iiTb,ff/Q BXQ`;fRyXRRyNfA111*PL699ee9XkyRNXNy9Nyy8. Cohen, Libby. 2020. “Reddit Bans Deepfakes with ‘Malicious’ Intent.” The Daily Dot. January 10, 2020. ?iiTb,ffrrrX/�BHv/QiX+QKfH�v2`3f`2//Bi@/22T7�F2b@#�Mf. https://www.kaggle.com/c/deepfake-detection-challenge https://doi.org/10.1007/s11831-019-09388-y https://doi.org/10.1007/s11831-019-09388-y https://arxiv.org/abs/1809.01410 https://doi.org/10.1007/978-3-030-01216-8_23 https://doi.org/10.1109/IEEECONF44664.2019.9049005 https://doi.org/10.1109/IEEECONF44664.2019.9049005 https://www.dailydot.com/layer8/reddit-deepfakes-ban/ Harper 25 Cohn, Gabe. 2018. “AI Art at Christie’s Sells for $432,500.” The New York Times, October 25, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fRyfk8f�`ibf/2bB;Mf�B@�`i@bQH/@+ ?`BbiB2bX?iKH. Fabbri, Matteo, Simone Calderara, and Rita Cucchiara. 2017. “Generative Adversarial Models for People Attribute Recognition in Surveillance.” In 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). N.p.: IEEE. ?iiTb,ff/QBXQ` ;fRyXRRyNf�oaaXkyRdX3yd38kR. Fulgeri, Federico, Matteo Fabbri, Stefano Alletto, Simone Calderara, and Rita Cucchiara. 2019. “Can Adversarial Networks Hallucinate Occluded People With a Plausible Aspect?” Com- puter Vision and Image Understanding 182 (May): 71–80. Giardina, Carolyn. 2019. “Will Smith, Robert De Niro and the Rise of the All-Digital Actor.” The Hollywood Reporter, August 10, 2019. ?iiTb,ffrrrX?QHHvrQQ/`2TQ`i2`X+QKf #2?BM/@b+`22Mf`Bb2@�HH@/B;Bi�H@�+iQ`@RkkNd3j. Giles, Martin. 2018. “The GANfather: The Man Who’s given Machines the Gift of Imagina- tion.” MIT Technology Review 121, no. 2 (March/April): 48–53. Gómez-Bombarelli, Rafael, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. 2018. “Automatic Chemical Design Us- ing a Data-Driven Continuous Representation of Molecules.” ACS Central Science 4, no. 2 (February): 268–76. ?iiTb,ff/QBXQ`;fRyXRykRf�+b+2Mib+BXd#yy8dk. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, edited by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, 27:2672–2680. Curran Associates, Inc. ?iiTb,ffT` Q+22/BM;bXM2m`BTbX++fT�T2`fkyR9f7BH2f8+�j2N#Rkk7eR737ye9N9+Nd#R� 7++7j@S�T2`XT/7. Harper, Charlie. 2018. “Machine Learning and the Library or: How I Learned to Stop Worrying and Love My Robot Overlords.” Code4Lib Journal, no. 41 (August). ?iiTb,ffDQm`M� HX+Q/29HB#XQ`;f�`iB+H2bfRjedR. Kahng, Minsuk, Nikhil Thorat, Duen Horng Polo Chau, Fernanda B. Viegas, and Martin Wat- tenberg. 2019. “GAN Lab: Understanding Complex Deep Generative Models Using Inter- active Visual Experimentation.” IEEE Transactions on Visualization and Computer Graph- ics 25, no. 1 (January 2019): 310–320. ?iiTb,ff/QBXQ`;fRyXRRyNfip+;XkyR3Xk3 e98yy. Karras, Tero, Samuli Laine, and Timo Aila. 2019. “A Style-Based Generator Architecture for Generative Adversarial Networks.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4396–4405. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNf *oS_XkyRNXyy98j. Korkinof, Dimitrios, Tobias Rijken, Michael O’Neill, Joseph Yearsley, Hugh Harvey, and Ben Glocker. 2018. “High-Resolution Mammogram Synthesis Using Progressive Generative Adversarial Networks.” Preprint, submitted July 9, 2018. ?iiTb,ff�`tBpXQ`;f�#bf R3ydXyj9yR. Langr, Jakub and Vladimir Bok. 2019. GANs in Action: Deep Learning with Generative Adver- sarial Networks. Shelter Island, NY: Manning Publications. Ma, Dongao, Ping Tang, and Lijun Zhao. 2019. “SiftingGAN: Generating and Sifting La- beled Samples to Improve the Remote Sensing Image Scene Classification Baseline In Vitro.” https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html https://doi.org/10.1109/AVSS.2017.8078521 https://doi.org/10.1109/AVSS.2017.8078521 https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783 https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783 https://doi.org/10.1021/acscentsci.7b00572 https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://journal.code4lib.org/articles/13671 https://journal.code4lib.org/articles/13671 https://doi.org/10.1109/tvcg.2018.2864500 https://doi.org/10.1109/tvcg.2018.2864500 https://doi.org/10.1109/CVPR.2019.00453 https://doi.org/10.1109/CVPR.2019.00453 https://arxiv.org/abs/1807.03401 https://arxiv.org/abs/1807.03401 26 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 IEEE Geoscience and Remote Sensing Letters 16, no. 7 (July): 1046–1050. ?iiTb,ff/QBX Q`;fRyXRRyNfH;`bXkyR3Xk3Ny9Rj. Ma, Lei, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. 2019. “Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review.” ISPRS Journal of Photogrammetry and Remote Sensing 152 (June): 166–77. ?iiTb,ff/QBXQ`;fRyXR yRefDXBbT`bDT`bXkyRNXy9XyR8. Mazzone, Marian, and Ahmed Elgammal. 2019. “Art, Creativity, and the Potential of Artificial Intelligence.” Arts 8, no. 1 (March): 1–9. ?iiTb,ff/QBXQ`;fRyXjjNyf�`ib3yRyyke. McCormack, Jon, Toby Gifford, and Patrick Hutchings. 2019. “Autonomy, Authenticity, Au- thorship and Intention in Computer Generated Art.” In ComputationalIntelligenceinMu- sic, Sound, Art and Design, edited by Anikó Ekárt, Antonios Liapis, and María Luz Castro Pena, 35–50. Cham: Springer International Publishing. Mukherjee, Sumit, Yixi Xu, Anusua Trivedi, and Juan Lavista Ferres. 2019. “Protecting GANs against Privacy Attacks by Preventing Overfitting.” Preprint, submitted December 31, 2019. ?iiTb,ff�`tBpXQ`;f�#bfkyyRXyyydRpR. Murphy, Kevin P. 2012. Machine Learning : A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. Cambridge, Mass: MIT Press. Park, Taesung, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. “Semantic Image Syn- thesis with Spatially-Adaptive Normalization.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2332–2341. N.p.: IEEE. ?iiTb,ff/QBXQ`;f RyXRRyNf*oS_XkyRNXyyk99. Radhakrishnan, Sreedhar, Varun Bharadwaj, Varun Manjunath, and Ramamoorthy Srinath. 2018. “Creative Intelligence – Automating Car Design Studio with Generative Adversarial Net- works (GAN).” InMachineLearningandKnowledgeExtraction, edited by Andreas Holzinger, Peter Kieseberg, A Min Tjoa, and Edgar Weippl, 160–75. Cham: Springer International Publishing. Romera, Eduardo, Luis M. Bergasa, Kailun Yang, Jose M. Alvarez, and Rafael Barea. 2019. “Bridging the Day and Night Domain Gap for Semantic Segmentation.” In 2019 IEEE Intelligent Vehicles Symposium (IV), 1312–18. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRy NfAoaXkyRNX33Rj333. Romm, Tony, Drew Harwell, and Isaac Stanley-Becker. 2020. “Facebook Bans Deepfakes, but New Policy May Not Cover Controversial Pelosi Video.” The Washington Post. January 7, 2020. ?iiTb,ffrrrXr�b?BM;iQMTQbiX+QKfi2+?MQHQ;vfkykyfyRfyef7�+2#QQ F@#�M@/22T7�F2b@bQm`+2b@b�v@M2r@TQHB+v@K�v@MQi@+Qp2`@+QMi`Qp2`bB �H@T2HQbB@pB/2Qf. Rostamzadeh, Negar, Seyedarian Hosseini, Thomas Boquet, Wojciech Stokowiec, Ying Zhang, Christian Jauvin, and Chris Pal. 2018. “Fashion-Gen: The Generative Fashion Dataset and Challenge.” Preprint, submitted June 21, 2018. ?iiTb,ff�`tBpXQ`;f�#bfR3yeXy3j Rd. Sanchez-Lengeling, Benjamin, and Alán Aspuru-Guzik. 2018. “Inverse Molecular Design Us- ing Machine Learning: Generative Models for Matter Engineering.” Science 361, no. 6400 (July): 360–365. ?iiTb,ff/QBXQ`;fRyXRRkefb+B2M+2X��ikeej. Shorten, Connor, and Taghi M. Khoshgoftaar. 2019. “A Survey on Image Data Augmentation for Deep Learning.” Journal of Big Data 6 (60): 1–48. ?iiTb,ff/QBXQ`;fRyXRR3efb9 y8jd@yRN@yRNd@y. https://doi.org/10.1109/lgrs.2018.2890413 https://doi.org/10.1109/lgrs.2018.2890413 https://doi.org/10.1016/j.isprsjprs.2019.04.015 https://doi.org/10.1016/j.isprsjprs.2019.04.015 https://doi.org/10.3390/arts8010026 https://arxiv.org/abs/2001.00071v1 https://doi.org/10.1109/CVPR.2019.00244 https://doi.org/10.1109/CVPR.2019.00244 https://doi.org/10.1109/IVS.2019.8813888 https://doi.org/10.1109/IVS.2019.8813888 https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://arxiv.org/abs/1806.08317 https://arxiv.org/abs/1806.08317 https://doi.org/10.1126/science.aat2663 https://doi.org/10.1186/s40537-019-0197-0 https://doi.org/10.1186/s40537-019-0197-0 Harper 27 Tirupattur, Praveen, Yogesh Singh Rawat, Concetto Spampinato, and Mubarak Shah. 2018. “Thoughtviz: Visualizing Human Thoughts Using Generative Adversarial Network.” In Proceedings of the 26th ACM International Conference on Multimedia, 950–958. New York: Association for Computing Machinery. ?iiTb,ff/QBXQ`;fRyXRR98fjk9y 8y3Xjk9ye9R. Tolosana, Ruben, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega- Garcia. 2020. “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detec- tion.” Preprint, submitted January 1, 2020. ?iiTb,ff�`tBpXQ`;f�#bfkyyRXyyRdN. Uřičář, Michal, Pavel Křížek, David Hurych, Ibrahim Sobh, Senthil Yogamani, and Patrick Denny. 2019. “Yes, We GAN: Applying Adversarial Techniques for Autonomous Driving.” In IS&T International Symposium on Electronic Imaging, 1–16. Springfield, VA: Society for Imaging Science and Technology. ?iiTb,ff/QBXQ`;fRyXkj8kfAaaLXk9dy@RRdjXk yRNXR8X�oJ@y93. Vondrick, Carl, and Antonio Torralba. 2017. “Generating the Future with Adversarial Trans- formers.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2992–3000. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNf*oS_XkyRdXjRN. Wang, Zhengwei, Qi She, and Tomas E. Ward. 2019. “Generative Adversarial Networks: A Survey and Taxonomy.” Preprint, submitted June 4, 2019. ?iiTb,ff�`tBpXQ`;f�#bf RNyeXyR8kN. Wen, Shiping, Weiwei Liu, Yin Yang, Tingwen Huang, and Zhigang Zeng. 2019. “Generating Realistic Videos From Keyframes With Concatenated GANs.” IEEE Transactions on Cir- cuits and Systems for Video Technology 29 (8): 2337–48. ?iiTb,ff/QBXQ`;fRyXRRyNf h*aohXkyR3Xk3edNj9. Yi, Xin, Ekta Walia, and Paul Babyn. 2019. “Generative Adversarial Network in Medical Imag- ing: A Review.” Medical Image Analysis 58 (December): 1–20. ?iiTb,ff/QBXQ`;fRy XRyRefDXK2/B�XkyRNXRyR88k. Zhavoronkov, Alex. 2018. “Artificial Intelligence for Drug Discovery, Biomarker Development, and Generation of Novel Chemistry.” Molecular Pharmaceutics 15, no. 10 (October): 4311–13. ?iiTb,ff/QBXQ`;fRyXRykRf�+bXKQHT?�`K�+2miX3#yyNjy. Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” In 2017 IEEE International Conference on Computer Vision (ICCV), 2242–2251. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRy XRRyNfA**oXkyRdXk99. https://doi.org/10.1145/3240508.3240641 https://doi.org/10.1145/3240508.3240641 https://arxiv.org/abs/2001.00179 https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048 https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048 https://doi.org/10.1109/CVPR.2017.319 https://arxiv.org/abs/1906.01529 https://arxiv.org/abs/1906.01529 https://doi.org/10.1109/TCSVT.2018.2867934 https://doi.org/10.1109/TCSVT.2018.2867934 https://doi.org/10.1016/j.media.2019.101552 https://doi.org/10.1016/j.media.2019.101552 https://doi.org/10.1021/acs.molpharmaceut.8b00930 https://doi.org/10.1109/ICCV.2017.244 https://doi.org/10.1109/ICCV.2017.244
hintze-artificial-2021 ---- Chapter 1 Artificial Intelligence in the Humanities: Wolf in Disguise, or Digital Revolution? Arend Hintze Dalarna University Jorden Schossau Michigan State University Introduction Artificial Intelligence, with its ability to machine learn coupled to an almost human-like under- standing, sounds like the ideal tool to the humanities. Instead of using primitive quantitative methods to count words or catalogue books, current advancements promise to reveal insights that otherwise could only be obtained by years of dedicated scholarship. But are these technolo- gies imbued with intuition or understanding, and do they learn like humans? Are they capable of developing their own perspective, and can they aid in qualitative research? In the 80s and 90s, as home computers were becoming more common, Hollywood was sen- sationalizing the idea of smart or human-like Artificial Intelligent machines (AI) through movies such as Terminator, Blade Runner, Short Circuit, and Bicentennial Man. At the same time, the home experience of personal computing highlighted the difference between Hollywood intelli- gent machines and the reality of how “dumb” machines really were. Home, or even industry machines, could not answer simple natural language questions of anything but the simplest of complexity. Instead, users or programmers needed to painstakingly implement an algorithm to address their question. Then, the user was required to wait for the machine to slavishly follow each instruction that was programmed while hoping that whoever entered the instructions did 3 4 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 not make a mistake. Despite the Hollywood intelligent machines sensation, people understood that computers did not and could not think like humans, but that they do excel at perform- ing repetitive tasks with extreme speed and fidelity. This shaped the expectations for interacting with computers. Computers became efficient tools that required specific instruction in order to achieve a desired outcome. Computational technology and user experience drastically changed over the next 20 years. Technology became much more intuitive to use while it also became much more powerful at handling large data sets. For instance, Google can return search results for websites as a response to even the silliest or sparsest request, with a decent chance that the results are relevant to the question asked. Did you read a manual before you used your smartphone, or did you like everyone else just “figure it out”? Or, as a consequence of modern-day media and its on-demand services, children ask to skip a song playing through radio broadcast. The older technologies quickly feel archaic. These technological advancements go hand in hand with the developments in the field of machine learning and artificial intelligence. The automotive industry is on the cusp of fully self- driving cars. Electronic assistants are not only keeping track of our dates and responding to spo- ken language, they will also soon start making our appointments by speaking to other humans on our behalf. Databases are getting new voice-controlled intuitive interfaces, changing a typ- ical incomprehensible “a1G1*h �o:Ub�H�`vV 6_PJ 2KTHQv22GBbi q>1_1 v2�`>B`2/ = kyRkc” to a spoken “Average salary of our employees hired after 2012?” Another phenomenon is the trend in many disciplines to go from “qualitative” to “quanti- tative” research, or to think about the “system” rather than the “components.” The field that probably experienced this trend first was biology. While obviously descriptive about species of organisms, biologists also always wanted to understand the mechanisms that drive life on earth spanning micro to macro scales. Consequently, a lot is known about the individual chemical components that constitute our metabolism, the components that drive cell division and DNA replication, and which genes are involved in, for example, developmental processes. However, in many cases, our scientific knowledge only covers single functions of single components. In the context of the cell, the state of the organism and how other components interact matters a lot. Cancer, for example, cannot be explained by a single mutation on a single gene but involves many complex interactions (Hanahan and Weinberg 2011). Ecosystems don’t collapse because a single insect dies, but because indirect changes in the food chain interact in complex ways (for a review of the different theories, see Tilman 1996). As a result, systems biology emerged. Systems biolo- gists use large data sets and are often dependent on computer models to understand phenomena on the systems level. The field of Bioinformatics is one such example of an entire field that emerged as a result of using computers to study entire systems that were otherwise humanly intractable. The human genome project to sequence the complete human genome finished in 2003, a time when our con- sumer data storage was limited by the amount of data that fit on a DVD (4.9 GB). While the hu- man genome fits on a DVD, the data that came from the sequencing machines was much larger. Short repetitive sequences first needed assembly, which at that time was a high-performance com- puting task. Other fields have since undergone their own computational revolutions, and now the hu- manities begin their computational revolution. Computers have been a part of core library in- frastructure and experience for some time now, by cataloging entries in a database and allowing intuitive user exploration of that database. However, the digital humanities go beyond this (Fitz- Hintze and Schossau 5 patrick 2012). The ability to analyze (crawl) extremely large corpora of different sources, monitor the internet using the Internet of Things as large sensor arrays, and detect patterns by using so- phisticated algorithms can each produce a treasure trove of quantitative data. Until this point these tasks could only be described or analyzed qualitatively. Additionally, artificial intelligence promises models of the human mind (Yampolskiy and Fox 2012). Machine learning allows us to learn from these data sets in ways that exceed human capa- bilities, while an artificial brain will eventually allow us to objectively describe a subjective experi- ence (through quantifying neural activations or positively and negatively associated memories). This would ultimately close the gap between quantitative and qualitative approaches by allowing an inspection of experience. However, this bridging between quantitative and qualitative methods causes a possible ten- sion for the humanities, which historically defines itself by qualitative methodologies. When qualitative experiences or responses can be finely quantified, such as sadness caused by reading a particular passage, or the curiosity caused by viewing certain works of art, then the field will undergo a revolution. When this happens, we will be able to quantify and discuss how sadness was learned by reading, or how much surprise was generated by viewing an artwork. This is exactly the point where the metaphors break down. Current computational models of the mind are not sophisticated enough to allow these kinds of inferences. Machine learning algorithms work well for what they do but have nothing to do with what a person would call learning. Artificial intelligence is a broad encompassing field. It includes methods that might have appeared to be magic only a couple of years ago (such as generative adversarial networks). Algorithmic finesse resulting from these advances is capable of beating humans in chess (Camp- bell, Hoane Jr, and Hsu 2002), but it is only a very specialized algorithm that has nothing to do with the way humans play or learn chess. This means we are back to the problem we had in the 80s. Instead of being disappointed by the difference between modern technology and Hol- lywood technology, we are disappointed by the difference between modern technology and the experience implied by the labels given to those technologies. Applying misnomer terminology, such as “smart,” “intelligent,” “search,” and “learning” to modern technologies that have little to do with those terms is misleading. It is possible that such technology was deliberately branded with these terms for the improved marketing and sales, effectively redefining them and obscuring their original meaning. Consequently, we again are disappointed by the mismatch of the expec- tations of our computing infrastructure and the reality of our experiences. The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Learning: Phenomenon versus Mechanism Learning is an electrochemical process that involves cells, their genetic makeup, and how they are interconnected. Some interplay between external stimuli and receptor proteins in specialized sensor neurons leads to electrochemical signals propagating over a network of interconnected cells, which themselves respond with physical and genetic changes to said stimuli, probably also dependent on previous stimuli (Kandel, Schwartz, Jessel 2000). This concoction of elaborate terms might suggest that we know in principle which parts are involved and where they are, but we are far from an understanding of the learning mechanism. The description above is as generic as saying that a city functions because cars drive on streets. Even though we might know a lot 6 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 about long-term potentiation or the mechanism of neurons which fire together wiring together (aka Hebbian learning), neither of these processes actually mechanistically explains how learning works. Neuroscience, neurophysiology, and cognitive science have not been able to discover this complete process in such a way that we can replicate it, though some inroads are being made (El- Boustani et al. 2018). Similarly, we find promising new interdisciplinary efforts like “Cognitive computational neuroscience” that try to bridge the gap between neuro- and cognitive science and computation (Kriegeskorte and Douglas 2018). So, unfortunately, while the components involved can be identified, the question about “how learning works” cannot be answered mech- anistically. However, a lot is known about the phenomenon of learning. It happens during the lifetime of an organism. What happens between the lifetimes of related organisms is an adaptive process called evolution: inheritance, variation, and natural selection over many generations up to 3.5 billion years here on Earth enabled populations of organisms to succeed in their environments in any way they could. Evolutionary forces found ways for organisms to adapt to their environment during their own lifetimes. While this can take many forms, such as storing energy, seeking shel- ter, having a fight or flight response, it has led to the phenomenon we now call learning. Instead of discussing the diversity of learning in the animal kingdom, we will discuss the richest example: human learning. Here, learning is defined as the cognitive adaptation to external stimulus. The phenomenon of learning can be observed as an increase in performance over time. Learning makes the organism better at doing something. In humans, because we have language and a much higher degree of abstract thinking, an improvement in performance can be facilitated very quickly. While it takes time to learn how to juggle, the ability to find the mean of a series of samples can be quickly com- municated by reading Wikipedia. Both types of lifetime adaptations are called learning. How- ever, these lifetime adaptations are facilitated by two different cognitive processes: explicit or im- plicit learning. 1 Explicit learning—or episodic memory—is fact-based memory. What you did yesterday, what happened in your childhood, or the list of things you should buy when you go shopping, are all memories. Currently, the engram theory best explains this mechanism (Poo et al. 2016 elaborates on the origins of the term). Explicit memory can be retrieved relatively easily and then used to inform future decisions: “Press the green button if the capital of Italy is Paris, otherwise press the red.” The rate of learning for explicit memory can be much higher than for implicit memory, and it can also be communicated more quickly. Abstract communication, such as “I saw a wolf” allows us to transfer the experience of seeing a wolf quickly to other individuals, even though their evoked explicit memory might not be identical to ours. Learning by using implicit memory—sometimes called procedural memory—is facilitated by much slower processes (Schacter, Chiu, and Ochsner 1993). It is generally based on the idea that learning is a combination of expectation, observation or action, and internal model changes. For example, a recovering hospital patient who has suffered a stroke is handed an apple. In this exchange, the patient forms an expectation of where his hand will be to accept the apple. He en- gages his muscles to move his forearm and hand to accept the apple, which is his action. Then the patient observes that his arm did not arrive at the correct position (due to neurological damage). This discrepancy between expectation and action-outcome drives internal changes so that the patient’s brain learns how to adequately control their arm. Presumably, everything considered a skill is based on this process. While very flexible, this form of memory is not easily communicated nor fast to acquire. For instance, while juggling can be described it cannot be communicated in 1There are more than these two mechanisms, but these are the two major ones. Hintze and Schossau 7 such a way that it enables the recipient to juggle without additional training. This description of explicit and implicit learning is an amalgamation of many different hy- potheses and observations. Also, these processes are not as well segregated in practice as outlined here. What is important is what these two learning mechanisms are based on: observations lead to memory, and internal predictions together with exploration lead to improved models about the world. Lastly, these learning processes only exist in organisms because they previously conferred an evolutionary advantage: Organisms that could memorize and then act on those memories had more offspring than those that did not. This interaction of learning and evolution is called the Baldwin effect (Weber and Depew 2003). Organisms that could explore the environment, make predictions about it, and use observations to optimize their internal models were similarly more capable than organisms that could not. Machines do not Learn; They are Trained Now prepared with a proper intuition about learning, we can turn our attention to machine learning. After all, our intuitions should be meaningful in the computational domain as well, because learning always follows the same pattern. One might be disappointed when looking over the table of contents of a machine learning book and find only methods for creating static trans- formation functions (see Russell and Norvig 2016, one of the putative foundations of machine learning and AI). There will typically be a distinction between supervised and unsupervised learn- ing, between categorical and continuous data, and maybe a section about other “smart” algo- rithms. You will not find a discussion about implicit and explicit memory, let alone methods for implementing these concepts. So, if these important sections in our imaginary machine learning book do not discuss the mechanisms of learning, then what are they discussing? Unsupervised learning describes algorithms that report information based on associations within the data. Clustering algorithms are a popular example of unsupervised learning. These use similarity between data points to form and report on distinct groups of data. Clustering is a very important method but is only a well-designed algorithm that is not adaptive. Supervised learning describes algorithms that refine a transformation function to convert from a certain input to a certain output. The idea is to balance specific and general refining such that the transformation function correctly transforms all known examples but generalizes enough to work well on new variations. For example, we would like the machine to transform image data into textual labels, such as “house” or “car.” The input is an image and the output is a label. The input image data are provided to the machine, and small adjustments to the machine’s function are made depending on how well it provided the correct output. Many iterations later ideally will result in a machine that can transform all image data to correct labels, and even operate correctly on new variations of images not provided before. Supervised learning is extremely powerful and is yet to be fully explored. However, supervised learning is quite dissimilar to actual learning. A common argument is that supervised learning uses feedback in a “student-teacher” paradigm of making changes with feedback until proper behavior is achieved, so it could be considered learning. But this feedback is external, objective, and not at all similar to our prediction and com- parison model that, for instance, operates without an all-knowing oracle whispering “good” or “bad” into our ears. Humans and other organisms instead compare predictions with outcomes, and the choices are driven by an intersection of desire and prediction. What seems astonishing is the diverse and specialized capabilities that these two rather simple types of computation, clustering and classification, can produce. Their economic impact is enor- 8 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 mous, and we are still finding new ways to combine neural networks and exploit deep learning techniques to create amazing data transformations, such as deep fake videos. But so far, each as- tounding example of AI, through machine learning or some other method, is not showcasing all these capabilities as one machine, but instead each as an independently achieved computational marvel. Each of these examples does only exactly what it was trained to do in a narrow domain and no more. Siri, or any other voice assistant for that matter, does not drive a car (López, Que- sada, and Guerrero 2017), Watson does not play chess (Ferrucci et al. 2013), and Google Alpha Go cannot understand spoken language (Gibney 2016). Even hybrid approaches, such as com- bining speech recognition, chess playing, and autonomous driving, would only be a combination of specialty strategies, not a trained entity from the ground up. Modern machine learning gives us an amazing collection of very applicable, but extremely specialized, computational tools that may be customized to particular data sets, but the resulting machines do not learn autonomously as you or I do. There are cutting edge technologies, such as so-called neuromorphic chips (Nawrocki, Voyles, and Shaheen 2016) and other computational brain models that more closely mimic brain function, but they are not what has been sensation- alized in the media as machine learning or AI, and they have yet to showcase competence on difficult problems competitive with standard supervised learning. Curiously, many people in the machine learning community defend the term “learning,” ar- guing there is no difference between learning and training. In traditional machine learning, the trained algorithm is deployed as a service after which it no longer improves. If the data set ever changes, then a new training set including correct labels needs to be generated and a new train- ing phase initiated. However, if the teacher can be forever bundled with the learner and training continued during the deployment phase, even on new never-before-seen data, then indeed the delineation between learning and training is far less clear. Approaches to such lifelong learning exist, but they struggle with what is called catastrophic forgetting—the phenomenon that only the most recent experiences are learned at the expense of older ones (French 1999). This is the objective for Continuous Delivery for machine learning. Unfortunately, creating a new training set is typically the most expensive endeavor for standard supervised machine learning develop- ment. Adequate training then becomes difficult or impossible without involving thousands or millions of human inputs to keep up with training and using the online machine on an ever- evolving data set. Some have tried to use such “human-in-the-loop” methods, but the resulting machine then becomes only a slight extension of the humans who are forever caught in the loop. Is it an intelligent machine, or a human trapped in a machine? To combat this problem of generating the training set, researchers altered the standard super- vised learning paradigm of flexible learner and rigid teacher to make the teacher likewise flexible to generate new data, continually probing the bounds of the student machine. This is the method of Generative Adversarial Networks, or GANs (Goodfellow et al. 2014). The teacher generates training examples and the student discerns between those generated examples and the original labeled training data. After many iterations, the teacher is improved to better fool the student, and the student is improved to better discern generated training data. As amazing as they are, GANs only partially mitigate the problematic requirement for human-labeled training data, be- cause GANs can only mimic a known labeled distribution. If that distribution ever changes, then new labeled data must be generated, and again we have the same problem as before. Unfor- tunately, GANs have been sensationalized as magic, and public and hobbyist expectation is that GANs are a way toward much better artificial intelligence. Disappointment is inevitable because GANs only allow us to explore what it would be like to have more training data from the same Hintze and Schossau 9 data sets we were using before. These expectations are important for machine learning and AI. We are very familiar with learning, to the point where our whole identity as human could be generously defined as the result of being a monkey with an exceptional proclivity for learning. If we now approach AI and machine learning with expectations that these technologies learn as we do, or are an equally general-purpose intelligence, then we will be bitterly disappointed. The best example of such discrepancy is how easily neural networks trained by deep learning can be fooled. Images that are seemingly identical and differ only by a few pixels are grossly misclassified, a mistake no human would make (Nguyen, Yosinski, and Clune 2015). Fortunately, we know about these biases and the possible shortcomings of these methods. As long as we have the right expectations, we can take their flaws into account and still enjoy the prospects they provide. Trained Machines: Tool or Provocation? On one side we have the natural sciences characterized by hypothesis-driven experimentation re- ducing reality to an abstract model of causal interactions. This approach can inform us about the consequences of our possible actions, but only as far in the future as the model can adequately predict. With machine learning and AI, we can move this temporal horizon of prediction far- ther into the future. While weather models might still struggle to predict precipitation 7 days in advance, global climate models predict in detail the effects of global warming in 100 years. But these models are nihilist, void of values, and cannot themselves answer the question if humans would prefer to live in one possible future or another. Is sunshine better than rain? The human- ities, on the other hand, are home to exactly these problems. What are our values? How do we understand what is essential? Now that we know the facts, how should we choose? Do we speak for everyone? The questions seem to be endless, but they are what makes our human experience so special, and what separates the humanities from the sciences. Labels—such as learning or intelligence—are too easily anthropomorphized. A technology branded in this way suggests human-like properties: intelligence, common sense, or even sub- jective opinion. From a name like “deep learning” we expect a system that develops a deep and intuitive understanding with insights more profound than our own. However, these systems do not provide an alternative perspective, but as explained above, are only as good or as biased as the scientist selecting their training data. Just because humans and machine learning are both black boxes in the sense that their inner workings are opaque, does not mean they share other quali- ties. For instance, having labeled the ML training process as “learning” does not imply that ML algorithms are curious and learn from observations. While these new computerized quantitative measures might be welcomed by some scholars, there will be others who view it as an existential threat to the very nature of the humanities. Are these quantitative methods sneaking into the hu- manities disguised by anthropomorphic terms like a wolf shrouded in a sheep’s fleece? From this viewpoint, having the wrong expectations is not only provoking a disappointment, but flooding the humanities with sophisticated technologies that dilute and muddy the nature of qualitative research that makes the humanities special. However, this imminent clash between quantitative and qualitative research also provides a unique opportunity. Suppose there is a question that can only be answered subjectively and qualitatively. If so, it would define a hard boundary against the aforementioned reductionism of the purely causal quantitative approach. At the same time, such a boundary presents the perfect target for an artificially intelligent system to prove its utility. If a computational human analog 10 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 can be created, then it must be capable of performing the same tasks as a humanities researcher. In other words, it must be able to answer subjective and qualitative questions, regardless of its computational and quantitative construction. Failing at such a task would be equivalent to fail- ing the famous Turing test, thereby proving the AI is not yet human-like enough. In this way, the qualitative nature of the humanities poses a challenge—and maybe a threat—to artificially intelligent systems. While some might say the threat is mutual, past successes of interdisciplinary research suggest otherwise: The digital humanities could become the forefront of AI research. Beyond machine training, towards general purpose intelligence Currently, machines do not learn but must be trained, typically with human-labeled data. ML algorithms are not smart as we are, but they can solve specific tasks in sophisticated ways. Per- haps sentience will only be a product of enough time and training data, but the path to sentience probably requires more than time and data. The process that gave rise to human intelligence was evolution. This opportunistic process optimized brains over endless generations to perform ever-changing tasks, and it is the only known example of a process that resulted in such complex intelligence. None of the earlier described computational methods even remotely follow this paradigm: Researchers designed ad hoc algorithms that solved well-defined problems. The next iteration of these methods is either an incremental improvement of existing code, a new method- ological invention, or an application to a new data set. These improvements do not compound to make AI tools better generalists, but instead contribute to the diversity of the existing tools. One approach that does not suffer from these shortcomings is neuro-evolution (Floreano, Dürr, and Mattiussi 2008). Currently, the field of Neuroevolution is in its infancy, but find- ing new and creative solutions to otherwise unsolved problems, such as controlling robots driv- ing cars, is a popular area of focus (Lehman et al. 2020). At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence. While it is not clear how thinking machines will ultimately emerge, they are on the horizon. The dualism of a quantitative system that can be subjective and understand the qualitative nature of existence makes it a strange artifact that cannot be ignored. References Campbell, Murray, A Joseph Hoane Jr, and Feng-hsiung Hsu. 2002. “Deep Blue.” Artificial Intelligence 134 (1–2): 57–83. El-Boustani, Sami, Jacque P K Ip, Vincent Breton-Provencher, Graham W Knott, Hiroyuki Okuno, Haruhiko Bito, and Mriganka Sur. 2018. “Locally Coordinated Synaptic Plasticity of Visual Cortex Neurons in Vivo.” Science 360 (6395): 1349–54. Ferrucci, David, Anthony Levas, Sugato Bagchi, David Gondek, and Erik T Mueller. 2013. “Watson: Beyond Jeopardy!” Artificial Intelligence 199: 93–105. Fitzpatrick, Kathleen. 2012. “The Humanities, Done Digitally.” In Debates in the Digital Hu- manities, edited by Matthew K. Gold, 12–15. Minneapolis: University of Minnesota Press. Hintze and Schossau 11 Floreano, Dario, Peter Dürr, and Claudio Mattiussi. 2008. “Neuroevolution: From Architec- tures to Learning.” Evolutionary Intelligence 1 (1): 47–62. French, Robert M. 1999. “Catastrophic Forgetting in Connectionist Networks.” Trends in Cog- nitive Sciences 3 (4): 128–35. Gibney, Elizabeth. 2016. “Google AI Algorithm Masters Ancient Game of Go.” Nature News 529 (7587): 445. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27 (NIPS 2014), edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, 2672–80. N.p.: Neural Infor- mation Processing Systems Foundation. Hanahan, Douglas, and Robert A Weinberg. 2011. “Hallmarks of Cancer: The Next Genera- tion.” Cell 144 (5): 646–74. Kandel, Eric R, James H Schwartz, and Thomas M Jessell. 2000. Principles of Neural Science. 4th ed. New York: McGraw-Hill. Kriegeskorte, Nikolaus, and Pamela K Douglas. 2018. “Cognitive Computational Neuroscience.” Nature Neuroscience 21: 1148–60. Lehman, Joel et al. 2020. “The Surprising Creativity of Digital Evolution: A Collection of Anec- dotes from the Evolutionary Computation and Artificial Life Research Communities.” Ar- tificial Life 26 (2): 274-306. López, Gustavo, Luis Quesada, and Luis A Guerrero. 2017. “Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces.” In Interna- tional Conference on Applied Human Factors and Ergonomics, edited by Isabel L. Nunes, 241–50. Cham: Springer. Marstaller, Lars, Arend Hintze, and Christoph Adami. 2013. “The Evolution of Representation in Simple Cognitive Networks.” Neural Computation 25 (8): 2079–2107. Nawrocki, Robert A, Richard M Voyles, and Sean E Shaheen. 2016. “A Mini Review of Neu- romorphic Architectures and Implementations.” IEEE Transactions on Electron Devices 63 (10): 3819–29. Nguyen, Anh, Jason Yosinski, and Jeff Clune. 2015. “Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images.” In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 427–36. N.p.: IEEE. Poo, Mu-ming et al. 2016. “What Is Memory? The Present State of the Engram.” BMC Biology 14: 1-18. Russell, Stuart J, and Peter Norvig. 2016. Artificial Intelligence: A Modern Approach. Malaysia: Pearson Education Limited. Schacter, Daniel L, C-Y Peter Chiu, and Kevin N Ochsner. 1993. “Implicit Memory: A Selective Review.” Annual Review of Neuroscience 16 (1): 159–82. Sheneman, Leigh, Jory Schossau, and Arend Hintze. 2019. “The Evolution of Neuroplasticity and the Effect on Integrated Information.” Entropy 21 (5): 1-15. Tilman, David. 1996. “Biodiversity: Population versus Ecosystem Stability.” Ecology 77 (2): 350–63. Tononi, Giulio. 2004. “An Information Integration Theory of Consciousness.” BMC Neuro- science 5: 1–22. Weber, Bruce H, and David J Depew. 2003. Evolution and Learning: The Baldwin Effect Recon- sidered. Cambridge, MA: Mit Press. 12 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 Yampolskiy, Roman V, and Joshua Fox. 2012. “Artificial General Intelligence and the Human Mental Model.” In Singularity Hypotheses: A Scientific and Philosophical Assessment, edited by Ammon H. Eden, James H. Moor, Johnny H. Søraker, and Erik Steinhart, 129–45. Hei- delberg: Springer.
janco-machine-2021 ---- Chapter 4 Machine Learning in Digital Scholarship Andrew Janco Haverford College Introduction We are entering an exciting time when research on machine learning and innovation no longer requires background knowledge in programming, mathematics, or data science. Tools like Run- wayML, the Teachable Machine, and Google AutoML allow researchers to train project-specific classification and object detection models. Other tools such as Prodigy or INCEpTION provide the means to train custom named entity recognition and named entity linking models. Yet with- out a clear way to communicate the value and potential of these solutions to humanities scholars, they are unlikely to incorporate them into their research practices. Since 2014, dramatic innovations in machine learning have occurred, providing new capa- bilities in computer vision, natural language processing, and other areas of applied artificial in- telligence. Scholars in the humanities, however, are often skeptical. They are eager to realize the potential of these new methods in their research and scholarship, but they do not yet have the means to do so. They need to make connections between machine capabilities, research in the sciences, and tangible outcomes for humanities scholarship, but very often, drawing these con- nections is more a matter of chance than deliberate action. Is it possible to make such connections deliberately and identify how machine learning methods can benefit a scholar’s research? This article outlines a method for connecting the technical possibilities of machine learning with the intellectual goals of academic researchers in the humanities. It argues for a reframing of the problem. Rather than appropriating innovations from computer science and artificial intelli- gence, this approach starts from humanities-based methods and practices. This shift allows us to work from the needs of humanities scholars in terms that are familiar and have recognized value to their peers. Machines can augment scholars’ tasks with greater scale, precision, and reproducibil- 43 44 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 ity than are possible for a single scholar alone. However, only relatively basic and repetitive tasks can presently be delegated to machines. This article argues that John Unsworth’s concept of “scholarly primitives” is an effective tool for identifying basic tasks that can be completed by computers in ways that advance humani- ties research (2000). As Unsworth writes, primitives are “basic functions common to scholarly activity across disciplines, over time, and independent of theoretical orientation.” They are the building blocks of research and analysis. As the roots and foundations of our work, “primitives” provide an effective starting point for the augmentation of scholarly tasks. Here it is important to note that the end goal is not the automation of scholarship, but rather the delegation of appropriate tasks to machines. As François Chollet recently noted, Our field isn’t quite “artificial intelligence” — it’s “cognitive automation”: the en- coding and operationalization of human-generated abstractions / behaviors / skills. The “intelligence” label is a category error. (2020) This view shifts our focus from the potential intelligence of machines towards their abil- ity to complete useful tasks for human ends. Specifically, they can augment scholars’ work by performing repetitive tasks at scale with superhuman speed and precision. I proceed from this understanding to argue for an experimental and interpretive approach to machine learning that highlights the value of the interaction between the scholar and machine rather than what ma- chines can produce. *** Unsworth’s notion “scholarly primitive” takes its meaning from programming and refers to the most basic operations and data types of a programming language. Primitives form the build- ing blocks for all other components and operations of the language. This borrowing of termi- nology also suggests that primitives are not universal. A sequence of characters called a string is a primitive in Python, but not in Java or C. The architecture of a language’s primitives changes over time and evolves with community needs. The Python and C communities, for example, have em- braced Unicode as a standard to allow strings in every human language (including emojis). Other communities continue to use a range of character encodings, which grants greater flexibility to the individual programmer and avoids the notion that there should be a common standard. For scholarship, the term offers a metaphor and point of departure. It poses a question: What are the most basic elements of scholarly research and analysis? Unsworth offers several initial ex- amples of primitives to illustrate their value without a claim that they are comprehensive, includ- ing discovering, annotating, comparing, referring, sampling, illustrating, and representing. These terms offer a “list of functions (recursive functions) that could be the basis for a manageable but also useful tool-building enterprise in humanities computing.” Primitives can thus guide us in the creation of computational tools for scholarship. For example, with the primitive of comparison, a scholar might study different editions of a text, searching for similarities and differences that often lead to new insights or highlight ideas that would otherwise be taken for granted. As a tool, comparison can (but does not always) re- veal new information. For an assignment in graduate school, I compared a historical calendar that showed the days of the week against entries in Stalin’s appointment book. The simple juxtaposi- tion revealed that none of Stalin’s appointments were on a Sunday. This example raises questions for further investigation and interpretation. If Stalin was an atheist who worked at all times of Janco 45 the day and night, why wouldn’t he schedule meetings on Sundays? Perhaps it was a legacy from Stalin’s youth spent in seminary? Is there a similar pattern in other periods of Stalin’s life? The craft of humanities research relies on many such simple initial queries. It should be noted that these little experiments are just the beginning of a research project. Nonetheless, the utility of comparison is clear. If anything, it seems so basic as to go unnoticed. This particular comparison offered an insight and new knowledge that led to further research questions. Such beginnings are often a matter of luck. However, machine learning offers an opportu- nity to increase the dimensionality of comparisons. The similarities and differences between two editions of a text can easily be quantified using Levenshtein distance.1 However, that will only capture the differences at the level of characters on a page. With machine learning, we can train embeddings that account for semantics, authors, time periods, genders and other features of a text and its contents simultaneously. We can quantify similarity in new ways that facilitate new forms of comparison. This approach builds on the original meaning and purpose of comparison as a form of “scholarly primitive,” but opens additional directions for research and opportunities for insights. Rather than relying on happenstance or intuition to find productive comparisons, we can systematically search and compare research materials. The second “scholarly primitive” that lends itself well to augmentation is annotation. This activity takes different forms across disciplines. A literary scholar might underline notable sec- tions of a text by writing a note in the margins. A historian transcribes information from an archival source into a notebook. At their core, these actions add observations and associations to the original materials. Those steps in the research process are the first, most basic step, that con- nects information in a source to a larger set of research materials. We add context and meaning to materials that make them part of a larger collection. When working with texts or images, machine learning models are presently capable of mak- ing simple annotations and associations. For example, named entity recognition models (NER) are able to recognize person names, place names, and other key words in text. Each label is an annotation that makes a claim about the content of the text. “Steamboat Springs” or “New York City” are linked to an entity called PLACE. Once again, we are speaking about the most basic first steps that scholars perform during research. I know that Steamboat Springs is a place. It’s where I grew up. However, another scholar, one less versed in small mountain towns in Colorado, might not recognize the town name. They might identify it as a spring or a ski resort; perhaps a volcanic field in Nevada. The idea of “scholarly primitives” forces us to confront the importance of do- main knowledge and the role that it plays in the interpretation of materials. To teach a machine to find entities, we must first explain everything in very specific terms. We can train the machine to use surrounding contextual information in order to predict — correctly — that “Steamboat Springs” refers to a town, a spring, or a ski resort. As part of a project with Philip Gleissner, I trained a model that correctly identifies Soviet journal names in diary entries. For instance, the machine uses contextual clues to identify when the term Volga refers to the journal by that name and not to the river or the automobile. Where is the mention of “October” a journal name and not a month, a factory name, or the revolu- tion? The trained model makes it possible to identify references to journals in a corpus of over 400,000 diary entries. This in turn makes it possible to research the diaries with a focus on reader reception. Normally, this would be a laborious and time-consuming task. Each time the machine predicts an entity in the text, it adds annotations. What was simply text is now marked as an en- 1Named after the Soviet mathematician Vladimir Levenshtein, Levenshtein distance uses the number of changes that would be needed to make two objects identical as a measure of their similarity. 46 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 tity. As part of this project, we had to define the relevant entities, create training data, and train the model to accomplish a specific task. This process has tangible value for scholarship because it forces us to break down complicated research processes into their most basic tasks and processes. As noted before, annotation can be an act of association and linking. Natural language pro- cessing is capable of not only recognizing entities in a text, but also associating that text with a record in a knowledge base. This capability is called named entity linking. Using embeddings, a statistical language model can not only predict that “Steamboat Springs” is a town, but that it is a specific town with the record Q984721 in dbpedia. This association opens a wealth of contex- tual information about the place, including its population, latitude and longitude, and elevation. A scholar might have ample knowledge and experience reading literature — specifically, Milton. A machine does not, but it has access to context information that enriches analysis and permits associations. The result is a reading of a literary work that accounts for contextual knowledge. To be sure, named entity linking is not a replacement for domain knowledge. However, it is able to augment a scholar’s contextual knowledge of materials and make that information available for study during research. At this point, we are asking the machine not only to sort or filter data, but to reason actively about its contents. Machine learning offers the potential to automate humanities annotation tasks at scale. This is true of basic tasks, such as recognizing that a given text is a letter. It is also true of object recognition tasks, such as identifying a state seal in a letterhead or other visual at- tributes. A Haverford College student was doing research on documents in a digital archive that we are building with the Grupo de Apoyo Mutuo (GAM), of more than three thousand case inves- tigations of disappeared persons during the Guatemalan Civil War. They noticed that many of the documents were signed with a thumbprint. The student and I trained an image classification model to identify those documents, thus providing the capability to search the entire collection of documents for this visual attribute. The thumbprints provided a proxy for literacy and allowed the student to study the collection in new ways. Similarly, documents containing the state seal of Guatemala are typically letters from the government in reply to GAM’s requests for information about disappeared persons. At present, several excellent tools exist to facilitate machine annotation of images and texts. Google’s Teachable Machine offers an intuitive web application that humanities faculty and stu- dents can use to train classification models for images, sounds, and poses. To take the example above, the user would upload images of correspondence. They would then upload images of doc- uments that are not letters.2 Once training begins, a base model is loaded and trained on the new categories. Because the model already has existing training on image categories, it is able to learn the new category with only a few examples. This process is called transfer learning. For more advanced tasks, Google offers AutoML Vision and Natural Language, which are able to process large collections of text or images and to deploy trained models using Google cloud infrastruc- ture. Similar products are available from Amazon, IBM, and other companies. Runway ML offers a locally installed program with more advanced capabilities than the Teachable Machine. Runway ML works with a wide range of machine learning models and is an excellent way for scholars to explore their capabilities without having to write code.3 The accessibility of tools like 2In the Google Cloud Terms of Service there is specific assurance that your data will not be shared or used for any other purpose than the training of the model. More expert analysis may find concerns, and caution is always warranted. At present, there seems to be no more risk in using cloud services for ML tasks than there are for using cloud services more generally. See ?iiTb,ff+HQm/X;QQ;H2X+QKfi2`Kbf. 3Teachable Machine, ?iiTb,ffi2�+?�#H2K�+?BM2XrBi?;QQ;H2X+QKf; Google AutoML, ?iiTb,ff+HQm /X;QQ;H2X+QKf�miQKHf; RunwayML, ?iiTb,ff`mMr�vKHX+QKf. https://cloud.google.com/terms/ https://teachablemachine.withgoogle.com/ https://cloud.google.com/automl/ https://cloud.google.com/automl/ https://runwayml.com/ Janco 47 Runway allows for low-stakes experimentation and exploration. It is also a particularly good way for scholars to explore new methods and discover new materials. For Unsworth, discovery is largely the process of identifying new resources. We can find new sources in a library catalog, on the shelf, or in a conversation. These activities require a human in the loop because it is the person’s incomplete knowledge of a source that makes it a “discovery” when found. Given that machines reason about the content of text and images in ways that are quite unlike those of humans, machine learning opens new possibilities for discovery. When it comes to the differences in our own habits of mind and the computational processes of artificial networks, we may speak of “neurodiversity.” Scholars can benefit from these differences, since the strengths of machine thinking complement our needs. Machine learning models offer a variety of ways to identify similarity and difference with re- search materials. Yale’s PixPlot, for example, uses a convolutional network to train image embed- dings which are then plotted relative to one another in two-dimensional space with a stochastic nearest neighbor algorithm (t-SNE) (Duhaime n.d.).4 PixPlot creates a striking visualization of hundreds or thousands of images, which are organized and clustered by their relative visual sim- ilarity. As a research tool, PixPlot and similar projects offer a quick means to identify statistically relevant similarities and clusters. This visualization reveals what patterns are most evident to the machine and provides a discovery tool for associations that might not be evident to a human researcher. Ben Schmidt has applied a comparable process to “machine read” and visualize four- teen million texts in the HathiTrust (n.d., 2018).5 Using the relative co-occurrence of words in a book, Schmidt is able to train book embeddings. Schmidt’s vectors provide an original way to organize and label texts based purely on the machine’s “reading” of a book. These machine- generated labels and clusters can be compared against human-generated metadata. The value of this work is the human investigation of what machine models find significant in a collection of research materials. For example, with topic modeling, a scholar must interpret what a particular algorithm has identified as a statistically significant topic by interpreting a cryptic chain of words. The topic “menu, platter, coffee, ashtray” is likely related to a diner. In these efforts, Scattertext offers an effective tool to visualize what terms are most distinctive of a text category. In a given corpus of text, I can identify which words are most exemplary of poetry and which words are most exemplary of prose. Scattertext creates a striking and useful visualization, or it can be used in the terminal to process large collections of text. Conclusion As a conceptual tool, “scholarly primitives” has considerable promise to connect the intellectual goals of academic researchers in the humanities with the technical possibilities of machine learn- ing. Rather than focusing on the capabilities of machine learning methods and the priorities of machine learning researchers, this method offers a means to build from the existing research practices of humanities scholars. It allows us to identify what kinds of tasks would benefit from being augmented. Using “primitives” shifts the focus away from large abstract goals, such as re- search findings and interpretive methods, to micro-methods and actions of humanities research. By augmenting these activities, we are able to benefit from the scale and precision afforded by 4See also ?iiTb,ff�`ib2tT2`BK2MibXrBi?;QQ;H2X+QKfibM2K�Tf. 5At time of writing, Schmidt’s digital monograph Creating Data (n.d.) is a work in progress, with most sections empty until the official publication. https://artsexperiments.withgoogle.com/tsnemap/ 48 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 computational methods, as well as the valuable interplay between scholars and machines as hu- manities research practices are made explicit and reproducible. References Chollet, François. 2020. “Our Field Isn’t Quite ‘Artificial Intelligence’ — It’s ‘Cognitive Au- tomation’: The Encoding and Operationalization of Human-Generated Abstractions / Be- haviors / Skills. The ‘Intelligence’ Label Is a Category Error.” Twitter, January 6, 2020, 10:45 p.m. ?iiTb,ffirBii2`X+QKf7+?QHH2ifbi�imbfRkR9jNk9Nejd8yk8ee9. Duhaime, Douglas. n.d. “PixPlot.” Yale DHLab. Accessed July 12, 2020. ?iiTb,ff/?H�#X v�H2X2/mfT`QD2+ibfTBtTHQif. Schmidt, Benjamin. n.d. “A Guided Tour of the Digital Library.” In Creating Data: The Inven- tion of Information in the American State, 1850-1950. ?iiT,ff+`2�iBM;/�i�Xmbf/� i�b2ibf?�i?B@72�im`2bf. . 2018. “Stable Random Projection: Lightweight, General-Purpose Dimension- ality Reduction for Digitized Libraries.” Journal of Cultural Analytics, October. ?iiTb, ff/QBXQ`;fRyXkkR93fReXyk8. Unsworth, John. 2000. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” Paper presented at the Symposium on Humanities Computing: Formal Methods, Experimental Practice, King’s College, Lon- don, May 2000. ?iiT,ffrrrXT2QTH2XpB`;BMB�X2/mf�DKmkKfEBM;bX8@yyfT`BK BiBp2bX?iKH. https://twitter.com/fchollet/status/1214392496375025664 https://dhlab.yale.edu/projects/pixplot/ https://dhlab.yale.edu/projects/pixplot/ http://creatingdata.us/datasets/hathi-features/ http://creatingdata.us/datasets/hathi-features/ https://doi.org/10.22148/16.025 https://doi.org/10.22148/16.025 http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html
jiang-cross-2021 ---- Chapter 6 Cross-Disciplinary ML Research is like Happy Marriages: Five Strengths and Two Examples Meng Jiang University of Notre Dame Top Strengths in ML+X Collaboration Cross-disciplinary research refers to research and creative practices that involve two or more aca- demic disciplines (Jeffrey 2003; Karniouchina, Victorino, and Verma 2006). These activities may range from those that simply place disciplinary insights side by side to much more integrative or transformative approaches (Aagaard-Hansen 2007; Muratovski 2011). Cross-disciplinary re- search matters, because (1) it provides an understanding of complex problems that require a mul- tifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and com- municate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al. 2011; O’Rourke, Crowley, and Gonnerman 2016; Miller and Leffert 2018). One of the most popular cross-disciplinary research topics/programs is Machine Learning + X (or Data Science + X). Machine learning (ML) is a method of data analysis that automates an- alytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. ML has been used in a variety of applications (Murthy 1998), such as email filtering and computer vision; however, most applications still fall in the domain of computer science and engineering. Recently, the power of ML+X, where X can be any other discipline (such as physics, chemistry, 63 64 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 biology, sociology, and psychology), is well recognized. ML tools can reveal profound insights hiding in ballooning datasets (Kohavi et al. 1994; Pedregosa et al. 2011; Kotsiantis 2012; Mul- lainathan and Spiess 2017). However, cross-disciplinary research, which ML+X is part of, is challenging. Collaborating with investigators outside one’s own field requires more than just adding a co-author to a paper or proposal. True collaborations will not always be without conflict—lack of information leads to misunderstandings. For example, ML experts would have little domain knowledge in the field of X; and researchers in X might not understand ML either. The knowledge gap limits the progress of collaborative research. So how can we start and manage successful cross-disciplinary research? What can we do to facilitate collaborative behaviors? In this essay, I will compare cross-disciplinary ML research to “happy marriages,” discussing some characteristics they share. Specifically, I will present the top strengths of conducting cross-disciplinary ML research and give two examples based on my experience of collaborating with historians and psychologists. Marriage is one of the most common “collaborative” behaviors. Couples expect to have happy marriages, just like collaborators expect to have successful project outcomes (Robinson and Blan- ton 1993; Pettigrew 2000; Xu et al. 2007). Extensive studies have revealed the top strengths of happy marriages (DeFrain and Asay 2007; Gordon and Baucom 2009; Prepare/Enrich, n.d.), which can be reflected in cross-disciplinary ML research. Here I focus on five of them: 1. Collaborators (“partners” in the language of marriage) are satisfied with communication. 2. Collaborators feel very close to each other. 3. Collaborators discuss their problems well. 4. Collaborators handle their differences creatively. 5. There is a goodbalanceoftimealone (i.e., individual research work) andtogether (meetings, discussions, etc). First of all, communication is the exchange of information to achieve a better understanding; and collaboration is defined as the process of working together with another person to achieve an end goal. Effective collaboration is about sharing information, knowledge, and resources to work together through satisfactory communication. Ineffectiveness or lack of communication is one of the biggest challenges in ML+X collaboration. Second, researchers in different disciplines meet different challenges through the process of collaboration. Making the challenges clear to understand and finding solutions together is the core of effective collaboration. Third, researchers in different disciplines can collaborate only when they recognize mutual interest and feel that the research topics they have studied in depth are very close to each other. Collaborators must be interested in solving the same, big problem. Fourth, collaborators must embrace their differences on concepts and methods and take ad- vantage of them. For example, one researcher can introduce a complementary method to the mix of other methods that the collaborator has been using for a long time; or one can have a new, impactful dataset and evaluation method to test the techniques proposed by the other. Fifth, in strong collaboration, there is a balance between separateness and togetherness. Meet- ings are an excellent use of time for having integrated perspectives and productive discourse around Jiang 65 difficult decisions. However, excessive collaboration happens when researchers are depleted by too many meetings and emails. It can lead to inefficient, unproductive meetings. So it is impor- tant to find a balance. Next, I, as a computer scientist and ML expert, will discuss twoML+X collaborative projects. ML experts bring mathematical modeling and computational methods for mining knowledge from data. The solutions usually have good generalizability; however, they still need to be tai- lored for specialized domains or disciplines. Example 1: ML + History The history professor Liang Cai and I have collaborated on an international research project ti- tled “Digital Empires: Structured Biographical and Social Network Analysis of Early Chinese Empires.” Dr. Cai is well known for her contributions to the fields of early Chinese Empires, Classical Chinese thought (in particular, Confucianism and Daoism), digital humanities, and the material culture and archaeological texts of early China (Cai 2014). Our collaboration ex- plores how digital humanities expand the horizon of historical research and help visualize the research landscape of Chinese history. Historical research is often constrained by sources and the human cognitive capacity for processing them. ML techniques may enhance historians’ abilities to organize and access sources as they like. ML techniques can even create new kinds of sources at scale for historians to interpret. “The historians pose the research questions and visualize the project,” said Cai. “The computer scientists can help provide new tools to process primary sources and expand the research horizon.” We conducted a structured biographical analysis to leverage the development of machine learning techniques, such as neural sequence labeling and textual pattern mining, which allowed classical sources of Chinese empires to be represented in an encoded way. The project aims to build a digital biographical database that sorts out different attributes of all recorded historical actors in available sources. Breaking with traditional formats, ML+History creates new oppor- tunities and augments our way of understanding history. First, it helps scholars, especially historians, change their research paradigm, allowing them to generalize their arguments with sufficient examples. ML techniques can find all examples in the data where manual investigation may miss some. Also, abnormal cases can indicate a new discovery. As far as early Chinese empires are concerned, ML promises to automate mining and encoding all available biographical data, which allows scholars to change the perspective from one person to a group of persons with shared characteristics, and to shift from analyzing examples to relating a comprehensive history. Therefore, scholars can identify general trends efficiently and present an information-rich picture of historical reality using ML techniques. Second, the structured data produced by ML techniques revolutionize the questions researchers ask, thereby changing the research landscape. Because of the lack of efficient tools, there are nu- merous interesting questions scholars would like to ask but cannot. For example, the geographical mobility of historical actors is an intriguing question for early China, the answer to which would show how diversified regions were integrated into a unified empire. Nevertheless, an individual historian cannot efficiently process the massive amount of information preserved in the sources. With ML techniques, we can generate fact tuples to sort out original geographical places of all available historical actors and provide comprehensive data for historians to analyze. 66 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 Figure 6.1: The graph presents a visual of the social network of officials who served in the gov- ernment about 2,000 years ago in China. The network describes their relationships and personal attributes. Jiang 67 Patterns Mined by ML Tech Extracted Relations $PER_X …ጛ$PER_Yழ$KLG (⑬,〫,) $PER_X was taught by $PER_Y on $KLG (knowledge) (〫,ↁၵ,) (⋁,ၔੲ,ឃ⑷) $PER_X PER_Y$ࢍ… (ோ㠟⊡༱,ၮឮሞ) $PER_X was taught/mentored by $PER_Y (ჶ㬾,ᴃ) $PER_X …ᖱ$PER_Y (ၯ,⭈㶷↲ኧ) $PER_X taught $PER_Y (ዀ,㭮⥸) $PER … $LOCࢁࢨ (,ᯊᡕቕ㙈) $PER place_of_birth $LOC (ዺヽ,ᝲ㋺) $PER㋣$TIT (ᠮ㋺,᱓ႉ) $PER job_title $TIT (ⅰኴ,㋨ᡕႉ) $PER⥤$TIT (㙈ⅴ,ጞை) $PER job_title $TIT (ၯஒ,ࡢᄝࡢმ) $PERẚ$TIT (ⅴ,⒆ࣝ) $PER job_title $TIT (ோ㠟⊡༱,㡧ሮश) Table 6.1: Examples of Chinese Text Extraction Patterns Third, the project revolutionizes our reading habits. Large datasets mined from primary sources will allow scholars to combine long-distant reading with original texts. The macro pic- ture generated from data will aid in-depth analysis of the event against its immediate context. Furthermore, graphics of social networks and common attributes of historical figures will change our reading habits, transforming linear storytelling to accommodate multiple narratives (see the above figure). Researchers from the two sides develop collaboration through the project step by step, just like developing a relationship for marriage. Ours started at a faculty gathering from some random chat about our research. As the historian is open-minded to ML technologies and the ML expert is willing to create broader impact, we brainstormed ideas that would not have developed without taking care of the five important points: 1. Communication: With our research groups, we started to meet frequently at the begin- ning. We set up clear goals at the early stage, including expected outcomes, publication venues, and joint proposals for funding agencies, such as the National Endowment for the Humanities (NEH) and Notre Dame seed grant funding. Our research groups met almost twice a week for as long as three weeks. 2. Feel very close to each other: Besides holding meetings, we exchanged our instant messenger accounts so we could communicate faster than email. We created Google Drive space to share readings, documents, and presentation slides. We found many tools to create “tight relationships” between the groups at the beginning. 3. Discuss their problems well: Whenever we had misunderstandings, we discussed our prob- 68 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 lems. Historians learned about what a machine does, what a machine can do, and generally how a machine works toward the task. ML people learned what is interesting to historians and what kind of information is valuable. We hold the principle that as the problems exist, they make sense; any problem any other encounters is worth a discussion. We needed to solve problems together from the moment they became our problems. 4. Handletheirdifferencescreatively: Historians are among the few who can read and write in classical Chinese. Classical Chinese was used as the written language from over 3,000 years ago to the early 20th century. Since then, mainland China has used either Mandarin (sim- plified Chinese) or Cantonese, while Taiwan has used traditional Chinese. None is similar to classical Chinese at all. In other words, historians work on a language that no ML ex- perts here, even those who speak modern Chinese, can understand. So we handle our lan- guage differences “creatively” by using the translated version as the intermediate medium. Historians have translated history books in classical Chinese into simplified Chinese so we can read the simplified version. Here, the idea is to let the machine learning algorithms read both versions. We find that information extraction (i.e., finding relations from text) and machine translation (i.e., from classical Chinese to modern Chinese) can mutually en- hance each other, which turns out to be one of our novel technical contributions to the field of natural language processing. 5. Good balance of time alone and together: After the first month, since the project goal, datasets, background knowledge, and many other aspects were clear in both sides’ minds, we had regular meetings in a less intensive manner. We met twice or three times a month so that computer science students could focus on developing machine learning algorithms, and only when significant progress was made or expert evaluation was needed would we schedule a quick appointment with Prof. Liang Cai. So far, we have published peer-reviewed papers on the topic of information extraction and entity retrieval in classical Chinese history books using ML (Ma et al. 2019; Zeng et al. 2019). We have also submitted joint proposals with the above work as preliminary results to NEH. Example 2: ML + Psychology I am working with Drs. Ross Jacobucci and Brooke Ammerman in psychology to apply ML to understand mental health problems and suicidal intentions. Suicide is a serious public health problem; however, suicides are preventable with timely, evidence-based interventions. Social me- dia platforms have been serving users who are experiencing real-time suicidal crises with hopes of receiving peer support. To better understand the helpfulness of peer support occurring online, we characterize the content of both a user’s post and corresponding peer comments occurring on a social media platform and present an empirical example for comparison. We have designed a new topic-model-based approach to finding topics of users and peer posts from the social me- dia forum data. The key advantages include: (i) modeling both the generative process of each type of corpora (i.e., user posts and peer comments) and the associations between them, and (ii) using phrases, which are more informative and less ambiguous than words alone, to represent so- cial media posts and topics. We evaluated the method using data from Reddit’s r/SuicideWatch community. Jiang 69 Figure 6.2: Screenshot of r/SuicideWatch on Reddit. We examined how the topics of user and peer posts were associated and how this information influenced the perceived helpfulness of peer support. Then, we applied structural topic modeling to data collected from individuals with a history of suicidal crisis as a means to validate findings. Our observations suggest that effective modeling of the association between the two lines of top- ics can uncover helpful peer responses to online suicidal crises, notably providing the suggestion of pursuing professional help. Our technology can be applied to “paired” corpora in many appli- cations such as tech support forums and question-answering sites. This project started from a talk I gave at the psychology graduate seminar. The fun thing is that Dr. Jacobucci was not able to attend the talk. Another psychology professor who attended my talk asked constructive questions and mentioned my research to Dr. Jacobucci when they met later. So Dr. Jacobucci dropped me an email, and we had coffee together. Cross-disciplinary research often starts from something that sounds like developing a relationship. Because, again, the psychologists are open-minded to ML technologies and the ML expert is willing to create broader impact, we successfully brainstormed ideas when we had coffee, but this would not have developed into long-term collaboration without the following efforts: (1) Communicate inten- sively between research groups at the early stage. We had multiple meetings a week to make the goals clear. (2) Get students involved in the process. When my graduate student received more and more advice from the psychology professors and students, the connections between the two groups became stronger. (3) Discuss the challenges in our fields very well. We analyzed together whether machine learning would be capable of addressing the challenges in mental health. We also analyzed whether domain experts could be involved in the loop of machine learning algo- rithms. (4) Handle our differences. We separately presented our research and then found times to work together to put sets of slides together based on one common vision and goal. (5) After the first month, only hold meetings when discussion is needed or there is an approaching deadline 70 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 for either paper or proposal. We have enjoyed our collaboration and the power of cross-disciplinary research. Our joint work is under review at Nature Palgrave Communications. We have also submitted joint propos- als to NIH with this work as preliminary results (Jiang et al. 2020). Conclusions In this essay, I used a metaphor comparing cross-disciplinary ML research to “happy marriages.” I discussed five characteristics they share. Specifically, I presented the top strengths of produc- ing successful cross-disciplinary ML research: (1) Partners are satisfied with communication. (2) Partners feel very close to each other. (3) Partners discuss their problems well. (4) Partners han- dle their differences creatively. (5) There is a good balance of time alone (i.e., individual research work) and together (meetings, discussions, etc). While every project is different and will produce its own challenges, my experience of collaborating with historians and psychologists according to the happy marriage metaphor suggests that it is a simple and strong paradigm that could help other interdisciplinary projects develop into successful, long-term collaborations. References Aagaard lj Hansen, Jens. 2007. “The Challenges of Cross lj Disciplinary Research.” Social Epistemology 21, no. 4 (October-December): 425–38. ?iiTb,ff/QBXQ`;fRyXRy3yfyk eNRdkydyRd9e89y. Cai, Liang. 2014. Witchcraft and the Rise of the First Confucian Empire. Albany: SUNY Press. DeFrain, John, and Sylvia M. Asay. 2007. “Strong Families Around the World: An Introduction to the Family Strengths Perspective.” Marriage & Family Review 41, no. 1–2 (August): 1–10. ?iiTb,ff/QBXQ`;fRyXRjyyfCyykp9RMyRnyR. Gordon, Cameron L., and Donald H. Baucom. 2009. “Examining the Individual Within Mar- riage: Personal Strengths and Relationship Satisfaction.” Personal Relationships 16, no. 3 (September): 421–435. ?iiTb,ff/QBXQ`;fRyXRRRRfDXR9d8@e3RRXkyyNXyRkjR Xt. Jeffrey, Paul. 2003. “Smoothing the Waters: Observations on the Process of Cross-Disciplinary Research Collaboration.” Social Studies of Science 33, no. 4 (August): 539–62. Jiang, Meng, Brooke A. Ammerman, Qingkai Zeng, Ross Jacobucci, and Alex Brodersen. 2020. “Phrase-Level Pairwise Topic Modeling to Uncover Helpful Peer Responses to Online Sui- cidal Crises.” Humanities and Social Sciences Communications 7: 1–13. Karniouchina, Ekaterina V., Liana Victorino, and Rohit Verma. 2006. “Product and Service In- novation: Ideas for Future Cross-Disciplinary Research.” TheJournalofProductInnovation Management 23, no. 3 (May): 274–80. Kohavi, Ron, George John, Richard Long, David Manley, and Karl Pfleger. 1994. “MLC++: A Machine Learning Library in C++.” In Proceedings of the Sixth International Conference on Tools with Artificial Intelligence, 740–3. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNfh� AXRNN9Xj9e9Rk. Kotsiantis, S.B. 2012. “Use of Machine Learning Techniques for Educational Proposes [sic]: a Decision Support System for Forecasting Students’ Grades.” Artificial Intelligence Review 37, no. 4 (May): 331–44. ?iiTb,ff/QBXQ`;fRyXRyydfbRy9ek@yRR@Nkj9@t. https://doi.org/10.1080/02691720701746540 https://doi.org/10.1080/02691720701746540 https://doi.org/10.1300/J002v41n01_01 https://doi.org/10.1111/j.1475-6811.2009.01231.x https://doi.org/10.1111/j.1475-6811.2009.01231.x https://doi.org/10.1109/TAI.1994.346412 https://doi.org/10.1109/TAI.1994.346412 https://doi.org/10.1007/s10462-011-9234-x Jiang 71 Ma, Yihong, Qingkai Zeng, Tianwen Jiang, Liang Cai, and Meng Jiang. 2019. “A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography.” In Pro- ceedings of the 2nd International Workshop on EntitY REtrieval, edited by Gong Cheng, Kalpa Gunaratna, and Jun Wang, 8–15. N.p.: International Workshop on EntitY REtrieval. ?iiT,ff+2m`@rbXQ`;foQH@k99ef. Miller, Eliza C. and Lisa Leffert. 2018. “Building Cross-Disciplinary Research Collaborations.” Stroke 49, no. 3 (March): e43-e45. ?iiTb,ff/QBXQ`;fRyXRReRfbi`QF2�?�XRRdXyk y9jd. Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine learning: an applied econometric ap- proach.” Journal of Economic Perspectives 31, no. 2 (spring): 87–106. ?iiTb,ff/QBXQ` ;fRyXRk8dfD2TXjRXkX3d. Muratovski, Gjoko. 2011. “Challenges and Opportunities of Cross-Disciplinary Design Edu- cation and Research.” In Proceedings from the Australian Council of University Art and Design Schools (ACUADS) Conference: Creativity: Brain—Mind—Body, edited by Gordon Bull. Canberra, Australia: ACAUDS Conference. ?iiTb,ff�+m�/bX+QKX�mf+QM72` 2M+2f�`iB+H2f+?�HH2M;2b@�M/@QTTQ`imMBiB2b@Q7@+`Qbb@/Bb+BTHBM�`v@ /2bB;M@2/m+�iBQM@�M/@`2b2�`+?f. Murthy, Sreerama K. 1998. “Automatic Construction of Decision Trees from Data: A Multi- Disciplinary Survey.” DataMiningandKnowledgeDiscovery 2, no. 4 (December): 345–89. ?iiTb,ff/QBXQ`;fRyXRykjf�,RyyNd99ejykk9. O’Rourke, Michael, Stephen Crowley, and Chad Gonnerman. 2016. “On the Nature of Cross- Disciplinary Integration: A Philosophical Framework.” Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 56 (April): 62–70. ?iiTb,ff/QBXQ`;fRyXRyRefDXb?Tb+XkyR8XRyXyyj. Pedregosa, Fabian et al. 2011. “Scikit-learn: Machine Learning in Python.” The Journal of Ma- chine Learning Research 12: 2825–30. ?iiT,ffrrrXDKH`XQ`;fT�T2`bfpRkfT2/`2; Qb�RR�X?iKH. Pettigrew, Simone F. 2000. “Ethnography and Grounded Theory: a Happy Marriage?” In Associ- ation for Consumer Research Conference Proceedings, edited by Stephen J. Hoch and Robert J. Meyer, 256–60. Provo, UT: Association for Consumer Research. ?iiTb,ffrrrX�+`r 2#bBi2XQ`;fpQHmK2bf39yyfpQHmK2bfpkdf. Prepare/Enrich. N.d. “National Survey of Marital Strengths.” Prepare/Enrich (website). Ac- cessed January 17, 2020. ?iiTb,ffrrrXT`2T�`2@2M`B+?X+QKfT2nK�BMnbBi2n+QM i2MifT/7f`2b2�`+?fM�iBQM�Hnbm`p2vXT/7. Robinson, Linda C. and Priscilla W. Blanton. 1993. “Marital Strengths in Enduring Marriages.” Family Relations: An Interdisciplinary Journal of Applied Family Studies 42, no. 1 (Jan- uary): 38–45. ?iiTb,ff/QBXQ`;fRyXkjydf839NRN. Urquhart, R., E. Grunfeld, L. Jackson, J. Sargeant, and G. A. Porter. 2013. “Cross-Disciplinary Research in Cancer: an Opportunity to Narrow the Knowledge–Practice Gap.” Current Oncology 20, no. 6 (December): e512–e521. ?iiTb,ff/QBXQ`;fRyXjd9df+QXkyXR9 3d. Xu, Anqi, Xiaolin Xie, Wenli Liu, Yan Xia, and Dalin Liu. 2007. “Chinese Family Strengths and Resiliency.” Marriage & Family Review 41, no. 1–2 (August): 143–64. ?iiTb, ff/QBXQ`;fRyXRjyyfCyykp9RMyRny3. Zeng, Qingkai, Mengxia Yu, Wenhao Yu, Jinjun Xiong, Yiyu Shi, and Meng Jiang. 2019. “Faceted Hierarchy: A New Graph Type to Organize Scientific Concepts and a Construction Method.” http://ceur-ws.org/Vol-2446/ https://doi.org/10.1161/strokeaha.117.020437 https://doi.org/10.1161/strokeaha.117.020437 https://doi.org/10.1257/jep.31.2.87 https://doi.org/10.1257/jep.31.2.87 https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://doi.org/10.1023/A:1009744630224 https://doi.org/10.1016/j.shpsc.2015.10.003 http://www.jmlr.org/papers/v12/pedregosa11a.html http://www.jmlr.org/papers/v12/pedregosa11a.html https://www.acrwebsite.org/volumes/8400/volumes/v27/ https://www.acrwebsite.org/volumes/8400/volumes/v27/ https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf https://doi.org/10.2307/584919 https://doi.org/10.3747/co.20.1487 https://doi.org/10.3747/co.20.1487 https://doi.org/10.1300/J002v41n01_08 https://doi.org/10.1300/J002v41n01_08 72 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), edited by Dmitry Ustalov, Swapna Somasundaran, Peter Jansen, Goran Glavaš, Martin Riedl, Mihai Surdeanu, and Michalis Vazirgiannis, 140–50. Hong Kong: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRyXR3e8jfpRf .RN@8jRd. https://doi.org/10.18653/v1/D19-5317 https://doi.org/10.18653/v1/D19-5317
johnson-preface-2021 ---- Preface This collection of essays is the unexpected culmination of a 2018–2020 grant from the Institute of Museum and Library Services to the Hesburgh Libraries at the University of Notre Dame.1 The plan called for a survey and a series of workshops hosted across the country to explore, orig- inally, “the national need for library based topic modeling tools in support of cross-disciplinary discovery systems.” As the project developed, however, it became apparent that the scope of re- search should expand beyond topic modeling and that the scope of output might expand beyond a white paper. The end of the 2010s, we found, was swelling with library-centered investigations of broader machine learning applications across the disciplines, and our workshops demonstrated such a compelling mixture of perspectives on this development that we felt an edited collection of essays from our participants would be an essential witness to the moment in history. With remaining grant funds, we hosted one last workshop at Notre Dame to kick start writing. The resulting essays cover a wide ground. Some present a practical, “how-to” approach to the machine learning process for those who wish to explore it at their own institutions. Oth- ers present individual projects, examining not just technical components or research findings, but also the social, financial, and political factors involved in working across departments (and in some cases, across the town/gown divide). Others still take a larger panoramic view of the ethics and opportunities of integrating machine learning with cross-disciplinary higher education, veer- ing between optimistic and wary viewpoints. The multi-disciplinarity of the essayists and the diversity of their research give each chapter a sui generis flavor, though several shared concerns thread through the collection. Most signifi- cantly, the authors suggest that while the technical aspects of machine learning are a challenge, especially when working with collaborators from different backgrounds, many of their key con- cerns are actually about the ethical and social dimensions of the work. In this sense, the collection is very much of the moment. Two large projects on machine learning, cross-disciplinarity, and libraries ran concurrently with our grant — Cordell 2020 and Padilla 2019, which were com- missioned by major players in the field, the Library of Congress and OCLC, respectively — and both took pains to foreground the wider potential effects of machine learning. As Ryan Cordell puts it, “current cultural attention to ML may make it seem necessary for libraries to implement ML quickly. However, it is more important for libraries to implement ML through their existing commitments to responsibility and care” (1). The voices represented here exhibit a thorough commitment to Cordell’s call for responsibil- ity and care, and they are only a subset of the larger chorus that sounded at the workshops. We editors therefore encourage readers interested in this bigger picture to examine the meta-themes 1LG-72-18-0221-18: “Investigating the National Need for Library Based Topic Modeling Discovery Systems.” See ?iiTb,ffrrrXBKHbX;Qpf;`�Mibf�r�`/2/fH;@dk@R3@ykkR@R3. v https://www.imls.gov/grants/awarded/lg-72-18-0221-18 vi Machine Learning, Libraries, and Cross-Disciplinary Research and detailed information that emerged in the course of the workshops and the original survey through the grant’s final report.2 All of these pieces together capture a fascinating snapshot of an interdisciplinary field in motion. We should note that the working methods of the collection’s editorial team were an attempt to extend the grant’s spirit of collaboration. Through several stages of development, content editors Don Brower, Mark Dehmlow, Eric Morgan, Alex Papson, and John Wang reviewed as- signed essays and provided commentary before notifying general editor Daniel Johnson for prose editing, who in turn shared the updated manuscripts with the authors so the cycle could begin again. The submissions, written variously in Microsoft Word or Google Docs format, were ush- ered through these stages of life in team Google Drive folders and tracked by spreadsheet be- fore eventual conversion by Don Brower into a series of TeX files, provisioned in a version con- trolled Github repository, for more fine-tuned final editing. Like working with diverse teams in the pursuit of machine learning, editing essays together in this fashion, for publication by the Hesburgh Libraries, was a novel way of collaborating, and we editors thought candor about this book-making process might prove insightful to readers. Attending to the social dimensions of the work ourselves, we must note that this collection would not have been possible without the generous support of many people and organizations. We would like to thank the IMLS for providing essential funding support for the grant and the Hesburgh Libraries’ Edward H. Arnold University Librarian, Diane Parr Walker, for her orga- nizational support. Thank you to the members of the Notre Dame IMLS grant team who, at its various stages, provided critical support in managing logistics, conducting research, facilitat- ing workshops, and analyzing results. These individuals include John Wang (grant project di- rector), Don Brower, Mark Dehmlow, Nastia Guimaraes, Melissa Harden, Helen Hockx-Yu, Daniel Johnson, Christina Leblang, Rebecca Leneway, Laurie McGowan, Eric Lease Morgan, and Alex Papson. The University of Notre Dame Office of General Counsel provided key publi- cation advice, and the University of Notre Dame Office of Research provided critical support in administering the grant. Again, many thanks. We would also like to thank the co-signatories of the IMLS Grant Application for supporting the project’s goals: Mark Graves (then Visiting Research Assistant Professor, Center for Theol- ogy, Science, and Human Flourishing, University of Notre Dame), Pamela Graham (Director of Global Studies and Director of the Center for Human Rights Documentation and Research, Columbia University Libraries), and Ed Fox (Professor of Computer Science and Director of the Digital Library Research Laboratory, Virginia Polytechnic Institute and State University). And of course, thanks to the 95 participants in our 2019 IMLS Grant Workshops (too many to enu- merate here) and to the essay authors for sharing their expertise and perspectives in growing our collective knowledge of machine learning and its use in research, scholarship, and cultural her- itage organizations. Your active engagement continues to shape the field, and we look forward to your next achievements. References Cordell, Ryan. 2020. “Machine Learning + Libraries: A Report on the State of the Field.” Com- missioned by LC Labs, Library of Congress. ?iiTb,ffH�#bXHQ+X;Qpfbi�iB+fH�#b frQ`Ff`2TQ`ibf*Q`/2HH@GP*@JG@`2TQ`iXT/7. 2See ?iiTb,ff/QBXQ`;fRyXdkd9f`y@jkyx@FM83. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf https://doi.org/10.7274/r0-320z-kn58 vii Padilla, Thomas. 2019. “Responsible Operations: Data Science, Machine Learning, and AI in Libraries.” Dublin, Ohio: OCLC Research. ?iiTb,ffrrrXQ+H+XQ`;f`2b2�`+?fTm #HB+�iBQMbfkyRNfQ+H+`2b2�`+?@`2bTQMbB#H2@QT2`�iBQMb@/�i�@b+B2M+2 @K�+?BM2@H2�`MBM;@�BX?iKH. https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html
kim-ai-2021 ---- Chapter 7 AI and Its Moral Concerns Bohyun Kim University of Rhode Island Automating Decisions and Actions The goal of artificial intelligence (AI) as a discipline is to create an artificial system—whether it be a piece of software or a machine with a physical body—that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity, such as playing chess or driving.1 The actual capability of most AI systems remained far below this ambitious goal for a long time. But with recent successes with machine learning and deep learning, the performance of some AI programs has started surpassing that of humans. In 2016, an AI program developed with the deep learning method, AlphaGo, astonished even its creators by winning four out of five Go matches with the eighteen-time world champion, Sedol Lee.2 In 2020, Google’s DeepMind unveiled Atari57, a deep reinforcement learning algorithm that reached superhuman levels of play in 57 classic Atari games.3 Early symbolic AI systems determined their outputs based upon given rules and logical in- ference. AI algorithms in these rule-based systems, also known as good old-fashioned AI (GO- FAI), are pre-determined, predictable, and transparent. On the other hand, machine learning, 1Note that by ‘as intelligent as a human,’ I only mean AI at human-level performance in achieving a particular goal not general(/strong) AI. General AI—also known as ‘artificial general intelligence (AGI)’ and ‘strong AI’—refers to AI with the ability to adapt to achieve any goals. By contrast, an AI system developed to perform only one or some activities in a specific domain is called a ‘narrow (/weak) AI’ system. 2AlphaGo can be said to be “as intelligent as humans,” but only in playing Go, where it exceeds human capability. So, it does not qualify as general/strong AI in spite of its human-level intelligence in Go-playing. It is to be noted that general(/strong) AI and narrow(/weak) AI signify the difference in the scope of AI capability. General(/strong) AI is also a broader concept than human-like intelligence, either with its carbon-based substrate or with human-like understanding that relies on what we regard as uniquely human cognitive states such as consciousness, qualia, emotions, and so on. For more helpful descriptions of common terms in AI, see (Tegmark 2017, 39). For more on the match between AlphaGo and Sedol Lee, see (Koch 2016). 3Deep reinforcement learning is a type of deep learning that is goal-oriented and reward-based. See (Heaven 2020). 73 74 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 another approach in AI, enables an AI algorithm to evolve to identify a pattern through the so- called ‘training’ process, which relies on a large amount of data and statistics. Deep learning, one of the widely-used techniques in machine learning, further refines this training process using a ‘neural network.’4 Machine learning and deep learning have brought significant improvements to the performance of AI systems in areas such as translation, speech recognition, and detecting objects and predicting their movements. Some people assume that machine learning completely replaced GOFAI, but this is a misunderstanding. Symbolic reasoning and machine learning are two distinct but not mutually exclusive approaches in AI, and they can be used together (Knight 2019a). With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.5 AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinc- tively moral in character. As humans, we are trained to recognize situations that demand moral decision-making. But how would an AI system be able to do so? Or, should they be? With self- driving cars and autonomous weapons systems under active development and testing, these are no longer idle questions. The Trolley Problem Recent advances of AI, such as autonomous cars, have brought new interest to the trolley prob- lem, a thought experiment introduced by the British philosopher Philippa Foot in 1967. In the standard version of this problem, a runaway trolley barrels down a track where five unsuspecting people are standing. You happen to be standing next to a lever that switches the trolley onto a different track, where there is only one person. Those who are on either track will be killed if the trolley heads their way. Should you pull the lever, so that the runaway trolley would kill one per- son instead of five? Unlike a person, a machine does not panic or freeze and simply follows and executes the given instruction. This means that an AI-powered trolley may act morally as long as it is programmed properly.6 The question itself remains, however. Should the AI-powered trolley be programmed to swerve or stay on course? Different moral theories, such as virtue ethics, contractarianism, and moral relativism, take different positions. Here, I will consider utilitarianism and deontology. Since their tenets are relatively straightforward, most AI developers are likely to look towards those two moral theories for guidance and insight. Utilitarianism argues that the utility of an action is what makes an action moral. In this view, what generates the greatest amount of good is the most moral thing to do. If one regards five human lives as a greater good than one, then one acts morally by pulling the lever and diverting the trolley to the other track. By contrast, deontology claims that what determines whether an action is morally right or wrong is not its utility but moral rules. If an action is in accordance with those rules, then the action is morally right. Otherwise, it is morally 4Machine learning and deep learning have gained momentum because the cost of high-performance computing has significantly decreased and large data sets have become more widely available. For example, the data in the ImageNet contains more than 14 million hand-annotated images. The ImageNet data have been used for the well-known annual AI competition for object detection and image classification at large scale from 2010 to 2017. See ?iiT,ffrrrXBK�; 2@M2iXQ`;f+?�HH2M;2bfGao_*f. 5For an excellent history of AI research, see chapter 1, “What is Artificial Intelligence,” of Boden 2016, 1-20. 6Programming here does not exclusively refer to a deep learning or machine learning approach. http://www.image-net.org/challenges/LSVRC/ http://www.image-net.org/challenges/LSVRC/ Kim 75 wrong. If not to kill another human being is one of those moral rules, then killing someone is morally wrong even if it is to save more lives. Note that these are highly simplified accounts of utilitarianism and deontology. The good in utilitarianism can be interpreted in many different ways, and the issue of conflicting moral rules is a perennial problem that deontological ethics grapples with.7 For our purpose, however, these simplified accounts are sufficient to highlight the aspects in which the utilitarian and the deontological position appeal to and go against our moral intuition at the same time. If a trolley cannot be stopped, saving five lives over one seems to be a right thing to do. Util- itarianism appears to get things right in this respect. However, it is hard to dispute that killing people is wrong. If killing is morally wrong no matter what, deontology seems to make more sense. With moral theories, things seem to get more confusing. Furthermore, consider the case in which one freezes and fails to pull the lever. According to utilitarianism, this would be morally wrong because it fails to maximize the greatest good, i.e. human lives. But how far should one go to maximize the good? Suppose there is a very large person on a footbridge over the trolley track, and one pushes that person off the footbridge onto the track, thus stopping the trolley and saving the five people. Would this count as a right thing to do? Utilitarianism may argue that. But in real life, many would consider throwing a person morally wrong but pulling the lever morally permissible.8 The problem with utilitarianism is that it treats the good as something inherently quantifi- able, comparable, calculable, and additive. But not all considerations that we have to factor into moral decision-making are measurable in numbers. What if the five people on the track are help- less babies or murderers who just escaped from the prison? Would or should that affect our de- cision? Some of us would surely hesitate to save the lives of five murderers by sacrificing one innocent baby. But what if things were different and we were comparing five school children ver- sus one baby or five babies versus one school child? No one can say for sure what is the morally right action in those cases.9 While the utilitarian position appears less persuasive in light of these considerations, deon- tology doesn’t fare too well, either. Deontology emphasizes one’s duty to observe moral rules. But what if those moral rules conflict with one another? Between the two moral rules, “do not kill a person” and “save lives,” which one should trump the other? The conflict among values is common in life, and deontology faces difficulty in guiding how an intelligent agent is to act in a tricky situation such as the trolley problem.10 Understanding What Ethics Has to Offer Now, let us consider AI-powered military robots and autonomous weapons systems since they present the moral dilemma in the trolley problem more convincingly due to the high stakes in- volved. Suppose that some engineers, following utilitarianism and interpreting victory as the ul- timate good/utility, wish to program an unmanned aerial vehicle (UAV) to autonomously drop 7For an overview, see (Sinnott-Armstrong, 2019) and (Alexander and Moore, 2016). 8For an empirical study on this, see (Cushman, Young, and Hauser 2006). For the results of a similar survey that involves an autonomous car instead of a trolley, see (Bonnefon, Shariff, and Rahwan 2016). 9For an attempt to identify moral principles behind our moral intuition in different versions of the trolley problem and other similar cases, see (Thomson 1976). 10Some moral philosophers doubt the value of our moral intuition in constructing a moral theory. See (Singer 2005), for example. But a moral theory that clashes with common moral intuition is unlikely to be sought out as a guide to making an ethical decision. 76 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 bombs in order to maximize the chances of victory. That may result in sacrificing a greater num- ber of civilians than necessary, and many will consider this to be morally wrong. Now imagine different engineers who, adopting deontology and following the moral principle of not killing people, program a UAV to autonomously act in a manner that minimizes casualties. This may lead to defeat on the battlefield, because minimizing casualties may not be always advantageous to winning a war. From these examples, we can see that philosophical insights from utilitarian- ism and deontology may provide little practical guidance on how to program autonomous AI systems to act morally. Ethicists seek abstract principles that can be generalized. For this reason, they are interested in borderline cases that reveal subtle differences in our moral intuition and varying moral theories. Their goal is to define what is moral and investigate how moral reasoning works or should work. By contrast, engineers and programmers pursue practical solutions to real-life problems and look for guidelines that will help with implementing those solutions. Their focus is on creating a set of constraints and if-then statements, which will allow a machine to identify and process morally relevant considerations, so that it can determine and execute an action that is not only rational but also ethical in the given situation.11 On the other hand, the goal of military commanders and soldiers is to end a conflict, bring peace, and facilitate restoring and establishing universally recognized human values such as free- dom, equality, justice, and self-determination. In order to achieve this goal, they must make the best strategic decisions and take the most appropriate actions. In deciding on those actions, they are also responsible for abiding by the principles of jus in bello and for not abdicating their moral responsibility, protecting civilians and minimizing harm, violence, and destruction as much as possible.12 The goal of military commanders and soldiers, therefore, differs from those of moral philosophers or of the engineers who build autonomous weapons. They are obligated to make quick decisions in a life-or-death situation while working with AI-powered military systems. These different goals and interests explain why moral philosophers’ discussion on the trolley problem may be disappointing to AI programmers or military commanders and soldiers. Ethics does not provide an easy answer to the question of how one should program moral decision- making into intelligent machines. Nor does it prescribe the right moral decision in a battlefield. But taking this as a shortcoming of ethics is missing the point. The role of moral philosophy is not to make decision-making easier but to highlight and articulate the difficulty and complexity involved in it. Ethical Challenges from Autonomous AI Systems The complexity of ethical questions means that dealing with the morality of an action by an autonomous AI system will require more than a clever engineering or programming solution. The fact that ethics does not eliminate the inherent ambiguity in many moral decisions should not lead to the dismissal of ethical challenges from autonomous AI systems. By injecting the capacity for autonomous decision-making into machines, AI can fundamentally transform any given field. For example, AI-powered military robots are not just another kind of weapon. When widely deployed, they can change the nature of war itself. Described below are some of the signif- icant ethical challenges that autonomous AI systems such as military robots present. Note that 11Note that this moral decision-making process can be modeled with a rule-based symbolic AI approach, a machine learning approach, or a combination of both. See Vincent Conitzer et al. 2017. 12For the principles of jus in bello, see International Committee of the Red Cross 2015. Kim 77 in spite of these ethical concerns, autonomous AI systems are likely to continue to be developed and adopted in many areas as a way to increase efficiency and lower cost. (a) Moral desensitization AI-powered military robots are more capable than merely remotely-operated weapons. They can identify a target and initiate an attack on their own. Due to their autonomy, military robots can significantly increase the distance between the party that kills and the party that gets killed (Sharkey 2012). This increase, however, may lead people to surrender their own moral responsi- bility to a machine, thereby resulting in the loss of humanity, which is a serious moral risk (Davis 2007). The more autonomous military robots become, the less responsibility humans will feel regarding their life-or-death decisions. (b) Unintended outcome The side that deploys AI-powered military robots is likely to suffer fewer casualties itself while inflicting more casualties on the enemy side. This may make the military more inclined to start a war. Ironically, when everyone thinks and acts this way, the number of wars and the overall amount of violence and destruction in the world will only increase.13 (c) Surrender of moral agency AI-powered military robots may fail to distinguish innocents from combatants and kill the for- mer. In such a case, can we be justified in letting robots take the lives of other human beings? Some may argue that only humans should decide to kill other humans, not machines (Davis 2007). Is it permissible for people to delegate such a decision to AI? (d) Opacity in decision-making Machine learning is used to build many AI systems today. Instead of prescribing a pre-determined algorithm, a machine learning system goes through a so-called ‘training’ process to produce the final algorithm from a large amount of data. For example, a machine learning system may generate an algorithm that successfully recognizes cats in a photo after going through millions of photos that show cats in many different postures from various angles.14 But the resulting algorithm is a complex mathematical formula and not something that humans can easily decipher. This means that the inner workings of a machine learning AI system and its decision-making process is opaque to human understanding, even to those who built the system itself (Knight 2017). In cases where the actions of an AI system can have grave consequences such as a military robot, such opacity becomes a serious problem.15 13(Kahn 2012) also argues that the resulting increase in the number of wars by the use of military robots will be morally bad. 14Google’s research team created an AI algorithm that learned how to recognize a cat in 2012. The neural network behind this algorithm had an array of 16,000 processors and more than one billion connections. Unlabeled random thumbnail images from 10 million YouTube videos allowed this algorithm to learn to identify cats by itself. See Markoff 2012 and Clark 2012. 15This black-box nature of AI systems powered by machine learning has raised great concern among many AI re- searchers in recent years. This is problematic in all areas where these AI systems are used for decision-making, not just in military operations. The gravity of decisions made in a military operation makes this problem even more troublesome. Fortunately, some AI researchers including those in the US Department of Defense are actively working to make AI sys- tems explainable. But until such research bears fruit and AI systems become fully explainable, their military use means accepting many unknown variables and unforeseeable consequences. See Turek n.d. 78 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 AI Applications for Libraries Do these ethical concerns outlined above apply to libraries? To answer that, let us first take a look at how AI, particularly machine learning, may apply to library services and operations. AI- powered digital assistants are likely to mediate a library user’s information search, discovery, and retrieval activities in the near future. In recent years, machine learning and deep learning have brought significant improvement to natural language processing (NLP), which deals with analyzing large amounts of natural lan- guage data to make the interaction between people and machines in natural languages possible. For instance, Google Assistant’s new feature ‘duplex’ was shown to successfully make a phone reservation with restaurant staff in 2018 (Welch 2018). Google’s real-time translation capability for 44 different languages was introduced to Google Assist-enabled Android and iOS phones in 2019 (Rincon 2019). As digital assistants become capable of handling more sophisticated language tasks, their use as a flexible voice user interface will only increase. Such digital assistants will be able to directly interact with library systems and applications, automatically interpret a query, and return results that they deem to be most relevant. Those digital assistants can also be equipped to handle the library’s traditional reference or readers’ advisory service. Integrated into a humanoid robot body, they may even greet library patrons at the entrance and answer directional questions about the library building. Cataloging, abstracting, and indexing are other areas where AI will be actively utilized. Cur- rently, those tasks are performed by skilled professionals. But as AI applications become more sophisticated, we may see many of those tasks partially or fully automated and handed over to AI systems. Machine learning and deep learning can be used to extract key information from a large number of documents or from information-rich visual materials, such as maps and video recordings, and generate metadata or a summary. Since machine learning is new to libraries, there are a relatively small number of machine learning applications developed for libraries’ use. They are likely to grow in number. Yewno, Quartolio, and Iris.ai are examples of the commercial products developed with machine learning and deep learning techniques.16 Yewno Discover displays the connections between different con- cepts or works in library materials. Quartolio targets researchers looking to discover untapped research opportunities based upon a large amount of data that includes articles, clinical trials, patents, and notes. Similarly, Iris.ai helps researchers identify and review a large amount of re- search papers and patents and extracts key information from them. Kira identifies, extracts, and analyzes text in contracts and other legal documents.17 None of these applications performs fully automated decision-making nor incorporates the digital assistant feature. But this is an area on which information systems vendors are increasingly focusing their efforts. Libraries themselves are also experimenting with AI to test its potential for library services and operations. Some are focusing on using AI, particularly the voice user interface aspect of the digital assistant, in order to improve existing services. The University of Oklahoma Libraries have been building an Alexa application to provide basic reference service to their students.18 16See ?iiTb,ffrrrXv2rMQX+QKf2/m+�iBQM, ?iiTb,ff[m�`iQHBQX+QKf, and ?iiTb,ffB`BbX�Bf. 17See ?iiTb,ffFB`�bvbi2KbX+QKf. Law firms are adopting similar products to automate and expedite their legal work, and law librarians are discussing how the use of AI may change their work. See Marr 2018 and Talley 2016. 18University of Oklahoma Libraries are building an Alexa application that will provide some basic reference service to their students. Also, their PAIR registry attempts to compile all AI-related projects at libraries. See ?iiTb,ffT�B`XH B#`�`B2bXQmX2/m. https://www.yewno.com/education https://quartolio.com/ https://iris.ai/ https://kirasystems.com/ https://pair.libraries.ou.edu https://pair.libraries.ou.edu Kim 79 At the University of Pretoria Library in South Africa, a robot named ‘Libby’ already interacts with patrons by providing guidance, answering questions, conducting surveys, and displaying marketing videos (Mahlangu 2019). Other libraries are applying AI to extract information from digital materials and automate metadata generation to enhance their discovery and use. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will max- imize the use of its digital collections in 2019.19 Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically gen- erate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.20 Some libraries are also testing out AI as a tool for evaluating services and operations. The Uni- versity of Rochester Libraries applied deep learning to the library’s space assessment to determine the optimal staffing level and building hours. The University of Illinois Urbana-Champaign Li- braries used machine learning to conduct sentiment analysis on their reference chat log (Blewer, Kim, and Phetteplace 2018). Ethical Challenges from the Personalized and Automated Information Environment Do these current and future AI applications for libraries pose ethical challenges similar to those that we discussed earlier? Since information query, discovery, and retrieval rarely involve life- or-death situations, stakes seem to be certainly lower. But an AI-driven automated information environment does raise its own distinct ethical challenges. (i) Intellectual isolation and bigotry hampering civic discourse Many AI applications that assist with information seeking activities promise a higher level of per- sonalization. But a highly personalized information environment often traps people in their own so-called ‘filter bubble,’ as we have been increasingly seeing in today’s social media channels, news websites, and commercial search engines, where such personalization is provided by machine learning and deep learning.21 Sophisticated AI algorithms are already curating and pushing in- formation feeds based upon the person’s past search and click behavior. The result is that infor- mation seekers are provided with information that conforms and reinforces their existing beliefs and interests. Views that are novel or contrast with their existing beliefs are suppressed and be- come invisible without them even realizing. Such lack of exposure to opposing views leads information users to intellectual isolation and even bigotry. Highly personalized information environments powered by AI can actively restrict ways in which people develop balanced and informed opinions, thereby intensifying and perpet- uating social discord and disrupting civic discourse. Under such conditions, prejudices, discrim- 19See Blewer, Kim, and Phetteplace 2018 and Price 2019. 20The AMP wiki is ?iiTb,ffrBFBX/HB#XBM/B�M�X2/mfT�;2bfpB2rT�;2X�+iBQM?T�;2A/48jReNNN9R. The Audiovisual Metadata Platform Pilot Development (AMPPD) project was presented at Code4Lib 2020 (Averkamp and Hardesty 2020). 21See Pariser 2012. https://wiki.dlib.indiana.edu/pages/viewpage.action?pageId=531699941 80 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 ination, and other unjust social practices are likely to increase, and this in turn will have more negative impact on those with fewer privileges. Intellectual isolation and bigotry has a distinctly moral impact on society. (ii) Weakening of cognitive agency and autonomy We have seen earlier that AI-powered digital assistants are likely to mediate people’s information search, discovery, and retrieval activities in the near future. As those digital assistants become more capable, they will go beyond listing available information. They will further choose what they deem to be most relevant to users and proceed to recommend or autonomously execute the best course of action.22 Other AI-driven features, such as extracting key information or generat- ing a summary of a large amount of information, are also likely to be included in future informa- tion systems, and they may deliver key information or summaries even before the request is made based upon constant monitoring of the user’s activities. In such a scenario, an information seeker’s cognitive agency is likely be undermined. Cru- cial to cognitive agency is the mental capacity to critically review a variety of information, judge what is and is not relevant, and interpret how they relate to other existing beliefs and opinions. If AI assumes those tasks, the opportunities for information seekers to exercise their own cogni- tive agency will surely decrease. Cognitive deskilling and the subsequent weakening of people’s agency in the AI -powered automated information environment presents an ethical challenge because such agency is necessary for a person to be a fully functioning moral agent in society.23 (iii) Social impact of scholarship and research from flawed AI algorithms Previously, we have seen that deep learning applications are opaque to human understanding. This lack of transparency and explainability raises a question of whether it is moral to rely on AI-powered military robots for life-or-death decisions. Does the AI-powered information envi- ronment have a similar problem? Machine learning applications base their recommendations and predictions upon the pat- terns in past data. Their predictions and recommendations are in this sense inherently conser- vative. They also become outdated when they fail to reflect new social views and material con- ditions that no longer fit the past patterns. Furthermore, each data set is a social construct that reflects particular values and choices such as who decided to collect the data and for what pur- pose; who labeled data; what criteria or beliefs guided such labeling; what taxonomies were used and why (Davis 2020). No data set can capture all variables and elements of the phenomenon that it describes. Furthermore, data sets used for training machine learning and deep learning algorithms may not be representational samples for all relevant subgroups. In such a case, an al- gorithm trained by such a data set will produce skewed results. Creating a large data set is also costly. Consequently, developers often simply take the data sets available to them. Those data sets are likely to come with inherent limitations such as omissions, inaccuracies, errors, and hidden biases. 22Needless to say, this is a highly simplified scenario. Those features can also be built in the information system itself rather than being delivered by a digital assistant. 23Outside of the automated information environment, AI has a strong potential to engender moral deskilling. Vallor (2015) points out that automated weapons will lead to soldiers’ moral deskilling in the use of military force; new me- dia practices of multitasking may result in deskilling in moral attention; and social robots can cause moral deskilling in practices of human caregiving. Kim 81 AI algorithms trained with these flawed data sets can fail unexpectedly, revealing those limi- tations. For example, it has been reported that the success rate of a facial recognition algorithm plunges from 99% to 35% when the group of subjects changes from white men to dark-skinned women because it was trained mostly with the photographs of white men (Lohr 2018). Adopt- ing such a faulty algorithm for any real-life use at a large scale would be entirely unethical. For the context of libraries, imagine using such a face-recognition algorithm to generate metadata for digitized historical photographs or a similarly flawed audio transcription algorithm to transcribe archival audio recordings. Just like those faulty algorithms, an AI-powered automated information environment can produce information, recommendations, and predictions affected by similar limitations existing in many data sets. The more seamless such an information environment is, the more invisible those limitations become. Automated information systems from libraries may not be involved in decisions that have a direct and immediate impact on people’s lives, such as setting a bail amount or determining the Medicaid payment to be paid.24 But automated information systems that are widely adopted and used for research and scholarship will impact real-life policies and regulations in areas such as healthcare and the economy. Undiscovered flaws will undermine the validity of the scholarly output that utilized those automated information systems and can further inflict serious harm on certain groups of people through those policies and regulations. Moral Intelligence and Rethinking the Role of AI In this chapter, I discussed four significant ethical challenges that automating decisions and ac- tions with AI presents: (a) moral desensitization; (b) unintended outcomes; (c) surrender of moral agency; (d) opacity in decision-making.25 I also examined somewhat different but equally significant ethical challenges in relation to the AI-powered automated information environment, which is likely to surround us in the future: (i) intellectual isolation and bigotry hampering civic discourse; (ii) weakening of cognitive agency and autonomy; (iii) social impact of scholarship and research based upon flawed AI algorithms. In the near future, libraries will be acquiring, building, customizing, and implementing many personalized and automated information systems. Given this, the challenges related to the AI- powered automated information environment are highly relevant to them. At present, libraries are at an early stage in developing AI applications and applying machine learning and deep learn- ing techniques to improve library services, systems, and operations. But the general issues of hidden biases and the lack of explainability in machine learning and deep learning are already gaining awareness in the library community. As we have seen in the trolley problem, whether a certain action is moral is not a line that can be drawn with absolute clarity. It is entirely possible for fully-functioning moral agents to make different judgements. In addition, there is the matter of morality that our tools and systems display. This is called “machine morality” in relation to AI systems. Wallach and Allen (2009) argue that there are three distinct levels of machine morality: oper- ational morality, functional morality, and full moral agency (26). Operational morality is found in systems that are low in both autonomy and ethical sensitivity. At this level of machine moral- ity, a machine or a tool is given a mechanism that prevents its immoral use, but the mechanism 24See Tashea 2017 and Stanley 2017. 25This is by no means an exhaustive list. User privacy and potential surveillance are examples of other important ethical challenges, which I do not discuss here. 82 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 is within the full control of the user. Such operational morality exists in a gun with a childproof safety mechanism, for example. A gun with a safety mechanism is neither autonomous nor sen- sitive to ethical concerns related to its use. By contrast, machines with functional morality do possess a certain level of autonomy and ethical sensitivity. This category includes AI systems with significant autonomy and little ethical sensitivity or those with little autonomy and high ethical sensitivity. An autonomous drone would fall under the former type, while MedEthEx, an ethical decision-support AI recommendation system for clinicians, would be of the latter. Lastly, Wallach and Allen regard systems with high autonomy and high ethical sensitivity as having full moral agency, as much as humans do. This means that those systems would have a mental rep- resentation of values and the capacity for moral reasoning. Such machines can be held morally responsible for their actions. We do not know whether AI will be able to produce such a machine with full moral agency. If the current direction to automate more and more human tasks for cost savings and efficiency at scale continues, however, most of the more sophisticated AI applications to come will be of the kind with functional morality, particularly the kind that combines a relatively high level of autonomy and a lower level of ethical sensitivity. In the beginning of this chapter, I mentioned that the goal of AI is to create an artificial system—whether it be a piece of software or a machine with a physical body—that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity. But what does “as intelligent as a human” exactly mean? If morality is an integral component of human-level intelligence, AI research needs to pay more attention to intelligence not only in accomplishing a goal but also in doing so ethically.26 In that light, it is meaningful to ask what level of autonomy and ethical sensitivity a given AI system is equipped with, and what level of machine morality is appropriate for its purpose. In designing an AI system, it would be helpful to consider what level of autonomy and ethical sensitivity would be best suited for its purpose and whether it is feasible to provide that level of machine morality for the system in question. In general, the narrower the function or the do- main of an AI system is, the easier it will be to equip it with an appropriate level of autonomy and ethical sensitivity. In evaluating and designing an AI system, it will be important to test the actual outcome against the anticipated outcome in different types of cases in order to identify potential problems. System-wide audits to detect well-known biases, such as gender discrimina- tion or racism, can serve as an effective strategy.27 Other undetected problems may surface only after the AI system is deployed. Having a mechanism to continually test an AI algorithm to iden- tify those unnoticed problems and feeding the test result back into the algorithm for retraining will be another way to deal with algorithmic biases. Those who build AI systems will also benefit from consulting existing principles and guidelines such as FAT/ML’s “Principles for Accountable Algorithms and a Social Impact Statement for Algorithms.”28 We may also want to rethink how and where we apply AI. We and our society do not have 26Here, I regard intelligence as the ability to accomplish complex goals following Tegmark 2017. For more discussion on intelligence and goals, see Chapter 2 and Chapter 7. 27These audits are far from foolproof, but the detection of hidden biases will be crucial in making AI algorithms more accountable and their decisions more ethical. A debiasing algorithm can also be used during the training stage of an AI algorithm to reduce hidden biases in training data. See Amini et al. 2019, Knight 2019b, and Courtland 2018. 28See ?iiTb,ffrrrX7�iKHXQ`;f`2bQm`+2bfT`BM+BTH2b@7Q`@�++QmMi�#H2@�H;Q`Bi?Kb. Other principles and guidelines include “Ethics Guidelines for Trustworthy AI” (?iiTb,ff2+X2m`QT�X2mf/B;Bi�H@b BM;H2@K�`F2if2MfM2rbf2i?B+b@;mB/2HBM2b@i`mbirQ`i?v@�B) and “Algorithmic Impact Assessments: A Practical Framework For Public Agency Accountability” (?iiTb,ff�BMQrBMbiBimi2XQ`;f�B�`2TQ`ikyR3XT /7). https://www.fatml.org/resources/principles-for-accountable-algorithms https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai https://ainowinstitute.org/aiareport2018.pdf https://ainowinstitute.org/aiareport2018.pdf Kim 83 to use AI to equip all our systems and machines with human- or superhuman-level performance. This is particularly so if the pursuit of such human- or superhuman-level performance is likely to increase unethical decisions that negatively impact a significant number of people. We do not have to task AI with always automating away human work and decisions as much as possible. What if we reframe AI’s role as helping people become more intelligent and more capable where they struggle or experience disadvantages, such as critical thinking, civic participation, healthy liv- ing, financial literacy, dyslexia, or hearing loss? What kind of AI-driven information systems and environments would be created if libraries approach AI with such intention from the beginning? References Alexander, Larry, and Michael Moore. 2016. “Deontological Ethics.” In The Stanford Encyclo- pedia of Philosophy, edited by Edward N. Zalta, Winter 2016. Metaphysics Research Lab, Stanford University. ?iiTb,ffTH�iQXbi�M7Q`/X2/mf�`+?Bp2bfrBMkyRef2Mi`B2 bf2i?B+b@/2QMiQHQ;B+�Hf. Amini, Alexander, Ava P. Soleimany, Wilko Schwarting, Sangeeta N. Bhatia, and Daniela Rus. 2019. “Uncovering and Mitigating Algorithmic Bias through Learned Latent Structure.” In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 289–295. AIES ’19. New York, NY, USA: Association for Computing Machinery. ?iiTb,ff/QBXQ`;f RyXRR98fjjyeeR3XjjR9k9j. Averkamp, Shawn, and Julie Hardesty. 2020. “AI Is Such a Tool: Keeping Your Machine Learn- ing Outputs in Check.” Presented at the Code4lib Conference, Pittsburgh, PA, March 11. ?iiTb,ffkykyX+Q/29HB#XQ`;fi�HFbf�A@Bb@bm+?@�@iQQH@E22TBM;@vQm`@K �+?BM2@H2�`MBM;@QmiTmib@BM@+?2+F. Blewer, Ashley, Bohyun Kim, and Eric Phetteplace. 2018. “Reflections on Code4Lib 2018.” ACRL TechConnect (blog). March 12, 2018. ?iiTb,ff�+`HX�H�XQ`;fi2+?+QMM2+i fTQbif`27H2+iBQMb@QM@+Q/29HB#@kyR3f. Boden, Margaret A. 2016. AI: Its Nature and Future. Oxford: Oxford University Press. Bonnefon, Jean-François, Azim Shariff, and Iyad Rahwan. 2016. “The Social Dilemma of Au- tonomous Vehicles.” Science 352 (6293): 1573–76. ?iiTb,ff/QBXQ`;fRyXRRkefb+B2 M+2X��7ke89. Clark, Liat. 2012. “Google’s Artificial Brain Learns to Find Cat Videos.” Wired, June 26, 2012. ?iiTb,ffrrrXrB`2/X+QKfkyRkfyef;QQ;H2@t@M2m`�H@M2irQ`Ff. Conitzer, Vincent, Walter Sinnott-Armstrong, Jana Schaich Borg, Yuan Deng, and Max Kramer. 2017. “Moral Decision Making Frameworks for Artificial Intelligence.” In Proceedingsofthe Thirty-First AAAI Conference on Artificial Intelligence, 4831–4835. AAAI’17. San Fran- cisco, California, USA: AAAI Press. Courtland, Rachel. 2018. “Bias Detectives: The Researchers Striving to Make Algorithms Fair.” Nature 558 (7710): 357–60. ?iiTb,ff/QBXQ`;fRyXRyj3f/9R83e@yR3@y89eN@j. Cushman, Fiery, Liane Young, and Marc Hauser. 2006. “The Role of Conscious Reasoning and Intuition in Moral Judgment: Testing Three Principles of Harm.” Psychological Science 17 (12): 1082–89. Davis, Daniel L. 2007. “Who Decides: Man or Machine?” Armed Forces Journal, November. ?iiT,ff�`K2/7Q`+2bDQm`M�HX+QKfr?Q@/2+B/2b@K�M@Q`@K�+?BM2f. Davis, Hannah. 2020. “A Dataset Is a Worldview.” Towards Data Science. March 5, 2020. ?iiT b,ffiQr�`/b/�i�b+B2M+2X+QKf�@/�i�b2i@Bb@�@rQ`H/pB2r@8jk3kRe//99/. https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/ https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/ https://doi.org/10.1145/3306618.3314243 https://doi.org/10.1145/3306618.3314243 https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/ https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/ https://doi.org/10.1126/science.aaf2654 https://doi.org/10.1126/science.aaf2654 https://www.wired.com/2012/06/google-x-neural-network/ https://doi.org/10.1038/d41586-018-05469-3 http://armedforcesjournal.com/who-decides-man-or-machine/ https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d 84 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 Foot, Philippa. 1967. “The Problem of Abortion and the Doctrine of Double Effect.” Oxford Review 5: 5–15. Heaven, Will Douglas. 2020. “DeepMind’s AI Can Now Play All 57 Atari Games—but It’s Still Not Versatile Enough.” MIT Technology Review, April 1, 2020. ?iiTb,ffrrrXi2+?MQ HQ;v`2pB2rX+QKfkykyfy9fyRfNd9NNd. International Committee of the Red Cross. 2015. “What Are Jus Ad Bellum and Jus in Bello?” January 22, 2015. ?iiTb,ffrrrXB+`+XQ`;f2Mf/Q+mK2Mifr?�i@�`2@Dmb@�/@#2H HmK@�M/@Dmb@#2HHQ@y. Kahn, Leonard. 2012. “Military Robots and The Likelihood of Armed Combat.” In Robot Ethics: The Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney, and George A. Bekey, 274–92. Intelligent Robotics and Autonomous Agents. Cambridge, Mass.: MIT Press. Knight, Will. 2017. “The Dark Secret at the Heart of AI.” MIT Technology Review, April 11, 2017. ?iiTb,ffrrrXi2+?MQHQ;v`2pB2rX+QKfkyRdfy9fRRf8RRj. . 2019a. “Two Rival AI Approaches Combine to Let Machines Learn about the World like a Child.” MIT Technology Review, April 8, 2019. ?iiTb,ffrrrXi2+?MQHQ;v `2pB2rX+QKfkyRNfy9fy3fRyjkkj. . 2019b. “AI Is Biased. Here’s How Scientists Are Trying to Fix It.” Wired, De- cember 19, 2019. ?iiTb,ffrrrXrB`2/X+QKfbiQ`vf�B@#B�b2/@?Qr@b+B2MiBbib @i`vBM;@7Btf. Koch, Christof. 2016. “How the Computer Beat the Go Master.” Scientific American. March 19, 2016. ?iiTb,ffrrrXb+B2MiB7B+�K2`B+�MX+QKf�`iB+H2f?Qr@i?2@+QKTmi2 `@#2�i@i?2@;Q@K�bi2`f. Lohr, Steve. 2018. “Facial Recognition Is Accurate, If You’re a White Guy.” New York Times, February 9, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fykfyNfi2+?MQHQ;vf7�+B�H @`2+Q;MBiBQM@`�+2@�`iB7B+B�H@BMi2HHB;2M+2X?iKH. Mahlangu, Isaac. 2019. “Meet Libby - the New Robot Library Assistant at the University of Pretoria’s Hatfield Campus.” SowetanLIVE. June 4, 2019. ?iiTb,ffrrrXbQr2i�MHBp 2X+QXx�fM2rbfbQmi?@�7`B+�fkyRN@ye@y9@K22i@HB##v@i?2@M2r@`Q#Qi@HB #`�`v@�bbBbi�Mi@�i@i?2@mMBp2`bBiv@Q7@T`2iQ`B�b@?�i7B2H/@+�KTmbf. Markoff, John. 2012. “How Many Computers to Identify a Cat? 16,000.” New York Times, June 25, 2012. Marr, Bernard. 2018. “How AI And Machine Learning Are Transforming Law Firms And The Legal Sector.” Forbes, May 23, 2018. ?iiTb,ffrrrX7Q`#2bX+QKfbBi2bf#2`M�`/K� ``fkyR3fy8fkjf?Qr@�B@�M/@K�+?BM2@H2�`MBM;@�`2@i`�Mb7Q`KBM;@H�r@7 B`Kb@�M/@i?2@H2;�H@b2+iQ`f. Pariser, Eli. 2011. TheFilterBubble: HowtheNewPersonalizedWebIsChangingWhatWeRead and How We Think. New York: Penguin Press. Price, Gary. 2019. “The Library of Congress Posts Solicitation For a Machine Learning/Deep Learning Pilot Program to ‘Maximize the Use of Its Digital Collection.’ ” LJ InfoDOCKET. June 13, 2019. ?iiTb,ffrrrXBM7Q/Q+F2iX+QKfkyRNfyefRjfHB#`�`v@Q7@+QM;` 2bb@TQbib@bQHB+Bi�iBQM@7Q`@�@K�+?BM2@H2�`MBM;@/22T@H2�`MBM;@TBHQ i@T`Q;`�K@iQ@K�tBKBx2@i?2@mb2@Q7@Bib@/B;Bi�H@+QHH2+iBQM@HB#`�`v@ Bb@HQQFBM;@7Q`@`f. Rincon, Lilian. 2019. “Interpreter Mode Brings Real-Time Translation to Your Phone.” Google Blog (blog). December 12, 2019. ?iiTb,ffrrrX#HQ;X;QQ;H2fT`Q/m+ibf�bbBbi� https://www.technologyreview.com/2020/04/01/974997 https://www.technologyreview.com/2020/04/01/974997 https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0 https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0 https://www.technologyreview.com/2017/04/11/5113 https://www.technologyreview.com/2019/04/08/103223 https://www.technologyreview.com/2019/04/08/103223 https://www.wired.com/story/ai-biased-how-scientists-trying-fix/ https://www.wired.com/story/ai-biased-how-scientists-trying-fix/ https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/ https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/ https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ Kim 85 MifBMi2`T`2i2`@KQ/2@#`BM;b@`2�H@iBK2@i`�MbH�iBQM@vQm`@T?QM2f. Sharkey, Noel. 2012. “Killing Made Easy: From Joysticks to Politics.” In Robot Ethics: The Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney, and George A. Bekey, 111–28. Intelligent Robotics and Autonomous Agents. Cambridge, Mass.: MIT Press. Singer, Peter. 2005. “Ethics and Intuitions.” The Journal of Ethics 9 (3/4): 331–52. Sinnott-Armstrong, Walter. 2019. “Consequentialism.” In The Stanford Encyclopedia of Phi- losophy, edited by Edward N. Zalta, Summer 2019. Metaphysics Research Lab, Stanford University. ?iiTb,ffTH�iQXbi�M7Q`/X2/mf�`+?Bp2bfbmKkyRNf2Mi`B2bf+QMb 2[m2MiB�HBbKf. Stanley, Jay. 2017. “Pitfalls of Artificial Intelligence Decisionmaking Highlighted In Idaho ACLU Case.” American Civil Liberties Union (blog). June 2, 2017. ?iiTb,ffrrrX�+HmXQ`;f# HQ;fT`Bp�+v@i2+?MQHQ;vfTBi7�HHb@�`iB7B+B�H@BMi2HHB;2M+2@/2+BbBQM K�FBM;@?B;?HB;?i2/@B/�?Q@�+Hm@+�b2. Talley, Nancy B. 2016. “Imagining the Use of Intelligent Agents and Artificial Intelligence in Academic Law Libraries.” Law Library Journal 108 (3): 383–402. Tashea, Jason. 2017. “Courts Are Using AI to Sentence Criminals. That Must Stop Now.” Wired, April 17, 2017. ?iiTb,ffrrrXrB`2/X+QKfkyRdfy9f+Qm`ib@mbBM;@�B@b2 Mi2M+2@+`BKBM�Hb@Kmbi@biQT@MQrf. Tegmark, Max. 2017. Life 3.0: Being Human in the Age of Artificial Intelligence. New York: Alfred Knopf. Thomson, Judith Jarvis. 1976. “Killing, Letting Die, and the Trolley Problem.” The Monist 59 (2): 204–17. Turek, Matt. n.d. “Explainable Artificial Intelligence.” Defense Advanced Research Projects Agency. ?iiTb,ffrrrX/�`T�XKBHfT`Q;`�Kf2tTH�BM�#H2@�`iB7B+B�H@BMi2H HB;2M+2. Vallor, Shannon. 2015. “Moral Deskilling and Upskilling in a New Machine Age: Reflections on the Ambiguous Future of Character.” Philosophy & Technology 28 (1): 107–24. ?iiTb, ff/QBXQ`;fRyXRyydfbRjj9d@yR9@yR8e@N. Wallach, Wendell. 2009. Moral Machines: Teaching Robots Right from Wrong. Oxford: Oxford University Press. Welch, Chris. 2018. “Google Just Gave a Stunning Demo of Assistant Making an Actual Phone Call.” The Verge. May 8, 2018. ?iiTb,ffrrrXi?2p2`;2X+QKfkyR3f8f3fRdjjkydy f;QQ;H2@�bbBbi�Mi@K�F2b@T?QM2@+�HH@/2KQ@/mTH2t@BQ@kyR3. https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://plato.stanford.edu/archives/sum2019/entries/consequentialism/ https://plato.stanford.edu/archives/sum2019/entries/consequentialism/ https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ https://www.darpa.mil/program/explainable-artificial-intelligence https://www.darpa.mil/program/explainable-artificial-intelligence https://doi.org/10.1007/s13347-014-0156-9 https://doi.org/10.1007/s13347-014-0156-9 https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018 https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018
lesk-fragility-2021 ---- Chapter 9 Fragility and Intelligibility of Deep Learning for Libraries Michael Lesk Rutgers University Introduction On February 7, 2018, Mounir Mahjoubi, then the “digital minister” of France (le secrétariat d’État chargé du Numérique), told the civil service to use only computer methods that could be understood (Mahjoubi 2018). To be precise, what he actually said to l’Assemblée Nationale was: Aucun algorithme non explicable ne pourra être utilisé. I gave this to Google Translate and asked for it in English. What I got (on October 13, 2019) was: No algorithm that can not be explained can not be used. That’s a long way from fluent English. As I count the “not” words, it’s actually reversed in mean- ing. But, what if I leave off the final period when I enter it in Google Translate? Then I get: No non-explainable algorithm can be used Quite different, and although only barely fluent, now the meaning is right. The difference was only the final punctuation on the sentence.1 This is an example of the fragility of an AI algorithm. The point is not that both translations are of doubtful quality. The point is that a seemingly insignificant change in the input produced such a difference in the output. In this case, the fragility was detected by accident. 1In the months between my original queries in October 2019 and the final preparations for publication in November 2020, the algorithm has changed to produce the same translation with or without a period: “No non-explicable algorithm can be used.” 101 102 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Machine learning systems have a set of data for training. For example, if you are interested in translation, and you have a large collection of text in both French and English, you might notice that the word truck in English appears where the word camion appears in French. And the system might “learn” this translation. It would then apply this in other examples; this is called general- ization. Of course if you wish to translate French into British English, a preferred translation of camion is lorry. And if the context of your English truck is a US discussion of the wheels and axles underneath railway vehicles, the better French word is le bogie. Deep learning enthusiasts believe that with enough examples, machine learning systems will be able to generalize correctly. There can be various kinds of failures: we can discuss both (a) problems in the scope of the training data and (b) problems in the kind of modeling done. If the system has sufficiently general input data so that it learns well enough to produce reliably correct results on examples it has not seen, we call it robust; robustness is the opposite of fragility. Fragility errors here can arise from many sources—for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Or, the data may not have the scope of the real problem: if you train for “boat” based on ocean liners, don’t be surprised if the program fails on canoes. In addition, there are also modeling issues. Suppose you use a very simple model, such as a linear model, for data that is actually perhaps quadratic or exponential. This is called “underfit- ting” and may often arise when there is not enough training data. The reverse is also possible: there may be a lot of training data, including many noisy points, and the program may decide on a very complex model to cover all the noise in the training data. This is called “overfitting” and gives you an answer too dependent on noise and outliers in your data. For example, 1998 was an unusually warm year, but the decline in world temperature for the next few years suggests it was noise in the data, not a change in the development of climate. Fragility is also a problem in image recognition (“AI Recognition” 2017). Currently the most common technique for image recognition research projects is the use of convolutional neural nets. Recently, several papers have looked at how trivial modifications to images may impact im- age classification. Here (figure 9.1) is an image taken from (Su, Vargas, and Sakurai 2019). The original image class is in black and the classifier choice (and confidence) after adding a single un- usual pixel are shown in blue, with the extraneous pixel in white. The images were deliberately processed at low resolution—hence the pixellation—to match the input requirement of a popu- lar image classification program. The authors experimented with algorithms to find the quickest single-pixel change that would deceive an image classifier. They were routinely able to fool the recognition software. In this ex- ample, the deception was deliberate; the researchers searched for the best place to change the image. Bias and mistakes We have seen a major change in the way we do machine learning, and there are real dangers in- volved. The current enthusiasm for neural nets risks the use of processes which cannot be under- stood, as Mahjoubi warned, and which can thus conceal methods we would not approve of, such as discrimination in lending or hiring. Cathy O’Neil has described this in her book Weapons of Math Destruction (2016). There is much research today that seeks methods to explain what neural nets are doing. See Lesk 103 Figure 9.1: Examples of misclassification. Guidiotti et al. (2017) for a survey. There is also a 2018 DARPA program on “Explainable AI.” Techniques used can include looking at the results over a range of input data and seeing if the neural net can be modeled by a decision tree, or modifying the input data to see which input elements have the greatest effect on the results, and then showing that to the user. For example, Mariusz Bojarski et al. describe a self-driving system that highlights what it thinks is important in what it is seeing (2017). However, this is generally research in progress, and it raises the question of whether we can trust the explanation generator. Many popular magazines have discussed this problem; Forbes, for example, had an explana- tion of how the choice of datasets can produce a biased result without any deliberate attempt to do so (Taulli 2019). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). The MIT Media Lab hosts the Algorithmic Justice League, trying to stop organizations from building socially slanted systems. Similar thoughts come from groups like the Data and Society Research Institute or the AI Now Institute. Again, the problems may be accidental or deliberate. The phrase “data poisoning” has been used to suggest malicious creation of training data or examples of data designed to deceive ma- chine learning systems. There is now a DARPA research program, “Guaranteeing AI Robustness against Deception (GARD),” supporting research to learn how to stop trickery such as a demon- stration of converting a traffic stop sign to a 45 mph speed limit with a few stickers (Eykholt et al. 2018). More generally, bias in systems deciding whether to grant loans may be discriminatory but nevertheless profitable. Even if you want to detect AI mistakes, recognizing such problems is difficult. Often things will be wrong and we won’t know why. And even hypothetical (but perhaps erroneous) explana- tions can be very convincing; people easily believe plausible stories. I routinely give my students a paper that concludes that prior ownership of a cat prevents fatal myocardial infarctions; its re- sult implies that cats are more protective than statin drugs (Qureshi et al. 2009). The students are very quick to come up with possibilities like “petting a cat is relaxing, relaxation reduces your blood pressure, and lower blood pressure decreases the risk of heart attacks.” Then I have to ex- plain that the paper evaluates 32 possibilities (prior/current ownership ⇥ cats/dogs ⇥ 4 medical conditions ⇥ fatal/nonfatal) and you shouldn’t be surprised if you evaluate 32 chances and one is significant at the 0.05 level, which is only 1 in 20. In this example, there is also the question of reverse causality: perhaps someone who is in ill health will decide he is too sick to take care of a 104 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.2: Panoramic landscape. pet, so that the poor health is not caused by the lack of a cat, but rather the poor health causes the absence of a cat. Sometimes explanations can help, as in a machine learning program that was deliberately trained to distinguish images of wolves and dogs but was trained using pictures of wolves that always contained snow and pictures of dogs that never did (Ribeiro, Singh, and Guestrin 2016). Without explaining that, 10 of 27 subjects thought the classifier was trustworthy; after point- ing out the snow only 3 of 27 subjects believed the system. Usually you don’t get such a clear presentation of a mis-trained system. Recognition of problems Can we tell when something is wrong? Here’s the result of a Google Photo merge of three other photos; two landscapes and a picture of somebody’s friend. The software was told to make a panorama and stitched the images together (Peng 2018). It looks like a joke, and even made it into a list of top jokes on reddit. The author’s point was that the panorama system didn’t understand basic composition: people are not the same scale as mountains. Often, machine learning results are overstated. Google Flu Trends was acclaimed for several years and then turned out to be undependable (Lazer et al. 2014). A study that attempted to compare the performance of machine learning systems for medical diagnosis with actual doctors found that of over 20,000 papers analyzed, only a few dozen had data suitable for an evaluation (Liu et al. 2019). The results claimed comparable accuracy, but virtually none of the papers Lesk 105 presented adequate data to support that conclusion. Unusually promising results are sometimes the result of overfitting (Brownlee 2018); this is what was wrong with Google Flu Trends. A machine learning program can learn a large number of special cases and then find that the results do not generalize. In other cases problems can result when using “clean” data for training, and then encountering messier data in applications. Ideally, training and testing data should be from the same dataset and divided at random, but it can be tempting to start off with examples that are the result of initial and higher quality data collection. Sometimes in the past we had a choice between modeling and data for predictions. Consider, for example, the problem of guessing what the weather will be tomorrow. We now do this based on a model of the atmosphere that uses the Navier-Stokes equations; we use supercomputers and derive tomorrow’s atmosphere from today’s (Christensen 2015). What did we do before we had supercomputers? Solving those equations by hand is impractical. One of the methods was “pre- diction by analogy”: find some day in the past whose weather was most similar to today. Suppose that day is Oct. 20, 1970. Then use October 21, 1970 as tomorrow’s prediction. Prediction by analogy doesn’t require you to have a model or use advanced mathematics. In this case, however, it doesn’t work as well—partly because we don’t have enough past days to choose from, and we only get new days at the rate of one per day. In fact, Huug van den Dool estimated the number of days of data needed to make accurate predictions as 1030 years, which is far more than the age of the universe (Wilks 2008). The under- lying problem is that the weather is very random. If your state lottery is properly run, it should be completely pointless to look at past winning numbers and try to guess the next one. The weather is not that random but it has too much variation to be solved easily by analogy. If your problem is very simple (tic-tac-toe) you could indeed write down each position and what the best next move is; there are only about 255,000 games. To deal with more realistic problems, much of machine learning research is now focused on obtaining larger training sets. Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, “more data beats better algorithms.” In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, “The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets. The basic learning and decoding algorithms have not changed substantially in 40 years” (2014). Nevertheless, speech recognition has gone from frustration to useful products such as dictation software or home appliances. Lacking a model, however, means that we won’t know the limits of the calculations being done. For example, if you have some data that looks quadratic, but you fit a linear model, any attempt at extrapolation is fraught with error. If you are using a “black box” system, you don’t know when this is happening. And, regrettably, many of the AI software systems are sold as black boxes where the purchasers and users do not have access to the process, even if they are imagined to be able to understand it. What’s changing Many AI researchers are sensitive to the risks, especially given the publicity over self-driving cars. As the hype over “deep learning” built up, writers discussed examples such as a Pittsburgh med- ical system that proposed to send patients with both pneumonia and asthma home, because the computer had not understood that patients with both problems were actually being sent to the ICU (Bornstein 2016; Caruana et al. 2015). 106 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.3: Explainability. Many people work on ways of explaining or presenting neural net software (Harley 2015). Most important, perhaps, are new EU regulations that prohibit automated decision making that affects EU citizens, and provides a “right of explanation” (Metz 2016). We recognize that systems which don’t rely on a mathematical model may be cheaper to build than one where the coders understand what is going on. More serious is that they may be more accurate. This image is from the same article on understandability (Bornstein 2016). If there really is a tradeoff between what will solve the problem and what can be explained, we know that many system builders will choose to solve the problem. And yet even having explana- tions may not be an answer; a key paper on interpretability discusses the complexities of meaning related to explanation, causality, and modeling (Lipton 2018). Arend Hintze has noted that we do not always impose a demand for explanation on people. I can write that the New York Public Library main building is well proportioned and attractive without anyone expecting that I will recite its dimensions or the source of the marble used to construct it. And for some problems that’s fine: I don’t care how my camera decides on the focus distance to the subject. Where it matters, however, we often want explanations; the hard ethical problem, as noted before, is if better performance can be achieved in an inexplicable way. Recommendations 2017 saw the publication of the “Asilomar AI principles” (2017). Two of these principles are: • Safety: AI systems should be safe and secure throughout their operational lifetime, and verifiably so where applicable and feasible. • Failure Transparency: If an AI system causes harm, it should be possible to ascertain why. The problem is that the technology used to build many systems does not enable verifiability and explanation. Similarly the World Economic Forum calls for protection against discrimina- tion but notes many ways in which technology can have unanticipated and undesirable effects as a result of machine learning (“How to Prevent” 2018). Lesk 107 Historically there has been and continues to be too much hype. An important image recog- nition task is distinguishing malignant and benign spots on mammograms. There have been promises for decades that computers would do this better than radiologists. Here are examples from 1995 (“computer-aided diagnosis can improve radiologists’ observational performance”) (Schmidt and Nishikawa) and 2009 (“The Bayesian network significantly exceeded the perfor- mance of interpreting radiologists”) (Burnside et al.). A typical recent AI paper to do this with convolutional neural nets reports 90% accuracy (Singh et al. 2020). To put this in perspective, the problem is complex, but some examples are more straightforward, and even pigeons can reach 85% (Levenson et al. 2015). A serious recent review is “Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection” (Lehman et al. 2015). Very re- cently there was another claim that computers have surpassed radiologists (Walsh 2020); we will have to await evaluation. As with many claims of medical progress, replicability and evaluation are needed before doctors will be willing to believe them. What should we do? Software testing generally is a decades-old discipline, and many basic principles of regression testing apply here also: • Test data should cover the full range of expected input. • Test data should also cover unexpected and even illegal input. • Test data should include known past failures believed cleared up. • Test data should exercise all parts of the program, and all important paths (coverage). • Test data should include a set of data which is representative of the distribution of actual data, to be used for timing purposes. It is difficult to apply these ideas in parts of the AI world. If the allowed input is speech, there is no exhaustive list of utterances which can be sampled. If a black-box commercial machine learning package is being used, there is no way to ask about coverage of any number of test cases. If a program is constantly learning from new data, there is no list of previously fixed failures to be collected that reflects the constantly changing program. And obviously the circumstances of use matter. We may well, as a society, decide that forcing banks evaluating loan applications to use decision trees instead of deep learning is appropriate, so that we know whether illegal discrimination is going on, even if this raises the costs to the banks. We might also believe that the safest possible railway operation is important, even if the automated train doesn’t routinely explain how it balanced its choices of acceleration to achieve high punctuality and low risk. What would I suggest? Organizationally: • Have teams including both the computer scientists and the users. • Collaborate with a statistician: they’ve seen a lot of these problems before. • Work on easier problems. As examples, I watched a group of zoologists with a group of computer scientists discussing how to improve accuracy at identifying animals in photographs. The discussion indicated that 108 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 you needed hundreds of training examples at a minimum, if not thousands, since the animals do not typically walk up to the camera and pose for a full-frame shot. It was important to have both the people who understood the learning systems and the people who knew what the pictures were realistically like. The most amusing contribution by a statistician happened when a computer scientist offered a program that tried to recognize individual giraffes, and a zoologist complained that it only worked if you had a view of the right-hand side of the giraffe. Somebody who knew statistics perked up and said “it’s a 50% chance of recognizing the animal? I can do the math for that.” And it is simpler to do “is there any animal in the picture?” before asking “which animal is it?” and create two easier problems. Technically: • Try to interpolate rather than extrapolate: use the algorithm on points “inside” the training set (thinking in multiple dimensions). • Lean towards feature detection and modeling rather than completely unsupervised learn- ing. • Emphasize continuous rather than discrete variables. I suggest using methods that involve feature detection, since that tells you what the algorithm is relying on. For example, consider the Google Flu Trends failure; the public was not told what terms were used. As David Lazer noted, some of them were just “winter” terms (like ‘basketball’). If you know that, you might be skeptical. More significant are decisions like jail sentences or college admissions; knowing that racial or religious discrimination are not relevant can be verified by knowing that the program did not use them. Knowing what features were used can sometimes help the user: if you know that your loan application was downrated because of your credit score, it may be possible for you to pay off some bill to raise the score. Sometimes you have to use categorical variables (what county do you live in?) but if you have a choice of how you phrase a variable, asking something like “how many minutes a day do you spend reading?” is likely to produce a better fit than asking people to choose “how much do you read: never, sometimes, a lot?” A machine learning algorithm may tell you how much of the variance each input variable explains; you can use that information to focus on the variables that are most important to your problem, and decide whether you think you are measuring them well enough. Why not extrapolate? Sadly, as I write this in early April 2020, we are seeing all sorts of ex- trapolations of the COVID-19 epidemic, with expected US deaths ranging from 30,000 to 2 million, as people try to fit various functions (Gaussians, logistic regression, or whatever) with inadequately precise data and uncertain models. A simpler example is Mark Twain’s: “In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hun- dred and forty-two miles. That is an average of a trifle over one mile and a third per year. There- fore, any calm person, who is not blind or idiotic, can see that in the ‘Old Oolitic Silurian Period,’ just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod. And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen” (1883). Lesk 109 Finally, note the advice of Edgar Allan Poe: “Believe nothing you hear, and only one half that you see.” References “AI Recognition Fooled by Single Pixel Change.” BBC News, November 3, 2017. ?iiTb,ffrr rX##+X+QKfM2rbfi2+?MQHQ;v@9R3983d3. “Asilomar AI Principles.” 2017. ?iiTb,ff7mim`2Q7HB72XQ`;f�B@T`BM+BTH2bf. Bojarski, Mariusz, Larry Jackel, Ben Firner, and Urs Muller. 2017. “Explaining How End-to- End Deep Learning Steers a Self-Driving Car.” NVIDIA Developer Blog. ?iiTb,ff/2p# HQ;bXMpB/B�X+QKf2tTH�BMBM;@/22T@H2�`MBM;@b2H7@/`BpBM;@+�`f. Bornstein, Aaron. 2016. “Is Artificial Intelligence Permanently Inscrutable?” Nautilus 40 (1). ?iiT,ffM�miBHXmbfBbbm2f9yfH2�`MBM;fBb@�`iB7B+B�H@BMi2HHB;2M+2@T2` K�M2MiHv@BMb+`mi�#H2. Brownlee, Jason. 2018. “The Model Performance Mismatch Problem (and What to Do about It).” Machine Learning Mastery. ?iiTb,ffK�+?BM2H2�`MBM;K�bi2`vX+QKfi?2@K Q/2H@T2`7Q`K�M+2@KBbK�i+?@T`Q#H2Kf. Burnside, Elizabeth S., Jessie Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom, Berta M. Geller, Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, and C. David Page. 2009. “Probabilistic Computer Model Developed from Clinical Data in National Mammography Database Format to Classify Mammographic Findings.” Radiology 251 (3): 663–72. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Read- mission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD ’15), 1721–30. New York: ACM Press. ?iiTb, ff/QBXQ`;fRyXRR98fkd3jk83Xkd33eRj. Christensen, Hannah. 2015. “Banking on better forecasts: the new maths of weather predic- tion.” The Guardian, 8 Jan 2015. ?iiTb,ffrrrXi?2;m�`/B�MX+QKfb+B2M+2f�H2t b@�/p2Mim`2b@BM@MmK#2`H�M/fkyR8fD�Mfy3f#�MFBM;@7Q`2+�bib@K�i?b@r 2�i?2`@T`2/B+iBQM@biQ+?�biB+@T`Q+2bb2b. Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramèr, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. “Physical Adversarial Examples for Ob- ject Detectors.” 12th USENIX Workshop on Offensive Technologies (WOOT 18). Guidiotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Giannotti Fosca, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Computing Surveys 51 (5): 1–42. Halevy, Alon, Peter Norvig, and Fernando Pereira. 2009. “The Unreasonable Effectiveness of Data.” IEEE Intelligent Systems 24 (2). Harley, Adam W. 2015. “An Interactive Node-Link Visualization of Convolutional Neural Net- works.” In Advances in Visual Computing, edited by George Bebis et al., 867–77. Lecture Notes in Computer Science. Cham: Springer International Publishing. “How to Prevent Discriminatory Outcomes in Machine Learning.” 2018. White Paper from the Global Future Council on Human Rights 2016–2018, World Economic Forum. ?iiTb, ffrrrXr27Q`mKXQ`;fr?Bi2T�T2`bf?Qr@iQ@T`2p2Mi@/Bb+`BKBM�iQ`v@Qmi+ QK2b@BM@K�+?BM2@H2�`MBM;. https://www.bbc.com/news/technology-41845878 https://www.bbc.com/news/technology-41845878 https://futureoflife.org/ai-principles/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://doi.org/10.1145/2783258.2788613 https://doi.org/10.1145/2783258.2788613 https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning 110 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Huang, Xuedong, James Baker, and Raj Reddy. 2014. “A Historical Perspective of Speech Recog- nition.” Communications of the ACM 57 (1): 94–103. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–1205. Lehman, Constance, Robert Wellman, Diana Buist, Karl Kerlikowske, Anna Tosteson, and Di- ana Miglioretti. 2015. “Diagnostic Accuracy of Digital Screening Mammography with and without Computer-Aided Detection.” JAMA Intern Med 175 (11): 1828–1837. Levenson, Richard M., Elizabeth A. Krupinski, Victor M. Navarro, and Edward A. Wasserman. 2015. “Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images.” PLoS One, November 18, 2015. ?iiTb,ff/QBXQ`;fRyXRjdRfDQm` M�HXTQM2XyR9Rj8d. Lipton, Zachary. 2018. “The Mythos of Model Interpretability.” ACM Queue 61 (10): 36–43. Liu, Xiaoxuan et al. 2019. “A Comparison of Deep Learning Performance against Health-Care Professionals in Detecting Diseases from Medical Imaging: a Systematic Review and Meta- Analysis.” Lancet Digital Health 1 (6): e271–97. ?iiTb,ffrrrXb+B2M+2/B`2+iX+Q Kfb+B2M+2f�`iB+H2fTBBfak83Nd8yyRNjyRkjk. Mahjoubi, Mounir. 2018. “Assemblée nationale, XVe législature. Session ordinaire de 2017–2018.” Compte rendu intégral, Deuxième séance du mercredi 07 février 2018. ?iiT,ffrrrX�b b2K#H22@M�iBQM�H2X7`fR8f+`BfkyRd@kyR3fkyR3yRjdX�bT. Metz, Cade. 2016. “Artificial Intelligence Is Setting Up the Internet for a Huge Clash with Eu- rope.” Wired, July 11, 2016. ?iiTb,ffrrrXrB`2/X+QKfkyRefydf�`iB7B+B�H@BMi 2HHB;2M+2@b2iiBM;@BMi2`M2i@?m;2@+H�b?@2m`QT2f. O’Neil, Cathy. 2016. Weapons of Math Destruction. New York: Crown. Peng, Tony. 2018. “2018 in review: 10 AI failures.” Medium, December 10, 2018. ?iiTb,ffK2 /BmKX+QKfbvM+2/`2pB2rfkyR3@BM@`2pB2r@Ry@�B@7�BHm`2b@+R37��/78N3j. Qureshi, A. I., M. Z. Memon, G. Vazquez, and M. F. Suri. 2009. “Cat ownership and the Risk of Fatal Cardiovascular Diseases. Results from the Second National Health and Nutrition Ex- amination Study Mortality Follow-up Study.” Journal of Vascular and Interventional Neu- rology 2 (1): 132–5. ?iiTb,ffrrrXM+#BXMHKXMB?X;QpfTK+f�`iB+H2bfSJ*jjRdj kN. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “ ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedingsofthe22ndACMSIGKDDIn- ternational Conference on Knowledge Discovery and Data Mining (KDD ’16), 1135–1144. New York: ACM Press. Schmidt, R. A. and R. M. Nishikawa. 1995. “Clinical Use of Digital Mammography: the Present and the Prospects.” Journal of Digital Imaging 8 (1 Suppl 1): 74–9. Singh, Vivek Kumar et al. 2020. “Breast Tumor Segmentation and Shape Classification in Mam- mograms Using Generative Adversarial and Convolutional Neural Network.” Expert Sys- tems with Applications 139. Su, Jiawei, Danilo Vasconcellos Vargas, and Kouichi Sakurai. 2019. “One Pixel Attack for Fool- ing Deep Neural Networks.” IEEETransactionsonEvolutionaryComputation23 (5): 828–841. Taulli, Tom. 2019. “How Bias Distorts AI (Artificial Intelligence).” Forbes, August 4, 2019. ?iiTb,ffrrrX7Q`#2bX+QKfbBi2bfiQKi�mHHBfkyRNfy3fy9f#B�b@i?2@bBH2M i@FBHH2`@Q7@�B@�`iB7B+B�H@BMi2HHB;2M+2fOR++e7j8/d/3d. Twain, Mark. 1883. Life on the Mississippi. Boston: J. R. Osgood & Co. https://doi.org/10.1371/journal.pone.0141357 https://doi.org/10.1371/journal.pone.0141357 https://www.sciencedirect.com/science/article/pii/S2589750019301232 https://www.sciencedirect.com/science/article/pii/S2589750019301232 http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 Lesk 111 Tugend, Alina. 2019. “The Bias Embedded in Tech.” The New York Times, June 17, 2019, section F, 10. Walsh, Fergus. 2020. “AI ‘outperforms’ doctors diagnosing breast cancer.” BBC News, January 2, 2020. ?iiTb,ffrrrX##+X+QKfM2rbf?2�Hi?@8y38dd8N. Wilks, Daniel S. 2008. Review of EmpiricalMethodsinShort-TermClimatePrediction, by Huug van den Dool. Bulletin of the American Meteorological Society 89 (6): 887–88. https://www.bbc.com/news/health-50857759
lucic-towards-2021 ---- Chapter 13 Towards a Chicago place name dataset: From back-of-the-book index to a labeled dataset Ana Lucic University of Illinois John Shanahan DePaul University Introduction Reading Chicago Reading1 is a grant-supported digital humanities project that takes as its ob- ject the “One Book One Chicago” (OBOC) program2 of the Chicago Public Library. Since fall 2001, One Book One Chicago has fostered community through reading and discussion. On its “Big Read” website, the Library of Congress includes information about One Book programs around the United States,3 and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.4 While community reading programs are not a 1Reading Chicago Reading project (?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf) gratefully acknowl- edges the support of the National Endowment for the Humanities Office of Digital Humanities, HathiTrust, and Lyrasis. 2See ?iiTb,ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf. 3See ?iiT,ff`2�/X;Qpf`2bQm`+2bf. 4See ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM2#QQF. 151 https://dh.depaul.press/reading-chicago/ https://www.chipublib.org/one-book-one-chicago/ http://read.gov/resources/ http://www.ala.org/tools/programming/onebook 152 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in continual existence for nearly 20 years). Although relatively common, book clubs and community-based reading programs are not regularly assessed as other library programming components are, or are subjects of long-term quantitative study. The following research questions have been guiding the Reading Chicago Reading project so far: can we predict the future circulation of a book using a predictive model based on prior cir- culation, community demographics, and text characteristics? How did different neighborhoods in a diverse but also segregated city respond to particular book choices? Have certain books been more popular than others around the city as measured by branch-level circulation, and can these changes in checkout totals be correlated with CPL outreach work? A related question is the fo- cus of this paper: by associating place names with sentiment scores in Chicago-themed OBOC books, what trends emerge from spatial analysis? Results are still in progress and will be forth- coming in future papers. In the meantime, exploration of these questions, and our attempt to find solutions for some of them, enables us to reflect on some innovative services that libraries can offer. We will discuss this possibility in the last section of this paper. Chicago as a place name Thus far, the Reading Chicago Reading project has focused the bulk of its analysis on seven recent OBOC book selections and their respective “seasons” of public outreach programming: • Fall of 2011: Saul Bellow’s The Adventures of Augie March • Spring of 2012: Yiyun Li’s Gold Boy, Emerald Girl • Fall of 2012: Markus Zusak’s The Book Thief • 2013–2014: Isabel Wilkerson’s The Warmth of Other Suns • 2014 – 2015: Michael Chabon’s The Amazing Adventures of Kavalier and Clay • 2015 – 2016: Thomas Dyja’s The Third Coast • 2016 – 2017: Barbara Kingsolver’s Animal Vegetable Miracle: A Year of Food Life All of the listed works above, spanning categories of fiction and non-fiction, are still in copy- right. Of the seven works, three were categorized as Chicago-themed because they take place in the Chicago area in whole or in substantial part: Saul Bellow’s The Adventures of Augie March, Isabel Wilkerson’s The Warmth of Other Suns, and Thomas Dyja’s The Third Coast. As part of ongoing work of the Reading Chicago Reading project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright nov- els in our set. The HathiTrust research portal permits the extraction of non-consumptive fea- tures of the works included in the digital library, even those that are still under copyright. Non- consumptive features do not violate copyright restrictions as they do not allow the regular reading (“consumption”) or digital reconstruction of the full work in question. An example of a non- consumptive feature is the part of speech information extracted in aggregate with or without connection to its source words. Location words (i.e. place names) in the text are another example Lucic and Shanahan 153 of a non-consumptive feature as long as we do not aim to extract locations with the surround- ing context: that is, while the extraction of a location word alone from a work under copyright will not violate copyright law, the extraction of the location word with its surrounding context (a fixed size “window” of words that surrounds the location word) might do so. Similarly, the sentiment of a sentence also falls under the category of a “non-consumptive” feature as long as we do not extract both the entire sentence and its sentiment score. Using these methods, it was possible to utilize the HathiTrust research portal to access and also extract the location words as well as sentiment of individual sentences from copyrighted works. As later paragraphs will reveal however, we also needed to verify the accuracy of these extractions, which was done manually by checking the extracted references against the actual text of the work. This paper arises from the finding that the three OBOC books that are set largely in or are about Chicago circulated differently than the OBOC books that are not, (i.e., Marcus Zusak’s TheBookThief, Yiyun Li’sGoldBoy, Barbara Kingsolver’sAnimal,Vegetable,Miracle, and Michael Chabon’s TheAmazingAdventuresofKavalierandClay. Since one of the findings was that some CPL branches had higher circulation for “Chicago” OBOC books than others in the program, we wanted to determine (1) which place names were featured in the three books and (2) quan- tify and examine the sentiment associated with these places. Although recognizing a well-defined place name in a text by automated means is no longer a difficult task thanks to the development of named entity recognizers such as the Stanford Named Entity Recognizer,5 OpenNLP,6 spaCy,7 and NLTK,8 recognizing whether a place name is a reference to a Chicago location is a harder task. If Chicago is the setting or one of the main topics of the book then we can assume that a number of locations mentioned will also be Chicago place names. However, if information about the topicality or locality of the book is not known in advance or if the plot in the book moves from location to location, then the task of verifying through automated methods whether a place name is a Chicago location is much harder. With the help of LinkedGeoData9 we were able to obtain all of the Chicago place names identified by volunteers through the OpenStreetMap project10 and then download a listing that included Chicago buildings, theaters, restaurants, streets, and other prominent places. While this is very useful, we also realized that we were missing historical Chicago place names with this ap- proach. At the same time, the way that place names are represented in a text will likely not always correspond to the way a place name is formally represented in a dictionary, database, or knowledge graph. For example, a sentence might simply use an anaphoric reference such as “that building” or “her home” instead of directly naming the entity known from other sentences. Moreover, there were many examples of generic place names: how many cities in the United States have a State Street, a Madison Street, or a 1st Avenue, and the like? A further hindrance was determining the type of place names we wanted to identify and collect from the text’s total set of location word tokens: it soon became obvious that for the purposes of visualizing a place name on the map, gen- eral references to Chicago went beyond the scope of the maps we wanted to create. We became more interested in tracking references to specific Chicago place names that included buildings (historical and present), named areas of the city, monuments, streets, theatres, restaurants, and the like. Given that our total dataset for this task comprised just three books, we were able to man- 5See ?iiTb,ffMHTXbi�M7Q`/X2/mfbQ7ir�`2f*_6@L1_X?iKH. 6See ?iiTb,ffQT2MMHTX�T�+?2XQ`;f. 7See ?iiTb,ffbT�+vXBQf. 8See ?iiTb,ffrrrXMHiFXQ`;f#QQFf+?ydX?iKH. 9See ?iiT,ffHBMF2/;2Q/�i�XQ`;f�#Qmi. 10See ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f. https://nlp.stanford.edu/software/CRF-NER.html https://opennlp.apache.org/ https://spacy.io/ https://www.nltk.org/book/ch07.html http://linkedgeodata.org/About https://www.openstreetmap.org/ 154 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 Figure 13.1: Mapping place names associated with positive (top row) and very negative (bottom row) sentiment extracted from three OBOC books. ually sift through the automatically identified place names and verify whether they were indeed a Chicago place name or not. We also established the sentiment of each location-bearing sentence in the three books using the Stanford Sentiment Analyzer.11 Our guiding principle was that spe- cific place(s) mentioned in the sentence “inherit” the sentiment score of the entire sentence. This principle may not always be true, but our manual inspection of the sentiment assigned to sen- tences, and therefore to locations mentioned in the sentences, established that this was a fairly accurate estimate: the sentiment score of the entire sentence is at the very least connected to or “resonates” with the individual components of the sentence including place names. While we did examine some samples, we did not conduct a qualitative analysis of the accuracy of the sentiment scores assigned to the corpus. Figure 13.1 documents an example of the results of our effort to integrate place names with the sentiment of the sentence. Particularly notable in Figure 13.1 is The Third Coast (right column) which shows a concen- tration of positively-associated Chicago place names in the northern parts of the city along the shore of Lake Michigan. Negative sentiment, by contrast appears to be more concentrated in the central part of Chicago and also in the southern parts of the city. The place names extracted from our three Chicago-setting OBOC books allowed us to focus 11See ?iiTb,ffMHTXbi�M7Q`/X2/mfb2MiBK2Mif. https://nlp.stanford.edu/sentiment/ Lucic and Shanahan 155 Figure 13.2: Mapping of sentences that feature “Hyde Park,” and their sentiment, from three OBOC program books on particular areas of the city such as Hyde Park on the South Side, which is mentioned in each of them. Larger circles correspond to a greater number of sentences that mention Hyde Park and are associated with a negative sentiment in both The Adventures of Augie March and The Warmth of Other Suns. As the maps in figure 13.2 indicate, on the other hand, The Third Coast features sentences in which Hyde Park is mentioned in both positive and negative contexts. These results prompt us to continue with this line of research and to procure a larger “con- trol” set of texts with Chicago place names and sentiment scores. This would allow us to focus on specific places such as “Wrigley Field” or the once-famous but no longer existing “Mecca” apart- ment building (which stood at the intersection of 34th and State Street on the South Side and was immortalized in a 1968 poetry collection by Gwendolyn Brooks). With a robust place name data set, we could analyze the context in which these place names were mentioned in other liter- ature, in contemporary or historical newspapers (Chicago Tribune, Chicago Sun-Times, Chicago Defender), or in library and archival materials. Promising contextual elements would include the sentiment associated with the place name. Our interest in creating a dataset of Chicago place names extracted from literature led us to The Chicago of Fiction, a vast annotated bibliography by James A. Kaser. Published in 2011, this 156 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 work contains entries on more than 1,200 works published between 1852 and 1980 that feature Chicago. Kaser’s book contains several indexes that can serve as sources of labeled data or in- stances in which Chicago locations are mentioned. Although we are still determining how many of the titles included in the annotated bibliography already exist in digital format or are accessible through the HathiTrust digital library, it is likely that a subset of the total can be accessed elec- tronically. Even if the books do not exist in electronic format presently, it is still possible to use the index as a source of already-labeled data for Chicago place names. We anticipate that such a dataset would be of interest to researchers in Urban Studies, Literature, History, and Geogra- phy. A sufficiently large number of sentences featuring Chicago place names would enable us to proceed in the direction of a Chicago place name recognizer that can “learn” Chicago context or examine how much context is sufficient to establish whether, for instance, a “Madison Street” place name in a text is located in Chicago or elsewhere. How do libraries innovate? From print index to labeled data Over the last decade, libraries have pioneered services related to the development and preservation of digital scholarship projects. Librarians frequently assist faculty and students with the devel- opment of digital humanities and digital scholarship projects. They point patrons to resources and portals where they can find data and help with licensing. Librarians also procure datasets, and some perform data cleaning and pre-processing tasks. And yet it is still not that common for librarians to participate in the creation of a dataset. A relatively recent initiative, however, Collections as Data,12 directly tackles the issue of treating research, library, and cultural heritage collections as data and providing access to them. This ongoing initiative aims to create 12 projects that can serve as a model to other libraries for making collections accessible as data. The data that undergird the mechanisms of library workings—circulation records for phys- ical and digital objects, metadata records, and the like—are not commonly available as datasets open to machine learning tasks. If they were, not only could libraries refer others to the already created and annotated physical and digital objects, but they could also participate in creating ob- jects that are local to their settings. Creation and curation of such datasets could in turn help establish new relationships between area libraries and local communities. One can imagine a “data challenge,” for instance, in which libraries assemble a community by building a dataset rel- evant to that community. Such an effort would need to be preceded by assessment of the data needs and interests of that particular community. In the case of a Chicago place name dataset challenge, efforts could revolve around local communities adding sentences to the dataset from literary sources. A second step might involve organizing a crowdsourced data challenge to build a place name recognizer model (e.g. Chicago place name recognizer model) based on the sentences gathered. One can also imagine turning metadata records into curated datasets that are shared with local communities and with teachers and university lecturers for use in the classroom. Once a dataset is built, scenarios can be invented for using it. This kind of work invites conversations with faculty members about their needs and about potential datasets that would be of particular interest. Creation of datasets based on unique materials at their disposal will enrich the palette of services already offered by libraries. 12See ?iiTb,ff+QHH2+iBQMb�b/�i�X;Bi?m#XBQfT�`ikr?QH2f. https://collectionsasdata.github.io/part2whole/ Lucic and Shanahan 157 One of the main goals of the Reading Chicago Reading project was the creation of a model that can predict the circulation of a One Book One Chicago program book selection given param- eters such as prior circulation for the book, its text characteristics, and the geographical locality of the work. We are not aware of other predictive models that integrate circulation records with text features extracted from the books in this way. Given that circulation records are not com- monly integrated with other data sources when they are analyzed, linking different data sources with circulation records is another challenging opportunity that this paper envisions. Ultimately, libraries can play a dynamic role in both managing and creating data and datasets that can be shared with the members of local communities. Using back-of-the-book indexes as a source of labeled place name data is a tool that we have begun to prototype but still requires further exploration and troubleshooting. While organizing a data challenge takes a lot of effort, a data challenge can be an effective way of reaching out to one’s local community and identifying their data needs. To this end, we aim to make freely available our curated list of sentences and associated sentiment scores for Chicago place names in the three OBOC selections centered on Chicago. We will invite scholars and the general public to add more Chicago location sentences extracted from other literature. Our end goal is a labeled training dataset for the creation of a Chicago place name recognizer, which, we hope, will enable new avenues of research. References American Library Association. n.d. “One Book One Community.” Programming & Exhibitions (website). Accessed May 31, 2020. ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM 2#QQF. Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media Inc. Chicago Public Library. n.d. “One Book One Chicago.” Accessed May 31, 2020. ?iiTb, ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf. “Collections as Data: Part to Whole.” n.d. Accessed May 31, 2020. ?iiTb,ff+QHH2+iBQMb� b/�i�X;Bi?m#XBQfT�`ikr?QH2f. Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005. “Incorporating Non- local Information into Information Extraction Systems by Gibbs Sampling.” In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363-370. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSy8@Ry98f. HathiTrust Digital Library. n.d. Accessed May 31, 2020. ?iiTb,ffrrrX?�i?Bi`mbiXQ`;f. Kaser, A. James. 2011. The Chicago of Fiction: A Resource Guide. Lanham: Scarecrow Press. Library of Congress. “Local/Community Resources.’ n.d. Read.gov. Accessed May 31, 2020. ?iiT,ff`2�/X;Qpf`2bQm`+2bf. LinkedGeoData. “About / News.” n.d. Accessed May 31, 2020. ?iiT,ffHBMF2/;2Q/�i�X Q`;f�#Qmi. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSR9@8yRyf. OpenStreetMap. n.d. Accessed May 31, 2020. ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f. Reading Chicago Reading. “About Reading Chicago Reading.” n.d. Accessed May 31, 2020. ?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf�#Qmif. http://www.ala.org/tools/programming/onebook http://www.ala.org/tools/programming/onebook https://www.chipublib.org/one-book-one-chicago/ https://www.chipublib.org/one-book-one-chicago/ https://collectionsasdata.github.io/part2whole/ https://collectionsasdata.github.io/part2whole/ https://www.aclweb.org/anthology/P05-1045/ https://www.hathitrust.org/ http://read.gov/resources/ http://linkedgeodata.org/About http://linkedgeodata.org/About https://www.aclweb.org/anthology/P14-5010/ https://www.openstreetmap.org/ https://dh.depaul.press/reading-chicago/about/
maceli-what-2015 ---- Microsoft Word - September_ITAL_Maceli_proofed.docx What Technology Skills Do Developers Need? A Text Analysis of Job Listings in Library and Information Science (LIS) from Jobs.code4lib.org. Monica Maceli INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 8 ABSTRACT Technology plays an indisputably vital role in library and information science (LIS) work; this rapidly moving landscape can create challenges for practitioners and educators seeking to keep pace with such change. In pursuit of building our understanding of currently sought technology competencies in developer-‐oriented positions within LIS, this paper reports the results of a text analysis of a large collection of job listings culled from the Code4lib jobs website. Beginning more than a decade ago as a popular mailing list covering the intersection of technology and library work, the Code4lib organization's current offerings include a website that collects and organizes LIS-‐related technology job listings. The results of the text analysis of this dataset suggest the currently vital technology skills and concepts that existing and aspiring practitioners may target in their continuing education as developers. INTRODUCTION For those seeking employment in a technology-‐intensive position within library and information science (LIS), the number and variation of technology skills required can be daunting. The need to understand common technology job requirements is relevant to current students positioning themselves to begin a career within LIS, those currently in the field that wish to enhance their technology skills, and LIS educators. The aim of this short paper is to highlight the skills and combinations of skills currently sought by LIS employers in North America through textual analysis of job listings. Previous research in this area explored job listings through various perspectives, from categorizing titles to interviewing employers;1,2 the approach taken in this study contributes a new perspective to this ongoing and highly necessary work. This research report seeks a further understanding of the following research questions: • What are the most common job titles and skills sought in technology-‐focused LIS positions? • What technology skills are sought in combination? • What implications do these findings have for aspiring and current LIS practitioners interested in developer positions? As detailed in the following research method section, this study addresses these questions Monica Maceli (mmaceli@pratt.edu) is Assistant Professor, School of Information and Library Science, Pratt Institute, New York. WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 9 through textual analysis of relevant job listings from a novel dataset—the job listings from the Code4lib jobs website (http://jobs.code4lib.org/). Code4lib began more than a decade ago as an electronic discussion list for topics around the intersection of libraries and technology.3 Over time, the Code4lib organization expanded to an annual conference in the United States, the Code4Lib Journal, and most relevant to this work, an associated jobs website that highlights jobs culled from both the discussion list and other job-‐related sources. Figure 1 illustrates the home page of the Code4lib jobs website; the page presents job listings and associated tags, with the tags facilitating navigation and viewing of other related positions. Users may also view positions geographically or by employer. Figure 1. Homepage of the code4lib Jobs Website, Displaying Most-‐Recently Posted Jobs and the Associated Tags.4 In addition to the visible user interface for job exploration, the website consists of software to gather the job listings from a variety of sources. The website incorporates jobs posted to the Code4lib discussion list, American Library Association, Canadian Library Association, Australian Library and Information Association, HigherEd Jobs, Digital Koans, Idealist, and ArchivesGig. This broad incoming set of jobs provides a wide look into new technology-‐related postings. New job listings are automatically added to a queue to be assessed and tagged by human curators before posting. This allows manual intervention where a curator assesses whether the job is relevant to technology in the library domain and to validate the job listing information and metadata (see figure 2). Curating is done on a volunteer basis, and curators are asked to assess whether the position is relevant to the Code4lib community, if it is unique, and to ensure that it has an associated employer, set of tags, and descriptive text. Combining both software processes INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 10 and human intervention in the job assessment results in the ability to gather a large number of jobs of high relevance to the Code4lib community. As mentioned earlier, Code4lib’s origins are in the area of software development and design as applied in LIS contexts. These foci mean that most jobs identified as relevant for inclusion in the Code4lib jobs dataset are oriented toward developer activities. The Code4lib jobs website therefore provides a useful and novel dataset within which to understand current employment opportunities relating to the intersection between technology— particularly developer work—and the LIS field. Figure 2. Code4lib Job Curators Interface Where Job Data is Validated and Tags Assigned.5 RESEARCH METHOD To analyze the job listing data in greater depth, a textual analysis was conducted using the R statistical package, exploring job titles and descriptions.6 First, the job listing data from the most recent complete year (2014) were dumped from the database backend of the Code4lib jobs website; this dataset contained 1,135 positions in total. The dataset included the job titles, descriptions, location and employer information, as well as tags associated with the various WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 11 positions. The text was then cleaned to remove any markup tags or special characters that remained from the scraping of listings. Finally, the tm (text mining) package in R was used to calculate frequency, correlation of terms, generate plots, and cluster terms across both job titles and descriptions.7 RESULTS Job Title Analysis Of the full set of 1,135 positions, 30 percent were titled as a librarian position; popular specialties included systems librarian and various digital collections and curation-‐oriented librarian titles. Figures 3 and 4 detail the most common terms used in position titles across librarian and nonlibrarian positions. Figure 3. Most Common Terms Used in Librarian Position Titles. 345 89 63 59 34 29 25 25 23 21 20 20 18 18 16 14 13 13 13 12 12 11 11 11 10 librarian digital systems services metadata data technologies university technology web electronic resources assistant information emerging scholarship collections library management initiatives sciences cataloging projects research professor Top Title Terms - Librarian Positions INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 12 Figure 4. Most Common Terms Used in Nonlibrarian Position Titles. The most popular job title terms were then clustered using Ward’s agglomerative hierarchical method (dendogram in figure 5). Agglomerative hierarchical clustering, of which Ward’s method is widely used, begins first with single-‐item clusters, then identifies and joins similar clusters until the final stage in which one larger cluster is formed. Commonly used in text analysis, this allows the investigator to explore datasets in which the number of clusters is not known before the analysis. The dendograms generated (e.g., figure 5) allow for visual identification and interpretation of closely related terms representing various common positions, e.g., digital librarian, software engineer, collections management, etc. Given that job titles in listings may include extraneous or infrequent words, such as the organization name, the cluster analysis can provide an additional view into common job titles across the full dataset in a more generalized fashion. 182 141 116 90 86 68 65 59 59 59 55 52 49 49 40 40 40 40 38 35 34 34 33 32 24 digital developer library manager specialist software web archivist services technology engineer director data systems analyst coordinator information senior metadata administrator lead project head programmer research Top Title Terms - Non-Librarian Positions WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 13 Figure 5. Cluster Dendrogram of Terms Used in Job Titles Generated Using Ward's Agglomerative Hierarchical Method. Tag Analysis As described earlier, the Code4lib jobs website allows curators to validate and tag jobs before listing. The word cloud in figure 6 displays the most common tags associated with positions, with XML being the most popular tag (178 occurrences). Figure 7 contains the raw frequency counts of common tags observed. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 14 Figure 6. Word Cloud of Most Frequent Tags Associated with Job Listings by Curators. WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 15 Figure 7. Frequency of Commonly Occurring Tags (frequency of fifty occurrences or more) in the 2014 Job Listings. Job Description Analysis The job description text was then analyzed to explore commonly co-‐occurring technology-‐related terms, focusing on frequent skills required by employers. Figures 8, 9, and 10 plot term correlations and interconnectedness. Terms with correlation coefficients of 0.3 or higher were chosen for plotting; this common threshold chosen broadly included terms with a range in positive relationship strength from moderate to strong. Plots were created to express correlations around the top five terms identified from the tags: XML, Javascript, PHP, metadata, and HTML (frequencies in figure 7). Any number of terms and 178 155 152 142 125 119 114 106 101 99 90 90 89 89 86 82 79 78 70 70 69 69 66 63 62 54 53 51 51 50 50 XML JavaScript PHP Metadata HTML Archive Cascading Style Sheets Python Integrated library system Java MySQL Dublin Core MARC standards Encoded Archival Description Ruby Drupal Project management SQL Metadata Object Description Standard Data management GNU/Linux Digital preservation Perl Digital library XSL Transformations Resource Description and Access Digital repository World Wide Web Management DSpace METS Frequency of Tags - 2014 Job Listings INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 16 frequencies can be plotted from such a dataset; to orient the findings closely around the job listing text, a focus on the top terms was chosen. These plots illustrate the broader set of skills related to these vital competencies represented in the job listings. Figure 8. Job Listing Terms Correlated with “XML” (most popular tag). Figure 9. Job Listing Terms Correlated with “Javascript” (Second Most Popular Tag), including “PHP” and “HTML” (third and fifth most popular tags, respectively). WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 17 Figure 10. Job Listing Terms Correlated with “Metadata” (fourth most popular tag). Finally, a series of general plots was created to visualize the broad set of skills necessary in fulfilling the positions of interest to the Code4lib community. As detailed in the title analysis (figures 3 and 4), apart from the generic term librarian, the two most common terms across all job titles were digital and developer. Correlation plots were created to detail the specific skills and requirements commonly sought in positions using such terms. Figure 11 illustrates the terms correlated with the general term of developer, while figure 12 displays terms correlated with digital. The implications of these findings will be discussed further in the following discussion section. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 18 Figure 11. Job Listing Terms Correlated with “Developer.” Figure 12. Job Listing Terms Correlated with “Ddigital.” WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 19 DISCUSSION Taken as a whole, the job listing dataset covered a quite dramatic range of positions, from highly technical (e.g., senior-‐level software engineer or web developer) to managerial and leadership roles (e.g., director or department head roles centered on digital services or emerging technologies). These findings support the suggestions of earlier research,8 which advocated for LIS graduate programs to build their offerings not just in technology skills but also in technology management and decision-‐making. However, the Code4lib jobs dataset is a one-‐dimensional view into the employment process and is focused largely on the developer perspective. Additional contextual information, including whether suitable candidates were easily identified and if the position was successfully filled, would provide a more complete view of the employment process. Prior research has indicated that many technology-‐related positions in LIS are in fact difficult to fill with LIS graduates.9 While LIS graduate programs have made great strides in increasing the number of courses and topics covered that address technology, these improvements may not benefit those already in the field or wishing to shift towards a more technology-‐focused position. In the common tags and terms analysis, experience with specific LIS applications was relatively infrequently required, with the Drupal content management system a notable exception. More generalizable programming languages or concepts, e.g., Python, relational databases, XML, etc., were favored As with technology positions outside of the LIS domain, employers likely seek those with the ability to flexibly apply their skills across various tools and platforms. This may also relate to the above challenges in filling such positions with LIS graduates, with the goal of opening up the position to a larger technologist applicant base. Common web technologies popular in the open-‐source software often favored by LIS organizations continued to dominate, with a clear preference for candidates well versed in HTML, CSS, JavaScript, and PHP. Relating to these skills, web development and design practices were often intertwined with positions requesting both developer-‐oriented skillsets as well as interface design (e.g., figure 7). Technologies supporting modern web application development and workflow management were evident as well, e.g., common requirements for experience with versioning systems such as Git, popular JavaScript libraries, and development frameworks. Also striking was the richness of the terms correlated with metadata (figure 10), including mention of growing areas of expertise, such as linked data. Interestingly, the general correlation plots expressing the common terms sought around “digital” and “developer” positions were quite varied. While the developer plot (figure 11 above) provided a richly technical view into common technologies broadly applied in web and software development, the terms correlated around digital were notably less technical (figure 12 above). While there was a clear focus on digital preservation activities and common standards in this area, mention of terms such as “grant” indicated that these positions likely have a broad role. The term digital was frequently observed in librarian job titles, so these roles may be tasked with both technical and administrative work. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2015 20 Finally, there are inherent difficulties in capturing all jobs relating to technology use in the LIS domain that introduce limitations into this study. While the incoming job feeds attempt to broadly capture recent job posts, it is possible that jobs are missed or overlooked by the job curators. Given the lack of one centralized job-‐posting source regardless of the field, this is a common challenge to research work attempting to assess every job posting. And as mentioned above, there is also a lack of corresponding data as to whether these jobs are successfully filled and what candidate backgrounds are ultimately chosen (i.e., from within or outside of LIS). CONCLUSION This assessment of the in-‐demand technology skills provides students, educators, and information professionals with useful direction in pursuing technology education or strengthening their existing skills. There are myriad technology skills, tools, and concepts in today’s information environments. Reorienting the pursuit of knowledge in this area around current employer requirements can be useful in professional development, new course creation, and course revision. The constellations of correlated skills presented above (figures 8–12) and popular job tags (figure 7) describe key areas of technology competencies in the diverse areas of expertise presently needed, from web design and development to metadata and digital collection management. In addition to the results presented in this paper, the Code4lib job website provides a continuously current view into recent jobs and related tags; this data can help those in the LIS field orient professional and curricular development toward real employer needs. ACKNOWLEDGEMENTS The author would like to thank Ed Summers of the Maryland Institute for Technology in the Humanities for generously providing the jobs.code4lib.org dataset for analysis. REFERENCES 1. Janie M. Mathews and Harold Pardue, “The Presence of IT Skill Sets in Librarian Position Announcements,” College & Research Libraries 70, no. 3 (2009): 250–57, http://dx.doi.org/10.5860/crl.70.3.250. 2. Vandana Singh and Bharat Mehra, “Strengths and Weaknesses of the Information Technology Curriculum in Library and Information Science Graduate Programs,” Journal of Librarianship & Information Science 45, no. 3 (2013): 219–31, http://dx.doi.org/10.1177/0961000612448206. 3. “About”" Code4lib, accessed January 6, 2014, http://jobs.code4lib.org/about/. 4. “code4lib jobs: all jobs,” Code4lib Jobs, accessed January 12, 2015, http://jobs.code4lib.org/. 5. “code4lib jobs: Curate,” Code4lib Jobs, accessed January 17, 2015, http://jobs.code4lib.org/curate/. 6. R Core Team, R: The R Project for Statistical Computing, 2014, http://www.R-‐project.org/. WHAT TECHNOLOGY SKILLS DO DEVELOPERS NEED? | MACELI doi: 10.6017/ital.v34i3.5893 21 7. Ingo Feinerer and Kurt Hornik, “tm: Text Mining Package,” 2014, http://CRAN.R-‐ project.org/package=tm. 8. Meredith G. Farkas, “Training Librarians for the Future: Integrating Technology into LIS Education,” in Information Tomorrow: Reflections on Technology and the Future of Public & Academic Libraries, edited by Rachel Singer Gordon, 193–201 (Medford, NJ: Information Today, 2007). 9. Mathews and Pardue, “The Presence of IT Skill Sets in Librarian Position Announcements.”
mathews-think-2012 ---- We don’t just need change, we need breakthrough, paradigm-shifting, transformative, disruptive ideas. April 2012 “Don’t think about better vacuum cleaners, think about cleaner floors.” That’s what I frequently remind my staff during our brainstorming sessions. Get beyond what’s familiar. It’s easy to just focus on making small tweaks to existing services, rather than considering the bigger, bolder, broader possibilities. Vacuum-cleaner-thinking is about asking: “How do we make it better?” A stylish new design? Stronger suction? Larger capacity? Attachments? Quieter motors? It’s all about building better features. And there’s nothing wrong with that. In fact, we should definitely strive for incremental improvement; but we have to go beyond that. We have to exceed our imaginations. We can’t just find new ways of doing the same old things. What we really need right now are breakthrough, paradigm-shifting, transformative, and disruptive ideas. When searching for “what’s next” we can’t focus on building a better vacuum cleaner, but rather, we need to set our minds to maintaining cleaner floors. That’s the real question at hand. It’s not about adding features, but about new processes. It’s not about modifying the reference desk model or purchasing ebooks. That’s just more of the same, but a little different. Instead we ought to consider a more central question: how can libraries support 21st century learners? Follow that thread and you’ll find transformative change. T H I N K L I K E A A white paper to inspire library entrepreneurialism Brian Mathews Associate Dean for Learning & Outreach at Virginia Tech www.brianmathews.com We have to face the future boldly. We have to peer upwards and outwards through telescopes, not downwards into microscopes. Over the next decade we need to implement big new ideas, otherwise the role of the library will become marginalized in higher education. We’ll become the keepers of the campus proxy, rather than information authorities. We’ll become just another campus utility like parking, dining services, and IT rather than the intellectual soul of the community. Now is the time to “zoom out” rather than “zoom in.”1 Let’s not pigeonhole ourselves into finite roles, such as print collections, computer labs, or information literacy. These self-imposed limitations will only ensure our vulnerability and gradual decline. We can’t abide by the dictionary definition of “library.” We can’t stay basically the same and only make small changes. Not only will that constrain the library, but it will also hold back scholarship and learning. With or without us the nature of information, knowledge creation, and content sharing is going to evolve. It’s already happening. Which side of the revolution will we be on? Dyson offers beautiful state-of-the-art vacuum machines. Their tools are top of the line. But ultimately, it’s still a chore to push a vacuum cleaner around the floor. If we’re talking about transformative ideas then iRobot is the place to focus your attention. Their machines are autonomous. Vacuuming isn’t a chore; it’s just something that happens while you sleep, work, or run errands. Their focus isn’t on providing new hardware, but on providing an ingenuous system that cleans surfaces for you. Carpets. Tiles. Hardwood. Pools. The Roomba is a revolution! It’s a new way of thinking. It’s solving a problem in a different way. And that’s what we need right now. We need to reinvent not just what we do, but how we think about it. This document is intended to inspire transformative thinking using insight into startup culture and innovation methodologies. It’s a collection of talking points intended to stir the entrepreneurial spirit in library leaders at every level. 1 Is Higher Education Too Big to Fail? Flip through the headlines and you’ll see that there is much to be concerned about: bankruptcy,2 mergers,3 and closures.4 Even Harvard is reducing library hours and laying off staff.5 While state budgets swing between bad and worse, something else is happening-- something more than just financial hardship. Higher education is facing increasing public criticism, and it’s possible (perhaps even inevitable) that the bubble is going to burst.6 Of course it won’t vanish; it will just evolve, like everything does, but traditional educational delivery is about to be disrupted.7 New options are emerging such as StraigterLine, UnCollege, and Udacity. There is no shortage of doom and gloom scenarios for the academic library.8 I hate adding more to the pile, but let’s face it: we’re vulnerable. While many of the services we provide are indeed essential to the academic mission, nothing says in stone that they must remain under our domain: • What if Residence Halls and Student Centers managed learning commons spaces? • What if the Office of Research managed campus- wide electronic database subscriptions and on- demand access to digital scholarly materials? • What if Facilities managed the off-campus warehouses where books and other print artifacts are stored? • What if the majority of scholarly information becomes open? Libraries would no longer need to acquire and control access to materials. • What if all students are given eBook readers and an annual allotment to purchase the books, articles, and other media necessary for their academic pursuits and cultural interests?9 Collections become personalized, on-demand, instantaneous, and lifelong learning resources. • What if local museums oversaw special collections and preservation? • What if graduate assistants, teaching fellows, post-docs, and undergraduate peer leaders managed database training, research assistance, and information literacy instruction? • What if the Office of Information Technology managed computer labs, proxy access, and lending technology and gadgets? Some of these are real possibilities over the next twenty years. Colleges and universities are highly competitive environments; everyone wants to expand, but funding is limited. If financial resources continue to decrease (as we expect that they will at public institutions) we’re likely to see some large-scale reorganization and reallocations take place.10 In the future you may still work as a librarian, just not in a traditional physical library. Many of the things we currently do could be assimilated elsewhere. This is why we need to be open to the definition of what an academic library is and focus on what people need it to become. How do we help the individuals at our institutions become more successful? That’s the goal. Our jobs are shifting from doing what we’ve always done very well, to always being on the lookout for new opportunities to advance teaching, learning, service, and research. 2 Change is going to be difficult, but the good news is that we know it’s necessary. Glance though the academic library job postings and you’ll see what I mean. Over and over again the word innovation pops up. There is a huge demand for librarians who “think different.” In fact, this theme of change has become a part of our landscape. Change is the new normal. Change is the only constant. Here is a sampling from some current ARL job listings: Innovators Wanted Mobile computing in everyone’s hands. An iTunes-like interface for quickly acquiring and accessing content anytime, anywhere, on any device. Facebook-like communities for students and scholars to discover, build, publish, and share new knowledge. Of course, this leads to a lot of controversy. Take collections for example. Several years ago it was impossible to imagine a research library without a significantly massive collection in print. Now I can’t envision a future without the majority of scholarly content being digital. But this isn’t just about books; it’s about libraries redefining what a collection is. As information migrates to digital platforms, let’s imagine what’s next: Google-like search capabilities across millions of books, articles, and multimedia. This is what I’m hearing around campus. This is what students, researchers, and administrations expect us to offer. This is the future they want to see. And if we don’t do it someone else will. Perhaps our future isn’t centered on access to content, but rather, the usage of it. Maybe there is a greater emphasis on community building, connecting people, engaging students, assisting researchers, and advancing knowledge production? Are academic libraries too important to fail? Maybe. If we remain steeped in nostalgia then I think we’re in trouble. At some point we have to take a leap into the future. Our focus can’t just be about adding features, but about redefining and realigning the role and identity of the academic library. We can’t map our value to outdated needs and practices, but instead, must intertwine ourselves with what’s needed next. It’s time to innovate. • ever-changing environment • an evolving program of research services • changing user preferences • receptive to and fostering new ideas We’re looking for people who are comfortable with change. We’re looking for people who can innovate. But is that what we really want? Innovation is messy. It takes many wild ideas that flop in order to find transformative gold. Innovation demands leaders who are persistent and who can challenge the status quo.11 Innovation requires organizations to live in liminality. Is your library ready for disruption? We can’t hire a few creative and improvisational individuals and expect them to deliver new service models if the work culture is not ready for new service models. We can’t expect entrepreneurialism to flourish in a tradition-obsessed environment. We can’t just talk about change; it must be embedded in the actions of employees. Innovation is a team sport that has to be practiced regularly. So how do we get there? • nimble • adaptive • flexible • self-starter 3 Think Like a Startup To become innovative organizations we need to emulate innovative organizations. Startups are a perfect model for guiding this change. The media and pop culture provide us with romanticized visions of dorm room ideas becoming billion dollar IPOs. And indeed, that does happen sometimes, but startups are more than rags to riches stories. In concise terms: startups are organizations dedicated to creating something new under conditions of extreme uncertainty.12 This sounds exactly like an academic library to me. Not only are we trying to survive, but we’re also trying to transform our organizations into a viable service for 21st century scholars and learners. Here are a few considerations: It’s not about what’s-now but about what’s-next. Startups probe for new possibilities. They examine what else needs to be done and then launch a path for that destination. Thinking like a startup positions us to think aspirationally about change. It requires and rewards innovation and creativity. It causes us to constantly reevaluate our organization, purpose, and drive: not against what it is or what it has been, but against what it needs to become. not necessarily profit. Obviously for businesses, financial validation is necessary for survival, but the incubation stage is more about trying to develop good ideas into working models. The film The Social Network provides a dramatic representation of this situation. The co-founders of Facebook ponder its future. One of them wants to monetize right away, while the other insists, “We don’t even know what it is yet.” That’s where we are with the future of academic libraries. We’re still in the early stages of our next evolution. It’s too early to know what libraries will become, but we know they’ll never be the same. Rather than getting bogged down with a definition, the time is ideal for launching new products, programs, and partnerships. The library is not a building, a website, or a person; it is a platform for scholars, students, cultural enthusiasts, and others who want to absorb and advance knowledge. They give us a way to analyze what we do, why we do it, and how we might implement change. The lean startup methodology accelerates discovering possibilities, addressing needs, and proposing solutions. Whether launching new initiatives or addressing existing ones, the startup mindset challenges us to test and validate our assumptions. It bonds us together. It connects us with our users. It forces us beyond satisfaction metrics and into the difficult but rewarding position of needs-based librarianship. Our profession invests a lot of time measuring how well we did, and hardly any time leap-frogging into what is going to be important in the future. Embracing startup culture is embracing a forward-thinking and future-oriented perspective. What can we create today that will be essential tomorrow? Startups condition us for constant change. Startups are about building a platform, Startups provide us a framework for action. Lastly, startup is a culture. 4 If most startups fail then why should we follow their lead? Indeed, studies suggest that as many as nine out of ten of these companies fall apart.13 But let’s flip that question and ask: what can we learn from the 10% that succeed? What did they do right? How did they think and act differently? The Lean Startup methodology addresses this perspective.14 Here are a few key insights: Investing too much time on something that doesn’t work is a common startup mistake. Their concepts are not viable, but they don’t discover that until it is too late. Instead, build “failure” or adjustment into the process. Seek to validate your ideas early on and then expand, edit, and revise them along the way. New ideas are exciting. You want to launch them as quickly as possible, but often you might feel “it’s just not ready yet.” That’s a surefire way to inhibit success. Instead, distill the concept into a raw form and then go with it. Get it into others’ hands and see what happens. If you are too hung up on creating policies and procedures, workflows and logistics, wordsmithing and committee debates then your idea doesn’t stand a chance. The project will stall out before you can even find out if it’s worth all the effort. When it’s good enough, go with it. Build upon success. That should be your initial objective. In the business lit they call this the minimum viable product. In Web 2.0 the motto is: everything is beta. Real estate is driven by location, location, location. With innovation it’s iteration, iteration, iteration. Your outlook should be to grow your idea by constantly building feedback into the developmental process. Let potential customers help nurture the concept to make it better. Don’t just cook it up in your office or meeting rooms-- test it in the field. You might begin traveling along one path but need to change the route in order to reach the destination. In fact, you might even need to change your destination. Successful startups are attuned to this. Facebook moved beyond just a college-oriented social network. Groupon shifted from social activism to social shopping. Realizing when you may need to pivot your idea in a new direction is critical toward cultivating innovation. Let it grow naturally. Don’t force it to become something it doesn’t want to be. Who doesn’t love following a great plan? Crossing off completed tasks. Reaching milestones. Launching on deadline. The problem, though, is that while we can follow a plan perfectly, it doesn’t mean it’s a good plan. We can follow a good plan right off a cliff. We can miss out on new opportunities because we’re too busy following the prescribed strategy. Instead, the goal should be to draft a good Plan A with the intention of it helping us get to plans B, C, and D. Instead of focusing on one perfect idea, try lots of decent ideas instead. See what works and what doesn’t. See what gains interest or has a positive impact. Nurture the projects that show the most potential. What isn’t being done? What opportunities exist to help people in new ways? Don’t limit your innovation to traditional library boundaries, but consider the entire teaching, learning, and research enterprise. What are the areas of untapped potential? Translation services? 3D Printing? Experimental classrooms? An important local collection? How might we fill a new role and not only expand the library’s portfolio, but also empower people by addressing unmet needs? Most Startups Fail; Learn From the Ones That Didn’t Fail Faster, Fail Smarter Good Enough is Good Enough to Start Feed the Feedback Loop Pivot Toward Success Don’t Get Stuck Following Plan A; Instead Get to A Plan That Works15 Plant Many Seeds16 Seize the White Space17 5 Build, Measure, Learn: The Methodology The lean startup method encourages a phased process right from the start.18 Building, measuring, and learning are integrated into the workflow. Changes to the idea, product, or service are expected and required. This is how it works: you take your initial concept and develop it into a shareable format. Test it and analyze the reaction. You then use this insight to build a better prototype. Repeat the process. Iterate forever. The aim isn’t to develop a finished product, but to continuous evaluate and evolve the concept. This cycle of rapid development keeps you on track for constant improvement instead of clinging to services that are no longer needed. While this process is ideal for software development, it also works well in other areas. For example, the Newman Library at Virginia Tech experimented by hosting writing center tutors at a table in a commons area. Based upon this successful trial the writing center staff left their former location and set up shop in the library full-time. During the incubation period they tested the concept: location, staffing, hours of operation, publicity, perceived value, etc. The resulting insight enabled the library and writing center to flesh out a successful concept before committing money and floor space. Thinking like a startup means getting your idea out quickly. Test it, improve it, and then try it again. And then repeat the process, refining the concept along the way. A variation of this model comes from the user experience domain and argues to shift the order of steps to Learn, Build, Measure.19 This sequence places a greater emphasis on investing a small amount of time upfront engaging people. After learning about any potential problems, address those needs by either tweaking the idea or pivoting the concept. Next measure behaviors or perceptions and gain insights from actual usage. This will then stimulate another round of learning, building, and measuring. Perhaps you already employ a form of this model. The point is to make it explicit in your operations. Whether launching a new service, developing a new space, or reviewing current workflows, build this continuous feedback loop into your process. The cycles should be more frequent at first and then taper off, but the important thing is stay focused on constant improvement: growing and pivoting, expanding and contracting. This practice of constant refinement will challenge us to think about what’s next rather than just clinging to what’s worked before. The NCSU Libraries have long practiced this good entrepreneurial development.20 Let’s look at two examples: During the early stages of their Commons development the library ran into a funding delay and was consequently left with a large open space. To bridge the gap, the library provided hundreds of beanbags. This temporary solution was fortuitous because it opened their eyes to what the library needed to become. Students were drawn to the open space and started bringing their own accessories and furniture. Watching the way the area was used, the librarians realized their initial plan was flawed; the way that students used the space was completely different than originally anticipated. NCSU had greatly underestimated the desire for social learning and collaboration. The architect was able to adjust the design, and they eventually constructed an environment more attuned to user preferences. The Libraries have since incorporated user-driven insight to inform all subsequent renovations. NCSU uses a variation of the Build, Measure, Learn method with many of its online projects as well. New digital collections are often rolled out quickly and then enhancements are added over time, making extensive use of web analytics and tracking on individual interfaces to review how the systems are being used. The NCSU Libraries have increasingly taken the approach of developing their applications in such a way that they generate the kind of data necessary to evaluate how the tool, content, or service is being used, so staff can respond to emerging patterns of use. They can grow the initiative according to what their users need it to become. D. H. Hill Library Learning Commons Web Initiatives21 6 “Entrepreneurship is similar to a science experiment; you’re constantly creating and testing new theses and seeing what works.” That’s the advice from Bob Summer, founder of TechPad, a Blacksburg startup co-working office space.22 Bob has been involved with startups from many dimensions, as a founder as well as a venture investor. At TechPad he is more than a property manager, serving as a mentor to several early-stage companies. He believes that successful ideas can be boiled down to three essential qualities:23 If your concept is lacking one of these attributes, it’s less likely to succeed. Some examples: A library I worked in wanted to offer a flexible, customizable, commons environment. High-end designer tables and chairs were installed that were lightweight, on casters, and very easy to move. From a cost and square footage standpoint this was feasible to make happen. In terms of value, many students enjoyed being able to create the type of space they needed on the fly. However, usability was questionable. While it was easy to move furniture around, the problem was excessive mobility. Students often left the tables and chairs in arrangements that were chaotic, confusing, and unnavigable. During finals week I often observed small groups cramming together for their last minute preparations before tests. I wanted to enhance this, especially for large general classes like biology and calculus. My concept: what if you could study with your friends, and a few others, and have the session facilitated by a teaching assistant? There was great value in this venture because many campus units partnered with us, and students turned out to take advantage of the program. It also had great usability because it worked well. Students discovered the program, found the locations, and commented that it helped them prepare for their tests. The issue was feasibility; it couldn’t scale. Some sessions had over 75 students show up but only enough room for 25. We encountered some reliability issues, too. Some teaching assistants didn’t show up and this caused anger, disappointment, and anxiety among the students. While the concept was good, the library was limited in being able to coordinate and scale to the demand. Char Booth describes her experience with the implementation of Skype reference at Ohio University. They experimented with setting up a Skype kiosk in various locations, enabling students to interact with librarians. After several iterations of location, signage, and software configuration, they decided to end the project. It was feasible and usable; from a technical standpoint the tools worked well and cost was minimal. The problem was value. Students just didn’t use the service. Maybe Char’s team was too ahead of the curve; Skype has only recently become a standard communications tool. Or maybe students just didn’t want to video chat with librarians. All three of these are examples of failure. Not epic, million-dollar catastrophes, but great ideas that just didn’t turn out as planned. And that’s okay. Forgiveness has to be built into the experience. We shouldn’t look at failure as finality, but rather as a test bed to help ideas evolve. The library with furniture chaos built table management into someone’s job responsibilities. This person was able to monitor the pulse of student needs and managed the learning space more effectively. The Exam Cram concept spun off from the library into the dining halls and dorms where it was more manageable and linked to the living-learning community. And the library that experimented with Skype gained insight about user preferences and were able to focus service toward anonymous and mobile platforms like instant messaging and texting. We have to look at our efforts beyond successes or failures, beyond black and white, and be comfortable with gray. We have to give our ideas enough time and room to grow. And we have to learn when to let them go. Building on the core elements of usability, feasibility, and value greatly increases the likelihood of developing ideas that people will adopt. Three Essential Qualities of Inspiring Products “Entrepreneurship is a lot like to a science experiment; you’re constantly creating and testing new theses and seeing what works.” Usability. Feasibility. Value. Iteration. Iteration. Iteration. Open Floor Plans Exam Cram Skype a Librarian24 7 Too Much Assessment, Not Enough Innovation We invest a lot of time, money, and effort into metrics. Entire journals and conferences are dedicated to library assessment. There are assessment librarian positions and even assessment departments. It’s obviously something we believe in. But does it work? Does it matter? Does it produce something useful? Does it encourage innovation? Does it nurture breakthrough, paradigm-shifting, transformative ideas? Or put another way: if we stopped all of our assessment programs today would our patrons notice anything different tomorrow? I’ll admit that I’ve grown skeptical of traditional library assessment. After spending time with startup founders and other entrepreneurs, as well as market researchers from Fortune 500 companies, I think it boils down one central difference: we’re asking the wrong questions. The problem with traditional library assessment is that it’s predominantly linked to satisfaction and performance. We’re focused on things like: how many articles are downloaded, how many pre- prints are in the repository, how many classes do we teach, or how our students feel about the library commons. This is all well and good. Obviously we want to measure and learn from how well our current services, processes, and products are performing. That’s just the tip of the iceberg. We stop short of discovering real transformative insights. We don’t ask big enough questions. We don’t follow the rabbit down the hole. We don’t break out of our comfort zones. We don’t seek out disruption. We’re too focused on trying to please our users rather than trying to anticipate their unarticulated needs. Assessment isn’t about developing breakthrough ideas. In short: we focus on service sustainability rather than revolutionary or evolutionary new services. As we think about the direction libraries are heading, the focus can’t remain on how well we’re doing right now, but on where we should be heading. It’s not about making our services incrementally better, but about developing completely new services and service models. Instead of assessment, we need to invest in R&D. We need to infuse the entrepreneurial spirit into our local efforts and into our professional conversations. R&D empowers us to move away from our niche and dabble in new arenas. Let’s take a look at instruction. Instead of continuing the library-centered perspective of infusing information literacy (something that we feel is critical) into the classroom, we could take a more empathic or user-sensitive approach of understanding the common barriers that students face with their assignments and then build instructional support to address these needs. We could take that even further by imagining the types of tools and services that would enable students to be more successful: project management, resource sharing, discovery tools and filters, processes for synthesizing information, and so forth. This more user-focused (as opposed to information-focused) approach moves us closer to addressing actual needs and further associates the library with user perceptions of scholarly achievement. The need for R&D isn’t new. Skunk works operations, or independent teams working on secret projects, have been proposed for libraries before.25 But we need more than just “the innovation department” - we need a culture of innovation. We need to encourage everyone at every level to be on the lookout for breakthrough, paradigm-shifting, transformative ideas. Innovation needs to happen out in the open. It needs to be in everyone’s job description. We don’t ask BIG ENOUGH questions. 8 A Strategic Culture (Instead of a Strategic Plan) Many library strategic plans read more like to-do lists rather than entrepreneurial visions. With all the effort that goes into these documents I’m not sure that we’re getting a good return. You can easily pick out who wrote which parts: there is a section for public services, a section for technical services, something about information literacy, something about open access, something about providing service excellence. These are highly predictable documents. They don’t say: we’re going to develop three big ideas that will shift the way we operate. They don’t say: we’re going delight our patrons by anticipating their needs. They don’t say: we’re going to transform how scholarship happens. They don’t attempt to dent the universe. A common strategy for innovation is the “copy-and- paste” method-- see what others are doing and then follow suit. Alter the name or modify the template, but largely our ideas come from other libraries. I observed this narrow-sightedness when I led a User Experience (UX) unit.29 Numerous librarians and administrators contacted me to inquire about my position. They remarked that they wanted to develop a similar position but didn’t know exactly what I did. UX was a sexy title back then and many libraries felt the need to jump on the bandwagon without understanding what it was. Sadly, over the last few years the user experience librarian trend has evolved into a website design, usability, and analytics role rather than one focused on improving the patron’s total library experience. Another example is the information/learning commons model. Here is the formula: lots of computers with software + designer furniture + café + research & tech help = a commons. Similar to UX librarians, every academic library had to have a learning commons over the last decade. We’re a copy-and-paste profession. When I’ve asked librarians about their design principles, critical success factors, or cultural and pedagogical outcomes they look at me strangely. We don’t typically link science and psychology to the spaces we develop. It’s easier to just select from the Steelcase or Herman Miller catalog without having a narrative behind what’s being developed. Too often our renovations are about refreshing the space, instead of revitalizing the way the organization operates. Being strategic should be about pushing the boundaries. Instead you are more likely to see something like: “embed information literacy into the curriculum” rather than “build a curriculum to prepare students for 21st century literacies.” Stretching not sustaining. A strategic instructional venture isn’t about just training students how to search database interfaces, but about building their fluency with data, visual, spatial, media, information, and technology literacies. This is how we can advance the role of the library. This is how we transform scholarship. Here are some approaches to get you started: Academic Librarianship by Design. Steven Bell and John Shank adapted the IDEO design-thinking method for the library environment. Innovation is a process: understand, observe, visualize, evaluate, refine, and implement. They argue for a more holistic approach to librarianship with goals such as improving faculty collaboration, connecting with learners, and taking on leadership to integrate the library into the total learning process. Nancy Foster and Susan Gibbons (and their staff) experimented with ethnographic techniques as a means of better understanding their student population. Anthropological methods of observation and community- study have blossomed in our field. This book reflects on involving library personnel in the process. Joseph Michelli provides insight that propelled Starbucks from turning ordinary into extraordinary experiences. His vision is based on the process of making a personal connection with people through a framework based on connecting, discovering, and responding. This transforms patrons into people and makes library usage personal. By focusing on relationship building instead of service excellence, organizations can uncover new needs and be in position to make a stronger impact. Academic Librarianship by Design26 Studying Students27 The Starbucks Experience28 It’s not about books migrating from print to digital. 9 Xerox provides us with a great example of strategic thinking.30 After dominating the marketplace with photocopiers and printers, they realized they needed to change. The rise of digital communications was impacting their core business, and instead of just building better hardware they expanded their identity. Xerox evolved from being a photocopy company to one that emphasizes business support services. They developed new areas such as document management, IT outsourcing, HR and accounting support, and data entry. They redefined themselves not by better document reproduction, but by becoming an integral partner in business operations infrastructure.31 We need to undergo a similar transformation. What’s the role for the library beyond providing access to information and a space to study? How can we make an impact on the teaching and learning process? How can we become an integral partner with faculty involved in the business of research? How can we stimulate knowledge production and sharing? These are the important questions that we need to ask. This is the important work that we need to figure out. This is beyond books migrating from print to digital platforms, but rather, it’s about libraries staking a claim in other parts of the scholarly enterprise. The most vital component to our success and survival is building a culture that inspires a strategic mindset -- a culture that embraces and rewards imagination, experimentation, teamwork, and initiative. The best way to do that is to fund it.32 Library administrators should serve as venture capitalists investing in creative concepts that show promise. They should invest in ideas that are, usable, feasible, and valuable. And they should invest in projects that are iterative and adapt to changes along the way. This investment should extend beyond project funding, and also include recruiting and developing talent and skill sets too. Administrators who aspire to be forward-thinking, user-focused, and entrepreneurial should demonstrate to their organizations that they are willing to embrace bold ideas that might not work out as planned. Startup culture is an attitude. It’s the responsibility of the administration to foster and inspire the entrepreneurial spirit. It’s the role of librarians and staff to push the boundaries, to find what’s next, and to redefine our profession. Libraries need to be a cause, a purpose, and the reason you get out of bed and are excited to get to work.34 Libraries are about people, not books or technology. It’s about the outcome for patrons interacting with everything we do and offer. If we are seeking breakthrough ideas that change service paradigms, then we need to be ready for disruption. If we’re serious about innovation then we need to go “all in” and can’t only bet on sure things. Entrepreneurialism is a cultural imperative, not something that should only happen in small pockets of your organization. Or as Steve Jobs preached, we need to strive to “dent the universe,” “build the impossible,” and offer “insanely great” services, products, and spaces.35 Until then we’re just building a better vacuum cleaner, rather than building breakthrough ideas. Innovators Experimenting with 3D printing. Early Adaptors Building visualization services. Early Majority Migrating to demand-driven acquisitions. Late Majority Offering text reference. Laggards Planning a Facebook fan page. H o w i n n o v a t i v e i s y o u r l i b r a r y ? 3 3 10 Microscopes & Telescopes Famed venture capitalist and business writer Guy Kawasaki offers a great metaphor for looking at strategic outlooks: telescopes and microscopes.36 Here is a paraphrase of his description: Microscopes magnify every detail, line item, expenditure, and demand full-blown forecasts. Microscopes are a cry for level-headed thinking, a return to fundamentals, and a “back to basics” approach. Telescopes bring the future closer. They dream up “the next big thing” and seek to change the world. Lots of ideas are tossed around. Some ideas stick and those move forward. The reality is that you need both perspectives. We can’t focus exclusively on traveling to the future scholarly universe. And at the same time we can’t remain static and nostalgic about what libraries have been. How we manage to pass through this crucible moment will define us.37 This decade before us will shape the future of what academic libraries will become. Change is inevitable and vital. Accepting this reality empowers us. This is change that we have a say in. This is change that we can guide: telescopes and microscopes working to see, plan, and implement the transformation together. “REAL ARTISTS SHIP!”38 Ideas are the easy part. Coming up with them doesn’t make you an innovator or a game-changer or a change-agent. True innovators get their hands dirty. It means taking ownership of the concept, believing it, advocating for it, fighting for it, shaping it, breathing life into it, and turning it into a reality. If you came up with the idea, then it’s your responsibility to see it through to the end.39 It’s your responsibility to stick it out. Real entrepreneurs are personally invested. Startup founders are not just in it for fame or fortune, but are driven to develop something new and to make their ideas tangible. The goal is to build something that doesn’t exist and to create something that wasn’t there before that is now absolutely essential. We in the library world need to feel that way too. That’s the heart and soul of startup culture. That’s what we need to tap into. It’s on our shoulders to find the future. It’s up to us to define what libraries will become. It won’t be easy, but how often do you get to redefine a profession? It’s not the time to do more of the same, arranging the same old blocks in different patterns. We need to change more than the packaging, add more than a shiny new wrapper. This transformation isn’t just about moving collections and services online, it’s about changing the DNA of our organizations. As Steve Jobs said, “real artists ship.” Real artists get their ideas out there. Real innovators deliver. Real entrepreneurs develop. Real startups launch. This is our time to face the future and redefine what libraries do. What will you invent next? Who will you partner with tomorrow? How will you plant the seeds of entrepreneurialism for the future? The direction academic libraries take is up to us. It’s ours to figure out. So let’s not be satisfied by adding small features, but instead, let’s use our imaginations to dream big and create amazing experiences that transform our users. True innovators get their hands dirty. 11 Summary We don’t just need change, we need breakthrough, paradigm-shifting, transformative, disruptive ideas. Startups are organizations dedicated to creating something new under conditions of extreme uncertainty. Now is not the time to find new ways of doing the same old thing. Launching a good idea is always better than not launching an awesome one. Don’t just expand services: solve problems. The library is a platform, not a place, website, or person. Libraries need less assessment and more R&D. Focus on relationship building instead of service excellence and satisfaction. Don’t just copy & paste from other libraries: invent! Grow your ideas: Build, Measure, Learn. Iterate & Prototype. Plant many seeds; nurture the ones that grow. Seize the whitespace. Good ideas are usable, feasible, and valuable. Give new ideas a place to incubate. Give new ideas enough time to blossom. Give new ideas a way to get funded. Give new ideas the talent they require. Give new ideas room to fail… and then evolve. Give up on a new idea if it just don’t work. Innovation happens out in the open—not behind closed doors. Innovation is a team sport. Practice it regularly. Innovation is messy. Innovation is disruptive. Real innovators get their hands dirty. Being strategic is about stretching not sustaining. Stake a claim in other parts of the scholarly enterprise. Build a strategic culture, not a strategic plan. Entrepreneurialism is a cultural imperative, not something that should only happen in small pockets of your organization. Strive to change the profession. Aim for epiphanies. 12 Notes 13 1 Jim Collins, Great By Choice: Uncertainty, Chaos, and Luck--Why Some Thrive Despite Them All, 2011. 2 “UNLV faculty warned higher ed system may be forced to declare bankruptcy.” http://www.lvrj.com/news/unlv-faculty-warned-university-sys tem-may-be-forced-to-declare-bankruptcy-116279269.html 3 “University System advances on campus mergers.” http://www.ajc.com/news/university-system-advances-on-1217183.html 4 “Four UCSD libraries to close, consolidate.” http://www.utsandiego.com/news/2011/mar/29/ucsd-libraries-close/ 5 “Harvard Libraries Cuts Jobs, Hours.” http://www.thecrimson.com/article/2009/6/26/harvard-libraries-cuts-jobs-hours-harvard/ 6 “Our Universities: Why Are They Failing?” http://www.nybooks.com/articles/archives/2011/nov/24/our-universities-why-are-they-failing/ 7 Anya Kamenetz, DIY U: Edupunks, Edupreneurs, and the Coming Transformation of Higher Education, 2010. 8 A recent example: “Academic Library Autopsy Report, 2050” http://chronicle.com/article/Academic-Library-Autopsy/125767/ 9 Conversation with Steven Bell: http://stevenbell.info/ 10 The University of California’s Next Generation Tech Services reports: http://libraries.universityofcalifornia.edu/about/uls/ngts/ 11 Malcolm Gladwell, “Creation Myth.” http://www.newyorker.com/reporting/2011/05/16/110516fa_fact_gladwell?currentPage=all 12 Eric Ries, The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses, 2011. 13 Lean Canvas, http://leancanvas.com/ 14 Eric Ries, The Lean Startup, 2011. 15 John Mullins & Randy Komisar, Getting to Plan B: Breaking Through to a Better Business Mode, 2009. 16 Guy Kawasaki, The Art of the Start: The Time-Tested, Battle-Hardened Guide for Anyone Starting Anything, 2004. 17 A. G. Lafley, Seizing the White Space: Business Model Innovation for Growth and Renewal, 2010. 18 Eric Ries, The Lean Startup, 2011. 19 “More, better, faster: UX design for startups,” http://www.cooper.com/journal/2011/03/more_better_faster_ux_design.html 20 “Building a competitive advantage.” http://americanlibrariesmagazine.org/columns/next-steps/building-competitive-advantage 21 Correspondence with Steve Morris: Head, Digital Library Initiatives and Digital Projects, NCSU. 22 Conversation with Bob Summer, see also http://www.collegiatetimes.com/stories/17767/techpad-opens-in-blacksburg 23 Bob Summer was influenced by Marty Cagan’s book Inspired: How To Create Products Customers Love, 2008. 24 Char Booth, “Hope, Hype and VoIP: Riding the Library Technology Cycle.” http://www.alastore.ala.org/detail.aspx?ID=3037 25 Brian Quinn “The McDonaldization of Academic Libraries?” College & Research Libraries, May 2000. 26 Steven Bell & John Shank, Academic Librarianship by Design: A Blended Librarian’s Guide to the Tools and Techniques, 2007. See also The Art of Innovation (Kelley & Littman) and “Spark Innovation Through Empathic Design” HBR (Leonard & Rayport). 27 Nancy Foster & Susan Gibbons, Studying Students: The Undergraduate Research Project at the University of Rochester, 2007. http://docushare. lib.rochester.edu/docushare/dsweb/View/Collection-4436 28 Joseph Michelli, The Starbucks Experience: 5 Principles for Turning Ordinary Into Extraordinary, 2006. 29 Erin Dorney, “The user experience librarian” CRL News, 2009. http://crln.acrl.org/content/70/6/346.full.pdf+html?sid=f29caba6-f126-42da- 9bd5-4c595f67da3a 30 “Fresh Copy: How Ursula Burns Reinvented Xerox” Fast Company, 2011. http://www.fastcompany.com/magazine/161/ursula-burns-xerox 31 A cautionary tale about railroads: Levitt, “Marketing Myopia.” HBR 38(4): 45-56. 32 Good example: Microgrants: http://info.lib.uh.edu/about/strategic-directions/microgrants 33 Everett Rogers, Diffusion of Innovations, 2003. (5th Edition) 34 Simon Sinek, Start with Why: How Great Leaders Inspire Everyone to Take Action, 2011. 35 Walter Isaacson, Steve Jobs, 2011. 36 Guy Kawasaki, The Art of the Start, 2004. 37 Robert Thomas, Crucibles of Leadership: How to Learn from Experience to Become a Great Leader, 2008. 38 Walter Isaacson, Steve Jobs, 2011. 39 Killer Innovations podcast: http://philmckinney.com/killer-innovations Paper Layout & Design by Ashley Marlowe Brian Mathews is an Associate Dean at Virginia Tech. He has previously worked as an Assistant University Librarian at UC Santa Barbara and as User Experience Librarian at Georgia Tech. Brian’s blog, The Ubiquitous Librarian, is hosted by the Chronicle of Higher Education: http://chronicle.com/ blognetwork/theubiquitouslibrarian/
morgan-bringing-2021 ---- Chapter 10 Bringing Algorithms and Machine Learning Into Library Collections and Services Eric Lease Morgan University of Notre Dame Seemingly revolutionary changes At the time of their implementation, some changes in the practice of librarianship were deemed revolutionary, but now-a-days some of these same changes are deemed matter of fact. Take, for example, the catalog. During much of the Middle Ages, a catalog was more akin to a simple acquisitions list. By 1548 the first author, title, subject catalog was created (LOC 2017, 18). These catalogs morphed into books, books which could be mass produced and distributed. But the books were difficult to keep up to date, and they were expensive to print. As a consequence, in the early 1860s, the card catalog was invented by Ezra Abbot, and the catalog eventually became a massive set of drawers (82). Unfortunately, because the way catalog cards are produced, it is not feasible to assign more than three or four subject headings to any given book. If one does, then the number of catalog cards quickly gets out of hand. In the 1870s, the idea of sharing catalog cards between libraries became common, and the Library of Congress facilitated much of the distribution (LOC 2017, 87). In 1965 and with the advent of computers, the idea of sharing cataloging data as MARC (machine readable cataloging) became prevalent (Crawford 1989, 204). The data structure of a MARC record is indicative of the time. Intended to be distributed on reel-to-reel tape, the MARC record is a sequential data structure designed to be read from beginning to end, complete with checks and balances ensuring the record’s integrity. Despite the apparent flexibility of a digital data structure, the tradition of three or four subject headings per book still holds true. Now-a-days, the data from MARC records is used to fill databases, the databases’ content is indexed, and items from the 113 114 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 library collection are located by searching the index. The evolution of the venerable library catalog has spanned centuries, each evolutionary change solving some problems but creating new ones. With the advent of the Internet, a host of other changes are (still) happening in libraries. Some of them are seen as revolutionary, and only time will tell whether or not these changes will persevere. Examples include but are not limited to: • the advocacy of alt-metrics and open access publications • the continuing dichotomy of the virtual library and library as place • the creation and maintenance of institutional repositories • the existence of digital scholarship centers • the increasing tendency to license instead of own content Many of the traditional roles of libraries are not as important as they used to be. That does not mean the roles are unimportant, just not as important. Like many other professions, librarianship is exploring new ways to remain relevant when many of their core functions are needed by fewer people. Working smarter, not harder Beyond automation, librarianship has not exploited computer technology. Despite the fact that libraries have the world of knowledge at their fingertips, libraries do not operate very intelligently, where “intelligently” is an allusion to artificial intelligence. Let’s enumerate the core functionalities of computers. First of all, computers…compute. They are given some sort of input, assign the input to a variable, apply any number of functions to the variable, and output the result. This process — computing — is akin to solving simple algebraic equations such as the area of a circle or a distance traveled. There are two factors of particular interest here. First, the input can be as simple as a number or a string (read: “a word”) or the input can be arbitrarily large combinations of both. Examples include: • 42 • 1776 • xyzzy • George Washington • a MARC record • the circulation history and academic characteristics of an individual • the full text and bibliographic descriptions of all early American authors Morgan 115 What is really important is the possible scale of a computer’s input. Libraries have not taken advantage of that scale. Imagine how librarianship would change if the profession actively used the full text of its collections to enhance bibliographic description and resulting public service. Imagine how collection policies and patron needs could be better articulated if: 1) students, re- searchers, or scholars first opted-in to have their records analyzed, and 2) the totality of circulation histories and journal usage histories were thoroughly investigated in combination with patron characteristics and data from other libraries. A second core functionality of computers is their ability to save, organize, and retrieve vast amounts of data. More specifically, computers save “data” — mere numbers and strings. But when the data is given context, such as a number denoted as date or a string denoted as a name, then the data is transformed into information. An example might include the birth year 1972 and the name of my pet, Blake. Given additional information, which may be compared and contrasted with other information, knowledge can be created — information put to use and un- derstood. For example, Mary, my sister, was born in 1951 and is therefore 21 years older than Blake. Computers excel at saving, organizing, and retrieving data which leads to information and knowledge. The possibilities of computers dispensing wisdom — knowledge of a timeless nature — is left for another essay. Like the scale of computer input, the library profession has not really exploited computers’ ability to save, organize, and retrieve data; on the whole, the library profession does not under- stand the concept of a “data structure.” For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures. Each has its own set of inherent strengths and weaknesses; there is no such thing as “One size fits all.” Through the use of data structures, computers store and retrieve information. Librarianship is about these same kinds of things, yet few librarians would be able to outline the differences between different data structures. Again, data becomes information when it is given context. In the world of MARC, when a string (one or more “words”) is inserted into the 245 field of a MARC bibliographic record, then the string is denoted as a title. In this case, MARC is a “data structure” because different fields denote different contexts. There are fields for authors, subjects, notes, added entries, etc. This is all very well and good, especially considering that MARC was designed more than fifty years ago. But since then, many more scalable, flexible, and efficient data structures have been designed. Relational databases are a good example. Relational databases build on a classic data structure known as the “table” — a matrix of rows and columns where each row is a record and each column is a field. Think “spreadsheet.” For example, each row may represent a book, with columns for authors, titles, dates, publishers, etc. The problem comes when a column needs to be repeatable. For example, a book may have multiple authors or more commonly, multiple subjects. In this case the idea of a table breaks down because it doesn’t make sense to have a column named subject-01, subject-02, and subject-03. As soon as you do that, you will want subject-04. Relational databases solve this problem. The solution is to first add a “key” — a unique value — to each row. Next, for fields with multiple values, create a new table where one of the columns is the key from the first table and the other column is a value, in this case, a subject heading. There are now two tables and they can be “joined” through the use of the key. Given such a data structure it is possible to add as many subjects as desired to any bibliographic item. But you say, “MARC can handle multiple subjects.” True, MARC can handle multiple sub- jects, but underneath, MARC is a data structure designed for when information was dissemi- 116 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 nated on tape. As such, it is a sequential data structure intended to be read from beginning to end. It is not a random access structure. What’s more, the MARC data structure is really di- vided into three substructures: 1) the leader, which is always twenty-four characters long, 2) the directory, which denotes where each bibliographic field exists, and 3) the bibliographic section where the bibliographic information is actually stored. It gets more complicated. The first five characters of the leader are expected to be a left-hand, zero-padded integer denoting the length of the record measured in bytes. A typical value may be 01999. Thus, the record is 1999 bytes long. Now, ask yourself, “What is the maximum size of a MARC record?” Despite the fact that librarianship embraces the idea of MARC, very few librarians really understand the structure of MARC data. MARC is a format for transmitting data from one place to another, not for organization. Moreover, libraries offer more than bibliographic information. There is information about people and organizations. Information about resource usage. Information about licensing. In- formation about resources that are not bibliographic, such as images or data sets. Etc. When these types of information present themselves, libraries fall back to the use of simple tables, which are usually not amenable to turning data into information. There are many different data structures. XML became popular about twenty years ago. Since then JSON has become prevalent. More than twenty years ago the idea of Linked Data was presented. All of these data structures have various strengths and weaknesses. None of them is perfect, and each addresses different needs, but they are all better than MARC when it comes to organizing data. Libraries understand the concept of manifesting data as information, but as a whole, libraries do not manifest the concept using computer technology. Finally, another core functionality of computers is networking and communication. The advent of the Internet is a relatively recent phenomenon, and the ubiquitous nature of comput- ers combined with other “smart” devices has facilitated literally billions of connections between computers (and people). Consequently the data computed upon and stored in one place can be transmitted almost instantly to another place, and the transmission is an exact copy. Again, like the process of computing and the process of storage, efficient computer communication builds upon itself with unforeseen consequences. For example, who predicted the demise of many cen- tralized information authorities? With the advent of the Internet there is less of a need/desire for travel agents, movie reviewers, or dare I say it, libraries. Yet again, libraries use the Internet, but do they actually exploit it? How many librarians are able to create a file, put it on the Web, and share the resulting URL? Granted, centralized computing departments and networking administrators put up road blocks to doing such things, but the sharing of data and information is at the core of librarianship. Putting a file on the ’Net, even temporarily, is something every librarian ought to be able to know how (and be authorized) to do. Despite the functionality of computers and their place in libraries over the past fifty to sixty years, computers have mostly been used to automate library tasks. MARC automated the process of printing catalog cards and eventually the creation of “discovery systems.” Libraries have used computers to automate the process of lending materials between themselves as well as to local learners, teachers, and scholars. Libraries use computers to store, organize, preserve, and dissem- inate the gray literature of our time, and we call these systems “institutional repositories.” In all of these cases, the automation has been a good thing because efficiencies were gained, but the use of computers has not gone far enough nor really evolved. Lending and usage statistics are not routinely harvested nor organized for the purposes of monitoring and predicting library patron Morgan 117 needs/desires. The content of institutional repositories is usually born digital, but libraries have not exploited their full text nature nor created services going beyond rudimentary catalogs. Computers can do so much more for libraries than mere automation. While I will never say computers are “smart,” their fundamental characteristics do appear intelligent, especially when used at scale. The scale of computing has significantly changed in the past ten years, and with this change the concept of “machine learning” has become more feasible. The following sections outline how libraries can go beyond automation, embrace machine learning, and truly evolve their ideas of collections and services. Machine learning: what it is, possibilities, and use cases Machine learning is a computing process used to make decisions and predictions. In the past, computer-aided decision-making and predictions were accomplished by articulating large sets of if-then statements and navigating down decision trees. The applications were extremely domain specific, and they weren’t very scalable. Machine learning turns this process on its head. Instead of navigating down a tree, machine learning takes sets of previously made observations (think “decisions”), identifies patterns and anomalies in the observations, and saves the result as a math- ematical model, which is really an n-dimensional array of vectors. Outside observations are then compared to the model and depending on the resulting similarities or differences, decisions or predictions are drawn. Using such a process, there are really only four different types of machine learning: classifi- cation, clustering, regression, and dimension reduction. Classification is a supervised machine learning process used to subdivide a set of observations into smaller sets which have been previ- ously articulated. For example, suppose you had a few categories of restaurants such as American, French, Italian, or Chinese. Given a set of previously classified menus, one could create a model defining each category and then classify new, unseen menus. The classic classification example is the filtering of email. “Is this message ‘spam’ or ‘ham’?” This chapter’s appendix walks a person through the creation of a simplified classification system. It classifies texts based on authorship. Clustering is almost always an unsupervised machine learning process which also creates smaller sets from a larger one, but clustering is not given a set of previously articulated categories. That is what makes it “unsupervised.” Instead, the categories are created as an end result. Topic modeling is a popular example of clustering. Regression predicts a numeric value based on sets of dependent variables. For example, given dependent variables like annual income, education level, size of family, age, gender, religion, and employment status, one might predict how much money a person may spend on an independent variable such as charity. Sometimes the number of characteristics of each observation is very large. Many times some of these characteristics do not play a significant role in decision-making or prediction. Dimension reduction is another machine learning process, and it is used to eliminate these less-than-useful characteristics from the observations. This process simplifies classification, clustering, or regres- sion. Some possible use cases There are many possible ways to enhance library collections and services through the use of ma- chine learning. I’m not necessarily advocating the implementation of any of the following ideas, 118 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 but they are possibilities. Each is grouped into the broadest of library functional departments: • reference and public services – given a set of grant proposals, suggest library resources be used in support of the grants – given a set of licensed library resources and their usage, suggest other resources for use – given a set of previously checked out materials, suggest other materials to be checked out – given a set of reference interviews, create a chatbot to supplement reference services – given the full text of a set of desirable journal articles, create a search strategy to be applied against any number of bibliographic indexes; answer the proverbial question, “Can you help me find more like this one?” – given the full text of articles as well as their bibliographic descriptions, predict and describe the sorts of things a specific journal title accepts or whether a given draft is good enough for publication – given the full text of reading materials assigned in a class, suggest library resources to support them • technical services – given a set of multimedia, enumerate characteristics of the media (number of faces, direction of angles, number and types of colors, etc.), and use the results to supple- ment bibliographic description – given a set of previously cataloged items, determine whether or not the cataloging can be improved – given full-text content harvested from just about anywhere, analyze the content in terms of natural language processing, and supplement bibliographic description • collections – given circulation histories, articulate more refined circulation patterns, and use the results to refine collection development policies – given the full text of sets of theses and dissertations, predict where scholarship at your institution is growing, and use the results to more intelligently build your just-in-case collection; do the same thing with faculty publications Implementing any of these possible use cases would necessarily be a collaborative effort. Im- plementation requires an array of expertise. Enumerated in no priority order, this expertise in- cludes: subject/domain expertise (such as cataloging trends, circulation services, collection strate- gies, etc.), computer programming and data management skills (such as Python, R, relational databases, JSON, etc.), and statistical modeling (an understanding of the strengths and weak- nesses of different machine learning algorithms). The team would then need to: 1. articulate and share a common goal for the work Morgan 119 2. amass the data to model 3. employ a feature extraction process (lower case words, extract a value from a database, etc.) 4. vectorize the features 5. create and evaluate the resulting model 6. go to Step #2 until satisfied 7. put the model into practice 8. go to Step #1; this work is never done For example, to bibliographically connect grant proposals to library resources, try this: 1. use classification to sub-divide each of your bibliographic index descriptions 2. apply the resulting model to the full text of the grants 3. return a percentage score denoting the strength of each resulting classification 4. recommend the use of zero or more bibliographic indexes To predict scholarship, try this: 1. amass the full text and bibliographic descriptions of all theses and dissertations 2. topic model the full text 3. evaluate the resulting topics 4. go to Step #2 until satisfied 5. augment the model’s matrix of vectors with bibliographic description 6. pivot the matrix on any of the given bibliographics 7. plot the results to see possible trends over time, trends within disciplines, etc. 8. use the results to make decisions The content of the GitHub repository reproduced in this chapter’s appendix describes how to do something very similar in method to the previous example.1 1See ?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. https://github.com/ericleasemorgan/bringing-algorithms 120 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 Some real-world use cases Here at the University of Notre Dame’s Navari Center for Digital Scholarship, we use machine learning in a number of ways. We cut our teeth on a system called Convocate.2 In this case we ob- tained a set of literature on the theme of human rights. Half of the set was written by researchers in non-governmental organizations. The other half was written by theologians. While both sets were on the same theme, the language of each was different. An excellent example is the use of the word “child.” In the former set, children were included in documents about fathers and mothers. In the later set, children often referred to the “Children of God.” Consequently, queries referring to children were often misleading. To rectify this problem, a set of broad themes were articulated, such as Actors, Harms and Violations, Rights and Freedoms, and Principles and Values. We then used topic modeling to subdivide all of the paragraphs of all of the documents into smaller and smaller sets of paragraphs. We compared the resulting topics to the broad themes, and when we found correlations between the two, we classified the paragraphs accordingly. Because the process required a great deal of human intervention, and thus impeded subsequent updates, this process was not ideal, but we were learning and the resulting index is useful. On a regular basis we find ourselves using a program called Topic Modeling Tool, which is a GUI/desktop application heavily based on the venerable MALLET suite of software.3 Given a set of plain text files and an integer, Topic Modeling Tool will create a weighted list of latent themes found in a corpus. Each theme is really a list of words which tend to cluster around each other, and these clusters are generated through the use of an algorithm called LDA (Latent Dirichlet Allocation). When it comes to topic modeling, there is no such thing as the correct number of topics. Just as in the traditional process of denoting what a corpus is about, there can be many distinct topics or there can be a few. Moreover, some of the topics may be large and others may be small. When using a topic modeler, it is important to iteratively configure and re-configure the input until the results seem to make sense. Just like every other machine learning application, Topic Modeling Tool bases its “reason- ing” on a matrix of vectors. Each row represents a document, and each column is a topic. At the intersection of a document row and a topic column is a score denoting how much the given doc- ument is “about” the calculated topic. It is then possible to sum each topic column and output a pie chart illustrating not only what the topics are, but how much of the corpus is about each topic. Such can be very insightful. By adding metadata to the matrix of vectors, even more insights can be garnered. Suppose you have a set of plain text files. Suppose also you know the names of the authors of each file. You can then do topic modeling against your corpus, and when the modeling is complete you can add a new column to the matrix and call it authors. Next, you update the values in the authors column with author names. Finally, you “pivot” the matrix on the authors column to calculate the degree each authors’ works are “about” the calculated topics. This too can be quite insightful. Suppose you have works by authors A, B, C, and D. Suppose you have calculated topics I, II, III, and IV. By updating the matrix and pivoting the results, you might discover that author A discusses topic I almost exclusively, whereas author B discusses topics I, II, III, and IV in equal parts. This process works for just about any type of metadata: gender, genre, extent, dates, language, etc. What’s more, Topic Modeling Tool makes this process almost trivial. To learn how, see the GitHub 2See ?iiTb,ff+QMpQ+�i2XM/X2/m. 3See ?iiTb,ff;Bi?m#X+QKfb2M/2`H2fiQTB+@KQ/2HBM;@iQQH for the Topic Modeling Tool. See ?iiT, ffK�HH2iX+bXmK�bbX2/m for MALLET. https://convocate.nd.edu https://github.com/senderle/topic-modeling-tool http://mallet.cs.umass.edu http://mallet.cs.umass.edu Morgan 121 repository accompanying this chapter.4 We have used classification techniques in at least a couple of ways. One project required the classification of press releases. Some press releases are deemed mandatory — declared necessary to publish. Other press releases are considered discretionary — published at the will of a com- pany. The domain expert needed a set of 100,000 press releases classified into either mandatory or discretionary piles. We used a process very similar to the process outlined in this chapter’s Ap- pendix. In the end, the domain expert believes the classification process was 86% correct, and this was good enough for them. In another project, we tried to identify articles about a particu- lar yeast (Cryptococcus neoformans), despite the fact that the articles never mentioned the given yeast. This project failed because we were unable to generate an accuracy score greater than 70%. This was deemed not good enough. We are developing a high performance computing system called the Distant Reader, which uses machine learning to do natural language processing against an arbitrarily large volume of text. Given one or more documents of just about any number or type, the Distant Reader will: 1. amass the documents 2. convert the documents into plain text 3. do rudimentary counts and tabulations against the plain text 4. calculate statistically significant keywords against the plain text 5. extract narrative summaries against the plain text 6. use Spacy (a natural language processing library) to classify each and every feature of each and every sentence into parts-of-speech and/or named entities5 7. save the results of Steps #1 through #6 as plain text and tab-delimited files 8. distill the tab-delimited files into an SQLite database 9. create both narrative as well as tabular reports against the database 10. create an archive (.zip file) of everything 11. return the archive to the student, researcher, or scholar The student, researcher, or scholar can then analyze the contents of the .zip file to get a bet- ter understanding of its contents. This analysis (“reading”) ranges from perusing the narrative reports, to using desktop tools to visualize the data, to exploiting command-line tools to inves- tigate the data, to writing software which uses the data as input. The Distant Reader scales to everything between a single scholarly report, hundreds of book-length documents, and thou- sands of journal articles. Its purpose is to supplement the traditional reading process, and it uses machine learning techniques at its core. 4?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. 5See ?iiTb,ffbT�+vXBQ. https://github.com/ericleasemorgan/bringing-algorithms https://spacy.io 122 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 Summary and Conclusion Computers and libraries are a natural fit. They both excel at the collection, organization, and dissemination of data, information, and knowledge. Compared to most professions, the practice of librarianship has used computers for a very long time. But, for the most part, the functionality of computers in libraries has not been fully exploited. Advances in machine learning coupled with the data/information found in libraries present an opportunity for both librarianship and the people whom libraries serve. Machine learning can be used to enhance library collections and services, and with a modest investment of time as well as resources, the profession can make it a reality. Appendix: Train and Classify This appendix lists two Python programs. The first (train.py) creates a model for the classification of plain text files. The second (classify.py) uses the output of the first to classify other plain text files. For your convenience, the scripts and some sample data ought to be available in a GitHub repository.6 The purpose of including these two scripts is to help demystify the process of machine learn- ing. Train The following Python script is a simple classification training application. Given a file name and a list of directories containing .txt files, this script first reads all of the files’ contents and the names of their directories into sets of data and labels (think “categories”). It then divides the data and labels into training and testing sets. Such is a best practice for these types of programs so the models can be evaluated for accuracy. Next, the script counts and tabulates (“vectorizes”) the training data and creates a model using a variation of the Naive Bayes algorithm. The script then vectorizes the test data, uses the model to classify the test data, and compares the resulting classifications to the originally supplied labels. The result is an accuracy score, and generally speaking, a score greater than 75% is on the road to success. A score of 50% is no better than flipping a coin. Finally, the model is saved to a file for later use. 1 O i`�BM X Tv @ ;Bp2M � 7BH2 M�K2 �M/ � HBbi Q7 /B`2+iQ`B2b O +QMi�BMBM; X iti 7BH2b - +`2�i2 � KQ/2H 7Q` +H�bbB7vBM; O bBKBH�` Bi2Kb O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F 6 7`QK bFH2�`MX72�im`2n2ti`�+iBQMXi2ti BKTQ`i *QmMio2+iQ`Bx2` 7`QK bFH2�`MXKQ/2Hnb2H2+iBQM BKTQ`i i`�BMni2binbTHBi 7`QK bFH2�`MXM�Bp2n#�v2b BKTQ`i JmHiBMQKB�HL" BKTQ`i ;HQ#- Qb- TB+FH2- bvb 11 O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi B7 H2MU bvbX�`;p V I 9 , bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y 6?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. https://github.com/ericleasemorgan/bringing-algorithms Morgan 123 ] IKQ/2H= I/B`2+iQ`v = I�MQi?2` /B`2+iQ`v =$M] V [mBiUV 16 O ;2i i?2 M�K2 Q7 i?2 7BH2 r?2`2 i?2 KQ/2H rBHH #2 b�p2/ KQ/2H 4 bvbX�`;p( R ) O ;2i i?2 `2bi Q7 i?2 BMTmi - i?2 M�K2b Q7 /B`2+iQ`B2b iQ T`Q+2bb 21 /B`2+iQ`B2b 4 () 7Q` B BM `�M;2U k- H2MU bvbX�`;p V V , /B`2+iQ`B2bX�TT2M/U bvbX�`;p( B ) V O BMBiB�HBx2 i?2 /�i� iQ �M�Hvx2 �M/ Bib �bbQ+B�i2/ H�#2Hb 26 /�i� 4 () H�#2Hb 4 () O HQQT i?`Qm;? 2�+? ;Bp2M /B`2+iQ`v 7Q` /B`2+iQ`v BM /B`2+iQ`B2b , 31 O 7BM/ �HH i?2 i2ti 7BH2b �M/ ;2i i?2 /B`2+iQ`v ^b M�K2 7BH2b 4 ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V H�#2H 4 QbXT�i?X#�b2M�K2U /B`2+iQ`v V 36 O T`Q+2bb 2�+? 7BH2 7Q` 7BH2 BM 7BH2b , O QT2M i?2 7BH2 rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 , 41 O �// i?2 +QMi2Mib Q7 i?2 7BH2 iQ i?2 /�i� /�i�X�TT2M/U ?�M/H2X`2�/UV V O mT/�i2 i?2 HBbi Q7 H�#2Hb 46 H�#2HbX�TT2M/U H�#2H V O /BpB/2 i?2 /�i� f H�#2Hb BMiQ i`�BMBM; b2ib �M/ i2biBM; b2ib c O � #2bi T`�+iB+2 /�i�ni`�BM - /�i�ni2bi - H�#2Hbni`�BM - H�#2Hbni2bi 4 51 i`�BMni2binbTHBiU /�i�- H�#2Hb V O BMBiB�HBx2 � p2+iQ`Bx2` - �M/ i?2M +QmMi f i�#mH�i2 i?2 O i`�BMBM; /�i� p2+iQ`Bx2` 4 *QmMio2+iQ`Bx2`U biQTnrQ`/b4^2M;HBb?^ V 56 /�i�ni`�BM 4 p2+iQ`Bx2`X7Bini`�Mb7Q`KU /�i�ni`�BM V O BMBiB�HBx2 � +H�bbB7B+�iBQM KQ/2H - �M/ i?2M mb2 L�Bp2 "�v2b O iQ +`2�i2 � KQ/2H +H�bbB7B2` 4 JmHiBMQKB�HL"UV 61 +H�bbB7B2`X7BiU /�i�ni`�BM - H�#2Hbni`�BM V O +QmMi f i�#mH�i2 i?2 i2bi /�i� - �M/ mb2 i?2 KQ/2H iQ +H�bbB7v Bi 124 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 /�i�ni2bi 4 p2+iQ`Bx2`Xi`�Mb7Q`KU /�i�ni2bi V +H�bbB7B+�iBQMb 4 +H�bbB7B2`XT`2/B+iU /�i�ni2bi V 66 O #2;BM iQ i2bi 7Q` �++m`�+v +QmMi 4 y O HQQT i?`Qm;? 2�+? i2bi +H�bbB7B+�iBQM 71 7Q` B BM `�M;2U H2MU +H�bbB7B+�iBQMb V V , O BM+`2K2Mi - +QM/BiBQM�HHv B7 +H�bbB7B+�iBQMb( B ) 44 H�#2Hbni2bi( B ) , +QmMi Y4 R 76 O +�H+mH�i2 �M/ QmiTmi i?2 �++m`�+v b+Q`2 c O �#Qp2 d8$W #2;BMb iQ �+?B2p2 bm++2bb T`BMi U ]�++m`�+v, WbWW $M] W U BMiU U +QmMi RXy V f H2MU +H�bbB7B+�iBQMb V Ryy V V V 81 O b�p2 i?2 p2+iQ`Bx2` �M/ i?2 +H�bbB7B2` U i?2 KQ/2H V O 7Q` 7mim`2 mb2 - �M/ /QM2 rBi? QT2MU KQ/2H- ^r#^ V �b ?�M/H2 , TB+FH2X/mKTU U p2+iQ`Bx2` - +H�bbB7B2` V- ?�M/H2 V 86 2tBiUV Classify The following Python script is a simple classification program. Given the model created by the previous script (train.py) and a directory containing a set of .txt files, this script will output a suggested label (“classification”) and a file name for each file in the given directory. This script automatically classifies a set of plain text files. O +H�bbB7v X Tv @ ;Bp2M � T`2pBQmbHv b�p2/ +H�bbB7B+�iBQM KQ/2H �M/ O � /B`2+iQ`v Q7 X iti 7BH2b - +H�bbB7v � b2i Q7 /Q+mK2Mib 4 O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F BKTQ`i ;HQ#- Qb- TB+FH2- bvb O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi B7 H2MU bvbX�`;p V 54 j , 9 bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y ] IKQ/2H= I/B`2+iQ`v =$M] V [mBiUV O ;2i BMTmi c ;2i i?2 KQ/2H iQ `2�/ �M/ i?2 /B`2+iQ`v +QMi�BMBM; 14 O i?2 X iti 7BH2b KQ/2H 4 bvbX�`;p( R ) /B`2+iQ`v 4 bvbX�`;p( k ) O `2�/ i?2 KQ/2H 19 rBi? QT2MU KQ/2H- ^`#^ V �b ?�M/H2 , Morgan 125 U p2+iQ`Bx2` - +H�bbB7B2` V 4 TB+FH2XHQ�/U ?�M/H2 V O T`Q+2bb 2�+? X iti 7BH2 7Q` 7BH2 BM ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V , 24 O QT2M - `2�/ - �M/ +H�bbB7v i?2 7BH2 rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 , +H�bbB7B+�iBQM 4 +H�bbB7B2`XT`2/B+iU p2+iQ`Bx2`Xi`�Mb7Q`KU ( ?�M/H2X`2�/UV ) V V 29 O QmiTmi i?2 +H�bbB7B+�iBQM �M/ i?2 7BH2 ^b M�K2 T`BMiU ]$i2ti#�+FbH�b? i]XDQBMU U +H�bbB7B+�iBQM( y )- QbXT�i?X#�b2M�K2U 7BH2 V V V V 34 O /QM2 2tBiUV References Crawford, Walt. 1989. MARC for Library Use: Understanding Integrated USMARC. 2nd ed. Boston: G.K. Hall. LOC (Library of Congress). 2017. The Card Catalog: Books, Cards, and Literary Treasures. San Francisco: Chronicle Books.
narlock-digital-2021 ---- Digital preservation services at digital scholarship centers The Journal of Academic Librarianship 47 (2021) 102334 Available online 24 February 2021 0099-1333/© 2021 Elsevier Inc. All rights reserved. Digital preservation services at digital scholarship centers Mikala Narlock a, *, Daniel Johnson b, Julie Vecchio, Assistant Director c a Digital Collection Strategy Librarian, Hesburgh Libraries, University of Notre Dame, United States of America b English, Film, Television, and Theatre; Digital Humanities Librarian, Hesburgh Libraries, University of Notre Dame, United States of America c Navari Family Center for Digital Scholarship, Hesburgh Libraries, University of Notre Dame, United States of America A R T I C L E I N F O Keywords: Digital scholarship centers Digital preservation Academic libraries Digital scholarship A B S T R A C T As academic library support services for digital scholarship activities continue to expand and evolve, large volumes of digital outputs have been created by, and in collaboration with, library and information professionals who are affiliated with digital scholarship centers. Drawing on a literature review and a 2018 pilot study of digital preservation services in digital scholarship centers, we propose future directions for investigation of preservation services for digital scholarship and projects. Introduction The proliferation of digital infrastructure, tools, and data sources has facilitated new types of academic exploration and created opportunities for novel collaborations with academic library specialty research sup- port services, such as digital scholarship centers (DSCs) (e.g., Bryson et al., 2011; Johnson & Dehmlow, 2019). DSCs are described as a “service model in academic libraries that bring faculty and student scholars, technologists, and librarians together to collaboratively develop digital projects supporting scholarship and research” (Tzoc, 2016), and for the purposes of this research, digital scholarship is construed broadly as the use of digital evidence and methods, digital publishing, digital curation and preservation, and digital use and reuse of scholarship, regardless of discipline (Rumsey, 2011). Academic li- brary support for digital scholarship encompasses a broad range of services, including teaching, consultation, outreach, the provision of access to technologies and data sources for creating and sharing new knowledge, and the creation and management of technology-enhanced spaces (e.g., Lippincott, 2017; Locke, 2017). As digital scholarship ac- tivities and outputs increase over time, the need for careful planning for the curation and long-term preservation of digital objects and projects is of critical importance (Owens, 2018). We explore the intersection of academic library digital scholarship centers with digital curation and preservation activities through the lens of a literature review and a 2018 pilot survey, seeking to address the following topics: 1. How do digital scholarship centers provide digital preservation in- formation to their users? 2. What digital preservation support is provided by digital scholarship centers to their users? 3. What kinds of relationships and interactions can we observe between academic libraries, DSCs, and digital preservation activities? Literature review The expansive growth of digital scholarship work—along with a concomitant need for data—has resulted in strengthened connections between library and information professionals and digital scholars, especially digital humanists (Johnson & Dehmlow, 2019; Millson-Mar- tula & Gunn, 2017; Sula, 2013). In particular, digital curation and preservation have been identified as ideal opportunities for collabora- tion between scholars, librarians, and information professionals, as li- brary organizations tend to focus on lifecycle management with an emphasis on curation and preservation (Lippincott, 2017). While re- searchers may lack specific training for research data curation or experience with building and applying robust preservation policies, li- brary and information professionals have been developing and utilizing these skills for decades (Poole & Garwood, 2018). Tenopir, Birch, and Allard (2012, 5) argue that there are “powerful reasons for librarians to explore how their academic libraries can better satisfy the needs of researchers in the new data-intensive research at- mosphere,” including the curation of research data to facilitate discov- ery, and advocacy for effective preservation. As Walters and Skinner (2011) note, when “the library embeds the curation and preservation infrastructure and knowledge within its own staffing and digital framework and provides stable, trustworthy, and affordable services to * Corresponding author. E-mail addresses: mnarlock@nd.edu (M. Narlock), djohns27@nd.edu (D. Johnson), jvecchio@nd.edu (J. Vecchio). Contents lists available at ScienceDirect The Journal of Academic Librarianship journal homepage: www.elsevier.com/locate/jacalib https://doi.org/10.1016/j.acalib.2021.102334 Received 10 February 2021; Accepted 15 February 2021 mailto:mnarlock@nd.edu mailto:djohns27@nd.edu mailto:jvecchio@nd.edu www.sciencedirect.com/science/journal/00991333 https://www.elsevier.com/locate/jacalib https://doi.org/10.1016/j.acalib.2021.102334 https://doi.org/10.1016/j.acalib.2021.102334 https://doi.org/10.1016/j.acalib.2021.102334 http://crossmark.crossref.org/dialog/?doi=10.1016/j.acalib.2021.102334&domain=pdf The Journal of Academic Librarianship 47 (2021) 102334 2 its campus, the library as an institution becomes more secure and influential within its campus setting.” (24) However, the results of Tenopir et al.’s, 2012 survey suggest that, while academic libraries and librarians are capable of providing research data services and support, there are often serious limitations in funding, particularly for staffing and repository maintenance. Since then, academic libraries have diverted more resources for research data services (particularly at R1 institutions), staff development, and additional support positions (Tenopir et al., 2019). This increased support coincides with increased collaborative efforts between libraries and DSCs, which academic li- braries have leveraged as an opportunity to advocate for their position, funding, and new roles (Cox, 2016). This work has also resulted in increased development of tools to support data and digital project curation, including efforts such as the Preservation Quality Tool (PresQT) and Emulation as a Service Infrastructure (EaaSI), which help harvest and curate data and metadata, and ensure that, regardless of format, data will be accessible into the future, easing the burden on both information professionals and repository managers. The importance of digital curation, and specifically lifecycle man- agement, has been written about extensively within the context of spe- cific types of disciplines, as well as writ large. In the humanities, digital curation has been supported by grant-funded projects such as the Uni- versity of Pittsburgh’s “Sustaining DH” NEH Institute for Advanced Topics in the Digital Humanities (https://sites.haa.pitt.edu/sustainabilit yinstitute/). The Institute educated librarians and departmental faculty alike on a new “Socio-Technical Sustainability Roadmap,” a framework to assist in “the seemingly daunting task of sustaining … web-based, user-facing, digital humanities project over time” (https://sites.haa. pitt.edu/sustainabilityroadmap/getting-started/). Similar efforts include “The Endings Project,” funded by the Social Sciences and Hu- manities Research Council of Canada (https://projectendings.github. io/), Katherina Fostano and Laura K. Morreale’s “Digital Documenta- tion Process” for DH scholarship (https://digitalhumanitiesddp.com/), and the Mellon-supported “Digits Project,” which promises to “conduct an environmental scan of the use of software containers in research and publication, as well as a fact-finding mission on the infrastructural needs of scholars who are currently producing non-standard digital research” (https://digits.pub/about/). Social science data are among the oldest digital media: beginning in the late 1800s, US census data were converted to a digital format for analysis by—what was at the time—brand-new tabulating machines (Gutmann et al., 2009). Text-mining and artificial intelligence technol- ogies available today are further extending the variety of data available for exploration through social science methodologies, shifting “the evi- dence base of social science” (Walters & Skinner, 2011). Social science data pose complex and unique challenges for data curation and preser- vation: documentation may be lacking or inaccessible, data ownership may be in question, data may have rigorous privacy/confidentiality requirements, and data format persistence may be problematic (ICPSR, 2012; Lyle et al., 2014). Repositories—both institutional and dis- ciplinary—are vital to the preservation of social science research assets and outputs, but are bound by their own unique missions and policies. Collaborative projects such as the Data Preservation Alliance for the Social Sciences (Data-PASS: http://www.data-pass.org/) leverage the resources of multiple institutions in support of the identification, acquisition, and curation of social science data that have been deemed “at risk,” whether from legacy research sources or from ongoing or future work (Gutmann et al., 2009). Academic library and information professionals—whether affiliated with DSCs or not—play a variety of critical roles in the preservation of social science data, ranging from acquisition, to educational and outreach services, to hands-on curation work, to name a few (Tammaro et al., 2019; Xia & Wang, 2014). The ‘hard sciences’ tend to produce data at a larger scale than the social sciences and humanities, especially that which is derived from niche software, tools, and highly-advanced equipment. Researchers and information professionals have actively been working to provide persistent and long-term access to research data and other scholarly outputs. Since the early 2000’s, librarians and information professionals have been advocating for and documenting research data curation (e.g., Gray et al., 2002), articulating the lifecycle of research data (e.g., Hig- gins, 2008), and carving space for information professionals to assist in the curation process. Data curators, discipline experts, and even private companies have developed numerous tools to help scholars and re- pository managers preserve content and provide consistent access to data and digital objects. The proliferation of disciplinary, institutional, and general repositories for researchers, as well as curatorial tools like wholeTALE, facilitate not only data reuse and reproducibility, but also curationand long-term accessibility to the data. In recent years, the rise of FAIR data (Findable, Accessible, Interoperable, and Reusable; Wil- kinson et al., 2016), increasing funder mandates and required data management plans (DMP), and hands-on data sharing workshops and hackathons (e.g., Hildreth & Meyers, 2020) have resulted in an increased awareness around the intricacies of preserving research data and the need to define domain-specific requirements. Despite the prevalence of digital scholarship activities across aca- demic disciplines, preservation remains a persistent challenge bedeviled by uncertain expectations, uneven work distribution, and inadequate sustainability planning, among other issues. Atkins (2013) found that most organizations, when lacking a dedicated digital preservation pro- gram, often left the task of preservation to the library. Li et al. (2020) observed a similar desire for help with managing research data at Wuhan University Library, but found in a quantitative survey that re- searchers “do not entirely believe librarians can be of significant help in managing research projects, providing data curation and sharing sup- port,” leading them to suggest that libraries should “promote and advertise their effort and abilities” (9.) Libraries, however, may lack the funding or technical infrastructure needed to support digital projects adequately in the long term (Owens, 2018). Moreover, given that effective digital preservation and consistent, long-term access to the content requires intense curatorial support, librarians, specifically sub- ject selectors and disciplinary curators, are in the best position to pro- vide feedback on digital scholarship projects (Tallman & Work, 2018). Robert Montoya (2017, 221) even argues that a new category of “boundary staff specifically charged with maintaining … boundary in- frastructures and negotiating mismatched practices between de- partments” is needed to break out of silos and integrate library strengths with cross-disciplinary projects. Regardless of where a digital object or project originates or con- cludes, the stakes for digital preservation are high, and project partners benefit from sharing the responsibility and privilege of applying digital preservation considerations to their work. Indeed, increasing the pool of stakeholders should increase preservation options, helping to alleviate the burden of hidden labor on a small group of individuals while also avoiding the temptation to overfit all projects to a one-size-fits-all preservation solution. DSCs, in turn, stand to benefit by learning how their peers are engaging stakeholders in this important endeavor. Pilot survey For additional perspective on this landscape, we distributed a pilot survey via list-serv in order to investigate how digital scholarship cen- ters within higher education institutions in the United States currently engage with their stakeholders on digital preservation. In total, the survey received forty-seven (47) responses. Respondents who left all answers blank were eliminated. Duplicate responses were received from three institutions. If there was overlap between responses, the authors looked to see if responses were identical; if so, one entry was kept for the institution, and if not, both entries were removed. Two entries were removed as non-US institutions. In total, twenty-five (25) survey re- sponses were used for analysis. For more information, please visit https://doi.org/10.17605/OSF.IO/3YJ8A. A key limitation of this survey is the small number of responses M. Narlock et al. https://sites.haa.pitt.edu/sustainabilityinstitute/ https://sites.haa.pitt.edu/sustainabilityinstitute/ https://sites.haa.pitt.edu/sustainabilityroadmap/getting-started/ https://sites.haa.pitt.edu/sustainabilityroadmap/getting-started/ https://projectendings.github.io/ https://projectendings.github.io/ https://digitalhumanitiesddp.com/ https://digits.pub/about/ http://www.data-pass.org/ https://doi.org/10.17605/OSF.IO/3YJ8A The Journal of Academic Librarianship 47 (2021) 102334 3 received relative to the number of invitations distributed through list- servs; the survey nevertheless provides an instructive starting place for continued exploration of digital preservation patron engagement ac- tivities at US digital scholarship centers. In following the format of the survey, the themes that emerged from our data have been divided into two categories: characteristics of the responding DSCs, and patterns of digital preservation practices. Responding digital scholarship center overview All responding centers indicated that they provide consultations to patrons (n = 25). Most responding DSCs indicated that they provide instruction (n = 22), cultivate a web presence (n = 21), and provide access to hardware and software for patron use (n = 20). Responding DSCs tended to have a broad range of expertise: while the particulars varied between DSCs, many indicated that they offer expertise in digital publishing, project management, data analysis, and metadata (n = 22, 20, 19, 19). Digital preservation was an area of expertise for over half of responding DSCs (n = 16), followed closely by institutional repository support (n = 15). Areas of DSC expertise may warrant additional exploration, specifically the emphasis on project management and data analysis and how they relate to preserving digital scholarship. The re- sponses here could be indicative of a number of things, including but not limited to: a primary focus on active project development by responding DSCs, which are often on the cutting edge of research and research methods; the possibility that responding DSCs were collaborating with patrons on sustainable projects that need less preservation support; a prevalence of projects that had not yet reached a stage where preser- vation concerns are imminent; or perhaps a lack of interest in preser- vation among responding DSC patrons. Additional investigation into these motivations for prioritizing project management and data analysis could help guide future developments in DSC support for curating and preserving digital scholarship outputs. Physically and organizationally, responding DSCs were linked to li- braries, echoing the prevalent themes in the literature about the rela- tionship between the two (Lippincott & Goldenberg-Hart, 2014). Most respondents noted that their DSC is located organizationally with the institution’s library (n = 19/25, 76%), and, when asked about their roles and responsibilities within the DSC, approximately one third of re- spondents indicated that their primary role was that of “Librarian” (n = 9/25). A responding DSC’s connection with an academic library was not associated with provision of digital preservation support by the responding DSC. This is an area that may warrant additional explora- tion: Given libraries’ and archives’ legacy of preservation and providing long-term access to materials, the library is the heir-apparent to pre- serving content created by and with the DSC, whether through curation, storage, metadata/descriptive practices, or other preservation activities. However, limited funding, overwhelmed staff, and DSCs’ charge to stay at the forefront of digital scholarship may prohibit this collaboration. Digital preservation practices of responding digital scholarship centers In terms of audience for digital preservation support, the majority of responding DSCs (n = 19) indicated that they provide support for digital preservation to patrons, with the primary demographic overwhelmingly faculty-oriented and humanities-centric. This could be due to the wide definition of “digital scholarship center” employed by the survey, which included digital humanities centers under the digital scholarship center umbrella. Additional exploration of the core demographics of commu- nities who engage with DSC services could be helpful for guiding the development of additional best practices for engaging users in digital preservation conversations. Overwhelmingly, the digital preservation support provided by responding DSCs tended to take the form of consultations (n = 19), followed by instruction and outreach (n = 8). This suggests an opportunity for developing additional resources for the integration of reusable assets and frameworks into consultative and instructional sessions. Future explorations and conclusion The literature review points to ample opportunities for libraries to engage across disciplines in digital preservation, and warns of peril if they don’t. Our pilot survey responses, though limited, suggest specific avenues of research, including the expansion of primary audiences for digital preservation outreach, the development of new (or imple- mentation of existing) resources for engaging faculty and students in digital preservation activities compatible with the time limitations in outreach and consultation, and consideration of the implications of organizational placement of DSCs for the provision of digital preserva- tion support to patrons. As DSCs continue to evolve, academic library organizations should consider prioritizing digital preservation competencies in continuing education opportunities for their employees. According to King (2018), there are a number of skills useful for DSC faculty and staff, including technical abilities, but also more traditional librarian expertise, including preservation, institutional repository support, and metadata enhancement; however, “Librarians felt overwhelmingly that they needed more, better trained staff to meet this need and that they themselves were in need of skills, knowledge and credentials.” (44) By providing these educational opportunities, funding, or other support to employees in addition to DSC patrons, libraries can continue to serve as active and collaborative partners in supporting the creation and pres- ervation of digital objects and digital scholarship projects. As a follow-up to this work, more detailed investigation into pres- ervation, through activities such as semi-structured interviews with survey respondents, could provide even more specific information on how DSCs engage patrons. While the pilot survey provides a snapshot in time, the response categories were too broad to learn detailed infor- mation at the outset. Additional research could investigate how active subject selectors, curators, or other disciplinary liaisons are in sup- porting the curation and preservation of DSC projects. Relatedly, we would like to learn whether DSCs are providing rubrics or other tools to support curators in deciding what to preserve, and to see how many DSCs are embracing a benign neglect towards their projects, allowing them to gracefully decline. Since the initial distribution of the survey, the landscape of higher education has changed drastically in the wake of the COVID-19 pandemic. Additional research could examine how remote work has impacted consultations and remote digital preservation work. Similarly, during the myriad social protests that occurred during the Summer of 2020, did DSCs engage in or support community archiving or preservation? The results of this pilot survey and related research have uncovered more questions than answers. As libraries and DSCs contend with an ever-increasing proliferation of data and digital objects—especially when considering legacy digital projects from early-adopters in the 2000s—and budgets that remain constant at best, effective digital preservation relies on an active collaboration between partners. Knowing how best to support DSCs and library and information pro- fessionals in this endeavor ensures time and resources are spent effec- tively in providing long-term access to digital projects for future scholars. This work can and must be a collaborative effort between institutional and organizational units, and requires more investigation to understand just where to start. References Atkins, W. (2013). Staffing for effective digital preservation: An NDSA report: Results of a survey of organizations preserving digital content. National Digital Stewardship Alliance. M. Narlock et al. http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0005 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0005 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0005 The Journal of Academic Librarianship 47 (2021) 102334 4 Bryson, T., Posner, M., St. Pierre, A., & Varner, S. (2011). Digital humanities, SPEC kit 326 (November 2011). https://publications.arl.org/Digital-Humanities-SPEC-Kit -326/. Cox, J. (2016). Communicating new library roles to enable digital scholarship: A review article. New Review of Academic Librarianship, 22(2–3), 132–147. https://doi.org/ 10.1080/13614533.2016.1181665 EaaSI GitLab | Software Preservation Network (SPN). (n.d.). Retrieved from https ://www.softwarepreservationnetwork.org/eaasi-gitlab/. Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., & van den Berg, J. (2002). In A. S. Szalay (Ed.), Online scientific data curation, publication, and archiving (pp. 103–107). https://doi.org/10.1117/12.461524 Gutmann, M. P., Abrahamson, M., Adams, M. O., Altman, M., Arms, C., Bollen, K., , … King, G., et al. (2009). From preserving the past to preserving the future: The data- PASS project and the challenges of preserving digital social science data. Library Trends, 57, 315–337. Internet. Higgins, S. (2008). The DCC curation lifecycle model. International Journal of Digital Curation, 3(1), 134–140. https://doi.org/10.2218/ijdc.v3i1.48 Hildreth, M., & Meyers, N. (2020). Final report: FAIR Hackathon workshop for mathematical and physical sciences research communities. https://doi.org/10.7274/R0-RWPP-AS13 Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to Archiving Social Science Data for Institutional Repositories (1st ed.) Ann Arbor, MI. Johnson, D., & Dehmlow, M. (2019). Digital exhibits to digital humanities: Expanding the digital libraries portfolio. In New top technologoies every librarian needs to know: A LITA guide (p. 123). King, M. (2018). Digital scholarship librarian: What skills and competences are needed to be a collaborative librarian. International Information & Library Review, 50, 40–46. https://doi.org/10.1080/10572317.2017.1422898 Li, B., Song, Y., Lu, X., & Zhou, L. (2020). Making the digital turn: Identifying the user requirements of digital scholarship services in university libraries. The Journal of Academic Librarianship, 46(2), Article 102135. https://doi.org/10.1016/j. acalib.2020.102135 Lippincott, J. K. (2017). Opening keynote: Fulfilling our mission in the digital age. Digital Initiatives Symposium, 17. Retrieved from https://digital.sandiego.edu/cgi/viewconte nt.cgi?article=1131&context=symposium. Lippincott, J. K., & Goldenberg-Hart, D. (2014). CNI workshop report. Digital scholarship centers: Trends and good practice. Retrieved from https://www.cni.org/wp-content /uploads/2014/11/CNI-Digitial-Schol.-Centers-report-2014.web_.pdf. Locke, B. T. (2017). Digital humanities pedagogy as essential liberal education: A framework for curriculum development. Digital Humanities Quarterly, 011(3). Lyle, J., Alter, G., & Green, A. (2014). Partnering to curate and archive social science data. Research data management: Practical strategies for information professionals (pp. 203–222). Millson-Martula, C., & Gunn, K. (2017). The digital humanities: Implications for librarians, libraries, and librarianship. College & Undergraduate Libraries, 24(2–4), 135–139. https://doi.org/10.1080/10691316.2017.1387011 Montoya, R. D. (2017). Boundary objects/boundary staff: Supporting digital scholarship in academic libraries. The Journal of Academic Librarianship, 43(3), 216–223. https:// doi.org/10.1016/j.acalib.2017.03.001 Owens, Trevor (2018). The theory and craft of digital preservation. John Hopkins University Press. Poole, A. H., & Garwood, D. A. (2018). “Natural allies”: Librarians, archivists, and big data in international digital humanities project work. Journal of Documentation, 74 (4), 804–826. https://doi.org/10.1108/JD-10-2017-0137 Rumsey, A. S. (2011). Scholarly communication institute 9: New-model scholarly communication: Road map for change. Charlottesville, VA: University of Virginia Library. Sula, C. A. (2013). Digital humanities and libraries: A conceptual model. Journal of Library Administration, 53(1), 10–26. https://doi.org/10.1080/ 01930826.2013.756680 Sustaining DH – An NEH Institute for Advanced Topics in the Digital Humanities. (n.d.). Retrieved from https://sites.haa.pitt.edu/sustainabilityinstitute/. Tallman, N., & Work, L. (2018). Approaching Appraisal. In International Conference on Digital Preservation (Vol. 2018). Tammaro, A. M., Matusiak, K. K., Sposito, F. A., & Casarosa, V. (2019). Data curator’s roles and responsibilities: An international perspective. Libri, 69(2), 89–104. https:// doi.org/10.1515/libri-2018-0090 Team, T. E. P. (n.d.). The Endings Project. Retrieved from https://endings.uvic.ca/. Tenopir, C., Allard, S., Baird, L., Sandusky, R., Lundeen, A., Hughes, D., & Pollock, D. (2019). Academic librarians and research data services: Attitudes and practices. IT Lib: Information Technology and Libraries Journal, (1). https://trace.tennessee.edu/ utk_infosciepubs/99. Tenopir, C., Birch, B., & Allard, S. (2012). Academic libraries and research data services: Current practices and plans for the future. In An ACRL white paper. https://trace.te nnessee.edu/utk_dataone/20. The Digital Documentation Process—The Digital Documentation Process. (n.d.). Retrieved from https://digitalhumanitiesddp.com/. Tzoc, E. (2016). Libraries and faculty collaboration: Four digital scholarship examples. Journal of Web Librarianship, 10(2), 124–136. https://doi.org/10.1080/ 19322909.2016.1150229 Walters, T., & Skinner, K. (2011). New roles for new times: Digital curation for preservation. Association of Research Libraries. https://vtechworks.lib.vt.edu/handle/10919/10 183. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18 Xia, J., & Wang, M. (2014). Competencies and responsibilities of social science data librarians: An analysis of job descriptions | Xia | College & Research Libraries. https:// doi.org/10.5860/crl13-435 M. Narlock et al. https://publications.arl.org/Digital-Humanities-SPEC-Kit-326/ https://publications.arl.org/Digital-Humanities-SPEC-Kit-326/ https://doi.org/10.1080/13614533.2016.1181665 https://doi.org/10.1080/13614533.2016.1181665 https://www.softwarepreservationnetwork.org/eaasi-gitlab/ https://www.softwarepreservationnetwork.org/eaasi-gitlab/ https://doi.org/10.1117/12.461524 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0030 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0030 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0030 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0030 https://doi.org/10.2218/ijdc.v3i1.48 https://doi.org/10.7274/R0-RWPP-AS13 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf8000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf8000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0045 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0045 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0045 https://doi.org/10.1080/10572317.2017.1422898 https://doi.org/10.1016/j.acalib.2020.102135 https://doi.org/10.1016/j.acalib.2020.102135 https://digital.sandiego.edu/cgi/viewcontent.cgi?article=1131&context=symposium https://digital.sandiego.edu/cgi/viewcontent.cgi?article=1131&context=symposium https://www.cni.org/wp-content/uploads/2014/11/CNI-Digitial-Schol.-Centers-report-2014.web_.pdf https://www.cni.org/wp-content/uploads/2014/11/CNI-Digitial-Schol.-Centers-report-2014.web_.pdf http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0070 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0070 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf7000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf7000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf7000 https://doi.org/10.1080/10691316.2017.1387011 https://doi.org/10.1016/j.acalib.2017.03.001 https://doi.org/10.1016/j.acalib.2017.03.001 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf9000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf9000 https://doi.org/10.1108/JD-10-2017-0137 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0095 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0095 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf0095 https://doi.org/10.1080/01930826.2013.756680 https://doi.org/10.1080/01930826.2013.756680 https://sites.haa.pitt.edu/sustainabilityinstitute http://refhub.elsevier.com/S0099-1333(21)00025-2/rf6000 http://refhub.elsevier.com/S0099-1333(21)00025-2/rf6000 https://doi.org/10.1515/libri-2018-0090 https://doi.org/10.1515/libri-2018-0090 https://endings.uvic.ca/ https://trace.tennessee.edu/utk_infosciepubs/99 https://trace.tennessee.edu/utk_infosciepubs/99 https://trace.tennessee.edu/utk_dataone/20 https://trace.tennessee.edu/utk_dataone/20 https://digitalhumanitiesddp.com/ https://doi.org/10.1080/19322909.2016.1150229 https://doi.org/10.1080/19322909.2016.1150229 https://vtechworks.lib.vt.edu/handle/10919/10183 https://vtechworks.lib.vt.edu/handle/10919/10183 https://doi.org/10.1038/sdata.2016.18 https://doi.org/10.5860/crl13-435 https://doi.org/10.5860/crl13-435 Digital preservation services at digital scholarship centers Introduction Literature review Pilot survey Responding digital scholarship center overview Digital preservation practices of responding digital scholarship centers Future explorations and conclusion References
oclc-social-2020 ---- Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Social Interoperability in Research Support: Cross-campus partnerships and the university research enterprise Rebecca Bryant, Annette Dortmund, and Brian Lavoie O C L C R E S E A R C H R E P O R T Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Rebecca Bryant Senior Program Officer Annette Dortmund Senior Product Manager Brian Lavoie Senior Research Scientist © 2020 OCLC. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ August 2020 OCLC Research Dublin, Ohio 43017 USA www.oclc.org ISBN: 978-1-55653-157-6 DOI: 10.25333/wyrd-n586 OCLC Control Number: 1184125043 ORCID iDs Rebecca Bryant http://orcid.org/0000-0002-2753-3881 Annette Dortmund https://orcid.org/0000-0003-1588-9749 Brian Lavoie http://orcid.org/0000-0002-7173-8753 Please direct correspondence to: OCLC Research oclcresearch@oclc.org Suggested citation: Bryant, Rebecca, Annette Dortmund, and Brian Lavoie. 2020. Social Interoperability in Research Support: Cross- Campus Partnerships and the University Research Enterprise. Dublin, OH: OCLC Research. https://doi.org/10.25333/wyrd-n586. http://creativecommons.org/licenses/by/4.0/ http://www.oclc.org http://orcid.org/0000-0002-2753-3881 https://orcid.org/0000-0003-1588-9749 http://orcid.org/0000-0002-7173-8753 mailto:oclcresearch@oclc.org https://doi.org/10.25333/wyrd-n586 C O N T E N T S Foreword ............................................................................................ vi Building Intra-Campus Relationships Around Research Support Services ............................................................................................... 1 Introduction .............................................................................................................. 1 Scope and Methods .................................................................................................3 Limitations ................................................................................................................4 The Campus Environment .................................................................. 5 Universities are Complex Adaptive Systems ..........................................................5 Intense Competition for Prestige, Rankings, and Resources .................................6 Leadership Challenges ............................................................................................ 7 Frustration and Isolation in Emerging Roles ........................................................... 7 A Model for Conceptualizing University Research Support Stakeholders ....................................................................................... 9 Academic Affairs .................................................................................................... 10 Research Administration ......................................................................................... 11 The Library ............................................................................................................... 11 Information and Communications Technology (ICT) .......................................... 12 Faculty Affairs and Governance ............................................................................ 13 Communications .................................................................................................... 14 Social Interoperability in Research Support Services ......................16 Research Data Management (RDM) ........................................................................17 Research Information Management (RIM) ............................................................ 19 Public researcher profiles ................................................................................. 19 Faculty Activity Reporting (FAR) ...................................................................... 20 Research Analytics ................................................................................................. 21 ORCID Adoption .....................................................................................................23 Comments on the Library as Partner ....................................................................24 Cross-Campus Relationship Building: Strategies and Tactics ......... 26 Strategies and Directions ......................................................................................26 Secure buy-in ....................................................................................................26 Know your audience ......................................................................................... 27 Speak their language ........................................................................................28 Offer concrete solutions to others’ problems .................................................28 Timing is essential.............................................................................................29 Relationship Building: Practical Advice .................................................................29 Meeting opportunities ......................................................................................29 Shared staff and embedded resources .......................................................... 30 Troubleshooting in Relationship Building ............................................................ 30 Making connections ........................................................................................ 30 Personalities ...................................................................................................... 31 Know your value / be confident ....................................................................... 31 Challenges: Managing Resistance and Sustaining Energy ..................................32 Managing resistance.........................................................................................32 Investing the energy .........................................................................................32 Conclusion ........................................................................................ 34 Acknowledgments ............................................................................ 36 Appendix: Interview Protocol .......................................................... 37 Notes ................................................................................................. 39 F I G U R E S FIGURE 1 A conceptual model of campus research support stakeholders ......................... 9 FIGURE 2 Stakeholder interest in research support areas ..................................................16 FIGURE 3 Key takeaways about successful intra-campus social interoperability ............. 33 F O R E W O R D To develop robust research support services across the entire research life cycle, individuals and units from across the university, including the library, must work across internal silos. Previous OCLC Research publications like The Realities of Research Data Management and Practices and Patterns in Research Information Management: Findings from a Global Survey (2017-18),1 prepared in partnership with euroCRIS, already describe this growing operational convergence. Libraries are increasingly partnering with other campus stakeholders in research support, such as the office of research, campus IT, faculty affairs, and academic affairs units. This OCLC Research Report, Social Interoperability in Research Support: Cross-campus partnerships and the university research enterprise, recognizes the growing imperative for libraries to work not only in support of the goals of their parent institution, as explored in the 2018 University Futures, Library Futures report,2 but also as a valued member of a cross-institutional team. Social Interoperability in Research Support explores the social and structural norms that can serve either as roadblocks or pathways to cross-institutional collaboration and offers a model for conceptualizing the key university stakeholders in research support. It examines the network of campus units involved in both the provision and consumption of research support services and concludes with recommendations for establishing and maintaining cross-campus relationships, synthesized from interviews conducted with practitioners from all corners of campus. Social Interoperability in Research Support offers a road map for acquainting librarians with the other research support stakeholders on campus. It additionally offers a resource for acquainting others on campus with the skills and expertise that the library brings to research support activities. While the interviews informing this publication were conducted prior to the onset of the COVID-19 crisis, I believe the findings are no less relevant. In fact, the need for increasing cross-institutional research support collaboration is likely to be amplified due to the current pandemic and its longer- term effects. Lorcan Dempsey, Vice President, Membership and Research, OCLC vi Building Intra-Campus Relationships Around Research Support Services Introduction In early 2020, the University Libraries at the University of Rhode Island publicized a posting for a Library Chief Data Strategist, responsible for “enhancing library-based data services programs.” The job description noted that: This position will work with the Office of Institutional Research and DataSpark (Library- based data analytics unit) to identify avenues to increase faculty and researcher success. Working with internal (e.g. MakerspaceURI, Launch Lab, Think Lab, and the AI Lab) and external (e.g. the Office of Advancement of Teaching and Learning, the Office of Community, Equity and Diversity, Division of Research and Economic Development and IT) partners, the incumbent will plan and implement experimental and innovative activities to cultivate and expand synergistic relationships.3 This description illustrates the deeply collaborative nature of providing research support services like data management, as well as the importance of developing and sustaining productive cross- campus relationships to make these collaborations work. The academic library is undoubtedly a key figure in the landscape of research support services, but it is not the only one. Successful management of the library’s portfolio of research services requires interaction, coordination, and even direct partnerships with other campus units. Research support services are those that enhance researcher productivity, facilitate analysis of research activity, and/or make research outputs visible and accessible across the scholarly community and beyond. Research support is an increasingly visible and expanding part of the network of services and infrastructure that enable the university’s research enterprise. Definitions of the term “research support service” range from the general to the precise. For example, North Carolina State University defines research support as “a service that allows a researcher to spend more time, more efficiently in his/her role as a researcher, and contributes positively to the quality of the research.”4 In contrast, Si, Zeng, Guo, and Zhuang suggest that research support services specifically include research data management, open access, scholarly publishing, research impact measurement, research guides, research consultation, and research tools recommendation.5 1 2 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Because research support services extend over the entire research life cycle, as well as across the entire campus, we offer a relatively expansive definition in this report. Research support services are those that enhance researcher productivity, facilitate analysis of research activity, and/or make research outputs visible and accessible across the scholarly community and beyond.6 The provision of research support services is seldom the responsibility of a single campus unit; nor is the consumption of research support services limited to a single campus cohort. Instead, both provision and consumption are distributed across many stakeholders—from the library to the research office; from faculty to administrators. The wide network of campus stakeholders involved in providing or using research support services underscores the importance of building strong intra-campus relationships to maximize their effectiveness and impact. In this report, we document the perspectives of individuals representing a wide range of campus stakeholders in research support, either as a provider or user, with the goal of making the stakeholder groups from which they are drawn more distinct, and their potential role as a partner in research support more apparent. Building robust relationships means moving beyond a “stick figure” view of campus partners to a fleshed-out, three-dimensional understanding of their responsibilities, capacities, goals, and needs that bear on the provision and/or consumption of research support services. Sheila Corrall observes that “[o]perational convergence (i.e., separate services/departments collaborating to coordinate their activities to improve conference and effectiveness) . . . is arguably more prevalent than ever, with libraries extending and deepening their collaborations and partnerships beyond IT and educational development colleagues to other professional services, such as research offices.”7 Operational convergence in turn is facilitated by social interoperability, which we define as the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding. While “technical interoperability”—different technical systems working smoothly together—may be a more familiar concept, social interoperability is of growing importance in a landscape where cross- campus partnerships are becoming both more prevalent and more necessary. Social interoperability [is] the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding. While this report is written primarily for academic librarians, we expect and hope that it will prove useful to the many other campus professionals involved in research support activities. Our premise is that cross-campus partnerships are a necessary condition for building effective research support services, and the best chance for developing these relationships is to cultivate a deep understanding of potential campus partners: their responsibilities, pain points, and areas of common interest where engagement can take root and flourish. The goal of this report is not just to acquaint academic librarians with other campus stakeholders in research support, but to acquaint other campus stakeholders with the library. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 3 The remainder of the report is as follows. This section concludes with a brief description of the scope of our study and our data-collecting methods. The next section, “The campus environment,” provides background on the organizational and decision-making environment at US universities. “A model for conceptualizing university research support stakeholders” introduces a model defining campus functional areas relevant to research support, illustrated and contextualized by our informants’ perspectives on their own roles. “Social interoperability in research support services” describes major categories of research support services on campus, and documents—through the lens of our informants’ experiences—the importance of social interoperability in building effective and impactful research support services. The final section draws out some general insights or “lessons learned” from our informants on developing good social interoperability skills that lead to successful cross-campus partnerships. Scope and Methods Our study is focused on research support in US universities. In focusing on research support, we see an opportunity to address a gap in existing literature,8 which extensively documents educational support services but is less rich in addressing research support services and intra- institutional research support challenges. Focusing on the United States was a pragmatic choice. Extending the analysis internationally raises significant challenges for meaningful comparison across different higher education systems. Each national higher education context is different, and worthy of separate study. Data was collected for this study through semi-structured interviews with individuals working in a wide range of research support-related roles across campus. We chose interviews as our strategy for data collection because we sought a more in-depth, personal perspective on cross-campus collaboration than other methods, such as a survey instrument, could afford. A key impetus for our research is that knowledge resides in people: therefore, there is great benefit in gathering and synthesizing what people know. That is the aim of this study and the rationale behind our method.9 Our interviews explored the functions and responsibilities of each individual in the context of their respective campus unit; the importance of their work—and their unit—to the university and its research enterprise; and how mutual research support interests have been or could be advanced through intra-campus relationships. The interviews sought to draw out our informants’ on-the- ground experiences in establishing and sustaining productive, cross-campus relationships. Our interviewees include individuals involved in the provision of research support services, as well as those whose responsibilities require or would benefit from consuming research support services. In examining research support services, we felt it very important to get the complete campus view. Research support services represent a dynamic service space, with new services emerging and existing services maturing, merging, or being re-defined. Services that are sourced in one campus unit (or units) today may be shifted to other providers (on campus or off) in the future. Given this, it is important to look at the overall campus landscape to better understand the scope and opportunities of the library’s role in this space. Our interviews therefore focused on collaborative experiences in research support regardless of whether the library was involved, rather than focusing strictly on collaborations involving the library. To identify interview candidates, we used a variety of sources, including personal networks and recommendations from colleagues and contacts. All told, we spoke to 22 individuals from 4 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 17 research-intensive universities in the United States. Sixteen of the 17 institutions are public institutions. Our interviewees included individuals with existing intra-campus relationships with the library as well as those with little library engagement; senior leaders as well as early-career staff; technical as well as nontechnical roles; and those with faculty status as well as those with nonfaculty positions. We spoke with academic deans and senior administrators in addition to an array of professionals working in the library, research development, faculty affairs, communications, and beyond. The fact that our informants straddle all of these categories is indicative of the wide impact of research support across the university. Our interviews did not include researchers, as we sought to examine collaborations and relationships between campus units. We did not enter the interview process with a specific number of interviewees in mind; instead, we halted the interview process when we felt that the relevant parts of the campus had been covered by at least one interviewee, and, more importantly, when we began to detect significant overlap in the perspectives related by later interviewees compared to earlier ones. The result, we hope, is a diverse array of perspectives, highlighting many facets of the intra-campus collaboration story. In conducting the interviews, we spoke to our informants about their personal perspectives on building intra-campus relationships around research support; we did not ask them to “represent” the campus unit in which they are embedded or to present a summary view detached from their own experiences. Relationship building is ultimately about people interacting with people; we tried to find out from our interviewees what worked for them—and what did not—as they reached out across the campus. Our interviews were recorded and transcribed prior to review and analysis. All our interviewees were guaranteed anonymity to remove obstacles to relating their experiences. To preserve their anonymity, therefore, we do not reveal the names of the interviewees, their job titles, nor their institutions. We also use the nongendered pronoun “they” when referring to our informants. Limitations Selecting a representative and informative cohort of interviewees required making choices, acknowledging trade-offs, and recognizing the distinct challenges presented by this domain: • Complexity: many campus units could potentially be stakeholders in the provision or consumption of research support services; moreover, within each unit, there are potentially many different roles relevant to research support. The result is a vast array of individuals with different informative perspectives to offer, far beyond the threshold of our resources to address them all. • Comparison: the delineation of campus units, or the titles and roles designated within those units, varies from university to university. This makes it difficult to choose a sample from an enumerated set of campus units and associated roles within those units. • Context: every university is different, so the experiences of an individual at a given campus in building intra-campus relationships in research support will be influenced by local circumstances. With these challenges in mind, we opted to assemble a collection of interesting and informative perspectives from individuals serving in a variety of roles across the campus, rather than attempting a comprehensive view of campus stakeholders in research support,10 with the goal of comparing and contrasting their experiences in cross-campus collaboration and drawing out general lessons and insights. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 5 The Campus Environment Being in a decentralized institution, I have to persuade people that it’s in their best interest to do [something]. But if I can do that successfully, it’s much more likely to lead to climate change than mandating. —Academic Dean It all takes longer and has more dependencies than you think. —RIM System Administrator Social interoperability takes place within the unique environment of the modern university. One key feature of this environment is the diffusion of authority and decision-making responsibility. For example, Deane and Clarke note that “it is rare for [presidents and provosts] to give anything like an order to deans, who enjoy considerable autonomy in leading their schools. This softness of command cascades down the ranks, as department heads have wide latitude in how they lead their departments and individual faculty have considerable discretion in how they conduct their teaching and research.”11 In this section, we discuss some of the organizational attributes of US universities and how they reinforce the importance of social interoperability as a key ingredient for getting things done. Universities are Complex Adaptive Systems There is no single model that can illustrate a “typical” research university structure—every institution is a bit unique, with a dizzying variety of hierarchies, positions, titles, units, and budget models. However, we find useful the description of universities as “complex adaptive systems” by systems engineering expert and former university leader William B. Rouse.12 Similar in complexity to urban systems, he describes universities as sharing these six main characteristics of complex adaptive systems: 1. Nonlinear, dynamic behavior. The behaviors in the university can appear random and chaotic. Individuals in the system may ignore stimuli, remaining oblivious to activities outside of their immediate purview, reacting infrequently, inconsistently, and perhaps overzealously when they do take notice. 2. Independent agents. Individuals, and especially faculty, have a lot of freedom to be self- directed: in research, teaching and course development, and behaviors. Their behaviors are not dictated by the university, and in fact, the independent agents may feel free to openly resist institutional initiatives. 3. Goals and behaviors that differ or conflict. The interests and needs of the independent agents acting within the university are highly heterogeneous, leading to internal conflicts, professional discourtesy, and sometimes outright competition. 4. Intelligent and learning agents. Not only are people independent agents, they’re also smart independent agents, who can learn how the complex university works and adapt their behaviors to achieve their personal goals. With such heterogeneous goals across the enterprise, individuals can end up working at odds with each other. 6 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 5. Self-organization. While universities have established hierarchies (like colleges, schools, and departments), there can also be self-organized interest groups that arise to meet evolving needs. This can also lead to duplication of effort and services, as a group working to address a problem may be unaware of similar efforts and act independently instead. 6. No single point(s) of control. Universities are characterized by a significant degree of decentralization where units, as well as individuals, operate in a federated manner with a high degree of autonomy. Our interview informants described this ecosystem as a major pain point. Universities are not sites where mandates usually work; they aren’t characterized by a command and control system. Instead, they work through incentives and inhibitions. Or, as one of our informants told us: “Mandatory is your first and fastest way to fail . . . [because] you aren’t going to dictate anything to anybody.” This can also mean that centralized efforts are more difficult.13 It’s also easy to make mistakes because “units don’t want to give up their autonomy . . . making it easy to step on toes.” Developing and stewarding trusted relationships in a decentralized organization is essential. William Rouse’s model offers context for understanding why cross-institutional collaboration can be so difficult. Instead of traditional organizational systems that rely more upon command and control management methods, a hierarchical network, contractual relationships, and a focus on efficiency, universities respond poorly to these methods. Instead, the more heterarchical and self- organized network is “better led than managed,” relying upon personal relationships, persuasion, and consideration of the interests, incentives, and inhibitions of others. Developing and stewarding trusted relationships in a decentralized organization is essential. There are also a few other, interrelated themes that emerged in the course of our interviews that are important for understanding both the imperative of cross-institutional collaboration as well as the challenges of achieving good social interoperability within the system. Intense Competition for Prestige, Rankings, and Resources Research universities today are participating in a high stakes reputation race, seeking higher rankings on national and international league tables. The quest for prestige and rankings—and the promise of greater resources with greater prestige—is driving incentives and activities throughout institutions, particularly as revenue streams decline or become less certain.14 A variety of research support-related activities relevant to institutional reputation management and research competitiveness are emerging, such as the implementation of RIM systems; support for research data management planning, storage, sharing, and preservation; and the desire for improved research analytics and benchmarking tools. These efforts require the buy-in, knowledge, and engagement of numerous campus units; they are also challenging, time-consuming efforts on decentralized campuses. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 7 Within this highly competitive environment, strategic alignment across campus units is more important than ever. Several of our interview informants emphasized this imperative, as well as the importance of senior leadership to signal the most important issues and activities. For instance, one library leader said, I don’t think that [research data management support] or the [RIM system] would have been successful as library-only initiatives. . . . It’s been absolutely critical that they were backed by the [office of research] because I think that’s also helped keep it to be more of a campus-wide perspective. I do think it’s pretty easy for the library to get sucked into that library world, so it could happen. This is true not only for research support activities, but also for supporting student learning and success,15 and there is a significant literature addressing the importance of close alignment between the library and the parent institution.16 Leadership Challenges A major challenge mentioned by several of our informants was the significant amount of leadership instability, or “churn,” as senior leaders enter and exit with regularity. This leadership discontinuity can particularly hamper progress on enterprise wide efforts, as executive sponsorship for campus level projects is essential for forward progress. One informant from campus IT shared, The change in leadership up and down the chain is so frequent, that we get a strategic direction in place and then no one is in place long enough to actually see it through. Then you spend another year or two kind of rudderless, with everyone kind of doing what they . . . think is best but unless you have the leadership at that level actually focusing resources on a particular effort, you’re not going to get very far on campus with these campus wide efforts. We can do lots of smaller things that you can garner the resources and backing to do, but you can’t do really big things without [senior leadership] aligned. The lack of sustained leadership and vision can inhibit social interoperability as well, as individuals and units may have no encouragement or leadership to create and maintain cross-institutional relationships in order to work toward a common goal. One of our informants, a senior academic affairs leader, used a tug-of-war metaphor to describe the role of a good leader in focusing attention on shared goals: “You need to make it clear that it’s a rope. That it’s this rope. And this is what pulling on it means.” Frustration and Isolation in Emerging Roles Several of the informants we interviewed were professional staff members, without faculty status. In recent years there has been a proliferation of nonfaculty professionals working at US universities, providing student and research support in a variety of areas, such as IT, career advising, counseling, research administration, and more. In fact, many of the people we spoke with were in positions that are relatively new roles within the university, particularly those serving in positions leading campus- wide research development efforts or RIM implementations. 8 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Celia Whitchurch describes these individuals as “Third Space professionals” working in emerging areas, within traditional organizational structures that simultaneously offer security and constraints, and working within and across these hierarchies in ways that are both appreciated and can sow friction.17 Many of our informants reported feeling isolated in their emergent roles, without (yet) a supportive community of practice within and beyond the university. In order to be successful, these professionals must develop trust relationships across campus, which will in turn also develop a socially interoperable community of practices. But this isn’t easy, especially in the university environment where decentralization, administrative churn, and local autonomy are standard. Sometimes our informants reflected frustration with their inability to lead change on campus, sometimes explicitly stating that they thought they were unable to move things forward because they weren’t faculty, and that they felt implicit bias and are seen as less respected members of an implicit caste system, or mere “administrators.”18 For example, one informant shared, “One of the reasons it may not have . . . gone anywhere was that it was coming from this staff perspective and that it may have to come through faculty members.” Social interoperability is a means of cutting through these complexities and obstacles, promoting mutual understanding, highlighting coincidence of interest, and cultivating buy-in and consensus. Leveraging relationships with faculty can be essential in this landscape, including with librarians with faculty status: We work really well with our library colleagues, because most of them are faculty librarians. They are tenured, or on the track. It’s a lot easier for us at times to hand some things over to them to let them carry it forward, especially around policy. However, one of our librarian informants cautioned that “even though we are members of the general faculty . . . we’re not always seen at the same level.” In sum, social interoperability is an essential skill in developing successful, high-impact research support services in the kind of complex adaptive system described by Rouse, and which is complicated further by intense international competition, local leadership discontinuity, and the disconnect that often attends emerging roles such as those associated with many aspects of research support. A staff member (not one of our interviewees) leading the implementation of a campus-wide RIM system half-jokingly referred to this effort as “herding flaming cats” to express the significant challenges of trying to coordinate highly independent individuals with different goals and interests, spread across a large, decentralized organization. Social interoperability is a means of cutting through these complexities and obstacles, promoting mutual understanding, highlighting coincidence of interest, and cultivating buy-in and consensus. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 9 A Model for Conceptualizing University Research Support Stakeholders Nobody knows what the %*@# a provost does. —Provost This section describes a conceptual model of campus stakeholders in research support identified in the course of our interviews with 22 individuals from 17 research-intensive institutions in the United States. The model helps visualize the broad functional areas on campus from which stakeholders in research support services often emerge and places the specific roles represented by our informants in a broader, campus-wide context. Campus stakeholders are not identical across institutions: the functions, responsibilities, and even nomenclature of both individual positions and campus units will differ. Therefore, the descriptions we offer below are stylized and intended to express the broad sweep of stakeholder interests in research support. These interests will be organized in different ways on different campuses. FIGURE 1. A conceptual model of campus research support stakeholders THE UNIVERSITY Academic A�airs Research Administration The Library Information & Communications Technology (ICT) Faculty A�airs & Governance Communications A Conceptual Model of Campus Research Support Stakeholders A Conceptual Model of Campus Research Support Stakeholders 10 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Our informants were associated with a diverse array of campus functional units. We have grouped them into six broad functional areas (figure 1). Note that these are not mutually exclusive; the distinctions across areas are those of focus, rather than clear administrative boundaries. Moreover, this is not a complete model of all the functional units found within a university, but instead is focused on those most relevant to research support services. Finally, we note that this model does not take into account any hierarchical relationships that may exist within and across these areas. The remainder of this section provides brief descriptions—often in the words of our informants—of each functional area represented in the model. In talking with our informants about their roles, we were impressed by the variation and nuance in responsibilities, interests, and institutional circumstances evident across seemingly similar functions or positions located at different universities. While this makes generalization difficult, we did identify a “takeaway message” in each campus area that seemed to resonate across our discussions. Academic Affairs Academic Affairs in our model includes individuals responsible for overseeing teaching, learning, and research activities at the university. Examples include the provost—the university’s chief academic officer—as well as deans and directors of colleges, schools, and institutes; department heads; directors of graduate study; and faculty and staff. It is important to emphasize that while Academic Affairs personnel are perhaps most commonly understood in relation to their oversight of academic programs (e.g., course offerings, teaching assignments, degree requirements) they also have responsibilities concerning research activities at the university. This underscores the need to understand the research interests of those in Academic Affairs positions, and by extension, their potential role as campus stakeholders in research support services. In some cases, academic and research interests may be intertwined, such as in graduate education, where the Graduate School takes a leading role in supporting the interests of early career researchers, including both graduate students and postdoctoral researchers. The functions falling within this area are vast and varied, but a common theme that emerged from our interviews is that individuals working in Academic Affairs often expressed their responsibilities in the language of campus-wide strategic imperatives. We spoke to a provost who described their responsibilities as “operationalizing the institution’s imperatives”—in other words, implementing the university’s strategy and vision. They went on to note the importance of the provost’s voice as a source of leadership in signaling and encouraging engagement with institutional priorities. Advocacy was a central responsibility of a graduate dean we interviewed, motivated by a concern that the interests of graduate students and postdoctoral researchers might be overlooked amidst an institutional focus on undergraduate education. And a dean of arts and sciences remarked on the need to demonstrate research impact and link it to institutional reputation and prestige. Moreover, emphasis on strategic imperatives—whether communicating the university vision, advocating for the interests of a student cohort, or enhancing the institutional brand and reputation—is not confined to senior leadership, but filters down, in one form or another, through the various layers of staff underneath. For example, one of our informants stressed the importance of all faculty and staff understanding their unit’s philosophy, its values, and its stance vis-à-vis other units. In working with individuals in Academic Affairs, whether executive or “front-line,” it may be especially important to understand the strategic interests motivating both their needs and the capacities they have developed or are developing. Although this observation was evident from our interviews with Academic Affairs personnel, it can be usefully applied to the other functional areas defined in the conceptual model (figure 1) as well. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 11 Research Administration Research Administration covers a vast array of services and activities, supporting one of the three great missions of most universities (education, research, and service).19 Broadly speaking, campus units associated with research administration provide services that help advance the university’s research activities, such as securing external funding, developing institutional strategy and policy, and providing oversight of issues having to do with responsible research conduct, ethics, and grant administration. Often, campus units aimed at supporting research administration are collected under a university Office of Research (or similar name) led by a vice president or vice chancellor, with responsibilities that extend over the entire research life cycle. For example, The Ohio State University Office of Research defines its mission as supporting “the development, submission, management and integrity of Ohio State research.”20 Similarly, the Office of Research Administration at Stanford University provides “an array of high-quality services and expertise to support the research mission and sponsored projects administration at Stanford University.”21 One of our informants in this area remarked that their primary responsibility was to help our researchers advance the research. . . . So it also means helping them make their lives easier. I often tell them, “You guys don’t . . . realize the disasters I’ve prevented you from seeing.” . . . So really it’s important because I am passionate about the research mission and we do whatever we can to keep our researchers focused on doing their research so that they’re not doing other things that they shouldn’t have to do. One theme that we heard from several informants, occupying different roles and responsibilities, was the importance of managing the competitiveness and growth of the university’s overall research administration. One informant described their responsibilities as “increasing the competitiveness of our faculty when they are seeking extramural support.” Another informant explained their unit’s role as “related to strategic planning, strategic investment opportunity for the institution to grow and expand . . . as an institution, where do we invest our dollars in order to expand our research enterprise” Yet another of our interviewees described their focus as “enterprise-level strategy” for the university’s Research Office. A key message from these responses is that the university research administration, while fragmented among many different disciplinary cohorts with different priorities and objectives, is nevertheless also viewed and managed as an enterprise-wide activity. Understanding campus-wide priorities and objectives regarding research administration is an important aspect of working with this area, as well as a helpful perspective in campus partnerships aimed at providing research support services across a diverse university research community. The Library The library is a familiar campus presence, and its traditional mission—broadly speaking, to connect students and faculty with the information resources they need for education and research—is likely familiar to most as well. We spoke to a number of individuals working in the library, or in library- adjacent services, and the diversity of their roles and responsibilities were indicative of the many points of contact between the library and the university research administration. For example, one informant manages a university press, while another directs a digital humanities institute. Other informants were involved in activities such as scholarly communication and disciplinary liaison work. As these roles suggest, today’s academic library is deeply embedded in all phases of 12 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise the research life cycle. Moreover, the library is often seen, as one informant put it, “as a trusted, agnostic partner on campus.” Speaking of an effort to develop academic and research analytics, the informant went on to observe: If the provost had implemented these programs, everybody would have assumed it was for some kind of evaluation process, and they wouldn’t have trusted it. . . . Because we’re not doing the evaluation, we can go in and just, “Hey, we’re here to help you. Tell us what your story is. We’ll help you find some way to tell that story better.” So that worked quite well and was really empowering. Although the library often deploys a wide range of research support services, it can be burdened by its historical role as a physical repository of print collections. One informant remarked on this challenge, observing: Because so often, librarians are forgotten. Our expertise is completely forgotten, and we’re the last people [to be considered]. So faculty are shocked when they realize, “oh, you can help me with my data? Oh, you can help me think through this . . . publishing considerations, whatever it might be.” Effective partnership with library staff involves relinquishing preconceived notions of what the library is and where its expertise lies. . . . The library in turn must communicate clearly to campus partners its full value proposition and expertise. Another informant alluded to similar issues, while at the same time noting the importance of the university librarian’s role in communicating the value of the library to other campus stakeholders, “to make that case to university administrators who previously have had a limited understanding of what things the libraries do.” Effective partnership with library staff involves relinquishing preconceived notions of what the library is and where its expertise lies to understand its role as a key campus player in supporting research activities throughout the research life cycle. The library in turn must communicate clearly to campus partners its full value proposition and expertise, making clear that this value and expertise extends to a broad range of services beyond books. Information and Communications Technology (ICT) Information and communications technology (ICT) corresponds to units responsible for supporting a wide array of technology needs on campus, including those related to education (e.g., learning management systems, distance learning), research (e.g., storage and high-performance computing Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 13 resources, digital collaboration tools, and research software), and general campus technology (e.g., email services, telecommunications, networking, personal computer access and support). ICT also provides technical consultation and support. A key feature of ICT units is their provision of centralized services in a decentralized campus administrative environment. One of our interviewees in this area observed that the “campus IT unit provides a lot of value in that they can offer a lot of centralized services to campus and make them available to everyone, make the experience more uniform across different audiences across the campus.” A similar sentiment was expressed by an IT professional responsible for managing a campus research information management system, who noted that the system was a central hub for a variety of campus-wide needs, such as facilitating cross-campus collaboration, serving as a central registry for research outputs, and providing a consolidated source of metrics and other information for campus administration. And it is important to emphasize that, like Academic Affairs, ICT staff are often deeply connected to broader institutional strategic priorities: an IT director, for example, noted their unit’s prominent role in enhancing the university’s grant proposal success rate. Although centralization of key services is an important function of ICT, we learned that it is challenging to draw the line between services that are best scaled to a campus-wide level, and those that are best provided at a college or department level. As one interviewee pointed out, “what we hope for is the things that make sense to be run from a central point kind of gravitate and migrate towards the central unit,” while discipline-specific services are managed by the relevant institutional units themselves. Our interviewees also noted that many units on campus such as colleges, research institutes, and departments have their own dedicated ICT capacity and staff; one of our informants emphatically remarked: “We stay out of that. There’re local division level and department level system administrators that have some systems that they spin up and we might guide people to them but it’s those folks who have the role of supporting them.” Given this, an important consideration for research support services is determining at what scale a service should be deployed, which in turn influences who the appropriate campus partners may be. Faculty Affairs and Governance Faculty Affairs and Governance in our model encompasses a wide range of services and functions aimed at supporting faculty members in their careers and scholarly activities, including those usually associated with a faculty affairs unit in the provost’s office, as well as those related to faculty governance, such as the faculty senate or the local American Association of University Professors (AAUP) chapter. A recent Chronicle of Higher Education article catalogs the many areas addressed by specialists in faculty affairs: “pay parity, leaves of absence, merit increases, annual reviews . . . tenure and promotion, contract renewals, sabbaticals, research grants, start-up funds, and faculty searches . . . counting faculty members for annual IPEDS and other national surveys”22 Faculty affairs is an emerging functional area on many campuses, and an important stakeholder in research support, conducting work critical “to facilitate a lot of the research work on campus,” as one informant expressed it. Another informant remarked that their “record-keeping” activities meant that they were “one of the sources of good data about the amazing accomplishments our faculty take part in every year.” However, challenges abound, as one informant mentioned their unit was still in the process of raising its profile across the university and establishing itself as a trusted service provider. Another informant noted that understaffing often led to long and demanding work weeks. 14 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise The informants we spoke to represent a range of different functions within faculty affairs, but recurrent themes of both concentration and coordination emerged despite the differences across their specific responsibilities. For example, one informant responsible for research analytics observed that their unit was the sole data source for many of the metrics and analytics consumed by other campus units. Another informant highlighted the importance of “the human touch and coordination behind the scenes to make sure that all the units are working together in the way that they should, that all the efforts are strategically aligned.” Faculty governance involves pathways for faculty participation in institutional decision-making: as one former university president (not an interviewee in our study) put it, “While faculty are, by nature, independent actors who are rarely motivated en masse, there are faculty organizations that can play an important and constructive role. I worked hard to develop close, cooperative relationships with each of these groups, and the effort paid off with the faculty as a whole in gaining their support for what I was trying to accomplish.”23 One of our informants, speaking of their participation in a faculty senate and its role as a forum for raising and discussing issues, noted that the “Senate is very central to campus . . . the Senate has the standing to be able to call those people to actually speak to those things. So I think that’s probably the most important function that it has is that it can bring these things to the surface and make people come and publicly answer questions and speak to us.” A key benefit of working with Faculty Affairs and Governance may be that they often occupy roles that cut across the campus stakeholder network, such as providing centralized data resources, coordinating cross-unit activities, and convening and/or participating in venues for discussion and problem solving. Communications Communications staff are responsible for promoting, marketing, or otherwise raising awareness about university programs, accomplishments, initiatives, and other activities. Communications professionals appear at various levels of the university organizational structure, whether concentrated in a university communications or public affairs office, or being embedded in a wide range of campus functional units, including academic units, corporate relations, the research office, alumni relations, and many more. Communications specialists are also involved in efforts to manage and promote the university’s brand and reputation. The information disseminated by communications staff may be directed at an internal audience (for example, a campus newsletter highlighting news and events associated with the university’s research activities) or an external audience (for example, communications targeted to local and state media, legislators, or potential donors). One of our informants summarized their communications work as “telling the story of safe, ethical, productive . . . research . . . and then on the flip side, helping to sell the ideas and the creativity of our researchers to our funding agencies.” An important insight that emerged from our interviews with communications specialists was the importance in communications work of building networks and community. One of our informants remarked on their efforts to promote interdisciplinary communication, and in doing so, cultivating a sense of community across the diverse cohort of researchers at the university. This individual went on to observe that “that kind of connecting, communicating, developing of networks . . . is probably the most vital thing that I do.” Another informant noted the importance of collaboration in their work: Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 15 So we have to be really collaborative to get our work done and just to rely on each other . . . It’s part of our DNA. . . . So I work very closely with all everyone in strategic communications, from marketing and brands to the media team, to the internal communications folks on a variety of different things. Networking is a key ingredient for successful communications work—whether building networks with colleagues in other parts of the campus to carry out communication initiatives, or to build networks on campus through communication initiatives. Building cross-campus partnerships in research support services would therefore benefit from tapping into the networking and community-building skills of communications specialists, who may also be consumers of research support services. In sum, our interviews helped uncover the wide diversity of roles and functions across the campus that touch on the university’s research activity, and by extension may potentially be stakeholders in research support services. This diversity is evident not only across the six broad functional areas highlighted in the model above, but also within these areas. Building cross-campus partnerships in research support services would therefore benefit from tapping into the networking and community-building skills of communications specialists, who may also be consumers of research support services. It is important to look beyond traditional and/or superficial perceptions of what campus units do to understand how the responsibilities of these units evolve, expand, and re-prioritize over time. One library told us that as part of a strategic planning process, they conducted a ten-question interview with various stakeholders around campus: So it started very meta. And it wasn’t until question eight that we talked about libraries. So it narrowed in, went down to their school in that department … and then into the libraries. And actually we got some of the richest information out of those first seven questions when they didn’t know that we’re talking about libraries because they didn’t know that we could do things in areas that they were talking about.24 The essential first step in building successful campus partnerships is to know your partners—what they do, what they prioritize, and how they see themselves contributing to the university mission. 16 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Social Interoperability in Research Support Services Well up front, I would say I can’t get anything done without partnerships. I mean it’s just absolutely essential to partner, whether it’s with centers, institutes, department chairs, academic deans, research deans, all the above. —Research development professional You have to recognize that you’re part of an organization and you want to advance your collective interests. Because advancing your collective interests will almost always roll down to your own benefit. —Senior university leader As discussed earlier in this report, there is increased operational convergence, as units and individuals across the campus must work together to provide support across all phases of the research life cycle: from project ideation, to grant development, to research, to publication and reuse. Increased interoperability across silos is necessary.25 This interoperability must exist in a technical sense, of course, but it is also the social interoperability within the complex adaptive system of the university that is needed to make efforts successful.26 In this section, we examine four research support topical areas in order to see how this interoperability between campus stakeholder groups plays out (figure 2). FIGURE 2. Stakeholder interest in research support areas THE UNIVERSITY Academic A�airs RDM RIM Research Analytics ORCID Adoption Research Administration RDM RIM Research Analytics ORCID Adoption The Library RDM RIM Research Analytics ORCID Adoption Information & Communications Technology (ICT) RDM RIM Research Analytics ORCID Adoption Faculty A�airs & Governance RIM ORCID Adoption Communications RIM Research Analytics ORCID Adoption Stakeholder Interest in Research Support Areas Stakeholder Interest in Research Support Areas Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 17 These areas were frequently discussed in our interviews as the locus of intra-campus research support collaborations and provide rich examples of social interoperability between stakeholder groups on campus: 1. Research Data Management (RDM) 2. Research Information Management (RIM) 3. Research analytics 4. ORCID adoption Research Data Management (RDM) Research data management has quickly grown in interest in higher education, with significant investment in services, resources, and infrastructure to support researchers’ data management needs. External funding agencies like the US National Science Foundation (NSF) require the inclusion of supplemental data management plans (DMPs) in grant proposals, noting that “[i]nvestigators are expected to share with other researchers . . . the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.”27 Institutional support for this type of mandate dovetails with activities related to proposal development, grants administration, active data management, and data curation, sharing, and preservation.28 Research data management has quickly grown in interest in higher education, with significant investment in services, resources, and infrastructure to support researchers’ data management needs. As a result, resources and support related to research data management are distributed broadly cross campus. Research administration, the library, and campus ICT are leading stakeholders in this area, and our informants reported highly synergistic relationships. On one campus, the data librarian is embedded in the research development office, a subunit of the office of research, providing guidance on DMPs, data requirements, and library data curation resources. On another, research development staff offer training for researchers on funding opportunities, proposal writing, and industry collaboration through the library’s research commons, in conjunction with research data management programming. In a third institution, research data management resources are primarily housed in the library, with significant financial support from the office of research. In this case, our informant said, I don’t think that either the [research data management services or campus RIM system] would have been successful as library only. It’s been absolutely critical that they were backed by the [office of research] because I think that’s also helped keep it to be more of a campus-wide perspective. 18 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise One of our informants from ICT described how their unit provides direct consulting to researchers, developing long-term relationships and deep knowledge of user needs in order to provide expert support. This includes identifying workflow and data management solutions and even advising faculty on proposal development, particularly on the technology sections of proposals. They avoid answering quick questions via email, instead seeking to deepen relationships and understand the larger context of the researchers’ needs through attendance at laboratory meetings and quiet observation. Our informant remarked that “this is not trying to be an efficient operation,” and emphasized that local provision to researchers is necessary to understand and address researcher needs. Their unit is “joined at the hip with the library” and always looking for new ways to collaborate. While many stakeholders are working synergistically to provide data management support to campus, it can still be difficult for researchers to know which resources are available, as there is rarely a central resource that indexes these services. One of our informants said if they could wave a magic wand to solve any problem there, they would “cultivate a network of . . . research consultants and have a portal or something to point to” to direct researchers to an array of services such as high performance computing resources, DMP development tools, and publishing concerns. Several key stakeholders have a keen interest in RDM service provision: • Research administration units such as research development are eager to support RDM services. Research administrators in the sponsored programs pre-award work to ensure that grant proposals include all required sections, including data management plans, while post- award administrators work to ensure that required data management policies are documented and followed. Research development professionals are eager to connect researchers with any and all services that will help ensure their productivity and success, making the research development office a natural partner with the library. The VP Research may provide significant executive and monetary support. • The library has a significant role to play in the education, expertise, and curation areas of research data management, and libraries may offer individual guidance, monitor agency data curation requirements, and support local deposit and curation of datasets. • ICT professionals also play a major role in RDM support, supporting access to technology and also potentially providing expert support on workflow solutions. • Academic affairs units are keen to support research data best practices among their scholars, and the graduate school may also be interested in promoting education and training about RDM practices among graduate students and postdocs. Research information management (RIM) is the aggregation, curation, and utilization of metadata about research activities. It’s a registry of information about research produced rather than the research data generated by researchers. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 19 Research Information Management (RIM) Research information management (RIM) is the aggregation, curation, and utilization of metadata about research activities. In other words, it’s a registry of information about research produced rather than the research data generated by researchers and includes information about locally- produced scholarly journal articles, monographs, datasets, presentations, and more.29 While national and regional reporting requirements are strong drivers of RIM practices in Europe and Australia, US practices are driven more by competition and reputation management needs, resulting in the emergence of two primary use cases—public profiles, and faculty activity reporting (FAR) workflows—both involving an array of stakeholders from across the institution.30 Other RIM use cases in the US, including internal decision support, data reuse, and institutional repository integrations, are currently of secondary relevance. Readers wanting to learn more about these uses are encouraged to consult previous OCLC Research reports.31 PUBLIC RESEARCHER PROFILES The first primary US use case is the implementation of public profiles of institutional researchers, with the hopes of facilitating the discovery of experts and collaborators, and to catalyze business and university relationships. One of our informants emphasized that at research universities, “we build reputation like businesses build profit,” and their institution, with library leadership, has implemented a researcher profiling system that harvests publications metadata on the work of every faculty member at the institution, with search engine optimization to support expertise discovery and boost the reputation of the parent institution. A variety of descriptive terms exist to describe these types of platforms, including Research Networking System (RNS) and Research Profiling System (RPS), and in our interviews, we found the campus profile system housed in the library, the office of research, or in campus ICT. In all cases, there was significant cooperation between units. One informant from research development described working “hand in glove with the library” on their campus profiles, and another informant emphasized the importance of library expertise with publications metadata as well as vendor negotiation. At another institution the profile system was administered by the library, with funding from the office of research. Many campus units are strongly interested in campus profile systems: • Research administration units are keen to connect researchers, develop strong interdisciplinary scientific research teams, and yield successful grant applications. Sponsored programs and medical center staff within the office of research may also use public profile systems to comply with US National Institutes of Health (NIH) Clinical and Translational Science Awards (CTSA) recommendations, which call for participating institutions to support collaboration among clinical and translational investigators through the provision of tools, training, and technology.32 • The library values these systems for registering the institutional record of the institution, a manifestation of the “inside out” library, and offers bibliographic expertise. • ICT professionals may be called upon to provide technical support as well as to support system-to-system interoperability, for instance, through the facilitation of automated data feeds or support for APIs. In one case, we found campus ICT as the home for the institutional profile system. • Campus communicators value resources that can help support discovery of experts for press requests and public interest stories within academic affairs units as well as research units. 20 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise • Other stakeholders in academic affairs and other units in the office of research are interested in how the aggregated content might inform institutional decision support. They also share the goal of connecting researchers with other potential collaborators within the institution. FACULTY ACTIVITY REPORTING (FAR) A second important RIM use case in the United States is annual academic progress reviews of faculty, frequently called Faculty Activity Reporting (FAR).33 Because of the disciplinary expertise required for reviews, these processes have long been administered at the departmental level, with a variety of workflow solutions ranging from Dropbox folders to dedicated FAR platforms. Like the public profile use case, the FAR workflow also captures information about scholarly products like publications, plus additional information about the teaching and service responsibilities of faculty. With so many research information management stakeholders, duplication of systems and services is possible, even likely, because of a lack of social interoperability. In particular, independent academic affairs units like colleges, departments, or research institutes may develop their own systems, instead of working with others across the institution. This was commented upon by several of our informants, including one who remarked that on their campus “we have six or seven research profiling systems. That is duplication of service, for sure.” In addition to being a duplication of effort, the failure of multiple stakeholders to work together on a unified system can unintentionally dilute the hoped for impact, as the institution delivers multiple profile discovery platforms instead of a single source of expertise. These are also silos of data that may not be easily combined to provide a broader, expertise snapshot of research activity. Duplication of systems and services is possible, even likely, because of a lack of social interoperability. For institutions that are centralizing faculty activity workflows, these are often managed by a faculty affairs office. The annual review of faculty may be mandated from the campus board of trustees, system, or even the state. One institution ties FAR participation to eligibility for merit pay increases, but even so, there are still a few noncompliant faculty. One of our informants reported how the faculty affairs unit at their institution is valued by many stakeholders on campus, including the provost and other senior campus leaders, for the business intelligence and benchmarking their unit provides. FAR workflows, which by definition are annual reviews of faculty activities, are still usually separate from the less frequent promotion and tenure (P&T) review processes, although one of our informants reported that FAR data can be extracted for reuse for P&T. For this use case we observed campus leadership from faculty affairs as well as from academic affairs units. In particular, FAR is of interest to several campus units. • Academic affairs units, including departments, colleges, and the provost’s office, are interested in faculty activity reporting practices. Colleges and campus level units are particularly interested in both improved workflows and the improved aggregated data that can used for decision support. However, there is often a great deal of unit autonomy, leading Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 21 to heterogeneous practices and duplication of effort and systems. The data aggregated in FAR workflows can also be reused for academic program reviews and program accreditation. • Faculty affairs units at some institutions, usually housed in the office of the provost, may take a leading role in implementing and managing a single FAR system for the institution. • ICT professionals play a role in supporting FAR workflows—at all levels of operation, whether it’s at the departmental or institutional level. And they may also work to provide data from other campus systems to populate the system, such as HR appointment data. • The library is also a stakeholder, providing expertise related to publications metadata, metadata harvesting workflows, and research impact metrics. The library may also play a role in vendor negotiations. • There are other stakeholders whose roles are important because their unit or system provides data for the FAR system, such as human resources (for appointment information), the registrar and/or the data warehouse (for course information), and the graduate school (for doctoral mentoring and committee service). Our informants also emphasized that public profiles and FAR are currently separate workflows and managed in separate systems even though these systems collect a lot of the same information, such as the publications and other scholarly outputs of institutional researchers. But because of a lack of both technical and social interoperability, these systems may exist in duplicate across campus, even requiring repeated manual data entry by faculty into multiple systems. As one of our interviewees emphasized, “There just needs to be the human touch and coordination behind the scenes to make sure that all the units are working together in the way that they should, that all the efforts are strategically aligned.” Research Analytics While university offices of institutional research have long collected and reported on educational outcomes, providing information to campus on student enrollment, retention, and career outcomes, US institutions have been slower to aggregate content on research activities. There are good reasons for this difference: institutions have collected their own measures of student progress while indicators of research productivity—things like journal articles and monographs—have been harder to capture, as they were processed and distributed outside of the organization. Institutions relied upon proxies of research productivity—measures like the number of research doctorates awarded or extramural funding received—to provide information on research productivity and prestige. With radical changes in digital publishing, persistent identifiers, and big data in the past two decades, as well as the growing influence of international rankings and league tables, there’s growing interest in looking beyond these proxies for a more nuanced view of an institution’s research strengths, weaknesses, and networks of opportunity. Today administrators across the institution want improved research analytics and decision support tools.34 This was a recurring trope in our interviews. One informant in research administration, when asked what problem they would solve with a magic wand, responded: “Data. Data! I’d have data at my fingertips that I could search!” Instead, they described much of the analysis of research activity on their campus as “ad hoc” and insufficient. Another informant described how the office of research at their institution wants improved “push-button reporting” about grants submitted/received, as well as to support the identification of prospective collaborators. A third informant described how 22 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise their institution’s new president is “appalled” at the difficulty of understanding institutional research strengths in a data-rich way. 35 We observed institutions responding to this need in a variety of ways. One institution is investing resources into a single, centralized data analytics office under the Chief Financial Officer (CFO), which will incorporate traditional institutional research professionals, as well as a reporting and analytics group that can provide expanded expertise on research metrics as well. (In another institution, the office of research has hired staff to support dedicated decision support and research analytics. This unit maintains its own local data warehouse, pulling data from external sources as well as numerous internal campus systems. Internal data sources include sponsored projects and extramural projects administrative databases, institutional financial data, and Enterprise Data Warehouse (EDW) data on HR appointments, space/room usage, and much more. External tools like SciVal and Pivot are also essential data sources. We also heard informants share how institutions are increasingly investing resources in managing institutional data through the development of campus data lakes and institutional data governance committees. Libraries are also often supporting the institution with data analysis. For instance, Virginia Tech Libraries (not part of our interview cohort) shared with OCLC Research Library Partnership institutions in April 2020 about their use of data analysis to identify synergies and partnerships between Virginia Tech researchers and their counterparts in industry and government.36 In our interviews, a data analyst in the office of research emphasized that impact librarians have a lot of the knowledge needed by data analysts—to understand bibliographic metadata as well as the strengths and limitations of bibliometrics. There is keen interest in improved research analytics from across campus. • Research administration units want improved intelligence about research productivity, campus strengths, trends, The Office of the Vice Chancellor for Research and Innovation at the University of Illinois at Urbana- Champaign is working to support research on a highly decentralized campus through an inclusive Research Development Community. This community is open to all members of the campus community interested in advancing research at Illinois, and is intended to: • Share information about policies, events, and opportunities • Develop and maintain templates, processes, and best practices • Build and support member literacy in a range of topics related to research development • Collaborate across the campus research community to identify research development challenges and support changes that enhance research at Illinois. In particular, the research development council encourages participation from anyone with connections to research—including individuals and units that might not fully realize they have a relationship with the research enterprise, such as facilities management and corporate relations. The group hosts a campus wide “research development day” as an opportunity to celebrate research at Illinois and to also bring in all the disparate stakeholders and service providers, including the library, campus ICT, supercomputing center, corporate relations, and institutional research. Research Development Community at the University of Illinois at Urbana- Champaign35 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 23 and opportunities for private research partnerships. Research development officers can also use research intelligence to inform the development of large “grand challenge” grants. Improved research analytics is seen as increasingly important for securing the prestige and competitive advantage of the institution.37 • Academic affairs units likewise want quality data to inform understanding and decision support. These leaders also see the value in aggregating institutional data to streamline existing processes such as academic program review, and quality data can bolster budgetary requests. • The library is an important stakeholder because of its expertise with bibliographic metadata. In particular, research impact librarians understand the indexes, tools, and limitations of bibliographic analysis and play leadership roles in advising on the responsible use of metrics. • Campus communicators also want improved information at their fingertips, data that can offer an improved understanding of institutional strengths, to help them identify stories to tell that will boost institutional reputation. • ICT professionals are crucial stakeholders in this landscape, playing a role as data stewards, supporting interoperability, and maintaining data warehouses. They are also key players as institutions move toward new data governance structures and develop data lakes for improved and shared analysis. In the course of our interviews, we thought we would find significant interest and engagement in research analytics from institutional research professionals, who collect, analyze, interpret, and report educational outcomes data, but we did not. One informant offered an opinion on this gap, saying that institutional research units are largely unfamiliar with the research domain, and are instead focused on Department of Education reporting on student outcomes. The informant expects institutional research offices to remain focused on educational assessment. As a result, a variety of stakeholders from across campus must work in increasingly socially interoperable ways to contribute knowledge and skills to develop improved data and analysis about the research enterprise.38 ORCID Adoption ORCID (Open Researcher and Contributor ID) is an open, nonprofit organization that works to create and maintain a global registry of unique identifiers for individual researchers. ORCID provides a framework for trustworthy identity management by linking research contributions Campus ICT and the library at the University of California, San Diego, have long partnered to support researchers. In an effort to enrich cross-unit relationships, the two units arranged a working meeting for relevant staff members, to identify a possible joint project or collaboration. Through an icebreaking post-it note exercise they started out collecting all the services and resources offered by both units, audiences served, areas of expertise, and service gaps. Participants suddenly realized how little they knew about the offerings of the other unit. “You have that? We didn’t know you have that!” was a common refrain, and spontaneous peer consulting and planning erupted. While the originally-planned project never happened, it didn’t matter. Instead, the greater knowledge and social interoperability gained through this exercise facilitated trusted relationships, collaborations, and ultimately, better support services for researchers at UCSD. Identifying synergies at the University of California, San Diego37 24 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise and related activities with their contributors across the scholarly communication ecosystem. The ORCID identifier can be integrated into a number of campus workflows and systems such as institutional repositories, grant administration workflows, RIM systems, HR systems, and institutional identity management systems. Consequently, cross-campus social interoperability is important for optimizing the technical interoperability that ORCID can help support.39 However, our informants reported that ORCID implementation efforts at their institutions were slow. For instance, one institution reported how securing buy-in and making any meaningful progress on campus ORCID adoption had taken years, finally resulting in ORCID integration with the institutional identity management system and campus directory. Another described the need for significant campus collaboration: “we had an ORCID integration committee that was looking for recommendations and an implementation plan and that was fairly formal because there were folks from the information systems side of the house, HR, graduate school, and the office of research. We had to come up with a plan and kind of make a recommendation of leadership in the libraries.” There are a multitude of campus stakeholders who must be engaged in ORCID adoption. • The library is frequently the institutional leader on ORCID planning, as it has the greatest familiarity with scholarly communication practices across disciplines. Libraries frequently assume a role as advocates for ORCID adoption and assume institutional responsibility for the training and outreach to scholars. • Research administration and faculty affairs are particularly interested in ORCID integration into RIM systems as ORCID can help disambiguate researchers and improve metadata harvesting workflows, data quality, and the need for manual entry. • Academic affairs units share this interest in improving workflows and reducing administrative burden on faculty. • Campus ICT is a key stakeholder because integration of the ORCID identifier into the central campus identity management system is an approach being used at many US institutions40 and can facilitate the more seamless integration of ORCID identifiers into other systems across the institution. • Campus communicators are eager for information and storytelling opportunities about campus, which improved, disambiguated scholarly communications data can offer.41 Comments on the Library as Partner Throughout the course of interviews, we heard several accounts from nonlibrary stakeholders on how the library is a valued partner in research support activities. In particular, our informants commented on the expertise of the library in licensing, vendor support and negotiations, and research impact and bibliometrics expertise. We heard numerous cases of library staff serving on search committees for the hiring of research development staff members and vice versa, with research development staff serving on search committees for library positions in data management, data visualization, and research impact. One informant saw the library as capable of making progress on things like RIM systems in part because “the library was seen as a trusted, agnostic partner on campus,” while another emphasized how the library has an important role to play as a central campus unit that serves as a trusted partner for sustainable services, not just short-term projects. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 25 However, we also heard that there are sometimes senior leaders in research administration or campus ICT who do not always understand how or why the library should be a partner in research support activities, often because these leaders were “coming from the outside [academia] and really have no concept.” In these cases, libraries and their advocates on campus must effectively and regularly communicate their value and offerings. Our informants described how the library was sometimes seen as less effective than it might be. Communication and scope were big issues, as its services and value proposition could be diluted by a desire to “be everything for everyone” as well as by an overemphasis on values, without appealing to the needs and interests of others. We also heard several comments about how a lack of confidence among librarians hindered their effectiveness. One of our library informants noted that “even though we are members of the general faculty . . . we are not always seen at the same level.” Another interviewee commented that librarians “don’t feel very comfortable. They don’t feel like they’re equals with the rest of campus . . . [even though] there’s no reason why they shouldn’t feel like equals because they [provide] an amazingly valuable expertise.” A feeling of implicit bias, in the sense of not being perceived as being on an equal footing with faculty, was also reported by nonlibrary administrative professionals, including by one interviewee who recommended confidence in one’s own abilities: I think the number one ingredient is the understanding that I bring a certain expertise to the table that [a faculty member] might not have. You are a faculty member in your areas, and you’re a leading world expert on it. Great! It doesn’t mean you know how to do data analytics related to your publication citation count. The library has an important role to play as a central campus unit that serves as a trusted partner for sustainable services, not just short-term projects. Another informant thought it imperative that librarians see themselves as “equal partners to make teams of diverse expertise to accomplish significant, important objectives quickly.” In our interviews, the library was sometimes also seen as “slow,” moving less quickly and with less urgency than other parts of campus: “they absolutely do not move at the same pace that research faculty move.” A couple of informants also commented on the library’s discomfort with financial realities or cost recovery, describing an “unrealistic” desire for everything to be “free” resulting in the criticism in that libraries “don’t focus on the freaking bottom line.” In sum, our interviews highlighted the importance of cross-campus social interoperability in the successful provision and use of major categories of research support services. In the next section, we will focus on strategies for increasing social interoperability and the success of cross- institutional research support efforts. 26 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Cross-Campus Relationship Building: Strategies and Tactics When things work well, it’s about people and relationships. When things don’t work well, it’s often also about people and relationships. —Academic Dean You can make more friends in two months by becoming interested in other people than you can in two years by trying to get other people interested in you. — Dale Carnegie, How to Win Friends and Influence People42 Strategies and Directions Considerable energy is invested in relationships and trusted partnerships in the provision and use of research support services. The amount of time and stewardship required is necessary given the complexities of the campus environment, which is characterized by highly heterogeneous interests and needs of smart independent agents, no single point of control, and a high level of self-organization. Strategies and challenges of cross-campus relationship building were discussed repeatedly in our interviews, and some recurrent themes emerged. SECURE BUY-IN One of the strongest common themes in our interviews was the need to get people “bought in to what you want to do.” Collaborations work best when everyone “thinks they are getting something they want.” This is especially important when working with independent agents in a decentralized campus environment. Persuading someone that something is in their own best interest to act upon is a powerful tactic in an environment where mandates do not exist or do not work. More than one of our interviewees called this “selling”—selling the idea and the role of the unit in it. “Of course, you’ve got to sell all the time!” Another interviewee explained: “Really, it’s building your services so that they’re meeting the needs that you think need to be met as well as possible so they’re attractive to people to use them. Kind of just like competing in regular free market.” Self-interest is a powerful motivator and can be leveraged in mutually beneficial ways. Our interviewees described directly appealing to the other party’s needs and goals as far more powerful than highlighting shared values or noble principles. “So being in a decentralized institution, I have to persuade people that it’s in their best interest to do it. But if I can do that successfully, it’s much more likely to lead to [institutional] climate change than mandating.” Appealing to people’s self- interest requires the ability to offer something that speaks to those needs, in a language that is clearly understood by the other party. This will help the unit to be more successful and to better align with campus goals and perspectives. People who are promoting their own agenda only, or their unit’s, rather than the entire university’s were seen as counterproductive by our interviewees. One senior university leader said, “that agenda thing is something that really, especially in academia, is the thing that really turns people off. . . . And I don’t even think it has to be a mutual benefit. . . . We can dissolve Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 27 it if it’s better to be in another unit, and I can go do something else with my life. That’s all okay as long as it’s not for some stupid, frivolous territorial thing that somebody needs to own everything. If it’s truly in the best interest of the campus and our researchers, then it can be okay.”43 KNOW YOUR AUDIENCE Deeply understanding other stakeholders on campus becomes crucial when appealing to their self-interest is the best way to succeed. Throughout all our interviews we heard a variation of: Do your research on them first and then be “meaningfully relevant” to them. Relationship building to this level exceeds knowing job titles. It requires real engagement and a deep understanding of other people’s responsibilities, priorities, and activities. It means being curious and courteous: taking the time to learn about what others do, developing trust, and stewarding the relationship over time. One of our library interviewees shared how they used a fixed set of ten questions for conversations with stakeholders (see sidebar). These questions were not immediately focused on what the library can bring to the table. Instead, the first seven questions explored larger issues about how the other stakeholder perceived campus priorities and how their unit might be affected by changing priorities. In the course of working through the ten questions, the focus narrowed, until the final questions touched only on library services. The informant noted that they “got some of the richest information out of those first seven questions when they didn’t know that we’re talking about the library because they didn’t know that we could do things in areas that they were talking about.” This strategy increased library awareness of the priorities and challenges of other units and provided the context the library needed to strategically align their work successfully with stakeholders. It also raised awareness among the other stakeholders about the offerings of the library and even provided the spark for some new programs and collaborations. 1. In what major ways do you see the University’s work and focus changing during the next 2-3 years? 2. How are these changes affecting the work and focus of your school/ department/program (unit)? 3. What are your unit’s goals for the next 2-3 years? 4. What about your responsibilities within the unit? What are your top responsibilities now, and how do you see these changing over the next 2-3 years? 5. What challenges must your unit overcome in order to meet its goals? 6. If you were a new hire, what tools and services would you need to be successful? 7. The next several years will not only be all about challenges. What are the opportunities that your unit will be pursuing? What do you see as exciting during the next few years? 8. How do the librarians and libraries contribute to your work now? 9. Considering the University’s goals and your unit’s goals, how could the libraries best contribute to the work of your unit—and to you— during the next few years? 10. What do you want the libraries to give careful consideration to as we craft our strategic plan. A script for learning about other units used at Rutgers University– New Brunswick43 28 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Paying attention to what is happening on campus more broadly is part of this effort to understand your existing or potential audiences. This can mean something as simple as reading emails coming from other units instead of filtering them out as spam, attending events to demonstrate interest in others, and serving on campus committees. Interviewees repeatedly warned against underestimating the importance of “just knowing what other people are up to or having other people know what you’re up to.” Even a small-scale project undertaken by a single campus unit and aimed at a limited audience may be an indicator of an unfulfilled large-scale need that can be identified and addressed through cross-unit cooperation. Understanding the landscape of one’s institution, including the national landscape, was listed as high priority by many of our interviewees. “If you think you don’t understand, you have even more of an obligation to kind of immerse yourself and understand more.” Conversely, no unit should assume others know what they do, but should actively reach out, make it easy for others to learn about its services and needs, and routinely make the case for what it needs. SPEAK THEIR LANGUAGE Different units on campus use different terms for the same things, for historical or cultural reasons. What some call financial viability, others call sustainability. Some talk about profit, others are more comfortable with calling it surplus. Some prefer the term support over subvention. Some units, such as the library, tend to avoid terminology with a corporate or business inflection, other units use that language and can best be served by adopting it too. Metadata is a particularly fraught example; it means something specific to librarians but something very different to IT and/or data warehouse administrators. Interviewees across the board emphasized the importance of the ability to speak the audience’s language. Obviously, if a service or project is not understood to be addressing a problem because of the language used, it may not capture the attention of whoever has that problem. Being prepared to deliver an elevator speech, when necessary, is one aspect of this. As one informant put it, “Do you know how long you [have] to make that case? Two minutes. Two minutes. And if you are not successful, the meeting is over.” Some of our informants shared how they help other units package their information in more suitable ways (e.g., by producing one-page information sheets on topics, or key talking points for outreach and engagement with others on campus). OFFER CONCRETE SOLUTIONS TO OTHERS’ PROBLEMS Another important theme in our interviews was the importance of understanding others’ pain points and of demonstrating how your offerings can help alleviate them. Interviewees shared how it is useful for them to not go into meetings empty-handed, to anticipate needs, be proactive about building skills, and to offer solutions in advance of demand—not just vaguely ask how they could help. One of our interviewees was particularly outspoken on this, recalling a situation in the past when they worked as a faculty member and was asked that question by a library representative: “What can I do for you?” . . . That’s like the most freaking passive-aggressive crap-ass thing you could ever do to a faculty member because how the hell should I know what he could do for me? I don’t know what he knows. I don’t know what resources he has. I don’t know how much time he has. So it’s not my job to educate him on how he can help me. It’s his job to figure out what my needs are, and to come in with, “Hey, I’ll bet you’re trying to.” More than once, initiating the first step was mentioned as a tactic on campus: offering concrete assistance or cooperation without expectation of immediate payoffs or advocating for others to get invited to a meeting potentially relevant to them in the hopes that one day they remind others to include you where you should be included. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 29 TIMING IS ESSENTIAL No pushing will help when the timing is not right, when needs are diffuse and urgency low, or when current priorities differ entirely. One informant said, “Until they need to hear it, they’re not going to hear it.” Creating awareness, informing repeatedly, and patiently waiting for the right moment— or even until the right partner comes into a role—can be the best strategy in an environment of nonlinear dynamic behavior and differing goals. All of this takes considerable time and effort, our interviewees agreed, as well as patience and perseverance: I think certainly at a big university like [ours], remembering the information lag factor, it takes people a while to realize you exist and then it takes people a while to remember what you do, and then they’ll remember what you do and then you’ve gone on to do several other things, but they still only remember the first thing that they learned about you. And then if you screw up then that’s the last story they remember, and they might not update their data on you for a while. Creating awareness, informing repeatedly, and patiently waiting for the right moment—or even until the right partner comes into a role—can be the best strategy in an environment of nonlinear dynamic behavior and differing goals. Relationship Building: Practical Advice Building new and maintaining existing relationships on campus requires considerable commitment and investment. We asked our interviewees to share how they made this happen, what opportunities there were to learn about stakeholders on campus, and which ones they found to be more useful than others. MEETING OPPORTUNITIES Our interviewees emphasized the importance of making regular contact with other stakeholders to build trust and steward relationships. These contacts existed on a continuum from formal and informal, scheduled and spontaneous, and there is value in every type of interaction. Committee work—serving on research committees, the faculty senate, or other bodies—was mentioned repeatedly as an invaluable opportunity for relationship building: to present oneself as a potential partner and to demonstrate good citizenship and support of larger university goals. It is excellent for temperature taking and trust building, and it helps sharpen skills in many ways. Faculty governance, in particular, was mentioned as something important for library staff to be engaged in, to find out what other people were talking about and how it might impact the library, as well as to increase library staff’s visibility and confidence as faculty members on an equal footing with other faculty: “I think that my work in Senate gave me so much more opportunity and ability to build these relationships with faculty that I really wish my staff had more of. . . . I do feel like participating in governance can really help you to grow those skills that you need to be an effective liaison as a collaborator rather than as a servant.” 30 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Scheduling standing meetings with stakeholders was strongly recommended by several interviewees, both for general knowledge sharing and as a welcome option to raise or discuss topics of relevance without unnecessarily ringing alarm bells. Especially when new staff come on board in other units, creating opportunities to meet them early and regularly was mentioned as good practice. Executive level support can be particularly helpful with creating the right sort of relationships at the right point in time. Some interviewees saw an opportunity to create communities that cut across campus silos to unite people with shared interest on campus. We learned about examples of open and inclusive groups on campus that regularly convene with the express purpose of facilitating communication and networking—such as the Research Development Community at the University of Illinois (see sidebar, page 22). Such initiatives are good examples of the self-organized interest groups, arising to meet evolving needs, that are so typical of complex adaptive systems. Finally, informal or “hallway” conversations before or after more formal meetings were highlighted as important ways of engaging. In these conversations, free of pressure or expectations, real progress can be made. People are less suspicious, and “frankly less guarded,” one interviewee remarked. SHARED STAFF AND EMBEDDED RESOURCES Another recurring theme was the benefits that staff movements can bring to the relationships between units, be it shared staff, embedded staff, or staff that moves around when changing roles. A network of former colleagues spread out across campus can be immensely beneficial. Members of staff familiar with one unit and closely working with another can function as trusted “ambassadors,” “allies,” and “champions,” and can effectively “translate” goals, processes, or values between units, as well as connect people. They can help with “cross-pollination,” the cross- unit flow of information and expertise, or simply with “getting a feel for their day-to-day struggles and activities.” And while most staff moves occur organically in the course of natural career progressions, encouraging them can even become a strategy. One of our interviewees told us they purposefully nurture talent in their unit to help them move elsewhere on campus. Based on what we heard in our interviews, the units the library shared staff with or library staff was moved to most often were campus IT or technology and the research office, sometimes as a result of previous project cooperation.44 Troubleshooting in Relationship Building MAKING CONNECTIONS A common issue our interviewees reported dealing with was that of making connections with the right people. Referrals and recommendations were often described as being immensely helpful, much more so than any cold calling. One informant shared how through their investment in long- term relationships with faculty members, “sometimes we get faculty who then introduce us to the next faculty member because they say, ‘These folks have helped us.’” In particular, the importance of a “connector” or “hub” person was recognized by several of our interviewees. The value of someone well-connected on campus, someone who can help identify partners or recommend connections, people to meet with, and workshops to attend—a “hub of hubs”—cannot be overestimated. The best of these people “can see both the details and the whole and bring them together on a campus to talk through the research enterprise. How do we make Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 31 it better, faster, stronger, easier? How do we identify ways that the system can help support that better?” Some of our interviewees identified with the role of a connector on campus themselves and said their job was “to be a facilitator.” This is also an area where senior leadership support can be very helpful. Several of our informants emphasized that it is important for relationships and conversations to take place within multiple levels of the parent units up and down the organizational hierarchy. And while top-down directed collaborations tend to fail, having executive support behind collaborations can be good to move people along, as we heard some of our informants say. PERSONALITIES One of the common issues our informants reported dealing with was that of having to get along with the personalities on campus. Relationship building is all about people. Interviewees often mentioned how their relationships and partnerships depended on the personalities involved, and in some cases failed because of this. Even when the difficulties seemed to lie in the unit or program, interviewees felt that, ultimately, they originated from differences in personalities, rather than disciplinary perspectives. In such cases, it can be helpful to deeply understand not only professional priorities, but also personal sensitivities so you can “sell to” those more personal needs, too. Still, sometimes an individual can prove impossible to work with. In these cases, walking away for a time and waiting for someone else to fill a role can be the most productive way to deal with a situation. One interviewee clearly recommended to not “spend time trying to work with areas that are less receptive” and instead work with who you can. “The good news of being at a big university is there’s plenty who are happy to make progress. . . . So in the meantime [while a certain unit is not amenable], we’ll work with those who want to make changes and do these things.” Good relationships cannot be forced but must be stewarded over time. However, more than one interviewee also recommended not to assume malicious intent. Following up to inquire if something potentially offending happened may be all it takes to see it fixed—and the relationship maintained. In any case, our informants warned against ever burning bridges. Being everything to everyone will not work. Stay focused on what you want to achieve: saying no or limiting scope can strengthen your value as a reliable partner. KNOW YOUR VALUE / BE CONFIDENT It is important to adapt to one’s audience, but it is equally important to be very clear and confident regarding one’s own role and value, including the scope of one’s work. Being everything to everyone will not work. Stay focused on what you want to achieve: saying no or limiting scope can strengthen your value as a reliable partner. 32 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Challenges: Managing Resistance and Sustaining Energy In complex adaptive systems it is not uncommon to see differing goals and behaviors result in internal conflicts and outright or perceived competition. Interviewees talked about how they are constantly trying to anticipate negative responses from different corners of campus and, at the same time, avoid losing control over their communications and efforts. This type of risk management is an important component in developing research support activities in the complex university environment of diffuse interests and conflicting perspectives. MANAGING RESISTANCE Independent agents may feel free to openly resist institutional initiatives in a system lacking single points of control. One successful tactic of dealing with the risk of upsetting others is that of consulting early and often with other stakeholders. For example, one informant recommended sharing ideas or drafts early in the process in order to take the temperature and collect preliminary feedback from stakeholders, top to bottom, so they all can feel consulted, concerns are addressed, and buy-in developed upfront. That way, one interviewee said, stakeholders will not feel blindsided by the launch of something new. Another informant emphasized the need to anticipate if and how new collaborations, or collaborative projects, impact business or administrative processes. Process changes often create resistance, and it is important to deal with them early and wisely. Resistance can also result when units feel their work or autonomy is at risk or initiatives are perceived as competing. One interviewee shared an example where they “ended up stepping on toes across the organization” because their unit offered services in a research support area (impact analysis) that others felt they owned. Departmental units felt their local autonomy was threatened. In such cases, the informant recommended, it is wiser not to try to replace existing services, but rather to find ways to complement and support them—with data, for example—while acknowledging the units’ independence. Earlier consultation with these audiences might have also reduced this friction. Relationship building is a significant but valuable investment. It is not cost-free, but as our informants made clear, the rate of return is usually quite high. INVESTING THE ENERGY With risks to manage, relationships to steward, and plenty of work to do, it is not surprising that we heard that people could feel “overwhelmed.” But our informants also emphasized the effort they invest in relationships. “It’s going to take quite a bit of effort to learn and listen about the other person’s perspectives and where they’re coming from.” And the work of relationship Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 33 building never ends “because if somebody changes, you’ve got a new person in a position, then you’ve lost all that historical agenda in that relationship.” This can be frustrating over time, even grueling. Collaboration can slow down progress—collaboration and speed can end up being trade-offs that must be balanced, potentially resulting in the duplication of systems and services on campus mentioned earlier. But despite this, our informants overwhelmingly agreed that taking the time to build strong cross-institutional relationships was essential for attaining individual and collective goals. People in emergent roles especially also report feeling isolated. They often lack a team to support them—or just free them up for their mission-critical work. Interviewees mentioned several examples where lack of resources or support—for example, help with marketing tools, assistance with event planning—made it harder for them to do impactful work. Having to attend to work outside their immediate expertise is an additional stress point for staff in emerging roles. We also heard of 80- hour work weeks and talked to informants who felt overworked and tired. In this situation, making an effort at relationship building can seem overly burdensome. But getting out of the office to learn more about what others are doing can also reduce the feeling of isolation and provide opportunities for building community and getting support. Relationship building is a significant but valuable investment. It is not cost-free, but as our informants made clear, the rate of return is usually quite high. FIGURE 3. Key takeaways about successful intra-campus social interoperability THE UNIVERSITY Academic A�airs Research Administration The Library Information & Communications Technology (ICT) Faculty A�airs & Governance Communications Secure buy-in Know your audience Speak their language O�er solutions to problems Timing is essential Find opportunities to connect Leverage shared sta� Find “connectors” Manage personalities Be confident in your value Manage resistance Invest the energy Key Takeaways About Successful Intra-Campus Social Interoperability Key Takeaways about Successful Intra-campus Social Interoperability 34 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise Conclusion We undertook this project to explore the role of social interoperability in research support following previous OCLC Research efforts where we observed the need for libraries to work closely with other campus stakeholders to advance resources and services.45 Our goal was to focus entirely on the topic of cross-campus, cross-domain institutional collaboration, and, using the human intelligence offered by our interview subjects, offer guidance for successful social interoperability in the complex adaptive system of the university. Effective social interoperability across campus units is an important, and increasingly necessary, feature of successful research support services, and requires a thorough knowledge of campus partners. In this report, we have gathered information from stakeholders in research support around the university, describing their goals, interests, expertise, and crucially, the importance of cross-campus relationships in their work. Based on our informants’ experiences, we drew out lessons and good practices on fostering social interoperability in the provision and use of research support services (figure 3). Our key findings include: • US research universities are highly decentralized, dynamic institutions, filled with heterogenous, independent agents that sometimes work at cross purposes. This environment creates specific challenges and calls for the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding—in short, social interoperability. This is of special significance for stakeholders in research support, where roles are often new, responsibilities emerging, and staff often report feeling isolated in the absence of an established community of practice within and beyond the university. • The essential first step in building successful campus partnerships is to know who the other stakeholders are: what they do, what they prioritize, and how they see themselves contributing to the university mission. In ”A model for conceptualizing university research support stakeholders,” we present a conceptual model of key stakeholders in the provision and consumption of research support services: Academic Affairs, Research Administration, Library, Information and Communications Technology, Faculty Affairs and Governance, and Communications. • In “Social interoperability in research support services,” we document our informants’ experiences in building and maintaining cross-campus relationships in key research support service areas: research data management (RDM), research information management (RIM), research analytics, and ORCID adoption. Our interviews highlight the importance of social interoperability in the successful provision and use of research support services. But challenges remain; even when stakeholders are working synergistically, it can still be difficult for researchers to know which resources are available if there is no central resource that indexes these services provided by different stakeholders. Duplication of systems and services is common. And progress can be slowed by the necessity of first securing buy-in across stakeholders on campus. • “Cross-campus relationship building” suggests lessons and best practices from our informants on how to optimize social interoperability in research support. For instance, persuading someone that something is in their own best interest to act upon is a powerful tactic in an environment where mandates do not exist or do not work. In addition, knowing Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 35 your audience, speaking their language, offering concrete solutions to their problems, and getting the timing right are important strategies. Considerable investment of energy and time is necessary for building and maintaining cross-campus relationships, but as our informants made clear, the rate of return is usually quite high. 36 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise A C K N O W L E D G M E N T S The authors extend special thanks to our interview informants who generously shared their expertise and time with us for this investigation. We also thank members of the OCLC Research Library Partnership who tapped into their own campus networks to recommend possible interview informants for this study. Several OCLC colleagues provided guidance and support in the preparation of this report. Ixchel Faniel provided input on strengthening the interview protocol; Erin Hood and Nick Spence assisted with note-taking and interview project management activities; and Lynn Silipigni Connaway extended resources for interview transcription as well as offered sage guidance throughout. The report could not have been published without the significant efforts of the OCLC Research publishing team, including Erica Melko, Jeanette McNicol, and JD Shipengrover. Finally, our work was made possible by the senior leadership of OCLC; we wish to particularly thank Lorcan Dempsey, Vice President, Membership and Research, for OCLC, for his continued support of this effort. Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 37 A P P E N D I X : I N T E R V I E W P R O T O C O L Institutional Stakeholders in Research Support Project, oc.lc/stakeholders Date of interview Informant’s name Informant’s title Informant’s unit Informant’s institution 0. Introductions (5 minutes/xx:00-xx:05) Thanks for talking with us today. We want to spend 75 minutes with you today, talking about your role at your institution, in order to learn more about your unit’s goals, tasks, challenges, and collaborations. This discussion is part of our information gathering for a project entitled “Institutional Stakeholders in Research Support,” in which we are examining and documenting the numerous campus stakeholders that – as we observe – are increasingly called to work together, to support one or more research activities on the university campus today. The three of us here are the core research team working on this project being conducted by OCLC Research, a leading research institute or think tank investigating issues relevant to the world’s libraries. At the conclusion of our project, we will publish a synthesis of our findings as an OCLC Research Report. I will be leading the discussion while my colleagues take notes. Introductions [ask each participant to quickly share their name and role] Your interview today is confidential and your comments will be useful to us as we attempt to synthesize the variety of goals and roles taking place at research universities today. We would like to record our conversation today—but only for our own personal use; we will not share the recordings with others. Did informant agree to allow recording? (Y/N) 1. Why is the work that you do important? (15 minutes/xx:05-xx:20) Question purpose: to understand their main goals and how these align with institutional goals. This question should also help us understand the drivers, although the follow-up questions may be necessary to get there. Follow-up questions: a. Redirect to focus on research support services. Do you feel that part of what you do is providing research support? [relevant to only some informants] b. Why is this work valuable to your institution? Your campus unit? Researchers? c. Who are the main stakeholders who care about the work that you do? These may be people or organizations inside or outside your university. [this is an incentives question: can we maybe use the RDM incentives model?] http://oc.lc/stakeholders 38 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 2. HOW do you do it? (15 minutes/xx:20-xx:35) Question purpose: to get them to describe what their unit does—the tasks. Follow-up questions: a. What is your unit really good at? b. What’s most important? c. Is your unit typical of practices at similar institutions? d. [for campus IT—you work at systems of scale. Are there differences in how this work for research services vs educational services?] e. Research support services have become a much more visible part of the service portfolio on campus. Are you familiar with that term, and if so, what kinds of services come to mind? [if not familiar, here are some examples: RDM, RIM, bibliometrics support—services that support researchers and also services that support the institutional research enterprise, reporting, and reputation management. 3. What are the most beneficial relationships for helping you achieve your goals? What are the relationships that are important for achieving your unit’s goals? (20 minutes/ xx:35-xx:55) Question purpose: to understand who they are partnering with. Follow-up questions: a. What units are your most common collaborators/partners? b. Are you trying to build new relationships across campus? Why? c. Have you tried to collaborate with some units and failed? d. Have you partnered with the library? e. What about off-campus collaborations? Professional conferences? 4. If you could wave a magic wand, what would you change or fix? (10 minutes/xx:55-xx:05) Question purpose: to understand their pain points. Follow-up questions: a. What are some new things on your road map that you’d like to accomplish? b. Can you give us a specific example of something you are trying to do? c. What are the primary barriers? 5. Is there anything else we should have asked? (5 minutes/xx:05-xy:10) Comments/Perceptions Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 39 N O T E S 1 Bryant, Rebecca, Anna Clements, Pablo de Castro, Joanne Cantrell, Annette Dortmund, Jan Fransen, Peggy Gallagher, and Michele Mennielli. 2018. Practices and Patterns in Research Information Management: Findings from a Global Survey. Dublin, OH: OCLC Research. https://doi.org/10.25333/BGFG-D241; Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2018. Sourcing and Scaling University RDM Services. The Realities of Research Data Management, Part 4. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3QW7M; Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2018. Incentives for Building University RDM Services. The Realities of Research Data Management, Part 3. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3S62F; Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2017. Scoping the University RDM Service Bundle. The Realities of Research Data Management, Part 2. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3Z039; Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2017. A Tour of the Research Data Management (RDM) Service Space. The Realities of Research Data Management, Part 1. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3PG8J. 2 Malpas, Constance, Roger Schonfeld, Rona Stein, Lorcan Dempsey, and Deanna Marcum. 2018. University Futures, Library Futures: Aligning Library Strategies with Institutional Directions. Dublin, OH: OCLC Research. https://doi.org/10.25333/WS5K-DD86. 3 The University of Rhode Island. “Assistant Professor, Library Chief Data Strategist.” Human Resource Administration: Posting Details. (Archived 28 February 2020). https://web.archive.org /web/20200228000245/https://jobs.uri.edu/postings/7102/print_preview. 4 NC State University. “Researcher Support.” North Carolina Training Consortium. Accessed 3 August 2020. https://research.ncsu.edu/nctc/study-guide/project-administration/project -management/researcher-support/. 5 Si, Li, Yueliang Zeng, Sicheng Guo, Xiaozhe Zhuang. 2019. “Investigation and Analysis of Research Support Services in Academic Libraries.” The Electronic Library 37, no. 2: 281-301. https://doi.org/10.1108/EL-06-2018-0125. 6 We first used the term “social interoperability” in this way in early 2019. See Lavoie, Brian. 2019. “RLP Research Data Management Interest Group: Acquiring RDM Services for Your Institution,” Hanging Together: the OCLC Research blog, 6 February 2019. https://hangingtogether.org/?p=6997. 7 Corrall, Sheila. 2014. “Designing Libraries for Research Collaboration in the Network World: An Exploratory Study, 37” LIBER Quarterly 24, no. 1: 17-48. https://www.liberquarterly.eu/article/10.18352/lq.9525/. https://doi.org/10.25333/BGFG-D241 https://doi.org/10.25333/C3QW7M https://doi.org/10.25333/C3S62F https://doi.org/10.25333/C3Z039 https://doi.org/10.25333/C3PG8J https://doi.org/10.25333/WS5K-DD86 https://web.archive.org/web/20200228000245/https://jobs.uri.edu/postings/7102/print_preview https://web.archive.org/web/20200228000245/https://jobs.uri.edu/postings/7102/print_preview https://research.ncsu.edu/nctc/study-guide/project-administration/project-management/researcher-supp https://research.ncsu.edu/nctc/study-guide/project-administration/project-management/researcher-supp https://doi.org/10.1108/EL-06-2018-0125 https://hangingtogether.org/?p=6997 https://www.liberquarterly.eu/article/10.18352/lq.9525/ 40 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 8 See Bradley, Cara. 2018. “Research Support Priorities of and Relationships among Librarians and Research Administrators: A Content Analysis of the Professional Literature.” Evidence Based Library & Information Practice 13 (4): 15–30. https://doi.org/10.18438/eblip29478; Bradley, for example, notes that “the importance of collaborating with others on campus (units, students, and faculty) in developing and delivering support for student learning has been well- documented . . . There has been less evidence collected about how academic libraries can best support campus research.” (p. 16). Bradley goes on to observe that collaboration in research support documented in the literature tends to focus on research data management. (p. 17-18) 9 A copy of the interview protocol is provided in the report appendix. 10 Some specific gaps include in-depth discussions about the roles of Technology Transfer, Institutional Research, or Corporate Relations units, which may be stakeholders in research support services on some campuses. 11 Dean, Jr., James W., and Deborah Y. Clarke. 2019. The Insider’s Guide to Working with Universities: Practical Insights for Board Members, Businesspeople, Entrepreneurs, Philanthropists, Alumni, Parents, and Administrators, 17. Chapel Hill: University of North Carolina Press. 12 Rouse, William B. 2016. Universities as Complex Enterprises: How Academia Works, Why It Works These Ways, and Where the University Enterprise Is Headed, 5-9. New York: Routledge. 13 Ibid. 14 Hazelkorn, Ellen. 2011. Rankings and the Reshaping of Higher Education: The Battle for World- Class Excellence, 5-10. Houndmills, Basingstoke, Hampshire: Palgrave Macmillan. https://doi.org/10.1057/9781137446671. In the past two decades, state support for public higher education has declined by billions of dollars and undergraduate enrollment is also in decline, with larger declines on the horizon in the 2020s; Mitchell, Michael, Michael Leachman, and Kathleen Masterson. 2017. A Lost Decade in Higher Education Funding: State Cuts Have Driven Up Tuition and Reduced Quality. Washington, DC: Center on Budget and Policy Priorities. https://www.cbpp.org/research/state-budget-and-tax /a-lost-decade-in-higher-education-funding; Nadworny, Elissa, and Max Larkin. 2019. “Fewer Students Are Going To College. Here’s Why That Matters.” NPR KQED audio (Education), 16 December 2019, 5:00 AM ET, Morning Edition (6 minutes). https://www.npr.org/2019/12/16/787909495/fewer-students-are-going-to-college -heres-why-that-matters. 15 Connaway, Lynn Silipigni, William Harvey, Vanessa Kitzie, and Stephanie Mikitish. 2017. Academic Library Impact: Improving Practice and Essential Areas to Research. Chicago, Illinois: Association of College & Research Libraries, 31, 40. http://www.ala.org/acrl/sites/ala.org.acrl/ files/content/publications/whitepapers/academiclib.pdf. 16 Cox, John. 2018. “Positioning the Academic Library within the Institution: A Literature Review.” New Review of Academic Librarianship 24, no. 3-4: 217–41. https://doi.org/10.1080/13614533.2018.1466342. https://doi.org/10.18438/eblip29478 https://doi.org/10.1057/9781137446671 https://www.cbpp.org/research/state-budget-and-tax/a-lost-decade-in-higher-education-funding https://www.cbpp.org/research/state-budget-and-tax/a-lost-decade-in-higher-education-funding https://www.npr.org/2019/12/16/787909495/fewer-students-are-going-to-college-heres-why-that-matters https://www.npr.org/2019/12/16/787909495/fewer-students-are-going-to-college-heres-why-that-matters http://www.ala.org/acrl/sites/ala.org.acrl/files/content/publications/whitepapers/academiclib.pdf http://www.ala.org/acrl/sites/ala.org.acrl/files/content/publications/whitepapers/academiclib.pdf https://doi.org/10.1080/13614533.2018.1466342 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 41 17 Whitchurch, Celia. 2015. “The Rise of Third Space Professionals: Paradoxes and Dilemmas.” In: Forming, Recruiting and Managing the Academic Profession, edited by U. Teichler and W. Cummings, vol. 14. The Changing Academy – The Changing Academic Profession in International Comparative Perspective. Switzerland: Springer, Cham. https://doi.org/10.1007/978-3-319-16080-1_5. 18 There’s a significant literature accusing the bloating number of unnecessary, highly paid administrators as the cause of rising college costs. However, there’s much more evidence that drastically reduced state support for public education is the primary factor. Many new positions have been added—and seen as necessary—as institutions have added IT infrastructure, compliance officers, and more student and research support services. As faculty member and author Robert Kelchen says, Faculty do complain about all the assistant and associate deans out there, but this workload would otherwise fall on faculty. And given the research, teaching, and service expectations that we face, we can’t take on those roles. See Kelchen, Robert. 2018. “Is Administrative Bloat Really a Big Problem?” Blog (Kelchen on Education), 10 May 2020. https://robertkelchen.com/2018/05/10/is-administrative -bloat-a-problem/; For a good discussion of these misconceptions, see Dean, Jr, James W., and Deborah Y. Clarke. 2019. The Insider’s Guide to Working with Universities: Practical Insights for Board Members, Businesspeople, Entrepreneurs, Philanthropists, Alumni, Parents, and Administrators, 131-133. Chapel Hill: University of North Carolina Press. 19 Dean, Jr., and Clarke. The Insider’s Guide, 32 (See note 11). 20 The Ohio State University. “Office of Research.” https://research.osu.edu/. 21 Stanford University. “Office of Research Administration.” https://ora.stanford.edu/. 22 Pesce, Jessica R. “Student Affairs Has an Association; Faculty Affairs Needs One, Too,” The Chronicle of Higher Education, 21 August 2018, https://www.chronicle.com/article/Student -Affairs-Has-an/244313. 23 Dean, Jr., and Clarke. The Insider’s Guide, 32 (See note 11). 24 This ten-question interview script is included in the section “Cross-campus relationship building.” (See “A script for learning about other units used at Rutgers University– New Brunswick,” sidebar, p. 27.) 25 Sheila Corrall described the need for greater operational convergence in the provision of research support services, as libraries increasingly partner with other institutional stakeholders, such as the office of research; See Corrall, Sheila. 2014. “Designing Libraries for Research Collaboration in the Network World: An Exploratory Study,” 37. LIBER Quarterly 24 (1): 17-48. https://doi.org/10.18352/lq.9525; Cara Bradley, in her review of the library and research administration literature, found that even in cases where these two professions engaged in the same topics, they focused largely https://doi.org/10.1007/978-3-319-16080-1_5 https://robertkelchen.com/2018/05/10/is-administrative-bloat-a-problem/ https://robertkelchen.com/2018/05/10/is-administrative-bloat-a-problem/ https://research.osu.edu/ https://ora.stanford.edu/ https://www.chronicle.com/article/Student-Affairs-Has-an/244313 https://www.chronicle.com/article/Student-Affairs-Has-an/244313 https://doi.org/10.18352/lq.9525 42 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise on different aspects. And, more significantly, the literature of each profession demonstrated little awareness of the activities and interests of the other. See Bradley, Cara. 2018. “Research Support Priorities of and Relationships among Librarians and Research Administrators: A Content Analysis of the Professional Literature.” Evidence Based Library & Information Practice 13 (4): 15–30. http://10.0.72.6/eblip29478, 26-28. 26 Lavoie, Brian. “RLP Research Data Management Interest Group: Acquiring RDM Services for Your Institution,” Hanging Together: the OCLC Research blog, 6 February 2019. https://hangingtogether.org/?p=6997. 27 National Science Foundation. “Dissemination and Sharing of Research Results.” https://www.nsf.gov/bfa/dias/policy/dmp.jsp. 28 Research Data Management has been an area of significant interest to OCLC Research, such as the Realities of Research Data Management series published in 2017-2018, as well as many other publications made publicly available on the OCLC web site, https://www.oclc.org /research/areas/research-collections/rdm.html; Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2017. A Tour of the Research Data Management (RDM) Service Space. The Realities of Research Data Management, Part 1. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3PG8J. 29 RIM is an emerging area of library interest and a subject of much previous OCLC research, such as Bryant, Rebecca, Anna Clements, Carol Feltes, David Groenewegen, Simon Huggard, Holly Mercer, Roxanne Missingham, Maliaca Oxnam, Anne Rauh, and John Wright. 2017. Research Information Management: Defining RIM and the Library’s Role. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3NK88. 30 European systems for collecting research information are typically called Current Research Information Systems (CRIS) and are used for collecting and reporting on institutional research productivity. Usage of the term CRIS is uncommon in the United States. See Wikipedia. “Current Research Information System.” Updated 2 August 2020, at 14:51 (UTC). https://en.wikipedia.org/wiki/Current_research_information_system. 31 Bryant, Lavoie, and Malpas, Research Information Management: Defining (See note 29); Bryant, Rebecca, Anna Clements, Pablo de Castro, Joanne Cantrell, Annette Dortmund, Jan Fransen, Peggy Gallagher, and Michele Mennielli. 2018. Practices and Patterns in Research Information Management: Findings from a Global Survey. Dublin, OH: OCLC Research. https://doi.org/10.25333/BGFG-D241. 32 Traditionally, libraries purchased and licensed materials from external sources, to be made available locally—an “outside-in” collection. In more recent years, there has been movement among research libraries to an “inside-out” model, where institutional outputs (digitized special collections, researcher profiles, etc.) are shared with an external audience. Explained in greater depth in Dempsey, Lorcan, Constance Malpas, and Brian Lavoie. 2014. “Collection Directions: Some Reflections on the Future of library Collections and Collecting.” Libraries and the Academy 14 (3): 393–423. https://doi.org/10.1353/pla.2014.0013. 33 Rouse, William B. 2016. Universities as Complex Enterprises: How Academia Works, Why It Works These Ways, and Where the University Enterprise Is Headed, 61. New York: Routledge. http://10.0.72.6/eblip29478, 26-28 https://hangingtogether.org/?p=6997 https://www.nsf.gov/bfa/dias/policy/dmp.jsp https://www.oclc.org/research/areas/research-collections/rdm.html. https://www.oclc.org/research/areas/research-collections/rdm.html. https://doi.org/10.25333/C3PG8J https://doi.org/10.25333/C3NK88 https://en.wikipedia.org/wiki/Current_research_information_system https://doi.org/10.25333/BGFG-D241 https://doi.org/10.1353/pla.2014.0013 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 43 34 Through conversations with OCLC Research library Partnership institutions, we know that research analytics is a growing area of activity and investment for research libraries. These conversations, and institutional responses, were documented in the OCLC Research Hanging Together blog: Lavoie, Brian. “Making Connections: Research Analytics at Virginia Tech,” Hanging Together: the OCLC Research blog, 13 April 2020. https://hangingtogether.org/?p=7854, and Lavoie, Brian. “Research Analytics: Where Do Libraries Fit In?” Hanging Together: the OCLC Research blog, 2 December 2019. https://hangingtogether.org/?p=7623. 35 See University of Illinois. “Research Development Center.” https://rdc.research.illinois.edu. Institutional permission was given to publicly recognize the institutions highlighted in the sidebar case studies. 36 Lavoie, Brian. “Making Connections: Research analytics at Virginia Tech,” Hanging Together: the OCLC Research blog, 13 April 2020. https://hangingtogether.org/?p=7854. 37 Permission to publicly recognize this institutional activity was provided by a university representative. 38 The Association for Institutional Research (AIR) is the primary professional organization in the United States for institutional research professionals. It provides an overview of the “Duties and Responsibilities of Institutional Research” professionals on its web site at: https://www.airweb .org/ir-data-professional-overview/duties-and-functions-of-institutional-research. 39 The ORCID US Community, supported and led by LYRASIS in partnership with the Big Ten Academic Alliance, the Greater Western library Alliance (GWLA), and the NorthEast Research Libraries (NERL) provides resources, training, and community support for ORCID adoption in the United States. https://www.lyrasis.org/Leadership/Pages/orcid-us.aspx. 40 Lyrasis. “ORCID US Exemplars.” https://www.lyrasis.org/Leadership/Pages/ORCID-US-Exemplars.aspx. 41 The ORCID US Community offers guidance to institutions on securing stakeholder support at: Lyrasis. “ORCID US Community Planning Guide for Research Institutions.” https://www.lyrasis.org/Leadership/Pages/orcid-us-planning-guide.aspx. 42 Carnegie, Dale. 2009. How to Win Friends and Influence People. New York: Simon and Schuster. 43 Permission to publicly recognize this institutional activity was provided by a university representative. 44 Moving staff between the library and the research office was also encouraged in a recent symposium held in Washington, DC. “Critical Roles for Libraries in Today’s Research Enterprise. In Symposium Proceedings,” 11 December, 2019, https://library.ucalgary.ca/ld.php?content_id=35088958. https://hangingtogether.org/?p=7854 https://hangingtogether.org/?p=7623 https://rdc.research.illinois.edu https://hangingtogether.org/?p=7854 https://www.airweb.org /ir-data-professional-overview/duties-and-functions-of-institutional-research https://www.airweb.org /ir-data-professional-overview/duties-and-functions-of-institutional-research https://www.lyrasis.org/Leadership/Pages/orcid-us.aspx https://www.lyrasis.org/Leadership/Pages/ORCID-US-Exemplars.aspx https://www.lyrasis.org/Leadership/Pages/orcid-us-planning-guide.aspx https://library.ucalgary.ca/ld.php?content_id=35088958 44 Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise 45 The need for library cooperation with multiple stakeholders was particularly documented in the Realities of Research Data Management series as well as the Practices and Patterns report on global RIM practices: Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2017. A Tour of the Research Data Management (RDM) Service Space. The Realities of Research Data Management, Part 1. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3PG8J. Bryant, Rebecca, Anna Clements, Pablo de Castro, Joanne Cantrell, Annette Dortmund, Jan Fransen, Peggy Gallagher, and Michele Mennielli. 2018. Practices and Patterns in Research Information Management: Findings from a Global Survey. Dublin, OH: OCLC Research. https://doi.org/10.25333/BGFG-D241. https://doi.org/10.25333/C3PG8J https://doi.org/10.25333/BGFG-D241 For more information about our work related to digitizing library collections, please visit: oc.lc/digitizing 6565 Kilgour Place Dublin, Ohio 43017-3395 T: 1-800-848-5878 T: +1-614-764-6000 F: +1-614-764-6096 www.oclc.org/research ISBN: 978-1-55653-157-6 DOI: 10.25333/wyrd-n586 RM-PR-216769-WWAE 2008 O C L C R E S E A R C H R E P O R T http://oc.lc/digitizing http://www.oclc.org/research Foreword Building Intra-Campus Relationships Around Research Support Services Introduction Scope and Methods Limitations The Campus Environment Universities are Complex Adaptive Systems Intense Competition for Prestige, Rankings, and Resources Leadership Challenges Frustration and Isolation in Emerging Roles A Model for Conceptualizing University Research Support Stakeholders Academic Affairs Research Administration The Library Information and Communications Technology (ICT) Faculty Affairs and Governance Communications Social Interoperability in Research Support Services Research Data Management (RDM) Research Information Management (RIM) Public researcher profiles Faculty Activity Reporting (FAR) Research Analytics ORCID Adoption Comments on the Library as Partner Cross-Campus Relationship Building: Strategies and Tactics Strategies and Directions Secure buy-in Know your audience Speak their language Offer concrete solutions to others’ problems Timing is essential Relationship Building: Practical Advice Meeting opportunities Shared staff and embedded resources Troubleshooting in Relationship Building Making connections Personalities Know your value / be confident Challenges: Managing Resistance and Sustaining Energy Managing resistance Investing the energy Conclusion Acknowledgments Appendix: Interview Protocol Notes FIGURE 1. A conceptual model of campus research support stakeholders FIGURE 2. Stakeholder interest in research support areas FIGURE 3. Key takeaways about successful intra-campus social interoperability.
defoe-plague-1722 ---- HISTORY OF THE PLAGUE IN LONDON. It was about the beginning of September, 1664, that I, among the rest of my neighbors, heard in ordinary discourse that the plague was returned again in Holland; for it had been very violent there, and particularly at Amsterdam and Rotterdam, in the year 1663, whither, they say, it was brought (some said from Italy, others from the Levant) among some goods which were brought home by their Turkey fleet; others said it was brought from Candia; others, from Cyprus. It mattered not from whence it came; but all agreed it was come into Holland again.[4] We had no such thing as printed newspapers in those days, to spread rumors and reports of things, and to improve them by the invention of men, as I have lived to see practiced since. But such things as those were gathered from the letters of merchants and others who corresponded abroad, and from them was handed about by word of mouth only; so that things did not spread instantly over the whole nation, as they do now. But it seems that the government had a true account of it, and several counsels[5] were held about ways to prevent its coming over; but all was kept very private. Hence it was that this rumor died off again; and people began to forget it, as a thing we were very little concerned in and that we hoped was not true, till the latter end of November or the beginning of December, 1664, when two men, said to be Frenchmen, died of the plague in Longacre, or rather at the upper end of Drury Lane.[6] The family they were in endeavored to conceal it as much as possible; but, as it had gotten some vent in the discourse of the neighborhood, the secretaries of state[7] got knowledge of it. And concerning themselves to inquire about it, in order to be certain of the truth, two physicians and a surgeon were ordered to go to the house, and make inspection. This they did, and finding evident tokens[8] of the sickness upon both the bodies that were dead, they gave their opinions publicly that they died of the plague. Whereupon it was given in to the parish clerk,[9] and he also returned them[10] to the hall; and it was printed in the weekly bill of mortality in the usual manner, thus:-- PLAGUE, 2. PARISHES INFECTED, 1. The people showed a great concern at this, and began to be alarmed all over the town, and the more because in the last week in December, 1664, another man died in the same house and of the same distemper. And then we were easy again for about six weeks, when, none having died with any marks of infection, it was said the distemper was gone; but after that, I think it was about the 12th of February, another died in another house, but in the same parish and in the same manner. This turned the people's eyes pretty much towards that end of the town; and, the weekly bills showing an increase of burials in St. Giles's Parish more than usual, it began to be suspected that the plague was among the people at that end of the town, and that many had died of it, though they had taken care to keep it as much from the knowledge of the public as possible. This possessed the heads of the people very much; and few cared to go through Drury Lane, or the other streets suspected, unless they had extraordinary business that obliged them to it. This increase of the bills stood thus: the usual number of burials in a week, in the parishes of St. Giles-in-the-Fields and St. Andrew's, Holborn,[11] were[12] from twelve to seventeen or nineteen each, few more or less; but, from the time that the plague first began in St. Giles's Parish, it was observed that the ordinary burials increased in number considerably. For example:-- Dec. 27 to Jan. 3, St. Giles's 16 St. Andrew's 17 Jan. 3 to Jan. 10, St. Giles's 12 St. Andrew's 25 Jan. 10 to Jan. 17, St. Giles's 18 St. Andrew's 18 Jan. 17 to Jan. 24, St. Giles's 23 St. Andrew's 16 Jan. 24 to Jan. 31, St. Giles's 24 St. Andrew's 15 Jan. 31 to Feb. 7, St. Giles's 21 St. Andrew's 23 Feb. 7 to Feb. 14, St. Giles's 24 Whereof one of the plague. The like increase of the bills was observed in the parishes of St. Bride's, adjoining on one side of Holborn Parish, and in the parish of St. James's, Clerkenwell, adjoining on the other side of Holborn; in both which parishes the usual numbers that died weekly were from four to six or eight, whereas at that time they were increased as follows:-- Dec. 20 to Dec. 27, St. Bride's 0 St. James's 8 Dec. 27 to Jan. 3, St. Bride's 6 St. James's 9 Jan. 3 to Jan. 10, St. Bride's 11 St. James's 7 Jan. 10 to Jan. 17, St. Bride's 12 St. James's 9 Jan. 17 to Jan. 24, St. Bride's 9 St. James's 15 Jan. 24 to Jan. 31, St. Bride's 8 St. James's 12 Jan. 31 to Feb. 7, St. Bride's 13 St. James's 5 Feb. 7 to Feb. 14, St. Bride's 12 St. James's 6 Besides this, it was observed, with great uneasiness by the people, that the weekly bills in general increased very much during these weeks, although it was at a time of the year when usually the bills are very moderate. The usual number of burials within the bills of mortality for a week was from about two hundred and forty, or thereabouts, to three hundred. The last was esteemed a pretty high bill; but after this we found the bills successively increasing, as follows:-- Buried. Increased. Dec. 20 to Dec. 27 291 0 Dec. 27 to Jan. 3 349 58 Jan. 3 to Jan. 10 394 45 Jan. 10 to Jan. 17 415 21 Jan. 17 to Jan. 24 474 59 This last bill was really frightful, being a higher number than had been known to have been buried in one week since the preceding visitation of 1656. However, all this went off again; and the weather proving cold, and the frost, which began in December, still continuing very severe, even till near the end of February, attended with sharp though moderate winds, the bills decreased again, and the city grew healthy; and everybody began to look upon the danger as good as over, only that still the burials in St. Giles's continued high. From the beginning of April, especially, they stood at twenty-five each week, till the week from the 18th to the 25th, when there was[13] buried in St. Giles's Parish thirty, whereof two of the plague, and eight of the spotted fever (which was looked upon as the same thing); likewise the number that died of the spotted fever in the whole increased, being eight the week before, and twelve the week above named. This alarmed us all again; and terrible apprehensions were among the people, especially the weather being now changed and growing warm, and the summer being at hand. However, the next week there seemed to be some hopes again: the bills were low; the number of the dead in all was but 388; there was none of the plague, and but four of the spotted fever. But the following week it returned again, and the distemper was spread into two or three other parishes, viz., St. Andrew's, Holborn, St. Clement's-Danes; and, to the great affliction of the city, one died within the walls, in the parish of St. Mary-Wool-Church, that is to say, in Bearbinder Lane, near Stocks Market: in all, there were nine of the plague, and six of the spotted fever. It was, however, upon inquiry, found that this Frenchman who died in Bearbinder Lane was one who, having lived in Longacre, near the infected houses, had removed for fear of the distemper, not knowing that he was already infected. This was the beginning of May, yet the weather was temperate, variable, and cool enough, and people had still some hopes. That which encouraged them was, that the city was healthy. The whole ninety-seven parishes buried but fifty-four, and we began to hope, that, as it was chiefly among the people at that end of the town, it might go no farther; and the rather, because the next week, which was from the 9th of May to the 16th, there died but three, of which not one within the whole city or liberties;[14] and St. Andrew's buried but fifteen, which was very low. It is true, St. Giles's buried two and thirty; but still, as there was but one of the plague, people began to be easy. The whole bill also was very low: for the week before, the bill was but three hundred and forty-seven; and the week above mentioned, but three hundred and forty-three. We continued in these hopes for a few days; but it was but for a few, for the people were no more to be deceived thus. They searched the houses, and found that the plague was really spread every way, and that many died of it every day; so that now all our extenuations[15] abated, and it was no more to be concealed. Nay, it quickly appeared that the infection had spread itself beyond all hopes of abatement; that in the parish of St. Giles's it was gotten into several streets, and several families lay all sick together; and accordingly, in the weekly bill for the next week, the thing began to show itself. There was indeed but fourteen set down of the plague, but this was all knavery and collusion; for St. Giles's Parish, they buried forty in all, whereof it was certain most of them died of the plague, though they were set down of other distempers. And though the number of all the burials were[16] not increased above thirty-two, and the whole bill being but three hundred and eighty-five, yet there was[17] fourteen of the spotted fever, as well as fourteen of the plague; and we took it for granted, upon the whole, that there were fifty died that week of the plague. The next bill was from the 23d of May to the 30th, when the number of the plague was seventeen; but the burials in St. Giles's were fifty-three, a frightful number, of whom they set down but nine of the plague. But on an examination more strictly by the justices of the peace, and at the lord mayor's[18] request, it was found there were twenty more who were really dead of the plague in that parish, but had been set down of the spotted fever, or other distempers, besides others concealed. But those were trifling things to what followed immediately after. For now the weather set in hot; and from the first week in June, the infection spread in a dreadful manner, and the bills rise[19] high; the articles of the fever, spotted fever, and teeth, began to swell: for all that could conceal their distempers did it to prevent their neighbors shunning and refusing to converse with them, and also to prevent authority shutting up their houses, which, though it was not yet practiced, yet was threatened; and people were extremely terrified at the thoughts of it. The second week in June, the parish of St. Giles's, where still the weight of the infection lay, buried one hundred and twenty, whereof, though the bills said but sixty-eight of the plague, everybody said there had been a hundred at least, calculating it from the usual number of funerals in that parish as above. Till this week the city continued free, there having never any died except that one Frenchman, who[20] I mentioned before, within the whole ninety-seven parishes. Now, there died four within the city,--one in Wood Street, one in Fenchurch Street, and two in Crooked Lane. Southwark was entirely free, having not one yet died on that side of the water. I lived without Aldgate, about midway between Aldgate Church and Whitechapel Bars, on the left hand, or north side, of the street; and as the distemper had not reached to that side of the city, our neighborhood continued very easy. But at the other end of the town their consternation was very great; and the richer sort of people, especially the nobility and gentry from the west part of the city, thronged out of town, with their families and servants, in an unusual manner. And this was more particularly seen in Whitechapel; that is to say, the Broad Street where I lived. Indeed, nothing was to be seen but wagons and carts, with goods, women, servants, children, etc.; coaches filled with people of the better sort, and horsemen attending them, and all hurrying away; then empty wagons and carts appeared, and spare horses with servants, who it was apparent were returning, or sent from the country to fetch more people; besides innumerable numbers of men on horseback, some alone, others with servants, and, generally speaking, all loaded with baggage, and fitted out for traveling, as any one might perceive by their appearance. This was a very terrible and melancholy thing to see, and as it was a sight which I could not but look on from morning to night (for indeed there was nothing else of moment to be seen), it filled me with very serious thoughts of the misery that was coming upon the city, and the unhappy condition of those that would be left in it. This hurry of the people was such for some weeks, that there was no getting at the lord mayor's door without exceeding difficulty; there was such pressing and crowding there to get passes and certificates of health for such as traveled abroad; for, without these, there was no being admitted to pass through the towns upon the road, or to lodge in any inn. Now, as there had none died in the city for all this time, my lord mayor gave certificates of health without any difficulty to all those who lived in the ninety-seven parishes, and to those within the liberties too, for a while. This hurry, I say, continued some weeks, that is to say, all the months of May and June; and the more because it was rumored that an order of the government was to be issued out, to place turnpikes[21] and barriers on the road to prevent people's traveling; and that the towns on the road would not suffer people from London to pass, for fear of bringing the infection along with them, though neither of these rumors had any foundation but in the imagination, especially at first. I now began to consider seriously with myself concerning my own case, and how I should dispose of myself; that is to say, whether I should resolve to stay in London, or shut up my house and flee, as many of my neighbors did. I have set this particular down so fully, because I know not but it may be of moment to those who come after me, if they come to be brought to the same distress and to the same manner of making their choice; and therefore I desire this account may pass with them rather for a direction to themselves to act by than a history of my actings, seeing it may not be of one farthing value to them to note what became of me. I had two important things before me: the one was the carrying on my business and shop, which was considerable, and in which was embarked all my effects in the world; and the other was the preservation of my life in so dismal a calamity as I saw apparently was coming upon the whole city, and which, however great it was, my fears perhaps, as well as other people's, represented to be much greater than it could be. The first consideration was of great moment to me. My trade was a saddler, and as my dealings were chiefly not by a shop or chance trade, but among the merchants trading to the English colonies in America, so my effects lay very much in the hands of such. I was a single man, it is true; but I had a family of servants, who[22] I kept at my business; had a house, shop, and warehouses filled with goods; and in short to leave them all as things in such a case must be left, that is to say, without any overseer or person fit to be trusted with them, had been to hazard the loss, not only of my trade, but of my goods, and indeed of all I had in the world. I had an elder brother at the same time in London, and not many years before come over from Portugal; and, advising with him, his answer was in the three words, the same that was given in another case[23] quite different, viz., "Master, save thyself." In a word, he was for my retiring into the country, as he resolved to do himself, with his family; telling me, what he had, it seems, heard abroad, that the best preparation for the plague was to run away from it. As to my argument of losing my trade, my goods, or debts, he quite confuted me: he told me the same thing which I argued for my staying, viz., that I would trust God with my safety and health was the strongest repulse[24] to my pretensions of losing my trade and my goods. "For," says he, "is it not as reasonable that you should trust God with the chance or risk of losing your trade, as that you should stay in so eminent a point of danger, and trust him with your life?" I could not argue that I was in any strait as to a place where to go, having several friends and relations in Northamptonshire, whence our family first came from; and particularly, I had an only sister in Lincolnshire, very willing to receive and entertain me. My brother, who had already sent his wife and two children into Bedfordshire, and resolved to follow them, pressed my going very earnestly; and I had once resolved to comply with his desires, but at that time could get no horse: for though it is true all the people did not go out of the city of London, yet I may venture to say, that in a manner all the horses did; for there was hardly a horse to be bought or hired in the whole city for some weeks. Once I resolved to travel on foot with one servant, and, as many did, lie at no inn, but carry a soldier's tent with us, and so lie in the fields, the weather being very warm, and no danger from taking cold. I say, as many did, because several did so at last, especially those who had been in the armies, in the war[25] which had not been many years past: and I must needs say, that, speaking of second causes, had most of the people that traveled done so, the plague had not been carried into so many country towns and houses as it was, to the great damage, and indeed to the ruin, of abundance of people. But then my servant who[26] I had intended to take down with me, deceived me, and being frighted at the increase of the distemper, and not knowing when I should go, he took other measures, and left me: so I was put off for that time. And, one way or other, I always found that to appoint to go away was always crossed by some accident or other, so as to disappoint and put it off again. And this brings in a story which otherwise might be thought a needless digression, viz., about these disappointments being from Heaven. It came very warmly into my mind one morning, as I was musing on this particular thing, that as nothing attended us without the direction or permission of Divine Power, so these disappointments must have something in them extraordinary, and I ought to consider whether it did not evidently point out, or intimate to me, that it was the will of Heaven I should not go. It immediately followed in my thoughts, that, if it really was from God that I should stay, he was able effectually to preserve me in the midst of all the death and danger that would surround me; and that if I attempted to secure myself by fleeing from my habitation, and acted contrary to these intimations, which I believed to be divine, it was a kind of flying from God, and that he could cause his justice to overtake me when and where he thought fit.[27] These thoughts quite turned my resolutions again; and when I came to discourse with my brother again, I told him that I inclined to stay and take my lot in that station in which God had placed me; and that it seemed to be made more especially my duty, on the account of what I have said. My brother, though a very religious man himself, laughed at all I had suggested about its being an intimation from Heaven, and told me several stories of such foolhardy people, as he called them, as I was; that I ought indeed to submit to it as a work of Heaven if I had been any way disabled by distempers or diseases, and that then, not being able to go, I ought to acquiesce in the direction of Him, who, having been my Maker, had an undisputed right of sovereignty in disposing of me; and that then there had been no difficulty to determine which was the call of his providence, and which was not; but that I should take it as an intimation from Heaven that I should not go out of town, only because I could not hire a horse to go, or my fellow was run away that was to attend me, was ridiculous, since at the same time I had my health and limbs, and other servants, and might with ease travel a day or two on foot, and, having a good certificate of being in perfect health, might either hire a horse, or take post on the road, as I thought fit. Then he proceeded to tell me of the mischievous consequences which attend the presumption of the Turks and Mohammedans in Asia, and in other places where he had been (for my brother, being a merchant, was a few years before, as I have already observed, returned from abroad, coming last from Lisbon); and how, presuming upon their professed predestinating[28] notions, and of every man's end being predetermined, and unalterably beforehand decreed, they would go unconcerned into infected places, and converse with infected persons, by which means they died at the rate of ten or fifteen thousand a week, whereas the Europeans, or Christian merchants, who kept themselves retired and reserved, generally escaped the contagion. Upon these arguments my brother changed my resolutions again, and I began to resolve to go, and accordingly made all things ready; for, in short, the infection increased round me, and the bills were risen to almost seven hundred a week, and my brother told me he would venture to stay no longer. I desired him to let me consider of it but till the next day, and I would resolve; and as I had already prepared everything as well as I could, as to my business and who[29] to intrust my affairs with, I had little to do but to resolve. I went home that evening greatly oppressed in my mind, irresolute, and not knowing what to do. I had set the evening wholly apart to consider seriously about it, and was all alone; for already people had, as it were by a general consent, taken up the custom of not going out of doors after sunset: the reasons I shall have occasion to say more of by and by. In the retirement of this evening I endeavored to resolve first what was my duty to do, and I stated the arguments with which my brother had pressed me to go into the country, and I set against them the strong impressions which I had on my mind for staying,--the visible call I seemed to have from the particular circumstance of my calling, and the care due from me for the preservation of my effects, which were, as I might say, my estate; also the intimations which I thought I had from Heaven, that to me signified a kind of direction to venture; and it occurred to me, that, if I had what I call a direction to stay, I ought to suppose it contained a promise of being preserved, if I obeyed. This lay close to me;[30] and my mind seemed more and more encouraged to stay than ever, and supported with a secret satisfaction that I should be kept.[31] Add to this, that turning over the Bible which lay before me, and while my thoughts were more than ordinary serious upon the question, I cried out, "Well, I know not what to do, Lord direct me!" and the like. And at that juncture I happened to stop turning over the book at the Ninety-first Psalm, and, casting my eye on the second verse, I read to the seventh verse exclusive, and after that included the tenth, as follows: "I will say of the Lord, He is my refuge and my fortress: my God; in him will I trust. Surely he shall deliver thee from the snare of the fowler, and from the noisome pestilence. He shall cover thee with his feathers, and under his wings shalt thou trust: his truth shall be thy shield and buckler. Thou shalt not be afraid for the terror by night; nor for the arrow that flieth by day; nor for the pestilence that walketh in darkness; nor for the destruction that wasteth at noonday. A thousand shall fall at thy side, and ten thousand at thy right hand; but it shall not come nigh thee. Only with thine eyes shalt thou behold and see the reward of the wicked. Because thou hast made the Lord, which is my refuge, even the Most High, thy habitation; there shall no evil befall thee, neither shall any plague come nigh thy dwelling," etc. I scarce need tell the reader that from that moment I resolved that I would stay in the town, and, casting myself entirely upon the goodness and protection of the Almighty, would not seek any other shelter whatever; and that as my times were in his hands,[32] he was as able to keep me in a time of the infection as in a time of health; and if he did not think fit to deliver me, still I was in his hands, and it was meet he should do with me as should seem good to him. With this resolution I went to bed; and I was further confirmed in it the next day by the woman being taken ill with whom I had intended to intrust my house and all my affairs. But I had a further obligation laid on me on the same side: for the next day I found myself very much out of order also; so that, if I would have gone away, I could not. And I continued ill three or four days, and this entirely determined my stay: so I took my leave of my brother, who went away to Dorking in Surrey,[33] and afterwards fetched around farther into Buckinghamshire or Bedfordshire, to a retreat he had found out there for his family. It was a very ill time to be sick in; for if any one complained, it was immediately said he had the plague; and though I had, indeed, no symptoms of that distemper, yet, being very ill both in my head and in my stomach, I was not without apprehension that I really was infected. But in about three days I grew better. The third night I rested well, sweated a little, and was much refreshed. The apprehensions of its being the infection went also quite away with my illness, and I went about my business as usual. These things, however, put off all my thoughts of going into the country; and my brother also being gone, I had no more debate either with him or with myself on that subject. It was now mid-July; and the plague, which had chiefly raged at the other end of the town, and, as I said before, in the parishes of St. Giles's, St. Andrew's, Holborn, and towards Westminster, began now to come eastward, towards the part where I lived. It was to be observed, indeed, that it did not come straight on towards us; for the city, that is to say within the walls, was indifferent healthy still. Nor was it got then very much over the water into Southwark; for though there died that week twelve hundred and sixty-eight of all distempers, whereof it might be supposed above nine hundred died of the plague, yet there was but twenty-eight in the whole city, within the walls, and but nineteen in Southwark, Lambeth Parish included; whereas in the parishes of St. Giles and St. Martin's-in-the-Fields alone, there died four hundred and twenty-one. But we perceived the infection kept chiefly in the outparishes, which being very populous and fuller also of poor, the distemper found more to prey upon than in the city, as I shall observe afterwards. We perceived, I say, the distemper to draw our way, viz., by the parishes of Clerkenwell, Cripplegate, Shoreditch, and Bishopsgate; which last two parishes joining to Aldgate, Whitechapel, and Stepney, the infection came at length to spread its utmost rage and violence in those parts, even when it abated at the western parishes where it began. It was very strange to observe that in this particular week (from the 4th to the 11th of July), when, as I have observed, there died near four hundred of the plague in the two parishes of St. Martin's and St. Giles-in-the-Fields[34] only, there died in the parish of Aldgate but four, in the parish of Whitechapel three, in the parish of Stepney but one. Likewise in the next week (from the 11th of July to the 18th), when the week's bill was seventeen hundred and sixty-one, yet there died no more of the plague, on the whole Southwark side of the water, than sixteen. But this face of things soon changed, and it began to thicken in Cripplegate Parish especially, and in Clerkenwell; so that by the second week in August, Cripplegate Parish alone buried eight hundred and eighty-six, and Clerkenwell one hundred and fifty-five. Of the first, eight hundred and fifty might well be reckoned to die of the plague; and of the last, the bill itself said one hundred and forty-five were of the plague. During the month of July, and while, as I have observed, our part of the town seemed to be spared in comparison of the west part, I went ordinarily about the streets as my business required, and particularly went generally once in a day, or in two days, into the city, to my brother's house, which he had given me charge of, and to see it was safe; and having the key in my pocket, I used to go into the house, and over most of the rooms, to see that all was well. For though it be something wonderful to tell that any should have hearts so hardened, in the midst of such a calamity, as to rob and steal, yet certain it is that all sorts of villainies, and even levities and debaucheries, were then practiced in the town as openly as ever: I will not say quite as frequently, because the number of people were[35] many ways lessened. But the city itself began now to be visited too, I mean within the walls. But the number of people there were[35] indeed extremely lessened by so great a multitude having been gone into the country; and even all this month of July they continued to flee, though not in such multitudes as formerly. In August, indeed, they fled in such a manner, that I began to think there would be really none but magistrates and servants left in the city. As they fled now out of the city, so I should observe that the court[36] removed early, viz., in the month of June, and went to Oxford, where it pleased God to preserve them; and the distemper did not, as I heard of, so much as touch them; for which I cannot say that I ever saw they showed any great token of thankfulness, and hardly anything of reformation, though they did not want being told that their crying vices might, without breach of charity, be said to have gone far in bringing that terrible judgment upon the whole nation. The face of London was now, indeed, strangely altered: I mean the whole mass of buildings, city, liberties, suburbs, Westminster, Southwark, and altogether; for as to the particular part called the city, or within the walls, that was not yet much infected. But in the whole, the face of things, I say, was much altered. Sorrow and sadness sat upon every face, and though some part were not yet overwhelmed, yet all looked deeply concerned; and as we saw it apparently coming on, so every one looked on himself and his family as in the utmost danger. Were it possible to represent those times exactly to those that did not see them, and give the reader due ideas of the horror that everywhere presented itself, it must make just impressions upon their minds, and fill them with surprise. London might well be said to be all in tears. The mourners did not go about the streets,[37] indeed; for nobody put on black, or made a formal dress of mourning for their nearest friends: but the voice of mourning was truly heard in the streets. The shrieks of women and children at the windows and doors of their houses, where their nearest relations were perhaps dying, or just dead, were so frequent to be heard as we passed the streets, that it was enough to pierce the stoutest heart in the world to hear them. Tears and lamentations were seen almost in every house, especially in the first part of the visitation; for towards the latter end, men's hearts were hardened, and death was so always before their eyes that they did not so much concern themselves for the loss of their friends, expecting that themselves should be summoned the next hour. Business led me out sometimes to the other end of the town, even when the sickness was chiefly there. And as the thing was new to me, as well as to everybody else, it was a most surprising thing to see those streets, which were usually so thronged, now grown desolate, and so few people to be seen in them, that if I had been a stranger, and at a loss for my way, I might sometimes have gone the length of a whole street, I mean of the by-streets, and see[38] nobody to direct me, except watchmen set at the doors of such houses as were shut up; of which I shall speak presently. One day, being at that part of the town on some special business, curiosity led me to observe things more than usually; and indeed I walked a great way where I had no business. I went up Holborn, and there the street was full of people; but they walked in the middle of the great street, neither on one side or[39] other, because, as I suppose, they would not mingle with anybody that came out of houses, or meet with smells and scents from houses, that might be infected. The inns of court were all shut up, nor were very many of the lawyers in the Temple,[40] or Lincoln's Inn, or Gray's Inn, to be seen there. Everybody was at peace, there was no occasion for lawyers; besides, it being in the time of the vacation too, they were generally gone into the country. Whole rows of houses in some places were shut close up, the inhabitants all fled, and only a watchman or two left. When I speak of rows of houses being shut up, I do not mean shut up by the magistrates, but that great numbers of persons followed the court, by the necessity of their employments, and other dependencies; and as others retired, really frighted with the distemper, it was a mere desolating of some of the streets. But the fright was not yet near so great in the city, abstractedly so called,[41] and particularly because, though they were at first in a most inexpressible consternation, yet, as I have observed that the distemper intermitted often at first, so they were, as it were, alarmed and unalarmed again, and this several times, till it began to be familiar to them; and that even when it appeared violent, yet seeing it did not presently spread into the city, or the east or south parts, the people began to take courage, and to be, as I may say, a little hardened. It is true, a vast many people fled, as I have observed; yet they were chiefly from the west end of the town, and from that we call the heart of the city, that is to say, among the wealthiest of the people, and such persons as were unincumbered with trades and business. But of the rest, the generality staid, and seemed to abide the worst; so that in the place we call the liberties, and in the suburbs, in Southwark, and in the east part, such as Wapping, Ratcliff, Stepney, Rotherhithe, and the like, the people generally staid, except here and there a few wealthy families, who, as above, did not depend upon their business. It must not be forgot here that the city and suburbs were prodigiously full of people at the time of this visitation, I mean at the time that it began. For though I have lived to see a further increase, and mighty throngs of people settling in London, more than ever; yet we had always a notion that numbers of people which--the wars being over, the armies disbanded, and the royal family and the monarchy being restored--had flocked to London to settle in business, or to depend upon and attend the court for rewards of services, preferments, and the like, was[42] such that the town was computed to have in it above a hundred thousand people more than ever it held before. Nay, some took upon them to say it had twice as many, because all the ruined families of the royal party flocked hither, all the soldiers set up trades here, and abundance of families settled here. Again: the court brought with it a great flux of pride and new fashions; all people were gay and luxurious, and the joy of the restoration had brought a vast many families to London.[43] But I must go back again to the beginning of this surprising time. While the fears of the people were young, they were increased strangely by several odd accidents, which put altogether, it was really a wonder the whole body of the people did not rise as one man, and abandon their dwellings, leaving the place as a space of ground designed by Heaven for an Aceldama,[44] doomed to be destroyed from the face of the earth, and that all that would be found in it would perish with it. I shall name but a few of these things; but sure they were so many, and so many wizards and cunning people propagating them, that I have often wondered there was any (women especially) left behind. In the first place, a blazing star or comet appeared for several months before the plague, as there did, the year after, another a little before the fire. The old women, and the phlegmatic hypochondriac[45] part of the other sex (whom I could almost call old women too), remarked, especially afterward, though not till both those judgments were over, that those two comets passed directly over the city, and that so very near the houses that it was plain they imported something peculiar to the city alone; that the comet before the pestilence was of a faint, dull, languid color, and its motion very heavy, solemn, and slow, but that the comet before the fire was bright and sparkling, or, as others said, flaming, and its motion swift and furious; and that, accordingly, one foretold a heavy judgment, slow but severe, terrible, and frightful, as was the plague, but the other foretold a stroke, sudden, swift, and fiery, as was the conflagration. Nay, so particular some people were, that, as they looked upon that comet preceding the fire, they fancied that they not only saw it pass swiftly and fiercely, and could perceive the motion with their eye, but even they heard it; that it made a rushing, mighty noise, fierce and terrible, though at a distance, and but just perceivable. I saw both these stars, and, I must confess, had had so much of the common notion of such things in my head, that I was apt to look upon them as the forerunners and warnings of God's judgments, and, especially when the plague had followed the first, I yet saw another of the like kind, I could not but say, God had not yet sufficiently scourged the city. The apprehensions of the people were likewise strangely increased by the error of the times, in which I think the people, from what principle I cannot imagine, were more addicted to prophecies, and astrological conjurations, dreams, and old wives' tales, than ever they were before or since.[46] Whether this unhappy temper was originally raised by the follies of some people who got money by it, that is to say, by printing predictions and prognostications, I know not. But certain it is, books frighted them terribly, such as "Lilly's Almanack,"[47] "Gadbury's Astrological Predictions," "Poor Robin's Almanack,"[48] and the like; also several pretended religious books,--one entitled "Come out of Her, my People, lest ye be Partaker of her Plagues;"[49] another called "Fair Warning;" another, "Britain's Remembrancer;" and many such,--all, or most part of which, foretold directly or covertly the ruin of the city. Nay, some were so enthusiastically bold as to run about the streets with their oral predictions, pretending they were sent to preach to the city; and one in particular, who, like Jonah[50] to Nineveh, cried in the streets, "Yet forty days, and London shall be destroyed." I will not be positive whether he said "yet forty days," or "yet a few days." Another ran about naked, except a pair of drawers about his waist, crying day and night, like a man that Josephus[51] mentions, who cried, "Woe to Jerusalem!" a little before the destruction of that city: so this poor naked creature cried, "Oh, the great and the dreadful God!" and said no more, but repeated those words continually, with a voice and countenance full of horror, a swift pace, and nobody could ever find him to stop, or rest, or take any sustenance, at least that ever I could hear of. I met this poor creature several times in the streets, and would have spoke to him, but he would not enter into speech with me, or any one else, but kept on his dismal cries continually. These things terrified the people to the last degree, and especially when two or three times, as I have mentioned already, they found one or two in the bills dead of the plague at St. Giles's. Next to these public things were the dreams of old women; or, I should say, the interpretation of old women upon other people's dreams; and these put abundance of people even out of their wits. Some heard voices warning them to be gone, for that there would be such a plague in London so that the living would not be able to bury the dead; others saw apparitions in the air: and I must be allowed to say of both, I hope without breach of charity, that they heard voices that never spake, and saw sights that never appeared. But the imagination of the people was really turned wayward and possessed; and no wonder if they who were poring continually at the clouds saw shapes and figures, representations and appearances, which had nothing in them but air and vapor. Here they told us they saw a flaming sword held in a hand, coming out of a cloud, with a point hanging directly over the city. There they saw hearses and coffins in the air carrying to be buried. And there again, heaps of dead bodies lying unburied and the like, just as the imagination of the poor terrified people furnished them with matter to work upon. So hypochondriac fancies represent Ships, armies, battles in the firmament; Till steady eyes the exhalations solve, And all to its first matter, cloud, resolve. I could fill this account with the strange relations such people give every day of what they have seen; and every one was so positive of their having seen what they pretended to see, that there was no contradicting them, without breach of friendship, or being accounted rude and unmannerly on the one hand, and profane and impenetrable on the other. One time before the plague was begun, otherwise than as I have said in St. Giles's (I think it was in March), seeing a crowd of people in the street, I joined with them to satisfy my curiosity, and found them all staring up into the air to see what a woman told them appeared plain to her, which was an angel clothed in white, with a fiery sword in his hand, waving it or brandishing it over his head. She described every part of the figure to the life, showed them the motion and the form, and the poor people came into it so eagerly and with so much readiness. "Yes, I see it all plainly," says one: "there's the sword as plain as can be." Another saw the angel; one saw his very face, and cried out what a glorious creature he was. One saw one thing, and one another. I looked as earnestly as the rest, but perhaps not with so much willingness to be imposed upon; and I said, indeed, that I could see nothing but a white cloud, bright on one side, by the shining of the sun upon the other part. The woman endeavored to show it me, but could not make me confess that I saw it; which, indeed, if I had, I must have lied. But the woman, turning to me, looked me in the face, and fancied I laughed, in which her imagination deceived her too, for I really did not laugh, but was seriously reflecting how the poor people were terrified by the force of their own imagination. However, she turned to me, called me profane fellow and a scoffer, told me that it was a time of God's anger, and dreadful judgments were approaching, and that despisers such as I should wander and perish. The people about her seemed disgusted as well as she, and I found there was no persuading them that I did not laugh at them, and that I should be rather mobbed by them than be able to undeceive them. So I left them, and this appearance passed for as real as the blazing star itself. Another encounter I had in the open day also; and this was in going through a narrow passage from Petty France[52] into Bishopsgate churchyard, by a row of almshouses. There are two churchyards to Bishopsgate Church or Parish. One we go over to pass from the place called Petty France into Bishopsgate Street, coming out just by the church door; the other is on the side of the narrow passage where the almshouses are on the left, and a dwarf wall with a palisade on it on the right hand, and the city wall on the other side more to the right. In this narrow passage stands a man looking through the palisades into the burying place, and as many people as the narrowness of the place would admit to stop without hindering the passage of others; and he was talking mighty eagerly to them, and pointing, now to one place, then to another, and affirming that he saw a ghost walking upon such a gravestone there. He described the shape, the posture, and the movement of it so exactly, that it was the greatest amazement to him in the world that everybody did not see it as well as he. On a sudden he would cry, "There it is! Now it comes this way!" then, "'Tis turned back!" till at length he persuaded the people into so firm a belief of it, that one fancied he saw it; and thus he came every day, making a strange hubbub, considering it was so narrow a passage, till Bishopsgate clock struck eleven; and then the ghost would seem to start, and, as if he were called away, disappeared on a sudden. I looked earnestly every way, and at the very moment that this man directed, but could not see the least appearance of anything. But so positive was this poor man that he gave them vapors[53] in abundance, and sent them away trembling and frightened, till at length few people that knew of it cared to go through that passage, and hardly anybody by night on any account whatever. This ghost, as the poor man affirmed, made signs to the houses and to the ground and to the people, plainly intimating (or else they so understanding it) that abundance of people should come to be buried in that churchyard, as indeed happened. But then he saw such aspects I must acknowledge I never believed, nor could I see anything of it myself, though I looked most earnestly to see it if possible. Some endeavors were used to suppress the printing of such books as terrified the people, and to frighten the dispersers of them, some of whom were taken up, but nothing done in it, as I am informed; the government being unwilling to exasperate the people, who were, as I may say, all out of their wits already. Neither can I acquit those ministers that in their sermons rather sunk than lifted up the hearts of their hearers. Many of them, I doubt not, did it for the strengthening the resolution of the people, and especially for quickening them to repentance; but it certainly answered not their end, at least not in proportion to the injury it did another way. One mischief always introduces another. These terrors and apprehensions of the people led them to a thousand weak, foolish, and wicked things, which they wanted not a sort of people really wicked to encourage them to; and this was running about to fortune tellers, cunning men,[54] and astrologers, to know their fortunes, or, as it is vulgarly expressed, to have their fortunes told them, their nativities[55] calculated, and the like. And this folly presently made the town swarm with a wicked generation of pretenders to magic, to the "black art," as they called it, and I know not what, nay, to a thousand worse dealings with the devil than they were really guilty of. And this trade grew so open and so generally practiced, that it became common to have signs and inscriptions set up at doors, "Here lives a fortune teller," "Here lives an astrologer," "Here you may have your nativity calculated," and the like; and Friar Bacon's brazen head,[56] which was the usual sign of these people's dwellings, was to be seen almost in every street, or else the sign of Mother Shipton,[57] or of Merlin's[58] head, and the like. With what blind, absurd, and ridiculous stuff these oracles of the devil pleased and satisfied the people, I really know not; but certain it is, that innumerable attendants crowded about their doors every day: and if but a grave fellow in a velvet jacket, a band,[59] and a black cloak, which was the habit those quack conjurers generally went in, was but seen in the streets, the people would follow them[60] in crowds, and ask them[60] questions as they went along. The case of poor servants was very dismal, as I shall have occasion to mention again by and by; for it was apparent a prodigious number of them would be turned away. And it was so, and of them abundance perished, and particularly those whom these false prophets flattered with hopes that they should be kept in their services, and carried with their masters and mistresses into the country; and had not public charity provided for these poor creatures, whose number was exceeding great (and in all cases of this nature must be so), they would have been in the worst condition of any people in the city. These things agitated the minds of the common people for many months while the first apprehensions were upon them, and while the plague was not, as I may say, yet broken out. But I must also not forget that the more serious part of the inhabitants behaved after another manner. The government encouraged their devotion, and appointed public prayers, and days of fasting and humiliation, to make public confession of sin, and implore the mercy of God to avert the dreadful judgment which hangs over their heads; and it is not to be expressed with what alacrity the people of all persuasions embraced the occasion, how they flocked to the churches and meetings, and they were all so thronged that there was often no coming near, even to the very doors of the largest churches. Also there were daily prayers appointed morning and evening at several churches, and days of private praying at other places, at all which the people attended, I say, with an uncommon devotion. Several private families, also, as well of one opinion as another, kept family fasts, to which they admitted their near relations only; so that, in a word, those people who were really serious and religious applied themselves in a truly Christian manner to the proper work of repentance and humiliation, as a Christian people ought to do. Again, the public showed that they would bear their share in these things. The very court, which was then gay and luxurious, put on a face of just concern for the public danger. All the plays and interludes[61] which, after the manner of the French court,[62] had been set up and began to increase among us, were forbid to act;[63] the gaming tables, public dancing rooms, and music houses, which multiplied and began to debauch the manners of the people, were shut up and suppressed; and the jack puddings,[64] merry-andrews,[64] puppet shows, ropedancers, and such like doings, which had bewitched the common people, shut their shops, finding indeed no trade, for the minds of the people were agitated with other things, and a kind of sadness and horror at these things sat upon the countenances even of the common people. Death was before their eyes, and everybody began to think of their graves, not of mirth and diversions. But even these wholesome reflections, which, rightly managed, would have most happily led the people to fall upon their knees, make confession of their sins, and look up to their merciful Savior for pardon, imploring his compassion on them in such a time of their distress, by which we might have been as a second Nineveh, had a quite contrary extreme in the common people, who, ignorant and stupid in their reflections as they were brutishly wicked and thoughtless before, were now led by their fright to extremes of folly, and, as I said before, that they ran to conjurers and witches and all sorts of deceivers, to know what should become of them, who fed their fears and kept them always alarmed and awake, on purpose to delude them and pick their pockets: so they were as mad upon their running after quacks and mountebanks, and every practicing old woman for medicines and remedies, storing themselves with such multitudes of pills, potions, and preservatives, as they were called, that they not only spent their money, but poisoned themselves beforehand, for fear of the poison of the infection, and prepared their bodies for the plague, instead of preserving them against it. On the other hand, it was incredible, and scarce to be imagined, how the posts of houses and corners of streets were plastered over with doctors' bills, and papers of ignorant fellows quacking and tampering in physic, and inviting people to come to them for remedies, which was generally set off with such flourishes as these; viz., "INFALLIBLE preventitive pills against the plague;" "NEVER-FAILING preservatives against the infection;" "SOVEREIGN cordials against the corruption of air;" "EXACT regulations for the conduct of the body in case of infection;" "Antipestilential pills;" "INCOMPARABLE drink against the plague, never found out before;" "An UNIVERSAL remedy for the plague;" "The ONLY TRUE plague water;" "The ROYAL ANTIDOTE against all kinds of infection;" and such a number more that I cannot reckon up, and, if I could, would fill a book of themselves to set them down. Others set up bills to summon people to their lodgings for direction and advice in the case of infection. These had specious titles also, such as these:-- An eminent High-Dutch physician, newly come over from Holland, where he resided during all the time of the great plague, last year, in Amsterdam, and cured multitudes of people that actually had the plague upon them. An Italian gentlewoman just arrived from Naples, having a choice secret to prevent infection, which she found out by her great experience, and did wonderful cures with it in the late plague there, wherein there died 20,000 in one day. An ancient gentlewoman having practiced with great success in the late plague in this city, anno 1636, gives her advice only to the female sex. To be spoken with, etc. An experienced physician, who has long studied the doctrine of antidotes against all sorts of poison and infection, has, after forty years' practice, arrived at such skill as may, with God's blessing, direct persons how to prevent being touched by any contagious distemper whatsoever. He directs the poor gratis. I take notice of these by way of specimen. I could give you two or three dozen of the like, and yet have abundance left behind. It is sufficient from these to apprise any one of the humor of those times, and how a set of thieves and pickpockets not only robbed and cheated the poor people of their money, but poisoned their bodies with odious and fatal preparations; some with mercury, and some with other things as bad, perfectly remote from the thing pretended to, and rather hurtful than serviceable to the body in case an infection followed. I cannot omit a subtlety of one of those quack operators with which he gulled the poor people to crowd about him, but did nothing for them without money. He had, it seems, added to his bills, which he gave out in the streets, this advertisement in capital letters; viz., "He gives advice to the poor for nothing." Abundance of people came to him accordingly, to whom he made a great many fine speeches, examined them of the state of their health and of the constitution of their bodies, and told them many good things to do, which were of no great moment. But the issue and conclusion of all was, that he had a preparation which, if they took such a quantity of every morning, he would pawn his life that they should never have the plague, no, though they lived in the house with people that were infected. This made the people all resolve to have it, but then the price of that was so much (I think it was half a crown[65]). "But, sir," says one poor woman, "I am a poor almswoman, and am kept by the parish; and your bills say you give the poor your help for nothing."--"Ay, good woman," says the doctor, "so I do, as I published there. I give my advice, but not my physic!"--"Alas, sir," says she, "that is a snare laid for the poor then, for you give them your advice for nothing; that is to say, you advise them gratis to buy your physic for their money: so does every shopkeeper with his wares." Here the woman began to give him ill words, and stood at his door all that day, telling her tale to all the people that came, till the doctor, finding she turned away his customers, was obliged to call her upstairs again and give her his box of physic for nothing, which perhaps, too, was good for nothing when she had it. But to return to the people, whose confusions fitted them to be imposed upon by all sorts of pretenders and by every mountebank. There is no doubt but these quacking sort of fellows raised great gains out of the miserable people; for we daily found the crowds that ran after them were infinitely greater, and their doors were more thronged, than those of Dr. Brooks, Dr. Upton, Dr. Hodges, Dr. Berwick, or any, though the most famous men of the time; and I was told that some of them got five pounds[66] a day by their physic. But there was still another madness beyond all this, which may serve to give an idea of the distracted humor of the poor people at that time, and this was their following a worse sort of deceivers than any of these; for these petty thieves only deluded them to pick their pockets and get their money (in which their wickedness, whatever it was, lay chiefly on the side of the deceiver's deceiving, not upon the deceived); but, in this part I am going to mention, it lay chiefly in the people deceived, or equally in both. And this was in wearing charms, philters,[67] exorcisms,[68] amulets,[69] and I know not what preparations to fortify the body against the plague, as if the plague was not the hand of God, but a kind of a possession of an evil spirit, and it was to be kept off with crossings,[70] signs of the zodiac,[71] papers tied up with so many knots, and certain words or figures written on them, as particularly the word "Abracadabra,"[72] formed in triangle or pyramid; thus,-- A B R A C A D A B R A A B R A C A D A B R A B R A C A D A B A B R A C A D A A B R A C A D A B R A C A A B R A C A B R A A B R A B A Others had the Jesuits' mark in a cross:-- I H S[73] Others had nothing but this mark; thus,-- + I might spend a great deal of my time in exclamations against the follies, and indeed the wickednesses of those things, in a time of such danger, in a matter of such consequence as this of a national infection; but my memorandums of these things relate rather to take notice of the fact, and mention only that it was so. How the poor people found the insufficiency of those things, and how many of them were afterwards carried away in the dead carts, and thrown into the common graves of every parish with these hellish charms and trumpery hanging about their necks, remains to be spoken of as we go along. All this was the effect of the hurry the people were in, after the first notion of the plague being at hand was among them, and which may be said to be from about Michaelmas,[74] 1664, but more particularly after the two men died in St. Giles's, in the beginning of December; and again after another alarm in February, for when the plague evidently spread itself, they soon began to see the folly of trusting to these unperforming creatures who had gulled them of their money; and then their fears worked another way, namely, to amazement and stupidity, not knowing what course to take or what to do, either to help or to relieve themselves; but they ran about from one neighbor's house to another, and even in the streets, from one door to another, with repeated cries of, "Lord, have mercy upon us! What shall we do?" I am supposing, now, the plague to have begun, as I have said, and that the magistrates began to take the condition of the people into their serious consideration. What they did as to the regulation of the inhabitants, and of infected families, I shall speak to[75] by itself; but as to the affair of health, it is proper to mention here my having seen the foolish humor of the people in running after quacks, mountebanks, wizards, and fortune tellers, which they did, as above, even to madness. The lord mayor, a very sober and religious gentleman, appointed physicians and surgeons for the relief of the poor, I mean the diseased poor, and in particular ordered the College of Physicians[76] to publish directions for cheap remedies for the poor in all the circumstances of the distemper. This, indeed, was one of the most charitable and judicious things that could be done at that time; for this drove the people from haunting the doors of every disperser of bills, and from taking down blindly and without consideration, poison for physic, and death instead of life. This direction of the physicians was done by a consultation of the whole college; and as it was particularly calculated for the use of the poor, and for cheap medicines, it was made public, so that everybody might see it, and copies were given gratis to all that desired it. But as it is public and to be seen on all occasions, I need not give the reader of this the trouble of it. It remains to be mentioned now what public measures were taken by the magistrates for the general safety and to prevent the spreading of the distemper when it broke out. I shall have frequent occasion to speak of the prudence of the magistrates, their charity, their vigilance for the poor and for preserving good order, furnishing provisions, and the like, when the plague was increased as it afterwards was. But I am now upon the order and regulations which they published for the government of infected families. I mentioned above shutting of houses up, and it is needful to say something particularly to that; for this part of the history of the plague is very melancholy. But the most grievous story must be told. About June, the lord mayor of London, and the court of aldermen, as I have said, began more particularly to concern themselves for the regulation of the city. The justices of the peace for Middlesex,[77] by direction of the secretary of state, had begun to shut up houses in the parishes of St. Giles-in-the-Fields, St. Martin's, St. Clement's-Danes, etc., and it was with good success; for in several streets where the plague broke out, upon strict guarding the houses that were infected, and taking care to bury those that died as soon as they were known to be dead, the plague ceased in those streets. It was also observed that the plague decreased sooner in those parishes after they had been visited to the full than it did in the parishes of Bishopsgate, Shoreditch, Aldgate, Whitechapel, Stepney, and others; the early care taken in that manner being a great means to the putting a check to it. This shutting up of the houses was a method first taken, as I understand, in the plague which happened in 1603, at the coming of King James I. to the crown; and the power of shutting people up in their own houses was granted by act of Parliament, entitled "An Act for the Charitable Relief and Ordering of Persons Infected with Plague." On which act of Parliament the lord mayor and aldermen of the city of London founded the order they made at this time, and which took place the 1st of July, 1665, when the numbers of infected within the city were but few; the last bill for the ninety-two parishes being but four, and some houses having been shut up in the city, and some people being removed to the pesthouse beyond Bunhill Fields, in the way to Islington. I say by these means, when there died near one thousand a week in the whole, the number in the city was but twenty-eight; and the city was preserved more healthy, in proportion, than any other place all the time of the infection. These orders of my lord mayor's were published, as I have said, the latter end of June, and took place from the 1st of July, and were as follow: viz.,-- ORDERS CONCEIVED AND PUBLISHED BY THE LORD MAYOR AND ALDERMEN OF THE CITY OF LONDON, CONCERNING THE INFECTION OF THE PLAGUE; 1665. Whereas in the reign of our late sovereign King James, of happy memory, an act was made for the charitable relief and ordering of persons infected with the plague; whereby authority was given to justices of the peace, mayors, bailiffs, and other head officers, to appoint within their several limits examiners, searchers, watchmen, keepers, and buriers, for the persons and places infected, and to minister unto them oaths for the performance of their offices; and the same statute did also authorize the giving of their directions as unto them for other present necessity should seem good in their discretions: it is now, upon special consideration, thought very expedient, for preventing and avoiding of infection of sickness (if it shall please Almighty God), that these officers following be appointed, and these orders hereafter duly observed. Examiners to be appointed to every Parish. First, it is thought requisite, and so ordered, that in every parish there be one, two, or more persons of good sort and credit chosen by the alderman, his deputy, and common council of every ward, by the name of examiners, to continue in that office for the space of two months at least: and if any fit person so appointed shall refuse to undertake the same, the said parties so refusing to be committed to prison until they shall conform themselves accordingly. The Examiner's Office. That these examiners be sworn by the aldermen to inquire and learn from time to time what houses in every parish be visited, and what persons be sick, and of what diseases, as near as they can inform themselves, and, upon doubt in that case, to command restraint of access until it appear what the disease shall prove; and if they find any person sick of the infection, to give order to the constable that the house be shut up; and, if the constable shall be found remiss and negligent, to give notice thereof to the alderman of the ward. Watchmen. That to every infected house there be appointed two watchmen,--one for every day, and the other for the night; and that these watchmen have a special care that no person go in or out of such infected houses whereof they have the charge, upon pain of severe punishment. And the said watchmen to do such further offices as the sick house shall need and require; and if the watchman be sent upon any business, to lock up the house and take the key with him; and the watchman by day to attend until ten o'clock at night, and the watchman by night until six in the morning. Searchers. That there be a special care to appoint women searchers in every parish, such as are of honest reputation and of the best sort as can be got in this kind; and these to be sworn to make due search and true report, to the utmost of their knowledge, whether the persons whose bodies they are appointed to search do die of the infection, or of what other diseases, as near as they can. And that the physicians who shall be appointed for the cure and prevention of the infection do call before them the said searchers, who are or shall be appointed for the several parishes under their respective cares, to the end they may consider whether they be fitly qualified for that employment, and charge them from time to time, as they shall see cause, if they appear defective in their duties. That no searcher during this time of visitation be permitted to use any public work or employment, or keep a shop or stall, or be employed as a laundress, or in any other common employment whatsoever. Chirurgeons.[78] For better assistance of the searchers, forasmuch as there has been heretofore great abuse in misreporting the disease, to the further spreading of the infection, it is therefore ordered that there be chosen and appointed able and discreet chirurgeons besides those that do already belong to the pesthouse, amongst whom the city and liberties to be quartered as they lie most apt and convenient; and every of these to have one quarter for his limit. And the said chirurgeons in every of their limits to join with the searchers for the view of the body, to the end there may be a true report made of the disease. And further: that the said chirurgeons shall visit and search such like persons as shall either send for them, or be named and directed unto them by the examiners of every parish, and inform themselves of the disease of the said parties. And forasmuch as the said chirurgeons are to be sequestered from all other cures,[79] and kept only to this disease of the infection, it is ordered that every of the said chirurgeons shall have twelvepence a body searched by them, to be paid out of the goods of the party searched, if he be able, or otherwise by the parish. Nurse Keepers. If any nurse keeper shall remove herself out of any infected house before twenty-eight days after the decease of any person dying of the infection, the house to which the said nurse keeper doth so remove herself shall be shut up until the said twenty-eight days shall be expired. ORDERS CONCERNING INFECTED HOUSES, AND PERSONS SICK OF THE PLAGUE. Notice to be given of the Sickness. The master of every house, as soon as any one in his house complaineth either of botch, or purple, or swelling in any part of his body, or falleth otherwise dangerously sick without apparent cause of some other disease, shall give notice thereof to the examiner of health, within two hours after the said sign shall appear. Sequestration of the Sick. As soon as any man shall be found by this examiner, chirurgeon, or searcher, to be sick of the plague, he shall the same night be sequestered in the same house; and in case he be so sequestered, then, though he die not, the house wherein he sickened shall be shut up for a month after the use of the due preservatives taken by the rest. Airing the Stuff. For sequestration of the goods and stuff of the infection, their bedding and apparel, and hangings of chambers, must be well aired with fire, and such perfumes as are requisite, within the infected house, before they be taken again to use. This to be done by the appointment of the examiner. Shutting up of the House. If any person shall visit any man known to be infected of the plague, or entereth willingly into any known infected house, being not allowed, the house wherein he inhabiteth shall be shut up for certain days by the examiner's direction. None to be removed out of Infected Houses, but, etc. Item, That none be removed out of the house where he falleth sick of the infection into any other house in the city (except it be to the pesthouse or a tent, or unto some such house which the owner of the said house holdeth in his own hands, and occupieth by his own servants), and so as security be given to the said parish whither such remove is made, that the attendance and charge about the said visited persons shall be observed and charged in all the particularities before expressed, without any cost of that parish to which any such remove shall happen to be made, and this remove to be done by night. And it shall be lawful to any person that hath two houses to remove either his sound or his infected people to his spare house at his choice, so as, if he send away first his sound, he do not after send thither the sick; nor again unto the sick, the sound; and that the same which he sendeth be for one week at the least shut up, and secluded from company, for the fear of some infection at first not appearing. Burial of the Dead. That the burial of the dead by this visitation be at most convenient hours, always before sunrising, or after sunsetting, with the privity[80] of the churchwardens, or constable, and not otherwise; and that no neighbors nor friends be suffered to accompany the corpse to church, or to enter the house visited, upon pain of having his house shut up, or be imprisoned. And that no corpse dying of the infection shall be buried, or remain in any church, in time of common prayer, sermon, or lecture. And that no children be suffered, at time of burial of any corpse, in any church, churchyard, or burying place, to come near the corpse, coffin, or grave; and that all graves shall be at least six feet deep. And further, all public assemblies at other burials are to be forborne during the continuance of this visitation. No Infected Stuff to be uttered.[81] That no clothes, stuff, bedding, or garments, be suffered to be carried or conveyed out of any infected houses, and that the criers and carriers abroad of bedding or old apparel to be sold or pawned be utterly prohibited and restrained, and no brokers of bedding or old apparel be permitted to make any public show, or hang forth on their stalls, shop boards, or windows towards any street, lane, common way, or passage, any old bedding or apparel to be sold, upon pain of imprisonment. And if any broker or other person shall buy any bedding, apparel, or other stuff out of any infected house, within two months after the infection hath been there, his house shall be shut up as infected, and so shall continue shut up twenty days at the least. No Person to be conveyed out of any Infected House. If any person visited[82] do fortune,[83] by negligent looking unto, or by any other means, to come or be conveyed from a place infected to any other place, the parish from whence such party hath come, or been conveyed, upon notice thereof given, shall, at their charge, cause the said party so visited and escaped to be carried and brought back again by night; and the parties in this case offending to be punished at the direction of the alderman of the ward, and the house of the receiver of such visited person to be shut up for twenty days. Every Visited House to be marked. That every house visited be marked with a red cross of a foot long, in the middle of the door, evident to be seen, and with these usual printed words, that is to say, "Lord have mercy upon us," to be set close over the same cross, there to continue until lawful opening of the same house. Every Visited House to be watched. That the constables see every house shut up, and to be attended with watchmen, which may keep in, and minister necessaries to them at their own charges, if they be able, or at the common charge if they be unable. The shutting up to be for the space of four weeks after all be whole. That precise order be taken that the searchers, chirurgeons, keepers, and buriers, are not to pass the streets without holding a red rod or wand of three foot in length in their hands, open and evident to be seen; and are not to go into any other house than into their own, or into that whereunto they are directed or sent for, but to forbear and abstain from company, especially when they have been lately used[84] in any such business or attendance. Inmates. That where several inmates are in one and the same house, and any person in that house happens to be infected, no other person or family of such house shall be suffered to remove him or themselves without a certificate from the examiners of the health of that parish; or, in default thereof, the house whither she or they remove shall be shut up as is in case of visitation. Hackney Coaches. That care be taken of hackney coachmen, that they may not, as some of them have been observed to do after carrying of infected persons to the pesthouse and other places, be admitted to common use till their coaches be well aired, and have stood unemployed by the space of five or six days after such service. ORDERS FOR CLEANSING AND KEEPING OF THE STREETS SWEPT. The Streets to be kept Clean. First, it is thought necessary, and so ordered, that every householder do cause the street to be daily prepared before his door, and so to keep it clean swept all the week long. That Rakers take it from out the Houses. That the sweeping and filth of houses be daily carried away by the rakers, and that the raker shall give notice of his coming by the blowing of a horn, as hitherto hath been done. Laystalls[85] to be made far off from the City. That the laystalls be removed as far as may be out of the city and common passages, and that no nightman or other be suffered to empty a vault into any vault or garden near about the city. Care to be had of Unwholesome Fish or Flesh, and of Musty Corn. That special care be taken that no stinking fish, or unwholesome flesh, or musty corn, or other corrupt fruits, of what sort soever, be suffered to be sold about the city or any part of the same. That the brewers and tippling-houses be looked unto for musty and unwholesome casks. That no hogs, dogs, or cats, or tame pigeons, or conies, be suffered to be kept within any part of the city, or any swine to be or stray in the streets or lanes, but that such swine be impounded by the beadle[86] or any other officer, and the owner punished according to the act of common council; and that the dogs be killed by the dog killers appointed for that purpose. ORDERS CONCERNING LOOSE PERSONS AND IDLE ASSEMBLIES. Beggars. Forasmuch as nothing is more complained of than the multitude of rogues and wandering beggars that swarm about in every place about the city, being a great cause of the spreading of the infection, and will not be avoided[87] notwithstanding any orders that have been given to the contrary: it is therefore now ordered that such constables, and others whom this matter may any way concern, take special care that no wandering beggars be suffered in the streets of this city, in any fashion or manner whatsoever, upon the penalty provided by law to be duly and severely executed upon them. Plays. That all plays, bear baitings,[88] games, singing of ballads, buckler play,[89] or such like causes of assemblies of people, be utterly prohibited, and the parties offending severely punished by every alderman in his ward. Feasting prohibited. That all public feasting, and particularly by the companies[90] of this city, and dinners in taverns, alehouses, and other places of public entertainment, be forborne till further order and allowance, and that the money thereby spared be preserved, and employed for the benefit and relief of the poor visited with the infection. Tippling-Houses. That disorderly tippling in taverns, alehouses, coffeehouses, and cellars, be severely looked unto as the common sin of the time, and greatest occasion of dispersing the plague. And that no company or person be suffered to remain or come into any tavern, alehouse, or coffeehouse, to drink, after nine of the clock in the evening, according to the ancient law and custom of this city, upon the penalties ordained by law. And for the better execution of these orders, and such other rules and directions as upon further consideration shall be found needful, it is ordered and enjoined that the aldermen, deputies, and common councilmen shall meet together weekly, once, twice, thrice, or oftener, as cause shall require, at some one general place accustomed in their respective wards, being clear from infection of the plague, to consult how the said orders may be put in execution, not intending that any dwelling in or near places infected shall come to the said meeting while their coming may be doubtful. And the said aldermen, deputies, and common councilmen, in their several wards, may put in execution any other orders that by them, at their said meetings, shall be conceived and devised for the preservation of his Majesty's subjects from the infection. Sir JOHN LAWRENCE, Lord Mayor. Sir GEORGE WATERMAN, } Sir CHARLES DOE, } Sheriffs. I need not say that these orders extended only to such places as were within the lord mayor's jurisdiction: so it is requisite to observe that the justices of peace within those parishes and places as were called the "hamlets" and "outparts" took the same method. As I remember, the orders for shutting up of houses did not take place so soon on our side, because, as I said before, the plague did not reach to this eastern part of the town at least, nor begin to be violent till the beginning of August. For example, the whole bill from the 11th to the 18th of July was 1,761, yet there died but 71 of the plague in all those parishes we call the Tower Hamlets; and they were as follows:-- Aldgate, 14 { 34 { 65 Stepney, 33 The next { 58 To { 76 Whitechapel, 21 week was { 48 Aug. 1 { 79 St. Kath. Tower.[91] 2 thus: { 4 thus: { 4 Trin. Minories,[92] 1 { 1 { 4 -- --- --- 71 145 228 It was indeed coming on amain, for the burials that same week were, in the next adjoining parishes, thus:-- St. L.[93] Shoreditch 64 The next week { 84 To { 110 St. Bot.[94] Bishopsg. 65 prodigiously { 105 Aug. 1 { 116 St. Giles's Crippl.[95] 213 increased, as { 431 thus: { 554 --- --- --- 342 620 780 This shutting up of houses was at first counted a very cruel and unchristian method, and the poor people so confined made bitter lamentations. Complaints of the severity of it were also daily brought to my lord mayor, of houses causelessly, and some maliciously, shut up. I cannot say but upon inquiry many that complained so loudly were found in a condition to be continued; and others again, inspection being made upon the sick person, and the sickness not appearing infectious, or, if uncertain, yet, on his being content to be carried to the pesthouse, was[96] released. As I went along Houndsditch one morning, about eight o'clock, there was a great noise. It is true, indeed, there was not much crowd, because the people were not very free to gather together, or to stay long together when they were there, nor did I stay long there; but the outcry was loud enough to prompt my curiosity, and I called to one, who looked out of a window, and asked what was the matter. A watchman, it seems, had been employed to keep his post at the door of a house which was infected, or said to be infected, and was shut up. He had been there all night, for two nights together, as he told his story, and the day watchman had been there one day, and was now come to relieve him. All this while no noise had been heard in the house, no light had been seen, they called for nothing, sent him of no errands (which used to be the chief business of the watchmen), neither had they given him any disturbance, as he said, from Monday afternoon, when he heard a great crying and screaming in the house, which, as he supposed, was occasioned by some of the family dying just at that time. It seems the night before, the "dead cart," as it was called, had been stopped there, and a servant maid had been brought down to the door dead; and the "buriers" or "bearers," as they were called, put her into the cart, wrapped only in a green rug, and carried her away. The watchman had knocked at the door, it seems, when he heard that noise and crying, as above, and nobody answered a great while; but at last one looked out and said with an angry, quick tone, and yet a kind of crying voice, or a voice of one that was crying, "What d'ye want, that you make such a knocking?" He answered, "I am the watchman. How do you do? What is the matter?" The person answered, "What is that to you? Stop the dead cart." This, it seems, was about one o'clock. Soon after, as the fellow said, he stopped the dead cart, and then knocked again, but nobody answered; he continued knocking, and the bellman called out several times, "Bring out your dead;" but nobody answered, till the man that drove the cart, being called to other houses, would stay no longer, and drove away. The watchman knew not what to make of all this, so he let them alone till the morning man, or "day watchman," as they called him, came to relieve him. Giving him an account of the particulars, they knocked at the door a great while, but nobody answered; and they observed that the window or casement at which the person looked out who had answered before, continued open, being up two pair of stairs. Upon this, the two men, to satisfy their curiosity, got a long ladder, and one of them went up to the window and looked into the room, where he saw a woman lying dead upon the floor, in a dismal manner, having no clothes on her but her shift.[97] But though he called aloud, and, putting in his long staff, knocked hard on the floor, yet nobody stirred or answered, neither could he hear any noise in the house. He came down again upon this, and acquainted his fellow, who went up also; and finding it just so, they resolved to acquaint either the lord mayor or some other magistrate of it, but did not offer to go in at the window. The magistrate, it seems, upon the information of the two men, ordered the house to be broke open, a constable and other persons being appointed to be present, that nothing might be plundered; and accordingly it was so done, when nobody was found in the house but that young woman, who having been infected, and past recovery, the rest had left her to die by herself, and every one gone, having found some way to delude the watchman, and to get open the door, or get out at some back door, or over the tops of the houses, so that he knew nothing of it. And as to those cries and shrieks which he heard, it was supposed they were the passionate cries of the family at this bitter parting, which, to be sure, it was to them all, this being the sister to the mistress of the family; the man of the house, his wife, several children and servants, being all gone and fled: whether sick or sound, that I could never learn, nor, indeed, did I make much inquiry after it. At another house, as I was informed, in the street next within Aldgate, a whole family was shut up and locked in because the maidservant was taken sick. The master of the house had complained by his friends to the next alderman, and to the lord mayor, and had consented to have the maid carried to the pesthouse, but was refused: so the door was marked with a red cross, a padlock on the outside, as above, and a watchman set to keep the door, according to public order. After the master of the house found there was no remedy, but that he, his wife, and his children, were locked up with this poor distempered servant, he called to the watchman, and told him he must go then and fetch a nurse for them to attend this poor girl, for that it would be certain death to them all to oblige them to nurse her, and told him plainly that if he would not do this the maid would perish either[98] of the distemper, or be starved for want of food, for he was resolved none of his family should go near her; and she lay in the garret, four story high, where she could not cry out or call to anybody for help. The watchman consented to that, and went and fetched a nurse as he was appointed, and brought her to them the same evening. During this interval, the master of the house took his opportunity to break a large hole through his shop into a bulk or stall, where formerly a cobbler had sat before or under his shop window; but the tenant, as may be supposed, at such a dismal time as that, was dead or removed, and so he had the key in his own keeping. Having[99] made his way into this stall, which he could not have done if the man had been at the door, the noise he was obliged to make being such as would have alarmed the watchman,--I say, having made his way into this stall, he sat still till the watchman returned with the nurse, and all the next day also; but the night following, having contrived to send the watchman of another trifling errand (which, as I take it, was to an apothecary's for a plaster for the maid, which he was to stay for the making up, or some other such errand that might secure his staying some time), in that time he conveyed himself and all his family out of the house, and left the nurse and the watchman to bury the poor wench, that is, throw her into the cart, and take care of the house. Not far from the same place they blowed up a watchman with gunpowder, and burned the poor fellow dreadfully; and while he made hideous cries, and nobody would venture to come near to help him, the whole family that were able to stir got out at the windows (one story high), two that were left sick calling out for help. Care was taken to give them nurses to look after them; but the persons fled were never found till, after the plague was abated, they returned. But as nothing could be proved, so nothing could be done to them. In other cases, some had gardens and walls, or pales,[100] between them and their neighbors, or yards and backhouses; and these, by friendship and entreaties, would get leave to get over those walls or pales, and so go out at their neighbors' doors, or, by giving money to their servants, get them to let them through in the night. So that, in short, the shutting up of houses was in no wise to be depended upon; neither did it answer the end at all, serving more to make the people desperate, and drive them to such extremities as that they would break out at all adventures. And that which was still worse, those that did thus break out spread the infection farther, by their wandering about with the distemper upon them in their desperate circumstances, than they would otherwise have done; for whoever considers all the particulars in such cases must acknowledge, and cannot doubt, but the severity of those confinements made many people desperate, and made them run out of their houses at all hazards, and with the plague visibly upon them, not knowing either whither to go, or what to do, or indeed what they did. And many that did so were driven to dreadful exigencies and extremities, and perished in the streets or fields for mere want, or dropped down by[101] the raging violence of the fever upon them. Others wandered into the country, and went forward any way, as their desperation guided them, not knowing whither they went or would go, till, faint and tired, and not getting any relief, the houses and villages on the road refusing to admit them to lodge, whether infected or no, they have perished by the roadside, or gotten into barns, and died there, none daring to come to them or relieve them, though perhaps not infected, for nobody would believe them. On the other hand, when the plague at first seized a family, that is to say, when any one body of the family had gone out, and unwarily or otherwise catched[102] the distemper and brought it home, it was certainly known by the family before it was known to the officers, who, as you will see by the order, were appointed to examine into the circumstances of all sick persons, when they heard of their being sick. In this interval, between their being taken sick and the examiners coming, the master of the house had leisure and liberty to remove himself, or all his family, if he knew whither to go; and many did so. But the great disaster was, that many did thus after they were really infected themselves, and so carried the disease into the houses of those who were so hospitable as to receive them; which, it must be confessed, was very cruel and ungrateful. I am speaking now of people made desperate by the apprehensions of their being shut up, and their breaking out by stratagem or force, either before or after they were shut up, whose misery was not lessened when they were out, but sadly increased. On the other hand, many who thus got away had retreats to go to, and other houses, where they locked themselves up, and kept hid till the plague was over; and many families, foreseeing the approach of the distemper, laid up stores of provisions sufficient for their whole families, and shut themselves up, and that so entirely, that they were neither seen or heard of till the infection was quite ceased, and then came abroad sound and well. I might recollect several such as these, and give you the particulars of their management; for doubtless it was the most effectual secure step that could be taken for such whose circumstances would not admit them to remove, or who had not retreats abroad proper for the case; for, in being thus shut up, they were as if they had been a hundred miles off. Nor do I remember that any one of those families miscarried.[103] Among these, several Dutch merchants were particularly remarkable, who kept their houses like little garrisons besieged, suffering none to go in or out, or come near them; particularly one in a court in Throckmorton Street, whose house looked into Drapers' Garden. But I come back to the case of families infected, and shut up by the magistrates. The misery of those families is not to be expressed; and it was generally in such houses that we heard the most dismal shrieks and outcries of the poor people, terrified, and even frightened to death, by the sight of the condition of their dearest relations, and by the terror of being imprisoned as they were. I remember, and while I am writing this story I think I hear the very sound of it: a certain lady had an only daughter, a young maiden about nineteen years old, and who was possessed of a very considerable fortune. They were only lodgers in the house where they were. The young woman, her mother, and the maid had been abroad on some occasion, I do not remember what, for the house was not shut up; but about two hours after they came home, the young lady complained she was not well; in a quarter of an hour more she vomited, and had a violent pain in her head. "Pray God," says her mother, in a terrible fright, "my child has not the distemper!" The pain in her head increasing, her mother ordered the bed to be warmed, and resolved to put her to bed, and prepared to give her things to sweat, which was the ordinary remedy to be taken when the first apprehensions of the distemper began. While the bed was airing, the mother undressed the young woman, and just as she was laid down in the bed, she, looking upon her body with a candle, immediately discovered the fatal tokens on the inside of her thighs. Her mother, not being able to contain herself, threw down her candle, and screeched out in such a frightful manner, that it was enough to place horror upon the stoutest heart in the world. Nor was it one scream, or one cry, but, the fright having seized her spirits, she fainted first, then recovered, then ran all over the house (up the stairs and down the stairs) like one distracted, and indeed really was distracted, and continued screeching and crying out for several hours, void of all sense, or at least government of her senses, and, as I was told, never came thoroughly to herself again. As to the young maiden, she was a dead corpse from that moment: for the gangrene, which occasions the spots, had spread over her whole body, and she died in less than two hours. But still the mother continued crying out, not knowing anything more of her child, several hours after she was dead. It is so long ago that I am not certain, but I think the mother never recovered, but died in two or three weeks after. I have by me a story of two brothers and their kinsman, who, being single men, but that had staid[104] in the city too long to get away, and, indeed, not knowing where to go to have any retreat, nor having wherewith to travel far, took a course for their own preservation, which, though in itself at first desperate, yet was so natural that it may be wondered that no more did so at that time. They were but of mean condition, and yet not so very poor as that they could not furnish themselves with some little conveniences, such as might serve to keep life and soul together; and finding the distemper increasing in a terrible manner, they resolved to shift as well as they could, and to be gone. One of them had been a soldier in the late wars,[105] and before that in the Low Countries;[106] and having been bred to no particular employment but his arms, and besides, being wounded, and not able to work very hard, had for some time been employed at a baker's of sea biscuit, in Wapping. The brother of this man was a seaman too, but somehow or other had been hurt of[107] one leg, that he could not go to sea, but had worked for his living at a sailmaker's in Wapping or thereabouts, and, being a good husband,[108] had laid up some money, and was the richest of the three. The third man was a joiner or carpenter by trade, a handy fellow, and he had no wealth but his box or basket of tools, with the help of which he could at any time get his living (such a time as this excepted) wherever he went; and he lived near Shadwell. They all lived in Stepney Parish, which, as I have said, being the last that was infected, or at least violently, they staid there till they evidently saw the plague was abating at the west part of the town, and coming towards the east, where they lived. The story of those three men, if the reader will be content to have me give it in their own persons, without taking upon me to either vouch the particulars or answer for any mistakes, I shall give as distinctly as I can, believing the history will be a very good pattern for any poor man to follow in case the like public desolation should happen here. And if there may be no such occasion, (which God of his infinite mercy grant us!) still the story may have its uses so many ways as that it will, I hope, never be said that the relating has been unprofitable. I say all this previous to the history, having yet, for the present, much more to say before I quit my own part. I went all the first part of the time freely about the streets, though not so freely as to run myself into apparent danger, except when they dug the great pit in the churchyard of our parish of Aldgate. A terrible pit it was, and I could not resist my curiosity to go and see it. As near as I may judge, it was about forty feet in length, and about fifteen or sixteen feet broad, and at the time I first looked at it about nine feet deep. But it was said they dug it near twenty feet deep afterwards, in one part of it, till they could go no deeper for the water; for they had, it seems, dug several large pits before this; for, though the plague was long a-coming[109] to our parish, yet, when it did come, there was no parish in or about London where it raged with such violence as in the two parishes of Aldgate and Whitechapel. I say they had dug several pits in another ground when the distemper began to spread in our parish, and especially when the dead carts began to go about, which was not in our parish till the beginning of August. Into these pits they had put perhaps fifty or sixty bodies each; then they made larger holes, wherein they buried all that the cart brought in a week, which, by the middle to the end of August, came to from two hundred to four hundred a week. And they could not well dig them larger, because of the order of the magistrates, confining them to leave no bodies within six feet of the surface; and the water coming on at about seventeen or eighteen feet, they could not well, I say, put more in one pit. But now, at the beginning of September, the plague raging in a dreadful manner, and the number of burials in our parish increasing to more than was[110] ever buried in any parish about London of no larger extent, they ordered this dreadful gulf to be dug, for such it was rather than a pit. They had supposed this pit would have supplied them for a month or more when they dug it; and some blamed the churchwardens for suffering such a frightful thing, telling them they were making preparations to bury the whole parish, and the like. But time made it appear, the churchwardens knew the condition of the parish better than they did: for, the pit being finished the 4th of September, I think they began to bury in it the 6th, and by the 20th, which was just two weeks, they had thrown into it eleven hundred and fourteen bodies, when they were obliged to fill it up, the bodies being then come to lie within six feet of the surface. I doubt not but there may be some ancient persons alive in the parish who can justify the fact of this, and are able to show even in what place of the churchyard the pit lay, better than I can: the mark of it also was many years to be seen in the churchyard on the surface, lying in length, parallel with the passage which goes by the west wall of the churchyard out of Houndsditch, and turns east again into Whitechapel, coming out near the Three Nuns Inn. It was about the 10th of September that my curiosity led, or rather drove, me to go and see this pit again, when there had been near four hundred people buried in it. And I was not content to see it in the daytime, as I had done before,--for then there would have been nothing to have been seen but the loose earth, for all the bodies that were thrown in were immediately covered with earth by those they called the "buriers," which at other times were called "bearers,"--but I resolved to go in the night, and see some of them thrown in. There was a strict order to prevent people coming to those pits, and that was only to prevent infection. But after some time that order was more necessary; for people that were infected and near their end, and delirious also, would run to those pits wrapped in blankets, or rugs, and throw themselves in, and, as they said, "bury themselves." I cannot say that the officers suffered any willingly to lie there; but I have heard that in a great pit in Finsbury, in the parish of Cripplegate (it lying open then to the fields, for it was not then walled about), many came and threw themselves in, and expired there, before they threw any earth upon them; and that when they came to bury others, and found them there, they were quite dead, though not cold. This may serve a little to describe the dreadful condition of that day, though it is impossible to say anything that is able to give a true idea of it to those who did not see it, other than this: that it was indeed very, very, very dreadful, and such as no tongue can express. I got admittance into the churchyard by being acquainted with the sexton who attended, who, though he did not refuse me at all, yet earnestly persuaded me not to go, telling me very seriously (for he was a good, religious, and sensible man) that it was indeed their business and duty to venture, and to run all hazards, and that in it they might hope to be preserved; but that I had no apparent call to it but my own curiosity, which, he said, he believed I would not pretend was sufficient to justify my running that hazard. I told him I had been pressed in my mind to go, and that perhaps it might be an instructing sight that might not be without its uses. "Nay," says the good man, "if you will venture upon that score, 'name of God,[111] go in; for, depend upon it, it will be a sermon to you, it may be, the best that ever you heard in your life. It is a speaking sight," says he, "and has a voice with it, and a loud one, to call us all to repentance;" and with that he opened the door, and said, "Go, if you will." His discourse had shocked my resolution a little, and I stood wavering for a good while; but just at that interval I saw two links[112] come over from the end of the Minories, and heard the bellman, and then appeared a "dead cart," as they called it, coming over the streets: so I could no longer resist my desire of seeing it, and went in. There was nobody, as I could perceive at first, in the churchyard, or going into it, but the buriers, and the fellow that drove the cart, or rather led the horse and cart; but when they came up to the pit, they saw a man go to and again,[113] muffled up in a brown cloak, and making motions with his hands, under his cloak, as if he was[114] in great agony. And the buriers immediately gathered about him, supposing he was one of those poor delirious or desperate creatures that used to pretend, as I have said, to bury themselves. He said nothing as he walked about, but two or three times groaned very deeply and loud, and sighed as[115] he would break his heart. When the buriers came up to him, they soon found he was neither a person infected and desperate, as I have observed above, or a person distempered in mind, but one oppressed with a dreadful weight of grief indeed, having his wife and several of his children all in the cart that was just come in with him; and he followed in an agony and excess of sorrow. He mourned heartily, as it was easy to see, but with a kind of masculine grief, that could not give itself vent by tears, and, calmly desiring the buriers to let him alone, said he would only see the bodies thrown in, and go away. So they left importuning him; but no sooner was the cart turned round, and the bodies shot into the pit promiscuously,--which was a surprise to him, for he at least expected they would have been decently laid in, though, indeed, he was afterwards convinced that was impracticable,--I say, no sooner did he see the sight, but he cried out aloud, unable to contain himself. I could not hear what he said, but he went backward two or three steps, and fell down in a swoon. The buriers ran to him and took him up, and in a little while he came to himself, and they led him away to the Pye[116] Tavern, over against the end of Houndsditch, where, it seems, the man was known, and where they took care of him. He looked into the pit again as he went away; but the buriers had covered the bodies so immediately with throwing in earth, that, though there was light enough (for there were lanterns,[117] and candles in them, placed all night round the sides of the pit upon the heaps of earth, seven or eight, or perhaps more), yet nothing could be seen. This was a mournful scene indeed, and affected me almost as much as the rest. But the other was awful, and full of terror: the cart had in it sixteen or seventeen bodies; some were wrapped up in linen sheets, some in rugs, some little other than naked, or so loose that what covering they had fell from them in the shooting out of the cart, and they fell quite naked among the rest; but the matter was not much to them, or the indecency much to any one else, seeing they were all dead, and were to be huddled together into the common grave of mankind, as we may call it; for here was no difference made, but poor and rich went together. There was no other way of burials, neither was it possible there should,[118] for coffins were not to be had for the prodigious numbers that fell in such a calamity as this. It was reported, by way of scandal upon the buriers, that if any corpse was delivered to them decently wound up, as we called it then, in a winding sheet tied over the head and feet (which some did, and which was generally of good linen),--I say, it was reported that the buriers were so wicked as to strip them in the cart, and carry them quite naked to the ground; but as I cannot credit anything so vile among Christians, and at a time so filled with terrors as that was, I can only relate it, and leave it undetermined. Innumerable stories also went about of the cruel behavior and practice of nurses who attended the sick, and of their hastening on the fate of those they attended in their sickness. But I shall say more of this in its place. I was indeed shocked with this sight, it almost overwhelmed me; and I went away with my heart most afflicted, and full of afflicting thoughts such as I cannot describe. Just at my going out of the church, and turning up the street towards my own house, I saw another cart, with links, and a bellman going before, coming out of Harrow Alley, in the Butcher Row, on the other side of the way; and being, as I perceived, very full of dead bodies, it went directly over the street, also, towards the church. I stood a while, but I had no stomach[119] to go back again to see the same dismal scene over again: so I went directly home, where I could not but consider with thankfulness the risk I had run, believing I had gotten no injury, as indeed I had not. Here the poor unhappy gentleman's grief came into my head again, and indeed I could not but shed tears in the reflection upon it, perhaps more than he did himself; but his case lay so heavy upon my mind, that I could not prevail with myself but that I must go out again into the street, and go to the Pye Tavern, resolving to inquire what became of him. It was by this time one o'clock in the morning, and yet the poor gentleman was there. The truth was, the people of the house, knowing him, had entertained him, and kept him there all the night, notwithstanding the danger of being infected by him, though it appeared the man was perfectly sound himself. It is with regret that I take notice of this tavern. The people were civil, mannerly, and an obliging sort of folks enough, and had till this time kept their house open, and their trade going on, though not so very publicly as formerly. But there was a dreadful set of fellows that used their house, and who, in the middle of all this horror, met there every night, behaving with all the reveling and roaring extravagances as is usual for such people to do at other times, and indeed to such an offensive degree that the very master and mistress of the house grew first ashamed, and then terrified, at them. They sat generally in a room next the street; and as they always kept late hours, so when the dead cart came across the street end to go into Houndsditch, which was in view of the tavern windows, they would frequently open the windows as soon as they heard the bell, and look out at them; and as they might often hear sad lamentations of people in the streets, or at their windows, as the carts went along, they would make their impudent mocks and jeers at them, especially if they heard the poor people call upon God to have mercy upon them, as many would do at those times, in their ordinary passing along the streets. These gentlemen, being something disturbed with the clutter of bringing the poor gentleman into the house, as above, were first angry and very high with the master of the house for suffering such a fellow, as they called him, to be brought out of the grave into their house; but being answered that the man was a neighbor, and that he was sound, but overwhelmed with the calamity of his family, and the like, they turned their anger into ridiculing the man and his sorrow for his wife and children, taunting him with want of courage to leap into the great pit, and go to heaven, as they jeeringly expressed it, along with them; adding some very profane and even blasphemous expressions. They were at this vile work when I came back to the house; and as far as I could see, though the man sat still, mute and disconsolate, and their affronts could not divert his sorrow, yet he was both grieved and offended at their discourse. Upon this, I gently reproved them, being well enough acquainted with their characters, and not unknown in person to two of them. They immediately fell upon me with ill language and oaths, asked me what I did out of my grave at such a time, when so many honester men were carried into the churchyard, and why I was not at home saying my prayers, against[120] the dead cart came for me, and the like. I was indeed astonished at the impudence of the men, though not at all discomposed at their treatment of me: however, I kept my temper. I told them that though I defied them, or any man in the world, to tax me with any dishonesty, yet I acknowledged, that, in this terrible judgment of God, many better than I were swept away, and carried to their grave; but, to answer their question directly, the case was, that I was mercifully preserved by that great God whose name they had blasphemed and taken in vain by cursing and swearing in a dreadful manner; and that I believed I was preserved in particular, among other ends of his goodness, that I might reprove them for their audacious boldness in behaving in such a manner, and in such an awful time as this was, especially for their jeering and mocking at an honest gentleman and a neighbor, for some of them knew him, who they saw was overwhelmed with sorrow for the breaches which it had pleased God to make upon his family. I cannot call exactly to mind the hellish, abominable raillery which was the return they made to that talk of mine, being provoked, it seems, that I was not at all afraid to be free with them; nor, if I could remember, would I fill my account with any of the words, the horrid oaths, curses, and vile expressions such as, at that time of the day, even the worst and ordinariest people in the street would not use: for, except such hardened creatures as these, the most wicked wretches that could be found had at that time some terror upon their mind of the hand of that Power which could thus in a moment destroy them. But that which was the worst in all their devilish language was, that they were not afraid to blaspheme God and talk atheistically, making a jest at my calling the plague the hand of God, mocking, and even laughing at the word "judgment," as if the providence of God had no concern in the inflicting such a desolating stroke; and that the people calling upon God, as they saw the carts carrying away the dead bodies, was all enthusiastic, absurd, and impertinent. I made them some reply, such as I thought proper, but which I found was so far from putting a check to their horrid way of speaking, that it made them rail the more: so that I confess it filled me with horror and a kind of rage; and I came away, as I told them, lest the hand of that Judgment which had visited the whole city should glorify his vengeance upon them and all that were near them. They received all reproof with the utmost contempt, and made the greatest mockery that was possible for them to do at me, giving me all the opprobrious insolent scoffs that they could think of for preaching to them, as they called it, which, indeed, grieved me rather than angered me; and I went away, blessing God, however, in my mind, that I had not spared them, though they had insulted me so much. They continued this wretched course three or four days after this, continually mocking and jeering at all that showed themselves religious or serious, or that were any way touched with the sense of the terrible judgment of God upon us; and I was informed they flouted in the same manner at the good people, who, notwithstanding the contagion, met at the church, fasted, and prayed to God to remove his hand from them. I say they continued this dreadful course three or four days (I think it was no more), when one of them, particularly he who asked the poor gentleman what he did out of his grave, was struck from Heaven with the plague, and died in a most deplorable manner; and, in a word, they were every one of them carried into the great pit, which I have mentioned above, before it was quite filled up, which was not above a fortnight or thereabout. These men were guilty of many extravagances, such as one would think human nature should have trembled at the thoughts of, at such a time of general terror as was then upon us, and particularly scoffing and mocking at everything which they happened to see that was religious among the people, especially at their thronging zealously to the place of public worship, to implore mercy from Heaven in such a time of distress; and this tavern where they held their club, being within view of the church door, they had the more particular occasion for their atheistical, profane mirth. But this began to abate a little with them before the accident, which I have related, happened; for the infection increased so violently at this part of the town now, that people began to be afraid to come to the church: at least such numbers did not resort thither as was usual. Many of the clergymen, likewise, were dead, and others gone into the country; for it really required a steady courage and a strong faith, for a man not only to venture being in town at such a time as this, but likewise to venture to come to church, and perform the office of a minister to a congregation of whom he had reason to believe many of them were actually infected with the plague, and to do this every day, or twice a day, as in some places was done. It seems they had been checked, for their open insulting religion in this manner, by several good people of every persuasion; and that[121] and the violent raging of the infection, I suppose, was the occasion that they had abated much of their rudeness for some time before, and were only roused by the spirit of ribaldry and atheism at the clamor which was made when the gentleman was first brought in there, and perhaps were agitated by the same devil when I took upon me to reprove them; though I did it at first with all the calmness, temper, and good manners that I could, which, for a while, they insulted me the more for, thinking it had been in fear of their resentment, though afterwards they found the contrary.[122] These things lay upon my mind, and I went home very much grieved and oppressed with the horror of these men's wickedness, and to think that anything could be so vile, so hardened, and so notoriously wicked, as to insult God, and his servants and his worship, in such a manner, and at such a time as this was, when he had, as it were, his sword drawn in his hand, on purpose to take vengeance, not on them only, but on the whole nation. I had indeed been in some passion at first with them, though it was really raised, not by any affront they had offered me personally, but by the horror their blaspheming tongues filled me with. However, I was doubtful in my thoughts whether the resentment I retained was not all upon my own private account; for they had given me a great deal of ill language too, I mean personally: but after some pause, and having a weight of grief upon my mind, I retired myself as soon as I came home (for I slept not that night), and, giving God most humble thanks for my preservation in the imminent danger I had been in, I set my mind seriously and with the utmost earnestness to pray for those desperate wretches, that God would pardon them, open their eyes, and effectually humble them. By this I not only did my duty, namely, to pray for those who despitefully used me, but I fully tried my own heart, to my full satisfaction that it was not filled with any spirit of resentment as they had offended me in particular; and I humbly recommend the method to all those that would know, or be certain, how to distinguish between their zeal for the honor of God and the effects of their private passions and resentment. I remember a citizen, who, having broken out of his house in Aldersgate Street or thereabout, went along the road to Islington. He attempted to have gone[123] in at the Angel Inn, and after that at the White Horse, two inns known still by the same signs, but was refused, after which he came to the Pyed[124] Bull, an inn also still continuing the same sign. He asked them for lodging for one night only, pretending to be going into Lincolnshire, and assuring them of his being very sound, and free from the infection, which also at that time had not reached much that way. They told him they had no lodging that they could spare but one bed up in the garret, and that they could spare that bed but for one night, some drovers being expected the next day with cattle: so, if he would accept of that lodging, he might have it, which he did. So a servant was sent up with a candle with him to show him the room. He was very well dressed, and looked like a person not used to lie in a garret; and when he came to the room, he fetched a deep sigh, and said to the servant, "I have seldom lain in such a lodging as this." However, the servant assured him again that they had no better. "Well," says he, "I must make shift.[125] This is a dreadful time, but it is but for one night." So he sat down upon the bedside, and bade the maid, I think it was, fetch him a pint of warm ale. Accordingly the servant went for the ale; but some hurry in the house, which perhaps employed her other ways, put it out of her head, and she went up no more to him. The next morning, seeing no appearance of the gentleman, somebody in the house asked the servant that had showed him upstairs what was become of him. She started. "Alas!" says she, "I never thought more of him. He bade me carry him some warm ale, but I forgot." Upon which, not the maid, but some other person, was sent up to see after him, who, coming into the room, found him stark dead, and almost cold, stretched out across the bed. His clothes were pulled off, his jaw fallen, his eyes open in a most frightful posture, the rug of the bed being grasped hard in one of his hands, so that it was plain he died soon after the maid left him; and it is probable, had she gone up with the ale, she had found him dead in a few minutes after he had sat down upon the bed. The alarm was great in the house, as any one may suppose, they having been free from the distemper till that disaster, which, bringing the infection to the house, spread it immediately to other houses round about it. I do not remember how many died in the house itself; but I think the maidservant who went up first with him fell presently ill by the fright, and several others; for, whereas there died but two in Islington of the plague the week before, there died nineteen the week after, whereof fourteen were of the plague. This was in the week from the 11th of July to the 18th. There was one shift[126] that some families had, and that not a few, when their houses happened to be infected, and that was this: the families who in the first breaking out of the distemper fled away into the country, and had retreats among their friends, generally found some or other of their neighbors or relations to commit the charge of those houses to, for the safety of the goods and the like. Some houses were indeed entirely locked up, the doors padlocked, the windows and doors having deal boards nailed over them, and only the inspection of them committed to the ordinary watchmen and parish officers; but these were but few. It was thought that there were not less than a thousand houses forsaken of the inhabitants in the city and suburbs, including what was in the outparishes and in Surrey, or the side of the water they called Southwark. This was besides the numbers of lodgers and of particular persons who were fled out of other families; so that in all it was computed that about two hundred thousand people were fled and gone in all.[127] But of this I shall speak again. But I mention it here on this account: namely, that it was a rule with those who had thus two houses in their keeping or care, that, if anybody was taken sick in a family, before the master of the family let the examiners or any other officer know of it, he immediately would send all the rest of his family, whether children or servants as it fell out to be, to such other house which he had not in charge, and then, giving notice of the sick person to the examiner, have a nurse or nurses appointed, and having another person to be shut up in the house with them (which many for money would do), so to take charge of the house in case the person should die. This was in many cases the saving a whole family, who, if they had been shut up with the sick person, would inevitably have perished. But, on the other hand, this was another of the inconveniences of shutting up houses; for the apprehensions and terror of being shut up made many run away with the rest of the family, who, though it was not publicly known, and they were not quite sick, had yet the distemper upon them; and who, by having an uninterrupted liberty to go about, but being obliged still to conceal their circumstances, or perhaps not knowing it themselves, gave the distemper to others, and spread the infection in a dreadful manner, as I shall explain further hereafter. I had in my family only an ancient woman that managed the house, a maidservant, two apprentices, and myself; and, the plague beginning to increase about us, I had many sad thoughts about what course I should take and how I should act. The many dismal objects[128] which happened everywhere as I went about the streets had filled my mind with a great deal of horror, for fear of the distemper itself, which was indeed very horrible in itself, and in some more than others. The swellings, which were generally in the neck or groin, when they grew hard, and would not break, grew so painful that it was equal to the most exquisite torture; and some, not able to bear the torment, threw themselves out at windows, or shot themselves, or otherwise made themselves away, and I saw several dismal objects of that kind. Others, unable to contain themselves, vented their pain by incessant roarings; and such loud and lamentable cries were to be heard, as we walked along the streets, that[129] would pierce the very heart to think of, especially when it was to be considered that the same dreadful scourge might be expected every moment to seize upon ourselves. I cannot say but that now I began to faint in my resolutions. My heart failed me very much, and sorely I repented of my rashness, when I had been out, and met with such terrible things as these I have talked of. I say I repented my rashness in venturing to abide in town, and I wished often that I had not taken upon me to stay, but had gone away with my brother and his family. Terrified by those frightful objects, I would retire home sometimes, and resolve to go out no more; and perhaps I would keep those resolutions for three or four days, which time I spent in the most serious thankfulness for my preservation and the preservation of my family, and the constant confession of my sins, giving myself up to God every day, and applying to him with fasting and humiliation and meditation. Such intervals as I had, I employed in reading books and in writing down my memorandums of what occurred to me every day, and out of which, afterwards, I took most of this work, as it relates to my observations without doors. What I wrote of my private meditations I reserve for private use, and desire it may not be made public on any account whatever. I also wrote other meditations upon divine subjects, such as occurred to me at that time, and were profitable to myself, but not fit for any other view, and therefore I say no more of that. I had a very good friend, a physician, whose name was Heath, whom I frequently visited during this dismal time, and to whose advice I was very much obliged for many things which he directed me to take by way of preventing the infection when I went out, as he found I frequently did, and to hold in my mouth when I was in the streets. He also came very often to see me; and as he was a good Christian, as well as a good physician, his agreeable conversation was a very great support to me in the worst of this terrible time. It was now the beginning of August, and the plague grew very violent and terrible in the place where I lived; and Dr. Heath coming to visit me, and finding that I ventured so often out in the streets, earnestly persuaded me to lock myself up, and my family, and not to suffer any of us to go out of doors; to keep all our windows fast, shutters and curtains close, and never to open them, but first to make a very strong smoke in the room, where the window or door was to be opened, with rosin[130] and pitch, brimstone and gunpowder, and the like; and we did this for some time. But, as I had not laid in a store of provision for such a retreat, it was impossible that we could keep within doors entirely. However, I attempted, though it was so very late, to do something towards it; and first, as I had convenience both for brewing and baking, I went and bought two sacks of meal, and for several weeks, having an oven, we baked all our own bread; also I bought malt, and brewed as much beer as all the casks I had would hold, and which seemed enough to serve my house for five or six weeks; also I laid in a quantity of salt butter and Cheshire cheese; but I had no flesh meat,[131] and the plague raged so violently among the butchers and slaughterhouses on the other side of our street, where they are known to dwell in great numbers, that it was not advisable so much as to go over the street among them. And here I must observe again, that this necessity of going out of our houses to buy provisions was in a great measure the ruin of the whole city; for the people catched the distemper, on these occasions, one of another; and even the provisions themselves were often tainted (at least I have great reason to believe so), and therefore I cannot say with satisfaction, what I know is repeated with great assurance, that the market people, and such as brought provisions to town, were never infected. I am certain the butchers of Whitechapel, where the greatest part of the flesh meat was killed, were dreadfully visited, and that at last to such a degree that few of their shops were kept open; and those that remained of them killed their meat at Mile End, and that way, and brought it to market upon horses. However, the poor people could not lay up provisions, and there was a necessity that they must go to market to buy, and others to send servants or their children; and, as this was a necessity which renewed itself daily, it brought abundance of unsound people to the markets; and a great many that went thither sound brought death home with them. It is true, people used all possible precaution. When any one bought a joint of meat in the market, they[132] would not take it out of the butcher's hand, but took it off the hooks themselves.[132] On the other hand, the butcher would not touch the money, but have it put into a pot full of vinegar, which he kept for that purpose. The buyer carried always small money to make up any odd sum, that they might take no change. They carried bottles for scents and perfumes in their hands, and all the means that could be used were employed; but then the poor could not do even these things, and they went at all hazards. Innumerable dismal stories we heard every day on this very account. Sometimes a man or woman dropped down dead in the very markets; for many people that had the plague upon them knew nothing of it till the inward gangrene had affected their vitals, and they died in a few moments. This caused that many died frequently in that manner in the street suddenly, without any warning: others, perhaps, had time to go to the next bulk[133] or stall, or to any door or porch, and just sit down and die, as I have said before. These objects were so frequent in the streets, that when the plague came to be very raging on one side, there was scarce any passing by the streets but that several dead bodies would be lying here and there upon the ground. On the other hand, it is observable, that though at first the people would stop as they went along, and call to the neighbors to come out on such an occasion, yet afterward no notice was taken of them; but that, if at any time we found a corpse lying, go across the way and not come near it; or, if in a narrow lane or passage, go back again, and seek some other way to go on the business we were upon. And in those cases the corpse was always left till the officers had notice to come and take them away, or till night, when the bearers attending the dead cart would take them up and carry them away. Nor did those undaunted creatures who performed these offices fail to search their pockets, and sometimes strip off their clothes, if they were well dressed, as sometimes they were, and carry off what they could get. But to return to the markets. The butchers took that care, that, if any person died in the market, they had the officers always at hand to take them up upon handbarrows, and carry them to the next churchyard; and this was so frequent that such were not entered in the weekly bill, found dead in the streets or fields, as is the case now, but they went into the general articles of the great distemper. But now the fury of the distemper increased to such a degree, that even the markets were but very thinly furnished with provisions, or frequented with buyers, compared to what they were before; and the lord mayor caused the country people who brought provisions to be stopped in the streets leading into the town, and to sit down there with their goods, where they sold what they brought, and went immediately away. And this encouraged the country people greatly to do so; for they sold their provisions at the very entrances into the town, and even in the fields, as particularly in the fields beyond Whitechapel, in Spittlefields. Note, those streets now called Spittlefields were then indeed open fields; also in St. George's Fields in Southwark, in Bunhill Fields, and in a great field called Wood's Close, near Islington. Thither the lord mayor, aldermen, and magistrates sent their officers and servants to buy for their families, themselves keeping within doors as much as possible; and the like did many other people. And after this method was taken, the country people came with great cheerfulness, and brought provisions of all sorts, and very seldom got any harm, which, I suppose, added also to that report of their being miraculously preserved.[134] As for my little family, having thus, as I have said, laid in a store of bread, butter, cheese, and beer, I took my friend and physician's advice, and locked myself up, and my family, and resolved to suffer the hardship of living a few months without flesh meat rather than to purchase it at the hazard of our lives. But, though I confined my family, I could not prevail upon my unsatisfied curiosity to stay within entirely myself, and, though I generally came frighted and terrified home, yet I could not restrain, only that, indeed, I did not do it so frequently as at first. I had some little obligations, indeed, upon me to go to my brother's house, which was in Coleman Street Parish, and which he had left to my care; and I went at first every day, but afterwards only once or twice a week. In these walks I had many dismal scenes before my eyes, as, particularly, of persons falling dead in the streets, terrible shrieks and screechings of women, who in their agonies would throw open their chamber windows, and cry out in a dismal surprising manner. It is impossible to describe the variety of postures in which the passions of the poor people would express themselves. Passing through Token-House Yard in Lothbury, of a sudden a casement violently opened just over my head, and a woman gave three frightful screeches, and then cried, "O death, death, death!" in a most inimitable tone, and which[135] struck me with horror, and[136] a chillness in my very blood. There was nobody to be seen in the whole street, neither did any other window open, for people had no curiosity now in any case, nor could anybody help one another: so I went on to pass into Bell Alley. Just in Bell Alley, on the right hand of the passage, there was a more terrible cry than that, though it was not so directed out at the window. But the whole family was in a terrible fright, and I could hear women and children run screaming about the rooms like distracted, when a garret window opened, and somebody from a window on the other side the alley called, and asked, "What is the matter?" Upon which from the first window it was answered, "O Lord, my old master has hanged himself!" The other asked again, "Is he quite dead?" and the first answered, "Ay, ay, quite dead; quite dead and cold!" This person was a merchant and a deputy alderman, and very rich. I care not to mention his name, though I knew his name too; but that would be a hardship to the family, which is now flourishing again.[137] But this is but one. It is scarce credible what dreadful cases happened in particular families every day,--people, in the rage of the distemper, or in the torment of their swellings, which was indeed intolerable, running out of their own government,[138] raving and distracted, and oftentimes laying violent hands upon themselves, throwing themselves out at their windows, shooting themselves, etc.; mothers murdering their own children in their lunacy; some dying of mere grief as a passion, some of mere fright and surprise without any infection at all; others frighted into idiotism[139] and foolish distractions, some into despair and lunacy, others into melancholy madness. The pain of the swelling was in particular very violent, and to some intolerable. The physicians and surgeons may be said to have tortured many poor creatures even to death. The swellings in some grew hard, and they applied violent drawing plasters, or poultices, to break them; and, if these did not do, they cut and scarified them in a terrible manner. In some, those swellings were made hard, partly by the force of the distemper, and partly by their being too violently drawn, and were so hard that no instrument could cut them; and then they burned them with caustics, so that many died raving mad with the torment, and some in the very operation. In these distresses, some, for want of help to hold them down in their beds or to look to them, laid hands upon themselves as above; some broke out into the streets, perhaps naked, and would run directly down to the river, if they were not stopped by the watchmen or other officers, and plunge themselves into the water wherever they found it. It often pierced my very soul to hear the groans and cries of those who were thus tormented. But of the two, this was counted the most promising particular in the whole infection: for if these swellings could be brought to a head, and to break and run, or, as the surgeons call it, to "digest," the patient generally recovered; whereas those who, like the gentlewoman's daughter, were struck with death at the beginning, and had the tokens come out upon them, often went about indifferently easy till a little before they died, and some till the moment they dropped down, as in apoplexies and epilepsies is often the case. Such would be taken suddenly very sick, and would run to a bench or bulk, or any convenient place that offered itself, or to their own houses, if possible, as I mentioned before, and there sit down, grow faint, and die. This kind of dying was much the same as it was with those who die of common mortifications,[140] who die swooning, and, as it were, go away in a dream. Such as died thus had very little notice of their being infected at all till the gangrene was spread through their whole body; nor could physicians themselves know certainly how it was with them till they opened their breasts, or other parts of their body, and saw the tokens. We had at this time a great many frightful stories told us of nurses and watchmen who looked after the dying people (that is to say, hired nurses, who attended infected people), using them barbarously, starving them, smothering them, or by other wicked means hastening their end, that is to say, murdering of them. And watchmen being set to guard houses that were shut up, when there has been but one person left, and perhaps that one lying sick, that[141] they have broke in and murdered that body, and immediately thrown them out into the dead cart; and so they have gone scarce cold to the grave. I cannot say but that some such murders were committed, and I think two were sent to prison for it, but died before they could be tried; and I have heard that three others, at several times, were executed for murders of that kind. But I must say I believe nothing of its being so common a crime as some have since been pleased to say; nor did it seem to be so rational, where the people were brought so low as not to be able to help themselves; for such seldom recovered, and there was no temptation to commit a murder, at least not equal to the fact, where they were sure persons would die in so short a time, and could not live. That there were a great many robberies and wicked practices committed even in this dreadful time, I do not deny. The power of avarice was so strong in some, that they would run any hazard to steal and to plunder; and, particularly in houses where all the families or inhabitants have been dead and carried out, they would break in at all hazards, and, without regard to the danger of infection, take even the clothes off the dead bodies, and the bedclothes from others where they lay dead. This, I suppose, must be the case of a family in Houndsditch, where a man and his daughter (the rest of the family being, as I suppose, carried away before by the dead cart) were found stark naked, one in one chamber and one in another, lying dead on the floor, and the clothes of the beds (from whence it is supposed they were rolled off by thieves) stolen, and carried quite away. It is indeed to be observed that the women were, in all this calamity, the most rash, fearless, and desperate creatures. And, as there were vast numbers that went about as nurses to tend those that were sick, they committed a great many petty thieveries in the houses where they were employed; and some of them were publicly whipped for it, when perhaps they ought rather to have been hanged for examples,[142] for numbers of houses were robbed on these occasions; till at length the parish officers were sent to recommend nurses to the sick, and always took an account who it was they sent, so as that they might call them to account if the house had been abused where they were placed. But these robberies extended chiefly to wearing-clothes, linen, and what rings or money they could come at, when the person died who was under their care, but not to a general plunder of the houses. And I could give you an account of one of these nurses, who several years after, being on her deathbed, confessed with the utmost horror the robberies she had committed at the time of her being a nurse, and by which she had enriched herself to a great degree. But as for murders, I do not find that there was ever any proofs of the fact in the manner as it has been reported, except as above. They did tell me, indeed, of a nurse in one place that laid a wet cloth upon the face of a dying patient whom she tended, and so put an end to his life, who was just expiring before; and another that smothered a young woman she was looking to, when she was in a fainting fit, and would have come to herself; some that killed them by giving them one thing, some another, and some starved them by giving them nothing at all. But these stories had two marks of suspicion that always attended them, which caused me always to slight them, and to look on them as mere stories that people continually frighted one another with: (1) That wherever it was that we heard it, they always placed the scene at the farther end of the town, opposite or most remote from where you were to hear it. If you heard it in Whitechapel, it had happened at St. Giles's, or at Westminster, or Holborn, or that end of the town; if you heard it at that end of the town, then it was done in Whitechapel, or the Minories, or about Cripplegate Parish; if you heard of it in the city, why, then, it happened in Southwark; and, if you heard of it in Southwark, then it was done in the city; and the like. In the next place, of whatsoever part you heard the story, the particulars were always the same, especially that of laying a wet double clout[143] on a dying man's face, and that of smothering a young gentlewoman: so that it was apparent, at least to my judgment, that there was more of tale than of truth in those things. A neighbor and acquaintance of mine, having some money owing to him from a shopkeeper in Whitecross Street or thereabouts, sent his apprentice, a youth about eighteen years of age, to endeavor to get the money. He came to the door, and, finding it shut, knocked pretty hard, and, as he thought, heard somebody answer within, but was not sure: so he waited, and after some stay knocked again, and then a third time, when he heard somebody coming downstairs. At length the man of the house came to the door. He had on his breeches, or drawers, and a yellow flannel waistcoat, no stockings, a pair of slip shoes, a white cap on his head, and, as the young man said, death in his face. When he opened the door, says he, "What do you disturb me thus for?" The boy, though a little surprised, replied, "I come from such a one; and my master sent me for the money, which he says you know of."--"Very well, child," returns the living ghost; "call, as you go by, at Cripplegate Church, and bid them ring the bell," and with these words shut the door again, and went up again, and died the same day, nay, perhaps the same hour. This the young man told me himself, and I have reason to believe it. This was while the plague was not come to a height. I think it was in June, towards the latter end of the month. It must have been before the dead carts came about, and while they used the ceremony of ringing the bell for the dead, which was over for certain, in that parish at least, before the month of July; for by the 25th of July there died five hundred and fifty and upwards in a week, and then they could no more bury in form[144] rich or poor. I have mentioned above, that, notwithstanding this dreadful calamity, yet that[145] numbers of thieves were abroad upon all occasions where they had found any prey, and that these were generally women. It was one morning about eleven o'clock, I had walked out to my brother's house in Coleman Street Parish, as I often did, to see that all was safe. My brother's house had a little court before it, and a brick wall and a gate in it, and within that several warehouses, where his goods of several sorts lay. It happened that in one of these warehouses were several packs of women's high-crowned hats, which came out of the country, and were, as I suppose, for exportation, whither I know not. I was surprised that when I came near my brother's door, which was in a place they called Swan Alley, I met three or four women with high-crowned hats on their heads; and, as I remembered afterwards, one, if not more, had some hats likewise in their hands. But as I did not see them come out at my brother's door, and not knowing that my brother had any such goods in his warehouse, I did not offer to say anything to them, but went across the way to shun meeting them, as was usual to do at that time, for fear of the plague. But when I came nearer to the gate, I met another woman, with more hats, come out of the gate. "What business, mistress," said I, "have you had there?"--"There are more people there," said she. "I have had no more business there than they." I was hasty to get to the gate then, and said no more to her; by which means she got away. But just as I came to the gate, I saw two more coming across the yard, to come out, with hats also on their heads and under their arms; at which I threw the gate to behind me, which, having a spring lock, fastened itself. And turning to the women, "Forsooth," said I, "what are you doing here?" and seized upon the hats, and took them from them. One of them, who, I confess, did not look like a thief, "Indeed," says she, "we are wrong; but we were told they were goods that had no owner: be pleased to take them again. And look yonder: there are more such customers as we." She cried, and looked pitifully: so I took the hats from her, and opened the gate, and bade them begone, for I pitied the women indeed. But when I looked towards the warehouse, as she directed, there were six or seven more, all women, fitting themselves with hats, as unconcerned and quiet as if they had been at a hatter's shop buying for their money. I was surprised, not at the sight of so many thieves only, but at the circumstances I was in; being now to thrust myself in among so many people, who for some weeks I had been so shy of myself, that, if I met anybody in the street, I would cross the way from them. They were equally surprised, though on another account. They all told me they were neighbors; that they had heard any one might take them; that they were nobody's goods; and the like. I talked big to them at first; went back to the gate and took out the key, so that they were all my prisoners; threatened to lock them all into the warehouse, and go and fetch my lord mayor's officers for them. They begged heartily, protested they found the gate open, and the warehouse door open, and that it had no doubt been broken open by some who expected to find goods of greater value; which indeed was reasonable to believe, because the lock was broke, and a padlock that hung to the door on the outside also loose, and not abundance of the hats carried away. At length I considered that this was not a time to be cruel and rigorous; and besides that, it would necessarily oblige me to go much about, to have several people come to me, and I go to several, whose circumstances of health I knew nothing of; and that, even at this time, the plague was so high as that there died four thousand a week; so that, in showing my resentment, or even in seeking justice for my brother's goods, I might lose my own life. So I contented myself with taking the names and places where some of them lived, who were really inhabitants in the neighborhood, and threatening that my brother should call them to an account for it when he returned to his habitation. Then I talked a little upon another footing with them, and asked them how they could do such things as these in a time of such general calamity, and, as it were, in the face of God's most dreadful judgments, when the plague was at their very doors, and, it may be, in their very houses, and they did not know but that the dead cart might stop at their doors in a few hours, to carry them to their graves. I could not perceive that my discourse made much impression upon them all that while, till it happened that there came two men of the neighborhood, hearing of the disturbance, and knowing my brother (for they had been both dependents upon his family), and they came to my assistance. These being, as I said, neighbors, presently knew three of the women, and told me who they were, and where they lived, and it seems they had given me a true account of themselves before. This brings these two men to a further remembrance. The name of one was John Hayward, who was at that time under-sexton of the parish of St. Stephen, Coleman Street (by under-sexton was understood at that time gravedigger and bearer of the dead). This man carried, or assisted to carry, all the dead to their graves, which were buried in that large parish, and who were carried in form, and, after that form of burying was stopped, went with the dead cart and the bell to fetch the dead bodies from the houses where they lay, and fetched many of them out of the chambers and houses; for the parish was, and is still, remarkable, particularly above all the parishes in London, for a great number of alleys and thoroughfares, very long, into which no carts could come, and where they were obliged to go and fetch the bodies a very long way, which alleys now remain to witness it; such as White's Alley, Cross Keys Court, Swan Alley, Bell Alley, White Horse Alley, and many more. Here they went with a kind of handbarrow, and laid the dead bodies on, and carried them out to the carts; which work he performed, and never had the distemper at all, but lived about twenty years after it, and was sexton of the parish to the time of his death. His wife at the same time was a nurse to infected people, and tended many that died in the parish, being for her honesty recommended by the parish officers; yet she never was infected, neither.[146] He never used any preservative against the infection other than holding garlic and rue[147] in his mouth, and smoking tobacco. This I also had from his own mouth. And his wife's remedy was washing her head in vinegar, and sprinkling her head-clothes so with vinegar as to keep them always moist; and, if the smell of any of those she waited on was more than ordinary offensive, she snuffed vinegar up her nose, and sprinkled vinegar upon her head-clothes, and held a handkerchief wetted with vinegar to her mouth. It must be confessed, that, though the plague was chiefly among the poor, yet were the poor the most venturous and fearless of it, and went about their employment with a sort of brutal courage: I must call it so, for it was founded neither on religion or prudence. Scarce did they use any caution, but ran into any business which they could get any employment in, though it was the most hazardous; such was that of tending the sick, watching houses shut up, carrying infected persons to the pesthouse, and, which was still worse, carrying the dead away to their graves. It was under this John Hayward's care, and within his bounds, that the story of the piper, with which people have made themselves so merry, happened; and he assured me that it was true. It is said that it was a blind piper; but, as John told me, the fellow was not blind, but an ignorant, weak, poor man, and usually went his rounds about ten o'clock at night, and went piping along from door to door. And the people usually took him in at public houses where they knew him, and would give him drink and victuals, and sometimes farthings; and he in return would pipe and sing, and talk simply, which diverted the people; and thus he lived. It was but a very bad time for this diversion while things were as I have told; yet the poor fellow went about as usual, but was almost starved: and when anybody asked how he did, he would answer, the dead cart had not taken him yet, but that they had promised to call for him next week. It happened one night that this poor fellow, whether somebody had given him too much drink or no (John Hayward said he had not drink in his house, but that they had given him a little more victuals than ordinary at a public house in Coleman Street), and the poor fellow having not usually had a bellyful, or perhaps not a good while, was laid all along upon the top of a bulk or stall, and fast asleep at a door in the street near London Wall, towards Cripplegate; and that, upon the same bulk or stall, the people of some house in the alley of which the house was a corner, hearing a bell (which they always rung before the cart came), had laid a body really dead of the plague just by him, thinking too that this poor fellow had been a dead body as the other was, and laid there by some of the neighbors. Accordingly, when John Hayward with his bell and the cart came along, finding two dead bodies lie upon the stall, they took them up with the instrument they used, and threw them into the cart; and all this while the piper slept soundly. From hence they passed along, and took in other dead bodies, till, as honest John Hayward told me, they almost buried him alive in the cart; yet all this while he slept soundly. At length the cart came to the place where the bodies were to be thrown into the ground, which, as I do remember, was at Mountmill; and, as the cart usually stopped some time before they were ready to shoot out the melancholy load they had in it, as soon as the cart stopped, the fellow awaked, and struggled a little to get his head out from among the dead bodies; when, raising himself up in the cart, he called out, "Hey, where am I?" This frighted the fellow that attended about the work; but, after some pause, John Hayward, recovering himself, said, "Lord bless us! There's somebody in the cart not quite dead!" So another called to him, and said, "Who are you?" The fellow answered, "I am the poor piper. Where am I?"--"Where are you?" says Hayward. "Why, you are in the dead cart, and we are going to bury you."--"But I ain't dead, though, am I?" says the piper; which made them laugh a little, though, as John said, they were heartily frightened at first. So they helped the poor fellow down, and he went about his business. I know the story goes, he set up[148] his pipes in the cart, and frighted the bearers and others, so that they ran away; but John Hayward did not tell the story so, nor say anything of his piping at all. But that he was a poor piper, and that he was carried away as above, I am fully satisfied of the truth of. It is to be noted here that the dead carts in the city were not confined to particular parishes; but one cart went through several parishes, according as the number of dead presented. Nor were they tied[149] to carry the dead to their respective parishes; but many of the dead taken up in the city were carried to the burying ground in the outparts for want of room. At the beginning of the plague, when there was now no more hope but that the whole city would be visited; when, as I have said, all that had friends or estates in the country retired with their families; and when, indeed, one would have thought the very city itself was running out of the gates, and that there would be nobody left behind,--you may be sure from that hour all trade, except such as related to immediate subsistence, was, as it were, at a full stop. This is so lively a case, and contains in it so much of the real condition of the people, that I think I cannot be too particular in it, and therefore I descend to the several arrangements or classes of people who fell into immediate distress upon this occasion. For example:-- 1. All master workmen in manufactures, especially such as belonged to ornament and the less necessary parts of the people's dress, clothes, and furniture for houses; such as ribbon-weavers and other weavers, gold and silver lacemakers, and gold and silver wire-drawers, seamstresses, milliners, shoemakers, hatmakers, and glovemakers, also upholsterers, joiners, cabinet-makers, looking-glass-makers, and innumerable trades which depend upon such as these,--I say, the master workmen in such stopped their work, dismissed their journeymen and workmen and all their dependents. 2. As merchandising was at a full stop (for very few ships ventured to come up the river, and none at all went out[150]), so all the extraordinary officers of the customs, likewise the watermen, carmen, porters, and all the poor whose labor depended upon the merchants, were at once dismissed, and put out of business. 3. All the tradesmen usually employed in building or repairing of houses were at a full stop; for the people were far from wanting to build houses when so many thousand houses were at once stripped of their inhabitants; so that this one article[151] turned out all the ordinary workmen of that kind of business, such as bricklayers, masons, carpenters, joiners, plasterers, painters, glaziers, smiths, plumbers, and all the laborers depending on such. 4. As navigation was at a stop, our ships neither coming in or going out as before, so the seamen were all out of employment, and many of them in the last and lowest degree of distress. And with the seamen were all the several tradesmen and workmen belonging to and depending upon the building and fitting out of ships; such as ship-carpenters, calkers, ropemakers, dry coopers, sailmakers, anchor-smiths, and other smiths, blockmakers, carvers, gunsmiths, ship-chandlers, ship-carvers, and the like. The masters of those, perhaps, might live upon their substance; but the traders were universally at a stop, and consequently all their workmen discharged. Add to these, that the river was in a manner without boats, and all or most part of the watermen, lighter-men, boat-builders, and lighter-builders, in like manner idle and laid by. 5. All families retrenched their living as much as possible, as well those that fled as those that staid; so that an innumerable multitude of footmen, serving men, shopkeepers, journeymen, merchants' bookkeepers, and such sort of people, and especially poor maidservants, were turned off, and left friendless and helpless, without employment and without habitation; and this was really a dismal article. I might be more particular as to this part; but it may suffice to mention, in general, all trades being stopped, employment ceased, the labor, and by that the bread of the poor, were cut off; and at first, indeed, the cries of the poor were most lamentable to hear, though, by the distribution of charity, their misery that way was gently[152] abated. Many, indeed, fled into the country; but, thousands of them having staid in London till nothing but desperation sent them away, death overtook them on the road, and they served for no better than the messengers of death: indeed, others carrying the infection along with them, spread it very unhappily into the remotest parts of the kingdom. The women and servants that were turned off from their places were employed as nurses to tend the sick in all places, and this took off a very great number of them. And which,[153] though a melancholy article in itself, yet was a deliverance in its kind, namely, the plague, which raged in a dreadful manner from the middle of August to the middle of October, carried off in that time thirty or forty thousand of these very people, which, had they been left, would certainly have been an insufferable burden by their poverty; that is to say, the whole city could not have supported the expense of them, or have provided food for them, and they would in time have been even driven to the necessity of plundering either the city itself, or the country adjacent, to have subsisted themselves, which would, first or last, have put the whole nation, as well as the city, into the utmost terror and confusion. It was observable, then, that this calamity of the people made them very humble; for now, for about nine weeks together, there died near a thousand a day, one day with another, even by the account of the weekly bills, which yet, I have reason to be assured, never gave a full account by many thousands; the confusion being such, and the carts working in the dark when they carried the dead, that in some places no account at all was kept, but they worked on; the clerks and sextons not attending for weeks together, and not knowing what number they carried. This account is verified by the following bills of mortality:-- Of All Diseases. Of the Plague. Aug. 8 to Aug. 15 5,319 3,880 Aug. 15 to Aug. 22 5,668 4,237 Aug. 22 to Aug. 29 7,496 6,102 Aug. 29 to Sept. 5 8,252 6,988 Sept. 5 to Sept. 12 7,690 6,544 Sept. 12 to Sept. 19 8,297 7,165 Sept. 19 to Sept. 30 6,400 5,533 Sept. 27 to Oct. 3 5,728 4,929 Oct. 3 to Oct. 10 5,068 4,227 ------ ------ 59,918 49,605 So that the gross of the people were carried off in these two months; for, as the whole number which was brought in to die of the plague was but 68,590, here is[154] 50,000 of them, within a trifle, in two months: I say 50,000, because as there wants 395 in the number above, so there wants two days of two months in the account of time.[155] Now, when I say that the parish officers did not give in a full account, or were not to be depended upon for their account, let any one but consider how men could be exact in such a time of dreadful distress, and when many of them were taken sick themselves, and perhaps died in the very time when their accounts were to be given in (I mean the parish clerks, besides inferior officers): for though these poor men ventured at all hazards, yet they were far from being exempt from the common calamity, especially if it be true that the parish of Stepney had within the year one hundred and sixteen sextons, gravediggers, and their assistants; that is to say, bearers, bellmen, and drivers of carts for carrying off the dead bodies. Indeed, the work was not of such a nature as to allow them leisure to take an exact tale[156] of the dead bodies, which were all huddled together in the dark into a pit; which pit, or trench, no man could come nigh but at the utmost peril. I have observed often that in the parishes of Aldgate, Cripplegate, Whitechapel, and Stepney, there were five, six, seven, and eight hundred in a week in the bills; whereas, if we may believe the opinion of those that lived in the city all the time, as well as I, there died sometimes two thousand a week in those parishes. And I saw it under the hand of one that made as strict an examination as he could, that there really died a hundred thousand people of the plague in it that one year; whereas, in the bills, the article of the plague was but 68,590. If I may be allowed to give my opinion, by what I saw with my eyes, and heard from other people that were eyewitnesses, I do verily believe the same; viz., that there died at least a hundred thousand of the plague only, besides other distempers, and besides those which died in the fields and highways and secret places, out of the compass[157] of the communication, as it was called, and who were not put down in the bills, though they really belonged to the body of the inhabitants. It was known to us all that abundance of poor despairing creatures who had the distemper upon them, and were grown stupid or melancholy by their misery (as many were), wandered away into the fields and woods, and into secret uncouth[158] places, almost anywhere, to creep into a bush or hedge, and die. The inhabitants of the villages adjacent would in pity carry them food, and set it at a distance, that they might fetch it if they were able; and sometimes they were not able. And the next time they went they would find the poor wretches lie[159] dead, and the food untouched. The number of these miserable objects were[160] many; and I know so many that perished thus, and so exactly where, that I believe I could go to the very place, and dig their bones up still;[161] for the country people would go and dig a hole at a distance from them, and then, with long poles and hooks at the end of them, drag the bodies into these pits, and then throw the earth in form, as far as they could cast it, to cover them, taking notice how the wind blew, and so come on that side which the seamen call "to windward," that the scent of the bodies might blow from them. And thus great numbers went out of the world who were never known, or any account of them taken, as well within the bills of mortality as without. This indeed I had, in the main, only from the relation of others; for I seldom walked into the fields,[162] except towards Bethnal Green and Hackney, or as hereafter. But when I did walk, I always saw a great many poor wanderers at a distance, but I could know little of their cases; for, whether it were in the street or in the fields, if we had seen anybody coming, it was a general method to walk away. Yet I believe the account is exactly true. As this puts me upon mentioning my walking the streets and fields, I cannot omit taking notice what a desolate place the city was at that time. The great street I lived in, which is known to be one of the broadest of all the streets of London (I mean of the suburbs as well as the liberties, all the side where the butchers lived, especially without the bars[163]), was more like a green field than a paved street; and the people generally went in the middle with the horses and carts. It is true that the farthest end, towards Whitechapel Church, was not all paved, but even the part that was paved was full of grass also. But this need not seem strange, since the great streets within the city, such as Leadenhall Street, Bishopsgate Street, Cornhill, and even the Exchange itself, had grass growing in them in several places. Neither cart nor coach was seen in the streets from morning to evening, except some country carts to bring roots and beans, or pease, hay, and straw, to the market, and those but very few compared to what was usual. As for coaches, they were scarce used, but to carry sick people to the pesthouse and to other hospitals, and some few to carry physicians to such places as they thought fit to venture to visit; for really coaches were dangerous things, and people did not care to venture into them, because they did not know who might have been carried in them last; and sick infected people were, as I have said, ordinarily carried in them to the pesthouses; and sometimes people expired in them as they went along. It is true, when the infection came to such a height as I have now mentioned, there were very few physicians who cared to stir abroad to sick houses, and very many of the most eminent of the faculty[164] were dead, as well as the surgeons also; for now it was indeed a dismal time, and for about a month together, not taking any notice of the bills of mortality, I believe there did not die less than fifteen or seventeen hundred a day, one day with another. One of the worst days we had in the whole time, as I thought, was in the beginning of September, when, indeed, good people were beginning to think that God was resolved to make a full end of the people in this miserable city. This was at that time when the plague was fully come into the eastern parishes. The parish of Aldgate, if I may give my opinion, buried above one thousand a week for two weeks, though the bills did not say so many; but it[165] surrounded me at so dismal a rate, that there was not a house in twenty uninfected. In the Minories, in Houndsditch, and in those parts of Aldgate Parish about the Butcher Row, and the alleys over against me,--I say, in those places death reigned in every corner. Whitechapel Parish was in the same condition, and though much less than the parish I lived in, yet buried near six hundred a week, by the bills, and in my opinion near twice as many. Whole families, and indeed whole streets of families, were swept away together, insomuch that it was frequent for neighbors to call to the bellman to go to such and such houses and fetch out the people, for that they were all dead. And indeed the work of removing the dead bodies by carts was now grown so very odious and dangerous, that it was complained of that the bearers did not take care to clear such houses where all the inhabitants were dead, but that some of the bodies lay unburied till the neighboring families were offended by the stench, and consequently infected. And this neglect of the officers was such, that the churchwardens and constables were summoned to look after it; and even the justices of the hamlets[166] were obliged to venture their lives among them to quicken and encourage them; for innumerable of the bearers died of the distemper, infected by the bodies they were obliged to come so near. And had it not been that the number of people who wanted employment, and wanted bread, as I have said before, was so great that necessity drove them to undertake anything, and venture anything, they would never have found people to be employed; and then the bodies of the dead would have lain above ground, and have perished and rotted in a dreadful manner. But the magistrates cannot be enough commended in this, that they kept such good order for the burying of the dead, that as fast as any of those they employed to carry off and bury the dead fell sick or died (as was many times the case), they immediately supplied the places with others; which, by reason of the great number of poor that was left out of business, as above, was not hard to do. This occasioned, that, notwithstanding the infinite number of people which died and were sick, almost all together, yet they were always cleared away, and carried off every night; so that it was never to be said of London that the living were not able to bury the dead. As the desolation was greater during those terrible times, so the amazement of the people increased; and a thousand unaccountable things they would do in the violence of their fright, as others did the same in the agonies of their distemper: and this part was very affecting. Some went roaring, and crying, and wringing their hands, along the street; some would go praying, and lifting up their hands to heaven, calling upon God for mercy. I cannot say, indeed, whether this was not in their distraction; but, be it so, it was still an indication of a more serious mind when they had the use of their senses, and was much better, even as it was, than the frightful yellings and cryings that every day, and especially in the evenings, were heard in some streets. I suppose the world has heard of the famous Solomon Eagle, an enthusiast. He, though not infected at all, but in his head, went about denouncing of judgment upon the city in a frightful manner; sometimes quite naked, and with a pan of burning charcoal on his head. What he said or pretended, indeed, I could not learn. I will not say whether that clergyman was distracted or not, or whether he did it out of pure zeal for the poor people, who went every evening through the streets of Whitechapel, and, with his hands lifted up, repeated that part of the liturgy of the church continually, "Spare us, good Lord; spare thy people whom thou hast redeemed with thy most precious blood." I say I cannot speak positively of these things, because these were only the dismal objects which represented themselves to me as I looked through my chamber windows; for I seldom opened the casements while I confined myself within doors during that most violent raging of the pestilence, when indeed many began to think, and even to say, that there would none escape. And indeed I began to think so too, and therefore kept within doors for about a fortnight, and never stirred out. But I could not hold it. Besides, there were some people, who, notwithstanding the danger, did not omit publicly to attend the worship of God, even in the most dangerous times. And though it is true that a great many of the clergy did shut up their churches and fled, as other people did, for the safety of their lives, yet all did not do so. Some ventured to officiate, and to keep up the assemblies of the people by constant prayers, and sometimes sermons, or brief exhortations to repentance and reformation; and this as long as they would hear them. And dissenters[167] did the like also, and even in the very churches where the parish ministers were either dead or fled; nor was there any room for making any difference at such a time as this was. It pleased God that I was still spared, and very hearty and sound in health, but very impatient of being pent up within doors without air, as I had been for fourteen days or thereabouts. And I could not restrain myself, but I would go and carry a letter for my brother to the posthouse; then it was, indeed, that I observed a profound silence in the streets. When I came to the posthouse, as I went to put in my letter, I saw a man stand in one corner of the yard, and talking to another at a window; and a third had opened a door belonging to the office. In the middle of the yard lay a small leather purse, with two keys hanging at it, with money in it; but nobody would meddle with it. I asked how long it had lain there. The man at the window said it had lain almost an hour, but they had not meddled with it, because they did not know but the person who dropped it might come back to look for it. I had no such need of money, nor was the sum so big that I had any inclination to meddle with it or to get the money at the hazard it might be attended with: so I seemed to go away, when the man who had opened the door said he would take it up, but so that, if the right owner came for it, he should be sure to have it. So he went in and fetched a pail of water, and set it down hard by the purse, then went again and fetched some gunpowder, and cast a good deal of powder upon the purse, and then made a train from that which he had thrown loose upon the purse (the train reached about two yards); after this he goes in a third time, and fetches out a pair of tongs red hot, and which he had prepared, I suppose, on purpose; and first setting fire to the train of powder, that singed the purse, and also smoked the air sufficiently. But he was not content with that, but he then takes up the purse with the tongs, holding it so long till the tongs burnt through the purse, and then he shook the money out into the pail of water: so he carried it in. The money, as I remember, was about thirteen shillings, and some smooth groats[168] and brass farthings.[169] Much about the same time, I walked out into the fields towards Bow; for I had a great mind to see how things were managed in the river and among the ships; and, as I had some concern in shipping, I had a notion that it had been one of the best ways of securing one's self from the infection to have retired into a ship. And, musing how to satisfy my curiosity in that point, I turned away over the fields, from Bow to Bromley, and down to Blackwall, to the stairs that are there for landing, or taking water. Here I saw a poor man walking on the bank, or "sea wall" as they call it, by himself. I walked awhile also about, seeing the houses all shut up. At last I fell into some talk, at a distance, with this poor man. First I asked how people did thereabouts. "Alas, sir!" says he, "almost desolate, all dead or sick; here are very few families in this part, or in that village," pointing at Poplar, "where half of them are not dead already, and the rest sick." Then he, pointing to one house, "They are all dead," said he, "and the house stands open: nobody dares go into it. A poor thief," says he, "ventured in to steal something; but he paid dear for his theft, for he was carried to the churchyard too, last night." Then he pointed to several other houses. "There," says he, "they are all dead, the man and his wife and five children. There," says he, "they are shut up; you see a watchman at the door:" and so of other houses. "Why," says I, "what do you here all alone?"--"Why," says he, "I am a poor desolate man: it hath pleased God I am not yet visited, though my family is, and one of my children dead."--"How do you mean, then," said I, "that you are not visited?"--"Why," says he, "that is my house," pointing to a very little low boarded house, "and there my poor wife and two children live," said he, "if they may be said to live; for my wife and one of the children are visited; but I do not come at them." And with that word I saw the tears run very plentifully down his face; and so they did down mine too, I assure you. "But," said I, "why do you not come at them? How can you abandon your own flesh and blood?"--"O sir!" says he, "the Lord forbid! I do not abandon them, I work for them as much as I am able; and, blessed be the Lord! I keep them from want." And with that I observed he lifted up his eyes to heaven with a countenance that presently told me I had happened on a man that was no hypocrite, but a serious, religious, good man; and his ejaculation was an expression of thankfulness, that, in such a condition as he was in, he should be able to say his family did not want. "Well," says I, "honest man, that is a great mercy, as things go now with the poor. But how do you live, then, and how are you kept from the dreadful calamity that is now upon us all?"--"Why, sir," says he, "I am a waterman, and there is my boat," says he, "and the boat serves me for a house; I work in it in the day, and I sleep in it in the night: and what I get I lay it down upon that stone," says he, showing me a broad stone on the other side of the street, a good way from his house; "and then," says he, "I halloo and call to them till I make them hear, and they come and fetch it." "Well, friend," says I, "but how can you get money as a waterman? Does anybody go by water these times?"--"Yes, sir," says he, "in the way I am employed there does. Do you see there," says he, "five ships lie at anchor?" pointing down the river a good way below the town; "and do you see," says he, "eight or ten ships lie at the chain there, and at anchor yonder?" pointing above the town. "All those ships have families on board, of their merchants and owners, and such like, who have locked themselves up and live on board, close shut in, for fear of the infection; and I tend on them to fetch things for them, carry letters, and do what is absolutely necessary, that they may not be obliged to come on shore. And every night I fasten my boat on board one of the ship's boats, and there I sleep by myself, and, blessed be God! I am preserved hitherto." "Well," said I, "friend, but will they let you come on board after you have been on shore here, when this has been such a terrible place, and so infected as it is?" "Why, as to that," said he, "I very seldom go up the ship side, but deliver what I bring to their boat, or lie by the side, and they hoist it on board: if I did, I think they are in no danger from me, for I never go into any house on shore, or touch anybody, no, not of my own family; but I fetch provisions for them." "Nay," says I, "but that may be worse; for you must have those provisions of somebody or other; and since all this part of the town is so infected, it is dangerous so much as to speak with anybody; for the village," said I, "is, as it were, the beginning of London, though it be at some distance from it." "That is true," added he; "but you do not understand me right. I do not buy provisions for them here. I row up to Greenwich, and buy fresh meat there, and sometimes I row down the river to Woolwich,[170] and buy there; then I go to single farmhouses on the Kentish side, where I am known, and buy fowls and eggs and butter, and bring to the ships as they direct me, sometimes one, sometimes the other. I seldom come on shore here, and I came only now to call my wife, and hear how my little family do, and give them a little money which I received last night." "Poor man!" said I. "And how much hast thou gotten for them?" "I have gotten four shillings," said he, "which is a great sum, as things go now with poor men; but they have given me a bag of bread too, and a salt fish, and some flesh: so all helps out." "Well," said I, "and have you given it them yet?" "No," said he, "but I have called; and my wife has answered that she cannot come out yet, but in half an hour she hopes to come, and I am waiting for her. Poor woman!" says he, "she is brought sadly down; she has had a swelling, and it is broke, and I hope she will recover, but I fear the child will die. But it is the Lord!"--Here he stopped, and wept very much. "Well, honest friend," said I, "thou hast a sure comforter, if thou hast brought thyself to be resigned to the will of God: he is dealing with us all in judgment." "O sir!" says he, "it is infinite mercy if any of us are spared; and who am I to repine!" "Say'st thou so?" said I; "and how much less is my faith than thine!" And here my heart smote me, suggesting how much better this poor man's foundation was, on which he stayed in the danger, than mine: that he had nowhere to fly; that he had a family to bind him to attendance, which I had not; and mine was mere presumption, his a true dependence and a courage resting on God; and yet that he used all possible caution for his safety. I turned a little away from the man while these thoughts engaged me; for, indeed, I could no more refrain from tears than he. At length, after some further talk, the poor woman opened the door, and called, "Robert, Robert!" He answered, and bid her stay a few moments and he would come: so he ran down the common stairs to his boat, and fetched up a sack in which was the provisions he had brought from the ships; and when he returned he hallooed again; then he went to the great stone which he showed me, and emptied the sack, and laid all out, everything by themselves, and then retired; and his wife came with a little boy to fetch them away; and he called, and said, such a captain had sent such a thing, and such a captain such a thing, and at the end adds, "God has sent it all: give thanks to him." When the poor woman had taken up all, she was so weak she could not carry it at once in, though the weight was not much, neither: so she left the biscuit, which was in a little bag, and left a little boy to watch it till she came again. "Well, but," says I to him, "did you leave her the four shillings too, which you said was your week's pay?" "Yes, yes," says he; "you shall hear her own it." So he called again, "Rachel, Rachel!" which it seems was her name, "did you take up the money?"--"Yes," said she. "How much was it?" said he. "Four shillings and a groat," said she. "Well, well," says he, "the Lord keep you all;" and so he turned to go away. As I could not refrain from contributing tears to this man's story, so neither could I refrain my charity for his assistance; so I called him. "Hark thee, friend," said I, "come hither, for I believe thou art in health, that I may venture thee:" so I pulled out my hand, which was in my pocket before. "Here," says I, "go and call thy Rachel once more, and give her a little more comfort from me. God will never forsake a family that trusts in him as thou dost." So I gave him four other shillings, and bid him go lay them on the stone, and call his wife. I have not words to express the poor man's thankfulness; neither could he express it himself but by tears running down his face. He called his wife, and told her God had moved the heart of a stranger, upon hearing their condition, to give them all that money; and a great deal more such as that he said to her. The woman, too, made signs of the like thankfulness, as well to Heaven as to me, and joyfully picked it up; and I parted with no money all that year that I thought better bestowed. I then asked the poor man if the distemper had not reached to Greenwich. He said it had not till about a fortnight before; but that then he feared it had, but that it was only at that end of the town which lay south towards Deptford[171] Bridge; that he went only to a butcher's shop and a grocer's, where he generally bought such things as they sent him for, but was very careful. I asked him then how it came to pass that those people who had so shut themselves up in the ships had not laid in sufficient stores of all things necessary. He said some of them had; but, on the other hand, some did not come on board till they were frightened into it, and till it was too dangerous for them to go to the proper people to lay in quantities of things; and that he waited on two ships, which he showed me, that had laid in little or nothing but biscuit bread[172] and ship beer, and that he had bought everything else almost for them. I asked him if there were any more ships that had separated themselves as those had done. He told me yes; all the way up from the point, right against Greenwich, to within the shores of Limehouse and Redriff, all the ships that could have room rid[173] two and two in the middle of the stream, and that some of them had several families on board. I asked him if the distemper had not reached them. He said he believed it had not, except two or three ships, whose people had not been so watchful as to keep the seamen from going on shore as others had been; and he said it was a very fine sight to see how the ships lay up the Pool.[174] When he said he was going over to Greenwich as soon as the tide began to come in, I asked if he would let me go with him, and bring me back, for that I had a great mind to see how the ships were ranged, as he had told me. He told me if I would assure him, on the word of a Christian and of an honest man, that I had not the distemper, he would. I assured him that I had not; that it had pleased God to preserve me; that I lived in Whitechapel, but was too impatient of being so long within doors, and that I had ventured out so far for the refreshment of a little air, but that none in my house had so much as been touched with it. "Well, sir," says he, "as your charity has been moved to pity me and my poor family, sure you cannot have so little pity left as to put yourself into my boat if you were not sound in health, which would be nothing less than killing me, and ruining my whole family." The poor man troubled me so much when he spoke of his family with such a sensible concern and in such an affectionate manner, that I could not satisfy myself at first to go at all. I told him I would lay aside my curiosity rather than make him uneasy, though I was sure, and very thankful for it, that I had no more distemper upon me than the freshest man in the world. Well, he would not have me put it off neither, but, to let me see how confident he was that I was just to him, he now importuned me to go: so, when the tide came up to his boat, I went in, and he carried me to Greenwich. While he bought the things which he had in charge to buy, I walked up to the top of the hill, under which the town stands, and on the east side of the town, to get a prospect of the river; but it was a surprising sight to see the number of ships which lay in rows, two and two, and in some places two or three such lines in the breadth of the river, and this not only up to the town, between the houses which we call Ratcliff and Redriff, which they name the Pool, but even down the whole river, as far as the head of Long Reach, which is as far as the hills give us leave to see it. I cannot guess at the number of ships, but I think there must have been several hundreds of sail; and I could not but applaud the contrivance, for ten thousand people and more who attended ship affairs were certainly sheltered here from the violence of the contagion, and lived very safe and very easy. I returned to my own dwelling very well satisfied with my day's journey, and particularly with the poor man; also I rejoiced to see that such little sanctuaries were provided for so many families on board in a time of such desolation. I observed, also, that, as the violence of the plague had increased, so the ships which had families on board removed and went farther off, till, as I was told, some went quite away to sea, and put into such harbors and safe roads[175] on the north coast as they could best come at. But it was also true, that all the people who thus left the land, and lived on board the ships, were not entirely safe from the infection; for many died, and were thrown overboard into the river, some in coffins, and some, as I heard, without coffins, whose bodies were seen sometimes to drive up and down with the tide in the river. But I believe I may venture to say, that, in those ships which were thus infected, it either happened where the people had recourse to them too late, and did not fly to the ship till they had staid too long on shore, and had the distemper upon them, though perhaps they might not perceive it (and so the distemper did not come to them on board the ships, but they really carried it with them), or it was in these ships where the poor waterman said they had not had time to furnish themselves with provisions, but were obliged to send often on shore to buy what they had occasion for, or suffered boats to come to them from the shore; and so the distemper was brought insensibly among them. And here I cannot but take notice that the strange temper of the people of London at that time contributed extremely to their own destruction. The plague began, as I have observed, at the other end of the town (namely, in Longacre, Drury Lane, etc.), and came on towards the city very gradually and slowly. It was felt at first in December, then again in February, then again in April (and always but a very little at a time), then it stopped till May; and even the last week in May there were but seventeen in all that end of the town. And all this while, even so long as till there died about three thousand a week, yet had the people in Redriff and in Wapping and Ratcliff, on both sides the river, and almost all Southwark side, a mighty fancy that they should not be visited, or at least that it would not be so violent among them. Some people fancied the smell of the pitch and tar, and such other things, as oil and resin and brimstone (which is much used by all trades relating to shipping), would preserve them. Others argued it,[176] because it[177] was in its extremest violence in Westminster and the parish of St. Giles's and St. Andrew's, etc., and began to abate again before it came among them, which was true, indeed, in part. For example:-- Aug. 8 to Aug. 15, St. Giles-in-the-Fields 242 " " Cripplegate 886 " " Stepney 197 " " St. Mag.[178] Bermondsey 24 " " Rotherhithe 3 Total this week 4,030 Aug. 15 to Aug. 22, St. Giles-in-the-Fields 175 " " Cripplegate 847 " " Stepney 273 " " St. Mag. Bermondsey 36 " " Rotherhithe 2 Total this week 5,319 N.B.[179]--That it was observed that the numbers mentioned in Stepney Parish at that time were generally all on that side where Stepney Parish joined to Shoreditch, which we now call Spittlefields, where the parish of Stepney comes up to the very wall of Shoreditch churchyard. And the plague at this time was abated at St. Giles-in-the-Fields, and raged most violently in Cripplegate, Bishopsgate, and Shoreditch Parishes, but there were not ten people a week that died of it in all that part of Stepney Parish which takes in Limehouse, Ratcliff Highway, and which are now the parishes of Shadwell and Wapping, even to St. Katherine's-by-the-Tower, till after the whole month of August was expired; but they paid for it afterwards, as I shall observe by and by. This, I say, made the people of Redriff and Wapping, Ratcliff and Limehouse, so secure, and flatter themselves so much with the plague's going off without reaching them, that they took no care either to fly into the country or shut themselves up: nay, so far were they from stirring, that they rather received their friends and relations from the city into their houses; and several from other places really took sanctuary in that part of the town as a place of safety, and as a place which they thought God would pass over, and not visit as the rest was visited. And this was the reason, that, when it came upon them, they were more surprised, more unprovided, and more at a loss what to do, than they were in other places; for when it came among them really and with violence, as it did indeed in September and October, there was then no stirring out into the country. Nobody would suffer a stranger to come near them, no, nor near the towns where they dwelt; and, as I have been told, several that wandered into the country on the Surrey side were found starved to death in the woods and commons; that country being more open and more woody than any other part so near London, especially about Norwood and the parishes of Camberwell, Dulwich,[180] and Lusum, where it seems nobody durst[181] relieve the poor distressed people for fear of the infection. This notion having, as I said, prevailed with the people in that part of the town, was in part the occasion, as I said before, that they had recourse to ships for their retreat; and where they did this early and with prudence, furnishing themselves so with provisions so that they had no need to go on shore for supplies, or suffer boats to come on board to bring them,--I say, where they did so, they had certainly the safest retreat of any people whatsoever. But the distress was such, that people ran on board in their fright without bread to eat, and some into ships that had no men on board to remove them farther off, or to take the boat and go down the river to buy provisions, where it may be done safely; and these often suffered, and were infected on board as much as on shore. As the richer sort got into ships, so the lower rank got into hoys,[182] smacks, lighters, and fishing boats; and many, especially watermen, lay in their boats: but those made sad work of it, especially the latter; for going about for provision, and perhaps to get their subsistence, the infection got in among them, and made a fearful havoc. Many of the watermen died alone in their wherries as they rid at their roads, as well above bridge[183] as below, and were not found sometimes till they were not in condition for anybody to touch or come near them. Indeed, the distress of the people at this seafaring end of the town was very deplorable, and deserved the greatest commiseration. But, alas! this was a time when every one's private safety lay so near them that they had no room to pity the distresses of others; for every one had death, as it were, at his door, and many even in their families, and knew not what to do, or whither to fly. This, I say, took away all compassion. Self-preservation, indeed, appeared here to be the first law: for the children ran away from their parents as they languished in the utmost distress; and in some places, though not so frequent as the other, parents did the like to their children. Nay, some dreadful examples there were, and particularly two in one week, of distressed mothers, raving and distracted, killing their own children; one whereof was not far off from where I dwelt, the poor lunatic creature not living herself long enough to be sensible of the sin of what she had done, much less to be punished for it. It is not, indeed, to be wondered at; for the danger of immediate death to ourselves took away all bowels of love, all concern for one another. I speak in general: for there were many instances of immovable affection, pity, and duty in many, and some that came to my knowledge, that is to say, by hearsay; for I shall not take upon me to vouch the truth of the particulars. I could tell here dismal stories of living infants being found sucking the breasts of their mothers or nurses after they have been dead of the plague; of a mother in the parish where I lived, who, having a child that was not well, sent for an apothecary to view the child, and when he came, as the relation goes, was giving the child suck at her breast, and to all appearance was herself very well; but, when the apothecary came close to her, he saw the tokens upon that breast with which she was suckling the child. He was surprised enough, to be sure; but, not willing to fright the poor woman too much, he desired she would give the child into his hand: so he takes the child, and, going to a cradle in the room, lays it in, and, opening its clothes, found the tokens upon the child too; and both died before he could get home to send a preventive medicine to the father of the child, to whom he had told their condition. Whether the child infected the nurse mother, or the mother the child, was not certain, but the last most likely. Likewise of a child brought home to the parents from a nurse that had died of the plague; yet the tender mother would not refuse to take in her child, and laid it in her bosom, by which she was infected and died, with the child in her arms dead also. It would make the hardest heart move at the instances that were frequently found of tender mothers tending and watching with their dear children, and even dying before them, and sometimes taking the distemper from them, and dying, when the child for whom the affectionate heart had been sacrificed has got over it and escaped. I have heard also of some who, on the death of their relations, have grown stupid with the insupportable sorrow; and of one in particular, who was so absolutely overcome with the pressure upon his spirits, that by degrees his head sunk into his body so between his shoulders, that the crown of his head was very little seen above the bone of his shoulders; and by degrees, losing both voice and sense, his face, looking forward, lay against his collar bone, and could not be kept up any otherwise, unless held up by the hands of other people. And the poor man never came to himself again, but languished near a year in that condition, and died. Nor was he ever once seen to lift up his eyes, or to look upon any particular object.[184] I cannot undertake to give any other than a summary of such passages as these, because it was not possible to come at the particulars where sometimes the whole families where such things happened were carried off by the distemper; but there were innumerable cases of this kind which presented[185] to the eye and the ear, even in passing along the streets, as I have hinted above. Nor is it easy to give any story of this or that family, which there was not divers parallel stories to be met with of the same kind. But as I am now talking of the time when the plague raged at the easternmost parts of the town; how for a long time the people of those parts had flattered themselves that they should escape, and how they were surprised when it came upon them as it did (for indeed it came upon them like an armed man when it did come),--I say this brings me back to the three poor men who wandered from Wapping, not knowing whither to go or what to do, and whom I mentioned before,--one a biscuit baker, one a sailmaker, and the other a joiner, all of Wapping or thereabouts. The sleepiness and security of that part, as I have observed, was such, that they not only did not shift for themselves as others did, but they boasted of being safe, and of safety being with them. And many people fled out of the city, and out of the infected suburbs, to Wapping, Ratcliff, Limehouse, Poplar, and such places, as to places of security. And it is not at all unlikely that their doing this helped to bring the plague that way faster than it might otherwise have come: for though I am much for people's flying away, and emptying such a town as this upon the first appearance of a like visitation, and that all people who have any possible retreat should make use of it in time, and begone, yet I must say, when all that will fly are gone, those that are left, and must stand it, should stand stock-still where they are, and not shift from one end of the town or one part of the town to the other; for that is the bane and mischief of the whole, and they carry the plague from house to house in their very clothes. Wherefore were we ordered to kill all the dogs and cats, but because, as they were domestic animals, and are apt to run from house to house and from street to street, so they are capable of carrying the effluvia or infectious steams of bodies infected, even in their furs and hair? And therefore it was, that, in the beginning of the infection, an order was published by the lord mayor and by the magistrates, according to the advice of the physicians, that all the dogs and cats should be immediately killed; and an officer was appointed for the execution. It is incredible, if their account is to be depended upon, what a prodigious number of those creatures were destroyed. I think they talked of forty thousand dogs and five times as many cats; few houses being without a cat, some having several, sometimes five or six in a house. All possible endeavors were used also to destroy the mice and rats, especially the latter, by laying rats-bane and other poisons for them; and a prodigious multitude of them were also destroyed. I often reflected upon the unprovided condition that the whole body of the people were in at the first coming of this calamity upon them; and how it was for want of timely entering into measures and managements, as well public as private, that all the confusions that followed were brought upon us, and that such a prodigious number of people sunk in that disaster which, if proper steps had been taken, might, Providence concurring, have been avoided, and which, if posterity think fit, they may take a caution and warning from. But I shall come to this part again. I come back to my three men. Their story has a moral in every part of it; and their whole conduct, and that of some whom they joined with, is a pattern for all poor men to follow, or women either, if ever such a time comes again: and if there was no other end in recording it, I think this a very just one, whether my account be exactly according to fact or no. Two of them were said to be brothers, the one an old soldier, but now a biscuit baker; the other a lame sailor, but now a sailmaker; the third a joiner. Says John the biscuit baker, one day, to Thomas, his brother, the sailmaker, "Brother Tom, what will become of us? The plague grows hot in the city, and increases this way. What shall we do?" "Truly," says Thomas, "I am at a great loss what to do; for I find if it comes down into Wapping I shall be turned out of my lodging." And thus they began to talk of it beforehand. John. Turned out of your lodging, Tom? If you are, I don't know who will take you in; for people are so afraid of one another now, there is no getting a lodging anywhere. Tho. Why, the people where I lodge are good civil people, and have kindness for me too; but they say I go abroad every day to my work, and it will be dangerous; and they talk of locking themselves up, and letting nobody come near them. John. Why, they are in the right, to be sure, if they resolve to venture staying in town. Tho. Nay, I might even resolve to stay within doors too; for, except a suit of sails that my master has in hand, and which I am just finishing, I am like to get no more work a great while. There's no trade stirs now, workmen and servants are turned off everywhere; so that I might be glad to be locked up too. But I do not see that they will be willing to consent to that any more than to the other. John. Why, what will you do then, brother? And what shall I do? for I am almost as bad as you. The people where I lodge are all gone into the country but a maid, and she is to go next week, and to shut the house quite up; so that I shall be turned adrift to the wide world before you: and I am resolved to go away too, if I knew but where to go. Tho. We were both distracted we did not go away at first, when we might ha' traveled anywhere: there is no stirring now. We shall be starved if we pretend to go out of town. They won't let us have victuals, no, not for our money, nor let us come into the towns, much less into their houses. John. And, that which is almost as bad, I have but little money to help myself with, neither. Tho. As to that, we might make shift. I have a little, though not much; but I tell you there is no stirring on the road. I know a couple of poor honest men in our street have attempted to travel; and at Barnet,[186] or Whetstone, or thereabout, the people offered to fire at them if they pretended to go forward: so they are come back again quite discouraged. John. I would have ventured their fire, if I had been there. If I had been denied food for my money, they should have seen me take it before their faces; and, if I had tendered money for it, they could not have taken any course with me by the law. Tho. You talk your old soldier's language, as if you were in the Low Countries[187] now; but this is a serious thing. The people have good reason to keep anybody off that they are not satisfied are sound at such a time as this, and we must not plunder them. John. No, brother, you mistake the case, and mistake me too: I would plunder nobody. But for any town upon the road to deny me leave to pass through the town in the open highway, and deny me provisions for my money, is to say the town has a right to starve me to death; which cannot be true. Tho. But they do not deny you liberty to go back again from whence you came, and therefore they do not starve you. John. But the next town behind me will, by the same rule, deny me leave to go back; and so they do starve me between them. Besides, there is no law to prohibit my traveling wherever I will on the road. Tho. But there will be so much difficulty in disputing with them at every town on the road, that it is not for poor men to do it, or undertake it, at such a time as this is especially. John. Why, brother, our condition, at this rate, is worse than anybody's else; for we can neither go away nor stay here. I am of the same mind with the lepers of Samaria.[188] If we stay here, we are sure to die. I mean especially as you and I are situated, without a dwelling house of our own, and without lodging in anybody's else. There is no lying in the street at such a time as this; we had as good[189] go into the dead cart at once. Therefore, I say, if we stay here, we are sure to die; and if we go away, we can but die. I am resolved to be gone. Tho. You will go away. Whither will you go, and what can you do? I would as willingly go away as you, if I knew whither; but we have no acquaintance, no friends. Here we were born, and here we must die. John. Look you, Tom, the whole kingdom is my native country as well as this town. You may as well say I must not go out of my house if it is on fire, as that I must not go out of the town I was born in when it is infected with the plague. I was born in England, and have a right to live in it if I can. Tho. But you know every vagrant person may, by the laws of England, be taken up, and passed back to their last legal settlement. John. But how shall they make me vagrant? I desire only to travel on upon my lawful occasions. Tho. What lawful occasions can we pretend to travel, or rather wander, upon? They will not be put off with words. John. Is not flying to save our lives a lawful occasion? And do they not all know that the fact is true? We cannot be said to dissemble. Tho. But, suppose they let us pass, whither shall we go? John. Anywhere to save our lives: it is time enough to consider that when we are got out of this town. If I am once out of this dreadful place, I care not where I go. Tho. We shall be driven to great extremities. I know not what to think of it. John. Well, Tom, consider of it a little. This was about the beginning of July; and though the plague was come forward in the west and north parts of the town, yet all Wapping, as I have observed before, and Redriff and Ratcliff, and Limehouse and Poplar, in short, Deptford and Greenwich, both sides of the river from the Hermitage, and from over against it, quite down to Blackwall, was entirely free. There had not one person died of the plague in all Stepney Parish, and not one on the south side of Whitechapel Road, no, not in any parish; and yet the weekly bill was that very week risen up to 1,006. It was a fortnight after this before the two brothers met again, and then the case was a little altered, and the plague was exceedingly advanced, and the number greatly increased. The bill was up at 2,785, and prodigiously increasing; though still both sides of the river, as below, kept pretty well. But some began to die in Redriff, and about five or six in Ratcliff Highway, when the sailmaker came to his brother John, express,[190] and in some fright; for he was absolutely warned out of his lodging, and had only a week to provide himself. His brother John was in as bad a case, for he was quite out, and had only[191] begged leave of his master, the biscuit baker, to lodge in an outhouse belonging to his workhouse, where he only lay upon straw, with some biscuit sacks, or "bread sacks," as they called them, laid upon it, and some of the same sacks to cover him. Here they resolved, seeing all employment being at an end, and no work or wages to be had, they would make the best of their way to get out of the reach of the dreadful infection, and, being as good husbands as they could, would endeavor to live upon what they had as long as it would last, and then work for more, if they could get work anywhere of any kind, let it be what it would. While they were considering to put this resolution in practice in the best manner they could, the third man, who was acquainted very well with the sailmaker, came to know of the design, and got leave to be one of the number; and thus they prepared to set out. It happened that they had not an equal share of money; but as the sailmaker, who had the best stock, was, besides his being lame, the most unfit to expect to get anything by working in the country, so he was content that what money they had should all go into one public stock, on condition that whatever any one of them could gain more than another, it should, without any grudging, be all added to the public stock. They resolved to load themselves with as little baggage as possible, because they resolved at first to travel on foot, and to go a great way, that they might, if possible, be effectually safe. And a great many consultations they had with themselves before they could agree about what way they should travel; which they were so far from adjusting, that, even to the morning they set out, they were not resolved on it. At last the seaman put in a hint that determined it. "First," says he, "the weather is very hot; and therefore I am for traveling north, that we may not have the sun upon our faces, and beating upon our breasts, which will heat and suffocate us; and I have been told," says he, "that it is not good to overheat our blood at a time when, for aught we know, the infection may be in the very air. In the next place," says he, "I am for going the way that may be contrary to the wind as it may blow when we set out, that we may not have the wind blow the air of the city on our backs as we go." These two cautions were approved of, if it could be brought so to hit that the wind might not be in the south when they set out to go north. John the baker, who had been a soldier, then put in his opinion. "First," says he, "we none of us expect to get any lodging on the road, and it will be a little too hard to lie just in the open air. Though it may be warm weather, yet it may be wet and damp, and we have a double reason to take care of our healths at such a time as this; and therefore," says he, "you, brother Tom, that are a sailmaker, might easily make us a little tent; and I will undertake to set it up every night and take it down, and a fig for all the inns in England. If we have a good tent over our heads, we shall do well enough." The joiner opposed this, and told them, let them leave that to him: he would undertake to build them a house every night with his hatchet and mallet, though he had no other tools, which should be fully to their satisfaction, and as good as a tent. The soldier and the joiner disputed that point some time; but at last the soldier carried it for a tent: the only objection against it was, that it must be carried with them, and that would increase their baggage too much, the weather being hot. But the sailmaker had a piece of good hap[192] fall in, which made that easy; for his master who[193] he worked for, having a ropewalk, as well as sailmaking trade, had a little poor horse that he made no use of then, and, being willing to assist the three honest men, he gave them the horse for the carrying their baggage; also, for a small matter of three days' work that his man did for him before he went, he let him have an old topgallant sail[194] that was worn out, but was sufficient, and more than enough, to make a very good tent. The soldier showed how to shape it, and they soon, by his direction, made their tent, and fitted it with poles or staves for the purpose: and thus they were furnished for their journey; viz., three men, one tent, one horse, one gun for the soldier (who would not go without arms, for now he said he was no more a biscuit baker, but a trooper). The joiner had a small bag of tools, such as might be useful if he should get any work abroad, as well for their subsistence as his own. What money they had they brought all into one public stock, and thus they began their journey. It seems that in the morning when they set out, the wind blew, as the sailor said, by his pocket compass, at N.W. by W., so they directed, or rather resolved to direct, their course N.W. But then a difficulty came in their way, that as they set out from the hither end of Wapping, near the Hermitage, and that the plague was now very violent, especially on the north side of the city, as in Shoreditch and Cripplegate Parish, they did not think it safe for them to go near those parts: so they went away east, through Ratcliff Highway, as far as Ratcliff Cross, and leaving Stepney church still on their left hand, being afraid to come up from Ratcliff Cross to Mile End, because they must come just by the churchyard, and because the wind, that seemed to blow more from the west, blowed directly from the side of the city where the plague was hottest. So, I say, leaving Stepney, they fetched a long compass,[195] and, going to Poplar and Bromley, came into the great road just at Bow. Here the watch placed upon Bow Bridge would have questioned them; but they, crossing the road into a narrow way that turns out of the higher end of the town of Bow to Oldford, avoided any inquiry there, and traveled on to Oldford. The constables everywhere were upon their guard, not so much, it seems, to stop people passing by, as to stop them from taking up their abode in their towns; and, withal, because of a report that was newly raised at that time, and that indeed was not very improbable, viz., that the poor people in London, being distressed and starved for want of work, and by that means for want of bread, were up in arms, and had raised a tumult, and that they would come out to all the towns round to plunder for bread. This, I say, was only a rumor, and it was very well it was no more; but it was not so far off from being a reality as it has been thought, for in a few weeks more the poor people became so desperate by the calamity they suffered, that they were with great difficulty kept from running out into the fields and towns, and tearing all in pieces wherever they came. And, as I have observed before, nothing hindered them but that the plague raged so violently, and fell in upon them so furiously, that they rather went to the grave by thousands than into the fields in mobs by thousands; for in the parts about the parishes of St. Sepulchre's, Clerkenwell, Cripplegate, Bishopsgate, and Shoreditch, which were the places where the mob began to threaten, the distemper came on so furiously, that there died in those few parishes, even then, before the plague was come to its height, no less than 5,361 people in the first three weeks in August, when at the same time the parts about Wapping, Ratcliff, and Rotherhithe were, as before described, hardly touched, or but very lightly; so that in a word, though, as I said before, the good management of the lord mayor and justices did much to prevent the rage and desperation of the people from breaking out in rabbles and tumults, and, in short, from the poor plundering the rich,--I say, though they did much, the dead cart did more: for as I have said, that, in five parishes only, there died above five thousand in twenty days, so there might be probably three times that number sick all that time; for some recovered, and great numbers fell sick every day, and died afterwards. Besides, I must still be allowed to say, that, if the bills of mortality said five thousand, I always believed it was twice as many in reality, there being no room to believe that the account they gave was right, or that indeed they[196] were, among such confusions as I saw them in, in any condition to keep an exact account. But to return to my travelers. Here they were only examined, and, as they seemed rather coming from the country than from the city, they found the people easier with them; that they talked to them, let them come into a public house where the constable and his warders were, and gave them drink and some victuals, which greatly refreshed and encouraged them. And here it came into their heads to say, when they should be inquired of afterwards, not that they came from London, but that they came out of Essex. To forward this little fraud, they obtained so much favor of the constable at Oldford as to give them a certificate of their passing from Essex through that village, and that they had not been at London; which, though false in the common acceptation of London in the country, yet was literally true, Wapping or Ratcliff being no part either of the city or liberty. This certificate, directed to the next constable, that was at Homerton, one of the hamlets of the parish of Hackney, was so serviceable to them, that it procured them, not a free passage there only, but a full certificate of health from a justice of the peace, who, upon the constable's application, granted it without much difficulty. And thus they passed through the long divided town of Hackney (for it lay then in several separated hamlets), and traveled on till they came into the great north road, on the top of Stamford Hill. By this time they began to weary; and so, in the back road from Hackney, a little before it opened into the said great road, they resolved to set up their tent, and encamp for the first night; which they did accordingly, with this addition: that, finding a barn, or a building like a barn, and first searching as well as they could to be sure there was nobody in it, they set up their tent with the head of it against the barn. This they did also because the wind blew that night very high, and they were but young at such a way of lodging, as well as at the managing their tent. Here they went to sleep; but the joiner, a grave and sober man, and not pleased with their lying at this loose rate the first night, could not sleep, and resolved, after trying it to no purpose, that he would get out, and, taking the gun in his hand, stand sentinel, and guard his companions. So, with the gun in his hand, he walked to and again before the barn; for that stood in the field near the road, but within the hedge. He had not been long upon the scout, but he heard a noise of people coming on as if it had been a great number; and they came on, as he thought, directly towards the barn. He did not presently awake his companions, but in a few minutes more, their noise growing louder and louder, the biscuit baker called to him and asked him what was the matter, and quickly started out too. The other being the lame sailmaker, and most weary, lay still in the tent. As they expected, so the people whom they had heard came on directly to the barn, when one of our travelers challenged, like soldiers upon the guard, with, "Who comes there?" The people did not answer immediately; but one of them speaking to another that was behind them, "Alas, alas! we are all disappointed," says he; "here are some people before us; the barn is taken up." They all stopped upon that, as under some surprise; and it seems there were about thirteen of them in all, and some women among them. They consulted together what they should do; and by their discourse, our travelers soon found they were poor distressed people too, like themselves, seeking shelter and safety; and besides, our travelers had no need to be afraid of their coming up to disturb them, for as soon as they heard the words, "Who comes there?" they could hear the women say, as if frighted, "Do not go near them; how do you know but they may have the plague?" And when one of the men said, "Let us but speak to them," the women said, "No, don't, by any means; we have escaped thus far by the goodness of God; do not let us run into danger now, we beseech you." Our travelers found by this that they were a good sober sort of people, and flying for their lives as they were; and as they were encouraged by it, so John said to the joiner, his comrade, "Let us encourage them too, as much as we can." So he called to them. "Hark ye, good people," says the joiner; "we find by your talk that you are flying from the same dreadful enemy as we are. Do not be afraid of us; we are only three poor men of us. If you are free from the distemper, you shall not be hurt by us. We are not in the barn, but in a little tent here on the outside, and we will remove for you; we can set up our tent again immediately anywhere else." And upon this a parley began between the joiner, whose name was Richard, and one of their men, whose said name was Ford. Ford. And do you assure us that you are all sound men? Rich. Nay, we are concerned to tell you of it, that you may not be uneasy, or think yourselves in danger; but you see we do not desire you should put yourselves into any danger, and therefore I tell you that we have not made use of the barn; so we will remove from it, that you may be safe and we also. Ford. That is very kind and charitable; but if we have reason to be satisfied that you are sound, and free from the visitation, why should we make you remove, now you are settled in your lodging, and, it may be, are laid down to rest? We will go into the barn, if you please, to rest ourselves awhile, and we need not disturb you. Rich. Well, but you are more than we are. I hope you will assure us that you are all of you sound too, for the danger is as great from you to us as from us to you. Ford. Blessed be God that some do escape, though it be but few! What may be our portion still, we know not, but hitherto we are preserved. Rich. What part of the town do you come from? Was the plague come to the places where you lived? Ford. Ay, ay, in a most frightful and terrible manner, or else we had not fled away as we do; but we believe there will be very few left alive behind us. Rich. What part do you come from? Ford. We are most of us from Cripplegate Parish; only two or three of Clerkenwell Parish, but on the hither side. Rich. How, then, was it that you came away no sooner? Ford. We have been away some time, and kept together as well as we could at the hither end of Islington, where we got leave to lie in an old uninhabited house, and had some bedding and conveniences of our own, that we brought with us; but the plague is come up into Islington too, and a house next door to our poor dwelling was infected and shut up, and we are come away in a fright. Rich. And what way are you going? Ford. As our lot shall cast us, we know not whither; but God will guide those that look up to him. They parleyed no further at that time, but came all up to the barn, and with some difficulty got into it. There was nothing but hay in the barn, but it was almost full of that, and they accommodated themselves as well as they could, and went to rest; but our travelers observed that before they went to sleep, an ancient man, who, it seems, was the father of one of the women, went to prayer with all the company, recommending themselves to the blessing and protection of Providence before they went to sleep. It was soon day at that time of the year; and as Richard the joiner had kept guard the first part of the night, so John the soldier relieved him, and he had the post in the morning. And they began to be acquainted with one another. It seems, when they left Islington, they intended to have gone north away to Highgate, but were stopped at Holloway, and there they would not let them pass; so they crossed over the fields and hills to the eastward, and came out at the Boarded River, and so, avoiding the towns, they left Hornsey on the left hand, and Newington on the right hand, and came into the great road about Stamford Hill on that side, as the three travelers had done on the other side. And now they had thoughts of going over the river in the marshes, and make forwards to Epping Forest, where they hoped they should get leave to rest. It seems they were not poor, at least not so poor as to be in want: at least, they had enough to subsist them moderately for two or three months, when, as they said, they were in hopes the cold weather would check the infection, or at least the violence of it would have spent itself, and would abate, if it were only for want of people left alive to be infected. This was much the fate of our three travelers, only that they seemed to be the better furnished for traveling, and had it in their view to go farther off; for, as to the first, they did not propose to go farther than one day's journey, that so they might have intelligence every two or three days how things were at London. But here our travelers found themselves under an unexpected inconvenience, namely, that of their horse; for, by means of the horse to carry their baggage, they were obliged to keep in the road, whereas the people of this other band went over the fields or roads, path or no path, way or no way, as they pleased. Neither had they any occasion to pass through any town, or come near any town, other than to buy such things as they wanted for their necessary subsistence; and in that, indeed, they were put to much difficulty, of which in its place. But our three travelers were obliged to keep the road, or else they must commit spoil, and do the country a great deal of damage in breaking down fences and gates to go over inclosed fields, which they were loath to do if they could help it. Our three travelers, however, had a great mind to join themselves to this company, and take their lot with them; and, after some discourse, they laid aside their first design, which looked northward, and resolved to follow the other into Essex. So in the morning they took up their tent and loaded their horse, and away they traveled all together. They had some difficulty in passing the ferry at the riverside, the ferryman being afraid of them; but, after some parley at a distance, the ferryman was content to bring his boat to a place distant from the usual ferry, and leave it there for them to take it. So, putting themselves over, he directed them to leave the boat, and he, having another boat, said he would fetch it again; which it seems, however, he did not do for above eight days. Here, giving the ferryman money beforehand, they had a supply of victuals and drink, which he brought and left in the boat for them, but not without, as I said, having received the money beforehand. But now our travelers were at a great loss and difficulty how to get the horse over, the boat being small, and not fit for it, and at last could not do it without unloading the baggage and making him swim over. From the river they traveled towards the forest; but when they came to Walthamstow, the people of that town denied[197] to admit them, as was the case everywhere; the constables and their watchmen kept them off at a distance, and parleyed with them. They gave the same account of themselves as before; but these gave no credit to what they said, giving it for a reason, that two or three companies had already come that way and made the like pretenses, but that they had given several people the distemper in the towns where they had passed, and had been afterwards so hardly used by the country, though with justice too, as they had deserved, that about Brentwood[198] or that way, several of them perished in the fields, whether of the plague, or of mere want and distress, they could not tell. This was a good reason, indeed, why the people of Walthamstow should be very cautious, and why they should resolve not to entertain anybody that they were not well satisfied of; but as Richard the joiner, and one of the other men who parleyed with them, told them, it was no reason why they should block up the roads and refuse to let the people pass through the town, and who asked nothing of them but to go through the street; that, if their people were afraid of them, they might go into their houses and shut their doors: they would neither show them civility nor incivility, but go on about their business. The constables and attendants, not to be persuaded by reason, continued obstinate, and would hearken to nothing: so the two men that talked with them went back to their fellows to consult what was to be done. It was very discouraging in the whole, and they knew not what to do for a good while; but at last John, the soldier and biscuit baker, considering awhile, "Come," says he, "leave the rest of the parley to me." He had not appeared yet: so he sets the joiner, Richard, to work to cut some poles out of the trees, and shape them as like guns as he could; and in a little time he had five or six fair muskets, which at a distance would not be known; and about the part where the lock of a gun is, he caused them to wrap cloth and rags, such as they had, as soldiers do in wet weather to preserve the locks of their pieces from rust; the rest was discolored with clay or mud, such as they could get; and all this while the rest of them sat under the trees by his direction, in two or three bodies, where they made fires at a good distance from one another. While this was doing, he advanced himself, and two or three with him, and set up their tent in the lane, within sight of the barrier which the townsmen had made, and set a sentinel just by it with the real gun, the only one they had, and who[199] walked to and fro with the gun on his shoulder, so as that the people of the town might see them; also he tied the horse to a gate in the hedge just by, and got some dry sticks together and kindled a fire on the other side of the tent, so that the people of the town could see the fire and the smoke, but could not see what they were doing at it. After the country people had looked upon them very earnestly a great while, and by all that they could see could not but suppose that they were a great many in company, they began to be uneasy, not for their going away, but for staying where they were; and above all, perceiving they had horses and arms (for they had seen one horse and one gun at the tent, and they had seen others of them walk about the field on the inside of the hedge by the side of the lane with their muskets, as they took them to be, shouldered),--I say, upon such a sight as this, you may be assured they were alarmed and terribly frightened; and it seems they went to a justice of the peace to know what they should do. What the justice advised them to, I know not; but towards the evening they called from the barrier, as above, to the sentinel at the tent. "What do you want?" says John. "Why, what do you intend to do?" says the constable. "To do?" says John; "what would you have us to do?" Const. Why don't you begone? What do you stay there for? John. Why do you stop us on the King's highway, and pretend to refuse us leave to go on our way? Const. We are not bound to tell you the reason, though we did let you know it was because of the plague. John. We told you we were all sound, and free from the plague, which we were not bound to have satisfied you of, and yet you pretend to stop us on the highway. Const. We have a right to stop it up, and our own safety obliges us to it; besides, this is not the King's highway, it is a way upon sufferance. You see here is a gate, and if we do let people pass here, we make them pay toll. John. We have a right to seek our own safety as well as you; and you may see we are flying for our lives, and it is very unchristian and unjust in you to stop us. Const. You may go back from whence you came, we do not hinder you from that. John. No, it is a stronger enemy than you that keeps us from doing that, or else we should not have come hither. Const. Well, you may go any other way, then. John. No, no. I suppose you see we are able to send you going, and all the people of your parish, and come through your town when we will; but, since you have stopped us here, we are content. You see we have encamped here, and here we will live. We hope you will furnish us with victuals. Const. We furnish you! What mean you by that? John. Why, you would not have us starve, would you? If you stop us here, you must keep us. Const. You will be ill kept at our maintenance. John. If you stint us, we shall make ourselves the better allowance. Const. Why, you will not pretend to quarter upon us by force, will you? John. We have offered no violence to you yet, why do you seem to oblige us to it? I am an old soldier, and cannot starve; and, if you think that we shall be obliged to go back for want of provisions, you are mistaken. Const. Since you threaten us, we shall take care to be strong enough for you. I have orders to raise the county upon you. John. It is you that threaten, not we; and, since you are for mischief, you cannot blame us if we do not give you time for it. We shall begin our march in a few minutes. Const. What is it you demand of us? John. At first we desired nothing of you but leave to go through the town. We should have offered no injury to any of you, neither would you have had any injury or loss by us. We are not thieves, but poor people in distress, and flying from the dreadful plague in London, which devours thousands every week. We wonder how you can be so unmerciful. Const. Self-preservation obliges us. John. What! To shut up your compassion, in a case of such distress as this? Const. Well, if you will pass over the fields on your left hand, and behind that part of the town, I will endeavor to have gates opened for you. John. Our horsemen cannot pass with our baggage that way. It does not lead into the road that we want to go, and why should you force us out of the road? Besides, you have kept us here all day without any provisions but such as we brought with us. I think you ought to send us some provisions for our relief. Const. If you will go another way, we will send you some provisions. John. That is the way to have all the towns in the county stop up the ways against us. Const. If they all furnish you with food, what will you be the worse? I see you have tents: you want no lodging. John. Well, what quantity of provisions will you send us? Const. How many are you? John. Nay, we do not ask enough for all our company. We are in three companies. If you will send us bread for twenty men and about six or seven women for three days, and show us the way over the field you speak of, we desire not to put your people into any fear for us. We will go out of our way to oblige you, though we are as free from infection as you are. Const. And will you assure us that your other people shall offer us no new disturbance? John. No, no; you may depend on it. Const. You must oblige yourself, too, that none of your people shall come a step nearer than where the provisions we send you shall be set down. John. I answer for it, we will not. Here he called to one of his men, and bade him order Captain Richard and his people to march the lower way on the side of the marshes, and meet them in the forest; which was all a sham, for they had no Captain Richard or any such company. Accordingly, they sent to the place twenty loaves of bread and three or four large pieces of good beef, and opened some gates, through which they passed; but none of them had courage so much as to look out to see them go, and as it was evening, if they had looked, they could not have seen them so as to know how few they were. This was John the soldier's management; but this gave such an alarm to the county, that, had they really been two or three hundred, the whole county would have been raised upon them, and they would have been sent to prison, or perhaps knocked on the head. They were soon made sensible of this, for two days afterwards they found several parties of horsemen and footmen also about, in pursuit of three companies of men armed, as they said, with muskets, who were broke out from London and had the plague upon them, and that were not only spreading the distemper among the people, but plundering the country. As they saw now the consequence of their case, they soon saw the danger they were in: so they resolved, by the advice also of the old soldier, to divide themselves again. John and his two comrades, with the horse, went away as if towards Waltham,[200]--the other in two companies, but all a little asunder,--and went towards Epping.[200] The first night they encamped all in the forest, and not far off from one another, but not setting up the tent for fear that should discover them. On the other hand, Richard went to work with his ax and his hatchet, and, cutting down branches of trees, he built three tents or hovels, in which they all encamped with as much convenience as they could expect. The provisions they had at Walthamstow served them very plentifully this night; and as for the next, they left it to Providence. They had fared so well with the old soldier's conduct, that they now willingly made him their leader, and the first of his conduct appeared to be very good. He told them that they were now at a proper distance enough from London; that, as they need not be immediately beholden to the country for relief, they ought to be as careful the country did not infect them as that they did not infect the country; that what little money they had they must be as frugal of as they could; that as he would not have them think of offering the country any violence, so they must endeavor to make the sense of their condition go as far with the country as it could. They all referred themselves to his direction: so they left their three houses standing, and the next day went away towards Epping; the captain also (for so they now called him), and his two fellow travelers, laid aside their design of going to Waltham, and all went together. When they came near Epping, they halted, choosing out a proper place in the open forest, not very near the highway, but not far out of it, on the north side, under a little cluster of low pollard trees.[201] Here they pitched their little camp, which consisted of three large tents or huts made of poles, which their carpenter, and such as were his assistants, cut down, and fixed in the ground in a circle, binding all the small ends together at the top, and thickening the sides with boughs of trees and bushes, so that they were completely close and warm. They had besides this a little tent where the women lay by themselves, and a hut to put the horse in. It happened that the next day, or the next but one, was market day at Epping, when Captain John and one of the other men went to market and bought some provisions, that is to say, bread, and some mutton and beef; and two of the women went separately, as if they had not belonged to the rest, and bought more. John took the horse to bring it home, and the sack which the carpenter carried his tools in, to put it in. The carpenter went to work and made them benches and stools to sit on, such as the wood he could get would afford, and a kind of a table to dine on. They were taken no notice of for two or three days; but after that, abundance of people ran out of the town to look at them, and all the country was alarmed about them. The people at first seemed afraid to come near them; and, on the other hand, they desired the people to keep off, for there was a rumor that the plague was at Waltham, and that it had been in Epping two or three days. So John called out to them not to come to them. "For," says he, "we are all whole and sound people here, and we would not have you bring the plague among us, nor pretend we brought it among you." After this, the parish officers came up to them, and parleyed with them at a distance, and desired to know who they were, and by what authority they pretended to fix their stand at that place. John answered very frankly, they were poor distressed people from London, who, foreseeing the misery they should be reduced to if the plague spread into the city, had fled out in time for their lives, and, having no acquaintance or relations to fly to, had first taken up at Islington, but, the plague being come into that town, were fled farther; and, as they supposed that the people of Epping might have refused them coming into their town, they had pitched their tents thus in the open field and in the forest, being willing to bear all the hardships of such a disconsolate lodging rather than have any one think, or be afraid, that they should receive injury by them. At first the Epping people talked roughly to them, and told them they must remove; that this was no place for them; and that they pretended to be sound and well, but that they might be infected with the plague, for aught they knew, and might infect the whole country, and they could not suffer them there. John argued very calmly with them a great while, and told them that London was the place by which they, that is, the townsmen of Epping, and all the country round them, subsisted; to whom they sold the produce of their lands, and out of whom they made the rents of their farms; and to be so cruel to the inhabitants of London, or to any of those by whom they gained so much, was very hard; and they would be loath to have it remembered hereafter, and have it told, how barbarous, how inhospitable, and how unkind they were to the people of London when they fled from the face of the most terrible enemy in the world; that it would be enough to make the name of an Epping man hateful throughout all the city, and to have the rabble stone them in the very streets whenever they came so much as to market; that they were not yet secure from being visited themselves, and that, as he heard, Waltham was already; that they would think it very hard, that, when any of them fled for fear before they were touched, they should be denied the liberty of lying so much as in the open fields. The Epping men told them again that they, indeed, said they were sound, and free from the infection, but that they had no assurance of it; and that it was reported that there had been a great rabble of people at Walthamstow, who made such pretenses of being sound as they did, but that they threatened to plunder the town, and force their way, whether the parish officers would or no; that there were near two hundred of them, and had arms and tents like Low Country soldiers; that they extorted provisions from the town by threatening them with living upon them at free quarter,[202] showing their arms, and talking in the language of soldiers; and that several of them having gone away towards Rumford and Brentwood, the country had been infected by them, and the plague spread into both those large towns, so that the people durst not go to market there, as usual; that it was very likely they were some of that party, and, if so, they deserved to be sent to the county jail, and be secured till they had made satisfaction for the damage they had done, and for the terror and fright they had put the country into. John answered, that what other people had done was nothing to them; that they assured them they were all of one company; that they had never been more in number than they saw them at that time (which, by the way, was very true); that they came out in two separate companies, but joined by the way, their cases being the same; that they were ready to give what account of themselves anybody desired of them, and to give in their names and places of abode, that so they might be called to an account for any disorder that they might be guilty of; that the townsmen might see they were content to live hardly, and only desired a little room to breathe in on the forest, where it was wholesome (for where it was not, they could not stay, and would decamp if they found it otherwise there). "But," said the townsmen, "we have a great charge of poor upon our hands already, and we must take care not to increase it. We suppose you can give us no security against your being chargeable to our parish and to the inhabitants, any more than you can of being dangerous to us as to the infection." "Why, look you," says John, "as to being chargeable to you, we hope we shall not. If you will relieve us with provisions for our present necessity, we will be very thankful. As we all lived without charity when we were at home, so we will oblige ourselves fully to repay you, if God please to bring us back to our own families and houses in safety, and to restore health to the people of London. "As to our dying here, we assure you, if any of us die, we that survive will bury them, and put you to no expense, except it should be that we should all die, and then, indeed, the last man, not being able to bury himself, would put you to that single expense; which I am persuaded," says John, "he would leave enough behind him to pay you for the expense of. "On the other hand," says John, "if you will shut up all bowels of compassion, and not relieve us at all, we shall not extort anything by violence, or steal from any one; but when that little we have is spent, if we perish for want, God's will be done!" John wrought so upon the townsmen by talking thus rationally and smoothly to them, that they went away; and though they did not give any consent to their staying there, yet they did not molest them, and the poor people continued there three or four days longer without any disturbance. In this time they had got some remote acquaintance with a victualing house on the outskirts of the town, to whom they called at a distance to bring some little things that they wanted, and which they caused to be set down at some distance, and always paid for very honestly. During this time the younger people of the town came frequently pretty near them, and would stand and look at them, and would sometimes talk with them at some space between; and particularly it was observed that the first sabbath day the poor people kept retired, worshiped God together, and were heard to sing psalms. These things, and a quiet, inoffensive behavior, began to get them the good opinion of the country, and the people began to pity them and speak very well of them; the consequence of which was, that upon the occasion of a very wet, rainy night, a certain gentleman who lived in the neighborhood sent them a little cart with twelve trusses or bundles of straw, as well for them to lodge upon as to cover and thatch their huts, and to keep them dry. The minister of a parish not far off, not knowing of the other, sent them also about two bushels of wheat and half a bushel of white pease. They were very thankful, to be sure, for this relief, and particularly the straw was a very great comfort to them; for though the ingenious carpenter had made them frames to lie in, like troughs, and filled them with leaves of trees and such things as they could get, and had cut all their tent cloth out to make coverlids, yet they lay damp and hard and unwholesome till this straw came, which was to them like feather beds, and, as John said, more welcome than feather beds would have been at another time. This gentleman and the minister having thus begun, and given an example of charity to these wanderers, others quickly followed; and they received every day some benevolence or other from the people, but chiefly from the gentlemen who dwelt in the country round about. Some sent them chairs, stools, tables, and such household things as they gave notice they wanted. Some sent them blankets, rugs, and coverlids; some, earthenware; and some, kitchen ware for ordering[203] their food. Encouraged by this good usage, their carpenter, in a few days, built them a large shed or house with rafters, and a roof in form, and an upper floor, in which they lodged warm, for the weather began to be damp and cold in the beginning of September; but this house being very well thatched, and the sides and roof very thick, kept out the cold well enough. He made also an earthen wall at one end, with a chimney in it; and another of the company, with a vast deal of trouble and pains, made a funnel to the chimney to carry out the smoke. Here they lived comfortably, though coarsely, till the beginning of September, when they had the bad news to hear, whether true or not, that the plague, which was very hot at Waltham Abbey on the one side, and Rumford and Brentwood on the other side, was also come to Epping, to Woodford, and to most of the towns upon the forest; and which, as they said, was brought down among them chiefly by the higglers,[204] and such people as went to and from London with provisions. If this was true, it was an evident contradiction to the report which was afterwards spread all over England, but which, as I have said, I cannot confirm of my own knowledge, namely, that the market people carrying provisions to the city never got the infection or carried it back into the country; both which, I have been assured, has been[205] false. It might be that they were preserved even beyond expectation, though not to a miracle;[206] that abundance went and came and were not touched; and that was much encouragement for the poor people of London, who had been completely miserable if the people that brought provisions to the markets had not been many times wonderfully preserved, or at least more preserved than could be reasonably expected. But these new inmates began to be disturbed more effectually, for the towns about them were really infected. And they began to be afraid to trust one another so much as to go abroad for such things as they wanted; and this pinched them very hard, for now they had little or nothing but what the charitable gentlemen of the country supplied them with. But, for their encouragement, it happened that other gentlemen of the country, who had not sent them anything before, began to hear of them and supply them. And one sent them a large pig, that is to say, a porker; another, two sheep; and another sent them a calf: in short, they had meat enough, and sometimes had cheese and milk, and such things. They were chiefly put to it[207] for bread; for when the gentlemen sent them corn, they had nowhere to bake it or to grind it. This made them eat the first two bushels of wheat that was sent them, in parched corn, as the Israelites of old did, without grinding or making bread of it.[208] At last they found means to carry their corn to a windmill near Woodford, where they had it ground; and afterwards the biscuit baker made a hearth so hollow and dry, that he could bake biscuit cakes tolerably well, and thus they came into a condition to live without any assistance or supplies from the towns. And it was well they did; for the country was soon after fully infected, and about a hundred and twenty were said to have died of the distemper in the villages near them, which was a terrible thing to them. On this they called a new council, and now the towns had no need to be afraid they should settle near them; but, on the contrary, several families of the poorer sort of the inhabitants quitted their houses, and built huts in the forest, after the same manner as they had done. But it was observed that several of these poor people that had so removed had the sickness even in their huts or booths, the reason of which was plain: namely, not because they removed into the air, but[209] because they did not remove time[210] enough, that is to say, not till, by openly conversing with other people, their neighbors, they had the distemper upon them (or, as may be said, among them), and so carried it about with them whither they went; or (2) because they were not careful enough, after they were safely removed out of the towns, not to come in again and mingle with the diseased people. But be it which of these it will, when our travelers began to perceive that the plague was not only in the towns, but even in the tents and huts on the forest near them, they began then not only to be afraid, but to think of decamping and removing; for, had they staid, they would have been in manifest danger of their lives. It is not to be wondered that they were greatly afflicted at being obliged to quit the place where they had been so kindly received, and where they had been treated with so much humanity and charity; but necessity, and the hazard of life which they came out so far to preserve, prevailed with them, and they saw no remedy. John, however, thought of a remedy for their present misfortune; namely, that he would first acquaint that gentleman who was their principal benefactor with the distress they were in, and to[211] crave his assistance and advice. This good charitable gentleman encouraged them to quit the place, for fear they should be cut off from any retreat at all by the violence of the distemper; but whither they should go, that he found very hard to direct them to. At last John asked of him, whether he, being a justice of the peace, would give them certificates of health to other justices who[212] they might come before, that so, whatever might be their lot, they might not be repulsed, now they had been also so long from London. This his worship immediately granted, and gave them proper letters of health; and from thence they were at liberty to travel whither they pleased. Accordingly they had a full certificate of health, intimating that they had resided in a village in the county of Essex so long; that, being examined and scrutinized sufficiently, and having been retired from all conversation[213] for above forty days, without any appearance of sickness, they were therefore certainly concluded to be sound men, and might be safely entertained anywhere, having at last removed rather for fear of the plague, which was come into such a town, rather[214] than for having any signal of infection upon them, or upon any belonging to them. With this certificate they removed, though with great reluctance; and, John inclining not to go far from home, they removed towards the marshes on the side of Waltham. But here they found a man who, it seems, kept a weir or stop upon the river, made to raise water for the barges which go up and down the river; and he terrified them with dismal stories of the sickness having been spread into all the towns on the river and near the river, on the side of Middlesex and Hertfordshire (that is to say, into Waltham, Waltham Cross, Enfield, and Ware, and all the towns on the road), that they were afraid to go that way; though it seems the man imposed upon them, for that[215] the thing was not really true. However, it terrified them, and they resolved to move across the forest towards Rumford and Brentwood; but they heard that there were numbers of people fled out of London that way, who lay up and down in the forest, reaching near Rumford, and who, having no subsistence or habitation, not only lived oddly,[216] and suffered great extremities in the woods and fields for want of relief, but were said to be made so desperate by those extremities, as that they offered many violences to the country, robbed and plundered, and killed cattle, and the like; and others, building huts and hovels by the roadside, begged, and that with an importunity next door to demanding relief: so that the country was very uneasy, and had been obliged to take some of them up. This, in the first place, intimated to them that they would be sure to find the charity and kindness of the county, which they had found here where they were before, hardened and shut up against them; and that, on the other hand, they would be questioned wherever they came, and would be in danger of violence from others in like cases with themselves. Upon all these considerations, John, their captain, in all their names, went back to their good friend and benefactor who had relieved them before, and, laying their case truly before him, humbly asked his advice; and he as kindly advised them to take up their old quarters again, or, if not, to remove but a little farther out of the road, and directed them to a proper place for them. And as they really wanted some house, rather than huts, to shelter them at that time of the year, it growing on towards Michaelmas, they found an old decayed house, which had been formerly some cottage or little habitation, but was so out of repair as[217] scarce habitable; and by consent of a farmer, to whose farm it belonged, they got leave to make what use of it they could. The ingenious joiner, and all the rest by his directions, went to work with it, and in a very few days made it capable to shelter them all in case of bad weather; and in which there was an old chimney and an old oven, though both lying in ruins, yet they made them both fit for use; and, raising additions, sheds, and lean-to's[218] on every side, they soon made the house capable to hold them all. They chiefly wanted boards to make window shutters, floors, doors, and several other things; but as the gentleman above favored them, and the country was by that means made easy with them, and, above all, that they were known to be all sound and in good health, everybody helped them with what they could spare. Here they encamped for good and all, and resolved to remove no more. They saw plainly how terribly alarmed that country was everywhere at anybody that came from London, and that they should have no admittance anywhere but with the utmost difficulty; at least no friendly reception and assistance, as they had received here. Now, although they received great assistance and encouragement from the country gentlemen, and from the people round about them, yet they were put to great straits; for the weather grew cold and wet in October and November, and they had not been used to so much hardship, so that they got cold in their limbs, and distempers, but never had the infection. And thus about December they came home to the city again. I give this story thus at large, principally to give an account[219] what became of the great numbers of people which immediately appeared in the city as soon as the sickness abated; for, as I have said, great numbers of those that were able, and had retreats in the country, fled to those retreats. So when it[220] was increased to such a frightful extremity as I have related, the middling people[221] who had not friends fled to all parts of the country where they could get shelter, as well those that had money to relieve themselves as those that had not. Those that had money always fled farthest, because they were able to subsist themselves; but those who were empty suffered, as I have said, great hardships, and were often driven by necessity to relieve their wants at the expense of the country. By that means the country was made very uneasy at them, and sometimes took them up, though even then they scarce knew what to do with them, and were always very backward to punish them; but often, too, they forced them from place to place, till they were obliged to come back again to London. I have, since my knowing this story of John and his brother, inquired, and found that there were a great many of the poor disconsolate people, as above, fled into the country every way; and some of them got little sheds and barns and outhouses to live in, where they could obtain so much kindness of the country, and especially where they had any, the least satisfactory account to give of themselves, and particularly that they did not come out of London too late. But others, and that in great numbers, built themselves little huts and retreats in the fields and woods, and lived like hermits in holes and caves, or any place they could find, and where, we may be sure, they suffered great extremities, such that many of them were obliged to come back again, whatever the danger was. And so those little huts were often found empty, and the country people supposed the inhabitants lay dead in them of the plague, and would not go near them for fear, no, not in a great while; nor is it unlikely but that some of the unhappy wanderers might die so all alone, even sometimes for want of help, as particularly in one tent or hut was found a man dead, and on the gate of a field just by was cut with his knife, in uneven letters, the following words, by which it may be supposed the other man escaped, or that, one dying first, the other buried him as well as he could:-- O m I s E r Y! We Bo T H Sh a L L D y E, W o E, W o E I have given an account already of what I found to have been the case down the river among the seafaring men, how the ships lay in the "offing," as it is called, in rows or lines, astern of one another, quite down from the Pool as far as I could see. I have been told that they lay in the same manner quite down the river as low as Gravesend,[222] and some far beyond, even everywhere, or in every place where they could ride with safety as to wind and weather. Nor did I ever hear that the plague reached to any of the people on board those ships, except such as lay up in the Pool, or as high as Deptford Reach, although the people went frequently on shore to the country towns and villages, and farmers' houses, to buy fresh provisions (fowls, pigs, calves, and the like) for their supply. Likewise I found that the watermen on the river above the bridge found means to convey themselves away up the river as far as they could go; and that they had, many of them, their whole families in their boats, covered with tilts[223] and bales, as they call them, and furnished with straw within for their lodging; and that they lay thus all along by the shore in the marshes, some of them setting up little tents with their sails, and so lying under them on shore in the day, and going into their boats at night. And in this manner, as I have heard, the riversides were lined with boats and people as long as they had anything to subsist on, or could get anything of the country; and indeed the country people, as well gentlemen as others, on these and all other occasions, were very forward to relieve them, but they were by no means willing to receive them into their towns and houses, and for that we cannot blame them. There was one unhappy citizen, within my knowledge, who had been visited in a dreadful manner, so that his wife and all his children were dead, and himself and two servants only left, with an elderly woman, a near relation, who had nursed those that were dead as well as she could. This disconsolate man goes to a village near the town, though not within the bills of mortality, and, finding an empty house there, inquires out the owner, and took the house. After a few days he got a cart, and loaded it with goods, and carries them down to the house. The people of the village opposed his driving the cart along, but, with some arguings and some force, the men that drove the cart along got through the street up to the door of the house. There the constable resisted them again, and would not let them be brought in. The man caused the goods to be unloaded and laid at the door, and sent the cart away, upon which they carried the man before a justice of peace; that is to say, they commanded him to go, which he did. The justice ordered him to cause the cart to fetch away the goods again, which he refused to do; upon which the justice ordered the constable to pursue the carters and fetch them back, and make them reload the goods and carry them away, or to set them in the stocks[224] till they[225] came for further orders; and if they could not find them,[226] and the man would not consent to take them[227] away, they[225] should cause them[227] to be drawn with hooks from the house door, and burned in the street. The poor distressed man, upon this, fetched the goods again, but with grievous cries and lamentations at the hardship of his case. But there was no remedy: self-preservation obliged the people to those severities which they would not otherwise have been concerned in. Whether this poor man lived or died, I cannot tell, but it was reported that he had the plague upon him at that time, and perhaps the people might report that to justify their usage of him; but it was not unlikely that either he or his goods, or both, were dangerous, when his whole family had been dead of the distemper so little a while before. I know that the inhabitants of the towns adjacent to London were much blamed for cruelty to the poor people that ran from the contagion in their distress, and many very severe things were done, as may be seen from what has been said; but I cannot but say also, that where there was room for charity and assistance to the people, without apparent danger to themselves, they were willing enough to help and relieve them. But as every town were indeed judges in their own case, so the poor people who ran abroad in their extremities were often ill used, and driven back again into the town; and this caused infinite exclamations and outcries against the country towns, and made the clamor very popular. And yet more or less, maugre[228] all the caution, there was not a town of any note within ten (or, I believe, twenty) miles of the city, but what was more or less infected, and had some[229] died among them. I have heard the accounts of several, such as they were reckoned up, as follows:-- Enfield 32 Hornsey 58 Newington 17 Tottenham 42 Edmonton 19 Barnet and Hadley 43 St. Albans 121 Watford 45 Uxbridge 117 Hertford 90 Ware 160 Hodsdon 30 Waltham Abbey 23 Epping 26 Deptford 623 Greenwich 631 Eltham and Lusum 85 Croydon 61 Brentwood 70 Rumford 109 Barking about 200 Brandford 432 Kingston 122 Staines 82 Chertsey 18 Windsor 103 cum aliis.[230] Another thing might render the country more strict with respect to the citizens, and especially with respect to the poor, and this was what I hinted at before; namely, that there was a seeming propensity, or a wicked inclination, in those that were infected, to infect others. There have been great debates among our physicians as to the reason of this. Some will have it to be in the nature of the disease, and that it impresses every one that is seized upon by it with a kind of rage and a hatred against their own kind, as if there were a malignity, not only in the distemper to communicate itself, but in the very nature of man, prompting him with evil will, or an evil eye, that as they say in the case of a mad dog, who, though the gentlest creature before of any of his kind, yet then will fly upon and bite any one that comes next him, and those as soon as any, who have been most observed[231] by him before. Others placed it to the account of the corruption of human nature, who[232] cannot bear to see itself more miserable than others of its own species, and has a kind of involuntary wish that all men were as unhappy or in as bad a condition as itself. Others say it was only a kind of desperation, not knowing or regarding what they did, and consequently unconcerned at the danger or safety, not only of anybody near them, but even of themselves also. And indeed, when men are once come to a condition to abandon themselves, and be unconcerned for the safety or at the danger of themselves, it cannot be so much wondered that they should be careless of the safety of other people. But I choose to give this grave debate quite a different turn, and answer it or resolve it all by saying that I do not grant the fact. On the contrary, I say that the thing is not really so, but that it was a general complaint raised by the people inhabiting the outlying villages against the citizens, to justify, or at least excuse, those hardships and severities so much talked of, and in which complaints both sides may be said to have injured one another; that is to say, the citizens pressing to be received and harbored in time of distress, and with the plague upon them, complain of the cruelty and injustice of the country people in being refused entrance, and forced back again with their goods and families; and the inhabitants, finding themselves so imposed upon, and the citizens breaking in, as it were, upon them, whether they would or no, complain that when they[233] were infected, they were not only regardless of others, but even willing to infect them: neither of which was really true, that is to say, in the colors they[234] were described in. It is true there is something to be said for the frequent alarms which were given to the country, of the resolution of the people of London to come out by force, not only for relief, but to plunder and rob; that they ran about the streets with the distemper upon them without any control; and that no care was taken to shut up houses, and confine the sick people from infecting others; whereas, to do the Londoners justice, they never practiced such things, except in such particular cases as I have mentioned above, and such like. On the other hand, everything was managed with so much care, and such excellent order was observed in the whole city and suburbs, by the care of the lord mayor and aldermen, and by the justices of the peace, churchwardens, etc., in the outparts, that London may be a pattern to all the cities in the world for the good government and the excellent order that was everywhere kept, even in the time of the most violent infection, and when the people were in the utmost consternation and distress. But of this I shall speak by itself. One thing, it is to be observed, was owing principally to the prudence of the magistrates, and ought to be mentioned to their honor; viz., the moderation which they used in the great and difficult work of shutting up houses. It is true, as I have mentioned, that the shutting up of houses was a great subject of discontent, and I may say, indeed, the only subject of discontent among the people at that time; for the confining the sound in the same house with the sick was counted very terrible, and the complaints of people so confined were very grievous: they were heard in the very streets, and they were sometimes such that called for resentment, though oftener for compassion. They had no way to converse with any of their friends but out of their windows, where they would make such piteous lamentations as often moved the hearts of those they talked with, and of others who, passing by, heard their story; and as those complaints oftentimes reproached the severity, and sometimes the insolence, of the watchmen placed at their doors, those watchmen would answer saucily enough, and perhaps be apt to affront the people who were in the street talking to the said families; for which, or for their ill treatment of the families, I think seven or eight of them in several places were killed. I know not whether I should say murdered or not, because I cannot enter into the particular cases. It is true, the watchmen were on their duty, and acting in the post where they were placed by a lawful authority; and killing any public legal officer in the execution of his office is always, in the language of the law, called "murder." But as they were not authorized by the magistrate's instructions, or by the power they acted under, to be injurious or abusive, either to the people who were under their observation or to any that concerned themselves for them, so that,[235] when they did so, they might be said to act themselves, not their office; to act as private persons, not as persons employed; and consequently, if they brought mischief upon themselves by such an undue behavior, that mischief was upon their own heads. And indeed they had so much the hearty curses of the people, whether they deserved it or not, that, whatever befell them, nobody pitied them; and everybody was apt to say they deserved it, whatever it was. Nor do I remember that anybody was ever punished, at least to any considerable degree, for whatever was done to the watchmen that guarded their houses. What variety of stratagems were used to escape, and get out of houses thus shut up, by which the watchmen were deceived or overpowered, and that[236] the people got away, I have taken notice of already, and shall say no more to that; but I say the magistrates did moderate and ease families upon many occasions in this case, and particularly in that of taking away or suffering to be removed the sick persons out of such houses, when they were willing to be removed, either to a pesthouse or other places, and sometimes giving the well persons in the family so shut up leave to remove, upon information given that they were well, and that they would confine themselves in such houses where they went, so long as should be required of them. The concern, also, of the magistrates for the supplying such poor families as were infected,--I say, supplying them with necessaries, as well physic as food,--was very great: and in which they did not content themselves with giving the necessary orders to the officers appointed; but the aldermen, in person and on horseback, frequently rode to such houses, and caused the people to be asked at their windows whether they were duly attended or not, also whether they wanted anything that was necessary, and if the officers had constantly carried their messages, and fetched them such things as they wanted, or not. And if they answered in the affirmative, all was well; but if they complained that they were ill supplied, and that the officer did not do his duty, or did not treat them civilly, they (the officers) were generally removed, and others placed in their stead. It is true, such complaint might be unjust; and if the officer had such arguments to use as would convince the magistrate that he was right, and that the people had injured him, he was continued, and they reproved. But this part could not well bear a particular inquiry, for the parties could very ill be well heard and answered in the street from the windows, as was the case then. The magistrates, therefore, generally chose to favor the people, and remove the man, as what seemed to be the least wrong and of the least ill consequence; seeing, if the watchman was injured, yet they could easily make him amends by giving him another post of a like nature; but, if the family was injured, there was no satisfaction could be made to them, the damage, perhaps, being irreparable, as it concerned their lives. A great variety of these cases frequently happened between the watchmen and the poor people shut up, besides those I formerly mentioned about escaping. Sometimes the watchmen were absent, sometimes drunk, sometimes asleep, when the people wanted them; and such never failed to be punished severely, as indeed they deserved. But, after all that was or could be done in these cases, the shutting up of houses, so as to confine those that were well with those that were sick, had very great inconveniences in it, and some that were very tragical, and which merited to have been considered, if there had been room for it: but it was authorized by a law, it had the public good in view as the end chiefly aimed at; and all the private injuries that were done by the putting it in execution must be put to the account of the public benefit. It is doubtful whether, in the whole, it contributed anything to the stop of the infection; and indeed I cannot say it did, for nothing could run with greater fury and rage than the infection did when it was in its chief violence, though the houses infected were shut up as exactly and effectually as it was possible. Certain it is, that, if all the infected persons were effectually shut in, no sound person could have been infected by them, because they could not have come near them.[237] But the case was this (and I shall only touch it here); namely, that the infection was propagated insensibly, and by such persons as were not visibly infected, who neither knew whom they infected, nor whom they were infected by. A house in Whitechapel was shut up for the sake of one infected maid, who had only spots, not the tokens, come out upon her, and recovered; yet these people obtained no liberty to stir, neither for air or exercise, forty days. Want of breath, fear, anger, vexation, and all the other griefs attending such an injurious treatment, cast the mistress of the family into a fever; and visitors came into the house and said it was the plague, though the physicians declared it was not. However, the family were obliged to begin their quarantine anew, on the report of the visitor or examiner, though their former quarantine wanted but a few days of being finished. This oppressed them so with anger and grief, and, as before, straitened them also so much as to room, and for want of breathing and free air, that most of the family fell sick, one of one distemper, one of another, chiefly scorbutic[238] ailments, only one a violent cholic; until, after several prolongings of their confinement, some or other of those that came in with the visitors to inspect the persons that were ill, in hopes of releasing them, brought the distemper with them, and infected the whole house; and all or most of them died, not of the plague as really upon them before, but of the plague that those people brought them, who should have been careful to have protected them from it. And this was a thing which frequently happened, and was indeed one of the worst consequences of shutting houses up. I had about this time a little hardship put upon me, which I was at first greatly afflicted at, and very much disturbed about, though, as it proved, it did not expose me to any disaster; and this was, being appointed, by the alderman of Portsoken Ward, one of the examiners of the houses in the precinct where I lived. We had a large parish, and had no less than eighteen examiners, as the order called us: the people called us visitors. I endeavored with all my might to be excused from such an employment, and used many arguments with the alderman's deputy to be excused; particularly, I alleged that I was against shutting up houses at all, and that it would be very hard to oblige me to be an instrument in that which was against my judgment, and which I did verily believe would not answer the end it was intended for. But all the abatement I could get was only, that whereas the officer was appointed by my lord mayor to continue two months, I should be obliged to hold it but three weeks, on condition, nevertheless, that I could then get some other sufficient housekeeper to serve the rest of the time for me; which was, in short, but a very small favor, it being very difficult to get any man to accept of such an employment that was fit to be intrusted with it. It is true that shutting up of houses had one effect which I am sensible was of moment; namely, it confined the distempered people, who would otherwise have been both very troublesome and very dangerous in their running about streets with the distemper upon them, which, when they were delirious, they would have done in a most frightful manner, as, indeed, they began to do at first very much until they were restrained; nay, so very open they were, that the poor would go about and beg at people's doors, and say they had the plague upon them, and beg rags for their sores, or both, or anything that delirious nature happened to think of. A poor unhappy gentlewoman, a substantial citizen's wife, was, if the story be true, murdered by one of these creatures in Aldersgate Street, or that way. He was going along the street, raving mad, to be sure, and singing. The people only said he was drunk; but he himself said he had the plague upon him, which, it seems, was true; and, meeting this gentlewoman, he would kiss her. She was terribly frightened, as he was a rude fellow, and she run from him; but, the street being very thin of people, there was nobody near enough to help her. When she saw he would overtake her, she turned and gave him a thrust so forcibly, he being but weak, as pushed him down backward; but very unhappily, she being so near, he caught hold of her and pulled her down also, and, getting up first, mastered her and kissed her, and, which was worst of all, when he had done, told her he had the plague, and why should not she have it as well as he. She was frightened enough before; but when she heard him say he had the plague, she screamed out, and fell down into a swoon, or in a fit, which, though she recovered a little, yet killed her in a very few days; and I never heard whether she had the plague or no. Another infected person came and knocked at the door of a citizen's house where they knew him very well. The servant let him in, and, being told the master of the house was above, he ran up, and came into the room to them as the whole family were at supper. They began to rise up a little surprised, not knowing what the matter was; but he bid them sit still, he only come to take his leave of them. They asked him, "Why, Mr. ----, where are you going?"--"Going?" says he; "I have got the sickness, and shall die to-morrow night." It is easy to believe, though not to describe, the consternation they were all in. The women and the man's daughters, which[239] were but little girls, were frightened almost to death, and got up, one running out at one door and one at another, some downstairs and some upstairs, and, getting together as well as they could, locked themselves into their chambers, and screamed out at the windows for help, as if they had been frightened out of their wits. The master, more composed than they, though both frightened and provoked, was going to lay hands on him and throw him downstairs, being in a passion; but then, considering a little the condition of the man and the danger of touching him, horror seized his mind, and he stood still like one astonished. The poor distempered man, all this while, being as well diseased in his brain as in his body, stood still like one amazed. At length he turns round. "Ay!" says he with all the seeming calmness imaginable, "is it so with you all? Are you all disturbed at me? Why, then, I'll e'en go home and die there." And so he goes immediately downstairs. The servant that had let him in goes down after him with a candle, but was afraid to go past him and open the door; so he stood on the stairs to see what he would do. The man went and opened the door, and went out and flung[240] the door after him. It was some while before the family recovered the fright; but, as no ill consequence attended, they have had occasion since to speak of it, you may be sure, with great satisfaction. Though the man was gone, it was some time, nay, as I heard, some days, before they recovered themselves of the hurry they were in; nor did they go up and down the house with any assurance till they had burned a great variety of fumes and perfumes in all the rooms, and made a great many smokes of pitch, of gunpowder, and of sulphur. All separately shifted,[241] and washed their clothes, and the like. As to the poor man, whether he lived or died, I do not remember. It is most certain, that if, by the shutting up of houses, the sick had not been confined, multitudes, who in the height of their fever were delirious and distracted, would have been continually running up and down the streets; and even as it was, a very great number did so, and offered all sorts of violence to those they met, even just as a mad dog runs on and bites at every one he meets. Nor can I doubt but that, should one of those infected diseased creatures have bitten any man or woman while the frenzy of the distemper was upon them, they (I mean the person so wounded) would as certainly have been incurably infected as one that was sick before and had the tokens upon him. I heard of one infected creature, who, running out of his bed in his shirt, in the anguish and agony of his swellings (of which he had three upon him), got his shoes on, and went to put on his coat; but the nurse resisting, and snatching the coat from him, he threw her down, run over her, ran downstairs and into the street directly to the Thames, in his shirt, the nurse running after him, and calling to the watch to stop him. But the watchman, frightened at the man, and afraid to touch him, let him go on; upon which he ran down to the Still-Yard Stairs, threw away his shirt, and plunged into the Thames, and, being a good swimmer, swam quite over the river; and the tide being "coming in," as they call it (that is, running westward), he reached the land not till he came about the Falcon Stairs, where, landing and finding no people there, it being in the night, he ran about the streets there, naked as he was, for a good while, when, it being by that time high water, he takes the river again, and swam back to the Still Yard, landed, ran up the streets to his own house, knocking at the door, went up the stairs, and into his bed again; and[242] that this terrible experiment cured him of the plague, that is to say, that the violent motion of his arms and legs stretched the parts where the swellings he had upon him were (that is to say, under his arms and in his groin), and caused them to ripen and break; and that the cold of the water abated the fever in his blood. I have only to add, that I do not relate this, any more than some of the other, as a fact within my own knowledge, so as that I can vouch the truth of them; and especially that of the man being cured by the extravagant adventure, which I confess I do not think very possible, but it may serve to confirm the many desperate things which the distressed people, falling into deliriums and what we call light-headedness, were frequently run upon at that time, and how infinitely more such there would have been if such people had not been confined by the shutting up of houses; and this I take to be the best, if not the only good thing, which was performed by that severe method. On the other hand, the complaints and the murmurings were very bitter against the thing itself. It would pierce the hearts of all that came by, to hear the piteous cries of those infected people, who, being thus out of their understandings by the violence of their pain or the heat of their blood, were either shut in, or perhaps tied in their beds and chairs, to prevent their doing themselves hurt, and who would make a dreadful outcry at their being confined, and at their being not permitted to "die at large," as they called it, and as they would have done before. This running of distempered people about the streets was very dismal, and the magistrates did their utmost to prevent it; but as it was generally in the night, and always sudden, when such attempts were made, the officers could not be at hand to prevent it; and even when they got out in the day, the officers appointed did not care to meddle with them, because, as they were all grievously infected, to be sure, when they were come to that height, so they were more than ordinarily infectious, and it was one of the most dangerous things that could be to touch them. On the other hand, they generally ran on, not knowing what they did, till they dropped down stark dead, or till they had exhausted their spirits so as that they would fall and then die in perhaps half an hour or an hour; and, which was most piteous to hear, they were sure to come to themselves entirely in that half hour or hour, and then to make most grievous and piercing cries and lamentations, in the deep afflicting sense of the condition they were in. There was much of it before the order for shutting up of houses was strictly put into execution; for at first the watchmen were not so rigorous and severe as they were afterwards in the keeping the people in; that is to say, before they were (I mean some of them) severely punished for their neglect, failing in their duty, and letting people who were under their care slip away, or conniving at their going abroad, whether sick or well. But after they saw the officers appointed to examine into their conduct were resolved to have them do their duty, or be punished for the omission, they were more exact, and the people were strictly restrained; which was a thing they took so ill, and bore so impatiently, that their discontents can hardly be described; but there was an absolute necessity for it, that must be confessed, unless some other measures had been timely entered upon, and it was too late for that. Had not this particular of the sick being restrained as above been our case at that time, London would have been the most dreadful place that ever was in the world. There would, for aught I know, have as many people died in the streets as died in their houses: for when the distemper was at its height, it generally made them raving and delirious; and when they were so, they would never be persuaded to keep in their beds but by force; and many who were not tied threw themselves out of windows when they found they could not get leave to go out of their doors. It was for want of people conversing one with another in this time of calamity, that it was impossible any particular person could come at the knowledge of all the extraordinary cases that occurred in different families; and particularly, I believe it was never known to this day how many people in their deliriums drowned themselves in the Thames, and in the river which runs from the marshes by Hackney, which we generally called Ware River or Hackney River. As to those which were set down in the weekly bill, they were indeed few. Nor could it be known of any of those, whether they drowned themselves by accident or not; but I believe I might reckon up more who, within the compass of my knowledge or observation, really drowned themselves in that year than are put down in the bill of all put together, for many of the bodies were never found who yet were known to be lost; and the like in other methods of self-destruction. There was also one man in or about Whitecross Street burnt himself to death in his bed. Some said it was done by himself, others that it was by the treachery of the nurse that attended him; but that he had the plague upon him, was agreed by all. It was a merciful disposition of Providence, also, and which I have many times thought of at that time, that no fires, or no considerable ones at least, happened in the city during that year, which, if it had been otherwise, would have been very dreadful; and either the people must have let them alone unquenched, or have come together in great crowds and throngs, unconcerned at the danger of the infection, not concerned at the houses they went into, at the goods they handled, or at the persons or the people they came among. But so it was, that excepting that in Cripplegate Parish, and two or three little eruptions of fires, which were presently extinguished, there was no disaster of that kind happened in the whole year. They told us a story of a house in a place called Swan Alley, passing from Goswell Street near the end of Old Street into St. John Street, that a family was infected there in so terrible a manner that every one of the house died. The last person lay dead on the floor, and, as it is supposed, had laid herself all along to die just before the fire. The fire, it seems, had fallen from its place, being of wood, and had taken hold of the boards and the joists they lay on, and burned as far as just to the body, but had not taken hold of the dead body, though she had little more than her shift on, and had gone out of itself, not hurting the rest of the house, though it was a slight timber house. How true this might be, I do not determine; but the city being to suffer severely the next year by fire, this year it felt very little of that calamity. Indeed, considering the deliriums which the agony threw people into, and how I have mentioned in their madness, when they were alone, they did many desperate things, it was very strange there were no more disasters of that kind. It has been frequently asked me, and I cannot say that I ever knew how to give a direct answer to it, how it came to pass that so many infected people appeared abroad in the streets at the same time that the houses which were infected were so vigilantly searched, and all of them shut up and guarded as they were. I confess I know not what answer to give to this, unless it be this: that, in so great and populous a city as this is, it was impossible to discover every house that was infected as soon as it was so, or to shut up all the houses that were infected; so that people had the liberty of going about the streets, even where they pleased, unless they were known to belong to such and such infected houses. It is true, that, as the several physicians told my lord mayor, the fury of the contagion was such at some particular times, and people sickened so fast and died so soon, that it was impossible, and indeed to no purpose, to go about to inquire who was sick and who was well, or to shut them up with such exactness as the thing required, almost every house in a whole street being infected, and in many places every person in some of the houses. And, that which was still worse, by the time that the houses were known to be infected, most of the persons infected would be stone dead, and the rest run away for fear of being shut up; so that it was to very small purpose to call them infected houses and shut them up, the infection having ravaged and taken its leave of the house before it was really known that the family was any way touched. This might be sufficient to convince any reasonable person, that as it was not in the power of the magistrates, or of any human methods or policy, to prevent the spreading the infection, so that this way of shutting up of houses was perfectly insufficient for that end. Indeed, it seemed to have no manner of public good in it equal or proportionable to the grievous burthen that it was to the particular families that were so shut up; and, as far as I was employed by the public in directing that severity, I frequently found occasion to see that it was incapable of answering the end. For example, as I was desired as a visitor or examiner to inquire into the particulars of several families which were infected, we scarce came to any house where the plague had visibly appeared in the family but that some of the family were fled and gone. The magistrates would resent this, and charge the examiners with being remiss in their examination or inspection; but by that means houses were long infected before it was known. Now, as I was in this dangerous office but half the appointed time, which was two months, it was long enough to inform myself that we were no way capable of coming at the knowledge of the true state of any family but by inquiring at the door or of the neighbors. As for going into every house to search, that was a part no authority would offer to impose on the inhabitants, or any citizen would undertake; for it would have been exposing us to certain infection and death, and to the ruin of our own families as well as of ourselves. Nor would any citizen of probity, and that could be depended upon, have staid in the town if they had been made liable to such a severity. Seeing, then, that we could come at the certainty of things by no method but that of inquiry of the neighbors or of the family (and on that we could not justly depend), it was not possible but that the uncertainty of this matter would remain as above. It is true, masters of families were bound by the order to give notice to the examiner of the place wherein he lived, within two hours after he should discover it, of any person being sick in his house, that is to say, having signs of the infection; but they found so many ways to evade this, and excuse their negligence, that they seldom gave that notice till they had taken measures to have every one escape out of the house who had a mind to escape, whether they were sick or sound. And while this was so, it was easy to see that the shutting up of houses was no way to be depended upon as a sufficient method for putting a stop to the infection, because, as I have said elsewhere, many of those that so went out of those infected houses had the plague really upon them, though they might really think themselves sound; and some of these were the people that walked the streets till they fell down dead: not that they were suddenly struck with the distemper, as with a bullet that killed with the stroke, but that they really had the infection in their blood long before, only that, as it preyed secretly on their vitals, it appeared not till it seized the heart with a mortal power, and the patient died in a moment, as with a sudden fainting or an apoplectic fit. I know that some, even of our physicians, thought for a time that those people that so died in the streets were seized but that moment they fell, as if they had been touched by a stroke from heaven, as men are killed by a flash of lightning; but they found reason to alter their opinion afterward, for, upon examining the bodies of such after they were dead, they always either had tokens upon them, or other evident proofs of the distemper having been longer upon them than they had otherwise expected. This often was the reason that, as I have said, we that were examiners were not able to come at the knowledge of the infection being entered into a house till it was too late to shut it up, and sometimes not till the people that were left were all dead. In Petticoat Lane two houses together were infected, and several people sick; but the distemper was so well concealed, the examiner, who was my neighbor, got no knowledge of it till notice was sent him that the people were all dead, and that the carts should call there to fetch them away. The two heads of the families concerted their measures, and so ordered their matters as that, when the examiner was in the neighborhood, they appeared generally at a time, and answered, that is, lied for one another, or got some of the neighborhood to say they were all in health, and perhaps knew no better; till, death making it impossible to keep it any longer as a secret, the dead carts were called in the night to both the houses, and so it became public. But when the examiner ordered the constable to shut up the houses, there was nobody left in them but three people (two in one house, and one in the other), just dying, and a nurse in each house, who acknowledged that they had buried five before, that the houses had been infected nine or ten days, and that for all the rest of the two families, which were many, they were gone, some sick, some well, or, whether sick or well, could not be known. In like manner, at another house in the same lane, a man, having his family infected, but very unwilling to be shut up, when he could conceal it no longer, shut up himself; that is to say, he set the great red cross upon the door, with the words, "LORD, HAVE MERCY UPON US!" and so deluded the examiner, who supposed it had been done by the constable, by order of the other examiner (for there were two examiners to every district or precinct). By this means he had free egress and regress into his house again and out of it, as he pleased, notwithstanding it was infected, till at length his stratagem was found out, and then he, with the sound part of his family and servants, made off and escaped; so they were not shut up at all. These things made it very hard, if not impossible, as I have said, to prevent the spreading of an infection by the shutting up of houses, unless the people would think the shutting up of their houses no grievance, and be so willing to have it done as that they would give notice duly and faithfully to the magistrates of their being infected, as soon as it was known by themselves; but as that cannot be expected from them, and the examiners cannot be supposed, as above, to go into their houses to visit and search, all the good of shutting up houses will be defeated, and few houses will be shut up in time, except those of the poor, who cannot conceal it, and of some people who will be discovered by the terror and consternation which the thing put them into. I got myself discharged of the dangerous office I was in as soon as I could get another admitted, whom I had obtained for a little money to accept of it; and so, instead of serving the two months, which was directed, I was not above three weeks in it; and a great while too, considering it was in the month of August, at which time the distemper began to rage with great violence at our end of the town. In the execution of this office, I could not refrain speaking my opinion among my neighbors as to the shutting up the people in their houses, in which we saw most evidently the severities that were used, though grievous in themselves, had also this particular objection against them; namely, that they did not answer the end, as I have said, but that the distempered people went day by day about the streets. And it was our united opinion that a method to have removed the sound from the sick, in case of a particular house being visited, would have been much more reasonable on many accounts, leaving nobody with the sick persons but such as should, on such occasions, request to stay, and declare themselves content to be shut up with them. Our scheme for removing those that were sound from those that were sick was only in such houses as were infected; and confining the sick was no confinement: those that could not stir would not complain while they were in their senses, and while they had the power of judging. Indeed, when they came to be delirious and light-headed, then they would cry out of[243] the cruelty of being confined; but, for the removal of those that were well, we thought it highly reasonable and just, for their own sakes, they should be removed from the sick, and that, for other people's safety, they should keep retired for a while, to see that they were sound, and might not infect others; and we thought twenty or thirty days enough for this. Now, certainly, if houses had been provided on purpose for those that were sound, to perform this demiquarantine in, they would have much less reason to think themselves injured in such a restraint than in being confined with infected people in the houses where they lived. It is here, however, to be observed, that after the funerals became so many that people could not toll the bell, mourn or weep, or wear black for one another, as they did before, no, nor so much as make coffins for those that died, so, after a while, the fury of the infection appeared to be so increased, that, in short, they shut up no houses at all. It seemed enough that all the remedies of that kind had been used till they were found fruitless, and that the plague spread itself with an irresistible fury; so that, as the fire the succeeding year spread itself and burnt with such violence that the citizens in despair gave over their endeavors to extinguish it, so in the plague it came at last to such violence, that the people sat still looking at one another, and seemed quite abandoned to despair. Whole streets seemed to be desolated, and not to be shut up only, but to be emptied of their inhabitants: doors were left open, windows stood shattering with the wind in empty houses, for want of people to shut them. In a word, people began to give up themselves to their fears, and to think that all regulations and methods were in vain, and that there was nothing to be hoped for but an universal desolation. And it was even in the height of this general despair that it pleased God to stay his hand, and to slacken the fury of the contagion in such a manner as was even surprising, like its beginning, and demonstrated it to be his own particular hand; and that above, if not without the agency of means, as I shall take notice of in its proper place. But I must still speak of the plague as in its height, raging even to desolation, and the people under the most dreadful consternation, even, as I have said, to despair. It is hardly credible to what excesses the passions of men carried them in this extremity of the distemper; and this part, I think, was as moving as the rest. What could affect a man in his full power of reflection, and what could make deeper impressions on the soul, than to see a man almost naked, and got out of his house or perhaps out of his bed into the street, come out of Harrow Alley, a populous conjunction or collection of alleys, courts, and passages, in the Butcher Row in Whitechapel,--I say, what could be more affecting than to see this poor man come out into the open street, run, dancing and singing, and making a thousand antic gestures, with five or six women and children running after him, crying and calling upon him for the Lord's sake to come back, and entreating the help of others to bring him back, but all in vain, nobody daring to lay a hand upon him, or to come near him? This was a most grievous and afflicting thing to me, who saw it all from my own windows; for all this while the poor afflicted man was, as I observed it, even then in the utmost agony of pain, having, as they said, two swellings upon him, which could not be brought to break or to suppurate; but by laying strong caustics on them the surgeons had, it seems, hopes to break them, which caustics were then upon him, burning his flesh as with a hot iron. I cannot say what became of this poor man, but I think he continued roving about in that manner till he fell down and died. No wonder the aspect of the city itself was frightful. The usual concourse of the people in the streets, and which used to be supplied from our end of the town, was abated. The Exchange was not kept shut, indeed, but it was no more frequented. The fires were lost: they had been almost extinguished for some days by a very smart and hasty rain. But that was not all. Some of the physicians insisted that they were not only no benefit, but injurious to the health of the people. This they made a loud clamor about, and complained to the lord mayor about it. On the other hand, others of the same faculty, and eminent too, opposed them, and gave their reasons why the fires were and must be useful to assuage the violence of the distemper. I cannot give a full account of their arguments on both sides; only this I remember, that they caviled very much with one another. Some were for fires, but that they must be made of wood and not coal, and of particular sorts of wood too, such as fir, in particular, or cedar, because of the strong effluvia of turpentine; others were for coal and not wood, because of the sulphur and bitumen; and others were neither for one or other. Upon the whole, the lord mayor ordered no more fires, and especially on this account, namely, that the plague was so fierce that they saw evidently it defied all means, and rather seemed to increase than decrease upon any application to check and abate it; and yet this amazement of the magistrates proceeded rather from want of being able to apply any means successfully than from any unwillingness either to expose themselves or undertake the care and weight of business; for, to do them justice, they neither spared their pains nor their persons. But nothing answered. The infection raged, and the people were now terrified to the last degree, so that, as I may say, they gave themselves up, and, as I mentioned above, abandoned themselves to their despair. But let me observe here, that when I say the people abandoned themselves to despair, I do not mean to what men call a religious despair, or a despair of their eternal state; but I mean a despair of their being able to escape the infection, or to outlive the plague, which they saw was so raging, and so irresistible in its force, that indeed few people that were touched with it in its height, about August and September, escaped; and, which is very particular, contrary to its ordinary operation in June and July and the beginning of August, when, as I have observed, many were infected, and continued so many days, and then went off, after having had the poison in their blood a long time. But now, on the contrary, most of the people who were taken during the last two weeks in August, and in the first three weeks in September, generally died in two or three days at the farthest, and many the very same day they were taken. Whether the dog days[244] (as our astrologers pretended to express themselves, the influence of the Dog Star) had that malignant effect, or all those who had the seeds of infection before in them brought it up to a maturity at that time altogether, I know not; but this was the time when it was reported that above three thousand people died in one night; and they that would have us believe they more critically observed it pretend to say that they all died within the space of two hours, viz., between the hours of one and three in the morning. As to the suddenness of people dying at this time, more than before, there were innumerable instances of it, and I could name several in my neighborhood. One family without the bars, and not far from me, were all seemingly well on the Monday, being ten in family. That evening one maid and one apprentice were taken ill, and died the next morning, when the other apprentice and two children were touched, whereof one died the same evening and the other two on Wednesday. In a word, by Saturday at noon the master, mistress, four children, and four servants were all gone, and the house left entirely empty, except an ancient woman, who came to take charge of the goods for the master of the family's brother, who lived not far off, and who had not been sick. Many houses were then left desolate, all the people being carried away dead; and especially in an alley farther on the same side beyond the bars, going in at the sign of Moses and Aaron.[245] There were several houses together, which they said had not one person left alive in them; and some that died last in several of those houses were left a little too long before they were fetched out to be buried, the reason of which was not, as some have written very untruly, that the living were not sufficient to bury the dead, but that the mortality was so great in the yard or alley that there was nobody left to give notice to the buriers or sextons that there were any dead bodies there to be buried. It was said, how true I know not, that some of those bodies were so corrupted and so rotten, that it was with difficulty they were carried; and, as the carts could not come any nearer than to the alley gate in the High Street, it was so much the more difficult to bring them along. But I am not certain how many bodies were then left: I am sure that ordinarily it was not so. As I have mentioned how the people were brought into a condition to despair of life, and abandoned themselves, so this very thing had a strange effect among us for three or four weeks; that is, it made them bold and venturous. They were no more shy of one another, or restrained within doors, but went anywhere and everywhere, and began to converse. One would say to another, "I do not ask you how you are, or say how I am. It is certain we shall all go: so 'tis no matter who is sick or who is sound." And so they ran desperately into any place or company. As it brought the people into public company, so it was surprising how it brought them to crowd into the churches. They inquired no more into who[246] they sat near to or far from, what offensive smells they met with, or what condition the people seemed to be in; but, looking upon themselves all as so many dead corpses, they came to the churches without the least caution, and crowded together as if their lives were of no consequence compared to the work which they came about there. Indeed, the zeal which they showed in coming, and the earnestness and affection they showed in their attention to what they heard, made it manifest what a value people would all put upon the worship of God if they thought every day they attended at the church that it would be their last. Nor was it without other strange effects, for it took away all manner of prejudice at, or scruple about, the person whom they found in the pulpit when they came to the churches. It cannot be doubted but that many of the ministers of the parish churches were cut off among others in so common and dreadful a calamity; and others had not courage enough to stand it, but removed into the country as they found means for escape. As then some parish churches were quite vacant and forsaken, the people made no scruple of desiring such dissenters as had been a few years before deprived of their livings, by virtue of an act of Parliament called the "Act of Uniformity,"[247] to preach in the churches, nor did the church ministers in that case make any difficulty in accepting their assistance; so that many of those whom they called silent ministers had their mouths opened on this occasion, and preached publicly to the people. Here we may observe, and I hope it will not be amiss to take notice of it, that a near view of death would soon reconcile men of good principles one to another, and that it is chiefly owing to our easy situation in life, and our putting these things far from us, that our breaches are fomented, ill blood continued, prejudices, breach of charity and of Christian union so much kept and so far carried on among us as it is. Another plague year would reconcile all these differences; a close conversing with death, or with diseases that threaten death, would scum off the gall from our tempers, remove the animosities among us, and bring us to see with differing eyes than those which we looked on things with before. As the people who had been used to join with the church were reconciled at this time with the admitting the dissenters to preach to them, so the dissenters, who, with an uncommon prejudice, had broken off from the communion of the Church of England, were now content to come to their parish churches, and to conform to the worship which they did not approve of before. But, as the terror of the infection abated, those things all returned again to their less desirable channel, and to the course they were in before. I mention this but historically: I have no mind to enter into arguments to move either or both sides to a more charitable compliance one with another. I do not see that it is probable such a discourse would be either suitable or successful; the breaches seem rather to widen, and tend to a widening farther, than to closing: and who am I, that I should think myself able to influence either one side or other? But this I may repeat again, that it is evident death will reconcile us all: on the other side the grave we shall be all brethren again. In heaven, whither I hope we may come from all parties and persuasions, we shall find neither prejudice nor scruple: there we shall be of one principle and of one opinion. Why we cannot be content to go hand in hand to the place where we shall join heart and hand without the least hesitation, and with the most complete harmony and affection,--I say, why we cannot do so here, I can say nothing to; neither shall I say anything more of it, but that it remains to be lamented. I could dwell a great while upon the calamities of this dreadful time, and go on to describe the objects that appeared among us every day,--the dreadful extravagances which the distraction of sick people drove them into; how the streets began now to be fuller of frightful objects, and families to be made even a terror to themselves. But after I have told you, as I have above, that one man being tied in his bed, and finding no other way to deliver himself, set the bed on fire with his candle (which unhappily stood within his reach), and burned himself in bed; and how another, by the insufferable torment he bore, danced and sung naked in the streets, not knowing one ecstasy[248] from another,--I say, after I have mentioned these things, what can be added more? What can be said to represent the misery of these times more lively to the reader, or to give him a perfect idea of a more complicated distress? I must acknowledge that this time was so terrible that I was sometimes at the end of all my resolutions, and that I had not the courage that I had at the beginning. As the extremity brought other people abroad, it drove me home; and, except having made my voyage down to Blackwall and Greenwich, as I have related, which was an excursion, I kept afterwards very much within doors, as I had for about a fortnight before. I have said already that I repented several times that I had ventured to stay in town, and had not gone away with my brother and his family; but it was too late for that now. And after I had retreated and staid within doors a good while before my impatience led me abroad, then they called me, as I have said, to an ugly and dangerous office, which brought me out again; but as that was expired, while the height of the distemper lasted I retired again, and continued close ten or twelve days more, during which many dismal spectacles represented themselves in my view,[249] out of my own windows, and in our own street, as that particularly, from Harrow Alley, of the poor outrageous creature who danced and sung in his agony; and many others there were. Scarce a day or a night passed over but some dismal thing or other happened at the end of that Harrow Alley, which was a place full of poor people, most of them belonging to the butchers, or to employments depending upon the butchery. Sometimes heaps and throngs of people would burst out of the alley, most of them women, making a dreadful clamor, mixed or compounded of screeches, cryings, and calling one another, that we could not conceive what to make of it. Almost all the dead part of the night,[250] the dead cart stood at the end of that alley; for if it went in, it could not well turn again, and could go in but a little way. There, I say, it stood to receive dead bodies; and, as the churchyard was but a little way off, if it went away full, it would soon be back again. It is impossible to describe the most horrible cries and noise the poor people would make at their bringing the dead bodies of their children and friends out to the cart; and, by the number, one would have thought there had been none left behind, or that there were people enough for a small city living in those places. Several times they cried murder, sometimes fire; but it was easy to perceive that it was all distraction and the complaints of distressed and distempered people. I believe it was everywhere thus at that time, for the plague raged for six or seven weeks beyond all that I have expressed, and came even to such a height, that, in the extremity, they began to break into that excellent order of which I have spoken so much in behalf of the magistrates, namely, that no dead bodies were seen in the streets, or burials in the daytime; for there was a necessity in this extremity to bear with its being otherwise for a little while. One thing I cannot omit here, and indeed I thought it was extraordinary, at least it seemed a remarkable hand of divine justice; viz., that all the predictors, astrologers, fortune tellers, and what they called cunning men, conjurers, and the like, calculators of nativities, and dreamers of dreams, and such people, were gone and vanished; not one of them was to be found. I am verily persuaded that a great number of them fell in the heat of the calamity, having ventured to stay upon the prospect of getting great estates; and indeed their gain was but too great for a time, through the madness and folly of the people: but now they were silent; many of them went to their long home, not able to foretell their own fate, or to calculate their own nativities. Some have been critical enough to say[251] that every one of them died. I dare not affirm that; but this I must own, that I never heard of one of them that ever appeared after the calamity was over. But to return to my particular observations during this dreadful part of the visitation. I am now come, as I have said, to the month of September, which was the most dreadful of its kind, I believe, that ever London saw; for, by all the accounts which I have seen of the preceding visitations which have been in London, nothing has been like it, the number in the weekly bill amounting to almost forty thousands from the 22d of August to the 26th of September, being but five weeks. The particulars of the bills are as follows: viz.,-- Aug. 22 to Aug. 29 7,496 Aug. 29 to Sept. 5 8,252 Sept. 5 to Sept. 12 7,690 Sept. 12 to Sept. 19 8,297 Sept. 19 to Sept. 26 6,460 ------ 38,195 This was a prodigious number of itself; but if I should add the reasons which I have to believe that this account was deficient, and how deficient it was, you would with me make no scruple to believe that there died above ten thousand a week for all those weeks, one week with another, and a proportion for several weeks, both before and after. The confusion among the people, especially within the city, at that time was inexpressible. The terror was so great at last, that the courage of the people appointed to carry away the dead began to fail them; nay, several of them died, although they had the distemper before, and were recovered; and some of them dropped down when they have been carrying the bodies even at the pitside, and just ready to throw them in. And this confusion was greater in the city, because they had flattered themselves with hopes of escaping, and thought the bitterness of death was past. One cart, they told us, going up Shoreditch, was forsaken by the drivers, or, being left to one man to drive, he died in the street; and the horses, going on, overthrew the cart, and left the bodies, some thrown here, some there, in a dismal manner. Another cart was, it seems, found in the great pit in Finsbury Fields, the driver being dead, or having been gone and abandoned it; and the horses running too near it, the cart fell in, and drew the horses in also. It was suggested that the driver was thrown in with it, and that the cart fell upon him, by reason his whip was seen to be in the pit among the bodies; but that, I suppose, could not be certain. In our parish of Aldgate the dead carts were several times, as I have heard, found standing at the churchyard gate full of dead bodies, but neither bellman, or driver, or any one else, with it. Neither in these or many other cases did they know what bodies they had in their cart, for sometimes they were let down with ropes out of balconies and out of windows, and sometimes the bearers brought them to the cart, sometimes other people; nor, as the men themselves said, did they trouble themselves to keep any account of the numbers. The vigilance of the magistrate was now put to the utmost trial, and, it must be confessed, can never be enough acknowledged on this occasion; also, whatever expense or trouble they were at, two things were never neglected in the city or suburbs either:-- 1. Provisions were always to be had in full plenty, and the price not much raised neither, hardly worth speaking. 2. No dead bodies lay unburied or uncovered; and if any one walked from one end of the city to another, no funeral, or sign of it, was to be seen in the daytime, except a little, as I have said, in the first three weeks in September. This last article, perhaps, will hardly be believed when some accounts which others have published since that shall be seen, wherein they say that the dead lay unburied, which I am sure was utterly false; at least, if it had been anywhere so, it must have been in houses where the living were gone from the dead, having found means, as I have observed, to escape, and where no notice was given to the officers. All which amounts to nothing at all in the case in hand; for this I am positive in, having myself been employed a little in the direction of that part of the parish in which I lived, and where as great a desolation was made, in proportion to the number of the inhabitants, as was anywhere. I say, I am sure that there were no dead bodies remained unburied; that is to say, none that the proper officers knew of, none for want of people to carry them off, and buriers to put them into the ground and cover them. And this is sufficient to the argument; for what might lie in houses and holes, as in Moses and Aaron Alley, is nothing, for it is most certain they were buried as soon as they were found. As to the first article, namely, of provisions, the scarcity or dearness, though I have mentioned it before, and shall speak of it again, yet I must observe here. 1. The price of bread in particular was not much raised; for in the beginning of the year, viz., in the first week in March, the penny wheaten loaf was ten ounces and a half, and in the height of the contagion it was to be had at nine ounces and a half, and never dearer, no, not all that season; and about the beginning of November it was sold at ten ounces and a half again, the like of which, I believe, was never heard of, in any city under so dreadful a visitation, before. 2. Neither was there, which I wondered much at, any want of bakers or ovens kept open to supply the people with bread; but this was indeed alleged by some families, viz., that their maidservants, going to the bakehouses with their dough to be baked, which was then the custom, sometimes came home with the sickness, that is to say, the plague, upon them. In all this dreadful visitation there were, as I have said before, but two pesthouses made use of; viz., one in the fields beyond Old Street, and one in Westminster. Neither was there any compulsion used in carrying people thither. Indeed, there was no need of compulsion in the case, for there were thousands of poor distressed people, who having no help, or conveniences, or supplies, but of charity, would have been very glad to have been carried thither and been taken care of; which, indeed, was the only thing that, I think, was wanting in the whole public management of the city, seeing nobody was here allowed to be brought to the pesthouse but where money was given, or security for money, either at their introducing,[252] or upon their being cured and sent out; for very many were sent out again whole, and very good physicians were appointed to those places; so that many people did very well there, of which I shall make mention again. The principal sort of people sent thither were, as I have said, servants, who got the distemper by going of errands to fetch necessaries for the families where they lived, and who, in that case, if they came home sick, were removed to preserve the rest of the house; and they were so well looked after there, in all the time of the visitation, that there was but one hundred and fifty-six buried in all at the London pesthouse, and one hundred and fifty-nine at that of Westminster. By having more pesthouses, I am far from meaning a forcing all people into such places. Had the shutting up of houses been omitted, and the sick hurried out of their dwellings to pesthouses, as some proposed it seems at that time as well as since, it[253] would certainly have been much worse than it was. The very removing the sick would have been a spreading of the infection, and the rather because that removing could not effectually clear the house where the sick person was of the distemper; and the rest of the family, being then left at liberty, would certainly spread it among others. The methods, also, in private families which would have been universally used to have concealed the distemper, and to have concealed the persons being sick, would have been such that the distemper would sometimes have seized a whole family before any visitors or examiners could have known of it. On the other hand, the prodigious numbers which would have been sick at a time would have exceeded all the capacity of public pesthouses to receive them, or of public officers to discover and remove them. This was well considered in those days, and I have heard them talk of it often. The magistrates had enough to do to bring people to submit to having their houses shut up; and many ways they deceived the watchmen, and got out, as I observed. But that difficulty made it apparent that they would have found it impracticable to have gone the other way to work; for they could never have forced the sick people out of their beds and out of their dwellings: it must not have been my lord mayor's officers, but an army of officers, that must have attempted it. And the people, on the other hand, would have been enraged and desperate, and would have killed those that should have offered to have meddled with them or with their children and relations, whatever had befallen them for it; so that they would have made the people (who, as it was, were in the most terrible distraction imaginable), I say, they would have made them stark mad: whereas the magistrates found it proper on several occasions to treat them with lenity and compassion, and not with violence and terror, such as dragging the sick out of their houses, or obliging them to remove themselves, would have been. This leads me again to mention the time when the plague first began,[254] that is to say, when it became certain that it would spread over the whole town, when, as I have said, the better sort of people first took the alarm, and began to hurry themselves out of town. It was true, as I observed in its place, that the throng was so great, and the coaches, horses, wagons, and carts were so many, driving and dragging the people away, that it looked as if all the city was running away; and had any regulations been published that had been terrifying at that time, especially such as would pretend to dispose of the people otherwise than they would dispose of themselves, it would have put both the city and suburbs into the utmost confusion. The magistrates wisely caused the people to be encouraged, made very good by-laws[255] for the regulating the citizens, keeping good order in the streets, and making everything as eligible as possible to all sorts of people. In the first place, the lord mayor and the sheriffs,[256] the court of aldermen, and a certain number of the common councilmen, or their deputies, came to a resolution, and published it; viz., that they would not quit the city themselves, but that they would be always at hand for the preserving good order in every place, and for doing justice on all occasions, as also for the distributing the public charity to the poor, and, in a word, for the doing the duty and discharging the trust reposed in them by the citizens, to the utmost of their power. In pursuance of these orders, the lord mayor, sheriffs, etc., held councils every day, more or less, for making such dispositions as they found needful for preserving the civil peace; and though they used the people with all possible gentleness and clemency, yet all manner of presumptuous rogues, such as thieves, housebreakers, plunderers of the dead or of the sick, were duly punished; and several declarations were continually published by the lord mayor and court of aldermen against such. Also all constables and churchwardens were enjoined to stay in the city upon severe penalties, or to depute such able and sufficient housekeepers as the deputy aldermen or common councilmen of the precinct should approve, and for whom they should give security, and also security, in case of mortality, that they would forthwith constitute other constables in their stead. These things reëstablished the minds of the people very much, especially in the first of their fright, when they talked of making so universal a flight that the city would have been in danger of being entirely deserted of its inhabitants, except the poor, and the country of being plundered and laid waste by the multitude. Nor were the magistrates deficient in performing their part as boldly as they promised it; for my lord mayor and the sheriffs were continually in the streets and at places of the greatest danger; and though they did not care for having too great a resort of people crowding about them, yet in emergent cases they never denied the people access to them, and heard with patience all their grievances and complaints. My lord mayor had a low gallery built on purpose in his hall, where he stood, a little removed from the crowd, when any complaint came to be heard, that he might appear with as much safety as possible. Likewise the proper officers, called my lord mayor's officers, constantly attended in their turns, as they were in waiting; and if any of them were sick or infected, as some of them were, others were instantly employed to fill up, and officiate in their places till it was known whether the other should live or die. In like manner the sheriffs and aldermen did,[257] in their several stations and wards, where they were placed by office; and the sheriff's officers or sergeants were appointed to receive orders from the respective aldermen in their turn; so that justice was executed in all cases without interruption. In the next place, it was one of their particular cares to see the orders for the freedom of the markets observed; and in this part either the lord mayor, or one or both of the sheriffs, were every market day on horseback to see their orders executed, and to see that the country people had all possible encouragement and freedom in their coming to the markets and going back again, and that no nuisance or frightful object should be seen in the streets to terrify them, or make them unwilling to come. Also the bakers were taken under particular order, and the master of the Bakers' Company was, with his court of assistants, directed to see the order of my lord mayor for their regulation put in execution, and the due assize[258] of bread, which was weekly appointed by my lord mayor, observed; and all the bakers were obliged to keep their ovens going constantly, on pain of losing the privileges of a freeman of the city of London. By this means, bread was always to be had in plenty, and as cheap as usual, as I said above; and provisions were never wanting in the markets, even to such a degree that I often wondered at it, and reproached myself with being so timorous and cautious in stirring abroad, when the country people came freely and boldly to market, as if there had been no manner of infection in the city, or danger of catching it. It was indeed one admirable piece of conduct in the said magistrates, that the streets were kept constantly clear and free from all manner of frightful objects, dead bodies, or any such things as were indecent or unpleasant; unless where anybody fell down suddenly, or died in the streets, as I have said above, and these were generally covered with some cloth or blanket, or removed into the next churchyard till night. All the needful works that carried terror with them, that were both dismal and dangerous, were done in the night. If any diseased bodies were removed, or dead bodies buried, or infected clothes burned, it was done in the night; and all the bodies which were thrown into the great pits in the several churchyards or burying grounds, as has been observed, were so removed in the night, and everything was covered and closed before day. So that in the daytime there was not the least signal of the calamity to be seen or heard of, except what was to be observed from the emptiness of the streets, and sometimes from the passionate outcries and lamentations of the people, out at their windows, and from the numbers of houses and shops shut up. Nor was the silence and emptiness of the streets so much in the city as in the outparts, except just at one particular time, when, as I have mentioned, the plague came east, and spread over all the city. It was indeed a merciful disposition of God, that as the plague began at one end of the town first, as has been observed at large, so it proceeded progressively to other parts, and did not come on this way, or eastward, till it had spent its fury in the west part of the town; and so as it came on one way it abated another. For example:-- It began at St. Giles's and the Westminster end of the town, and it was in its height in all that part by about the middle of July, viz., in St. Giles-in-the-Fields, St. Andrew's, Holborn, St. Clement's-Danes, St. Martin's-in-the-Fields, and in Westminster. The latter end of July it decreased in those parishes, and, coming east, it increased prodigiously in Cripplegate, St. Sepulchre's, St. James's, Clerkenwell, and St. Bride's and Aldersgate. While it was in all these parishes, the city and all the parishes of the Southwark side of the water, and all Stepney, Whitechapel, Aldgate, Wapping, and Ratcliff, were very little touched; so that people went about their business unconcerned, carried on their trades, kept open their shops, and conversed freely with one another in all the city, the east and northeast suburbs, and in Southwark, almost as if the plague had not been among us. Even when the north and northwest suburbs were fully infected, viz., Cripplegate, Clerkenwell, Bishopsgate, and Shoreditch, yet still all the rest were tolerably well. For example:-- From the 25th of July to the 1st of August the bill stood thus of all diseases:-- St. Giles's, Cripplegate 554 St. Sepulchre's 250 Clerkenwell 103 Bishopsgate 116 Shoreditch 110 Stepney Parish 127 Aldgate 92 Whitechapel 104 All the 97 parishes within the walls 228 All the parishes in Southwark 205 ----- 1,889 So that, in short, there died more that week in the two parishes of Cripplegate and St. Sepulchre's by forty-eight than all the city, all the east suburbs, and all the Southwark parishes put together. This caused the reputation of the city's health to continue all over England, and especially in the counties and markets adjacent, from whence our supply of provisions chiefly came, even much longer than that health itself continued; for when the people came into the streets from the country by Shoreditch and Bishopsgate, or by Old Street and Smithfield, they would see the outstreets empty, and the houses and shops shut, and the few people that were stirring there walk in the middle of the streets; but when they came within the city, there things looked better, and the markets and shops were open, and the people walking about the streets as usual, though not quite so many; and this continued till the latter end of August and the beginning of September. But then the case altered quite; the distemper abated in the west and northwest parishes, and the weight of the infection lay on the city and the eastern suburbs, and the Southwark side, and this in a frightful manner. Then indeed the city began to look dismal, shops to be shut, and the streets desolate. In the High Street, indeed, necessity made people stir abroad on many occasions; and there would be in the middle of the day a pretty many[259] people, but in the mornings and evenings scarce any to be seen even there, no, not in Cornhill and Cheapside. These observations of mine were abundantly confirmed by the weekly bills of mortality for those weeks, an abstract of which, as they respect the parishes which I have mentioned, and as they make the calculations I speak of very evident, take as follows. The weekly bill which makes out this decrease of the burials in the west and north side of the city stands thus:-- St. Giles's, Cripplegate 456 St. Giles-in-the-Fields 140 Clerkenwell 77 St. Sepulchre's 214 St. Leonard, Shoreditch 183 Stepney Parish 716 Aldgate 629 Whitechapel 532 In the 97 parishes within the walls 1,493 In the 8 parishes on Southwark side 1,636 ----- 6,076 Here is a strange change of things indeed, and a sad change it was; and, had it held for two months more than it did, very few people would have been left alive; but then such, I say, was the merciful disposition of God, that when it was thus, the west and north part, which had been so dreadfully visited at first, grew, as you see, much better; and, as the people disappeared here, they began to look abroad again there; and the next week or two altered it still more, that is, more to the encouragement of the other part of the town. For example:-- Sept. 19-26. St. Giles's, Cripplegate 277 St. Giles-in-the-Fields 119 Clerkenwell 76 St. Sepulchre's 193 St. Leonard, Shoreditch 146 Stepney Parish 616 Aldgate 496 Whitechapel 346 In the 97 parishes within the walls 1,268 In the 8 parishes on Southwark side 1,390 ----- 4,927 Sept. 26-Oct. 3. St. Giles's, Cripplegate 196 St. Giles-in-the-Fields 95 Clerkenwell 48 St. Sepulchre's 137 St. Leonard, Shoreditch 128 Stepney Parish 674 Aldgate 372 Whitechapel 328 In the 97 parishes within the walls 1,149 In the 8 parishes on Southwark side 1,201 ----- 4,328 And now the misery of the city, and of the said east and south parts, was complete indeed; for, as you see, the weight of the distemper lay upon those parts, that is to say, the city, the eight parishes over the river, with the parishes of Aldgate, Whitechapel, and Stepney, and this was the time that the bills came up to such a monstrous height as that I mentioned before, and that eight or nine, and, as I believe, ten or twelve thousand a week died; for it is my settled opinion that they[260] never could come at any just account of the numbers, for the reasons which I have given already. Nay, one of the most eminent physicians, who has since published in Latin an account of those times and of his observations, says that in one week there died twelve thousand people, and that particularly there died four thousand in one night; though I do not remember that there ever was any such particular night so remarkably fatal as that such a number died in it. However, all this confirms what I have said above of the uncertainty of the bills of mortality, etc., of which I shall say more hereafter. And here let me take leave to enter again, though it may seem a repetition of circumstances, into a description of the miserable condition of the city itself, and of those parts where I lived, at this particular time. The city, and those other parts, notwithstanding the great numbers of people that were gone into the country, was[261] vastly full of people; and perhaps the fuller because people had for a long time a strong belief that the plague would not come into the city, nor into Southwark, no, nor into Wapping or Ratcliff at all; nay, such was the assurance of the people on that head, that many removed from the suburbs on the west and north sides into those eastern and south sides as for safety, and, as I verily believe, carried the plague amongst them there, perhaps sooner than they would otherwise have had it. Here, also, I ought to leave a further remark for the use of posterity, concerning the manner of people's infecting one another; namely, that it was not the sick people only from whom the plague was immediately received by others that were sound, but the well. To explain myself: by the sick people, I mean those who were known to be sick, had taken their beds, had been under cure, or had swellings or tumors upon them, and the like. These everybody could beware of: they were either in their beds, or in such condition as could not be concealed. By the well, I mean such as had received the contagion, and had it really upon them and in their blood, yet did not show the consequences of it in their countenances; nay, even were not sensible of it themselves, as many were not for several days. These breathed death in every place, and upon everybody who came near them; nay, their very clothes retained the infection; their hands would infect the things they touched, especially if they were warm and sweaty, and they were generally apt to sweat, too. Now, it was impossible to know these people, nor did they sometimes, as I have said, know themselves, to be infected. These were the people that so often dropped down and fainted in the streets; for oftentimes they would go about the streets to the last, till on a sudden they would sweat, grow faint, sit down at a door, and die. It is true, finding themselves thus, they would struggle hard to get home to their own doors, or at other times would be just able to go into their houses, and die instantly. Other times they would go about till they had the very tokens come out upon them, and yet not know it, and would die in an hour or two after they came home, but be well as long as they were abroad. These were the dangerous people; these were the people of whom the well people ought to have been afraid: but then, on the other side, it was impossible to know them. And this is the reason why it is impossible in a visitation to prevent the spreading of the plague by the utmost human vigilance; viz., that it is impossible to know the infected people from the sound, or that the infected people should perfectly know themselves. I knew a man who conversed freely in London all the season of the plague in 1665, and kept about him an antidote or cordial, on purpose to take when he thought himself in any danger; and he had such a rule to know, or have warning of the danger by, as indeed I never met with before or since: how far it may be depended on, I know not. He had a wound in his leg; and whenever he came among any people that were not sound, and the infection began to affect him, he said he could know it by that signal, viz., that the wound in his leg would smart, and look pale and white: so as soon as ever he felt it smart it was time for him to withdraw, or to take care of himself, taking his drink, which he always carried about him for that purpose. Now, it seems he found his wound would smart many times when he was in company with such who thought themselves to be sound, and who appeared so to one another; but he would presently rise up, and say publicly, "Friends, here is somebody in the room that has the plague," and so would immediately break up the company. This was, indeed, a faithful monitor to all people, that the plague is not to be avoided by those that converse promiscuously in a town infected, and people have it when they know it not, and that they likewise give it to others when they know not that they have it themselves; and in this case, shutting up the well or removing the sick will not do it, unless they can go back and shut up all those that the sick had conversed with, even before they knew themselves to be sick; and none knows how far to carry that back, or where to stop, for none knows when, or where, or how, they may have received the infection, or from whom. This I take to be the reason which makes so many people talk of the air being corrupted and infected, and that they need not be cautious of whom they converse with, for that the contagion was in the air. I have seen them in strange agitations and surprises on this account. "I have never come near any infected body," says the disturbed person; "I have conversed with none but sound healthy people, and yet I have gotten the distemper." "I am sure I am struck from Heaven," says another, and he falls to the serious part.[262] Again the first goes on exclaiming, "I have come near no infection, or any infected person; I am sure it is in the air; we draw in death when we breathe, and therefore it is the hand of God: there is no withstanding it." And this at last made many people, being hardened to the danger, grow less concerned at it, and less cautious towards the latter end of the time, and when it was come to its height, than they were at first. Then, with a kind of a Turkish predestinarianism,[263] they would say, if it pleased God to strike them, it was all one whether they went abroad, or staid at home: they could not escape it. And therefore they went boldly about, even into infected houses and infected company, visited sick people, and, in short, lay in the beds with their wives or relations when they were infected. And what was the consequence but the same that is the consequence in Turkey, and in those countries where they do those things, namely, that they were infected too, and died by hundreds and thousands? I would be far from lessening the awe of the judgments of God, and the reverence to his providence, which ought always to be on our minds on such occasions as these. Doubtless the visitation itself is a stroke from Heaven upon a city, or country, or nation, where it falls; a messenger of his vengeance, and a loud call to that nation, or country, or city, to humiliation and repentance, according to that of the prophet Jeremiah (xviii. 7, 8): "At what instant I shall speak concerning a nation, and concerning a kingdom, to pluck up, and to pull down, and to destroy it; if that nation, against whom I have pronounced, turn from their evil, I will repent of the evil that I thought to do unto them." Now, to prompt due impressions of the awe of God on the minds of men on such occasions, and not to lessen them, it is that I have left those minutes upon record. I say, therefore, I reflect upon no man for putting the reason of those things upon the immediate hand of God and the appointment and direction of his providence; nay, on the contrary, there were many wonderful deliverances of persons from infection, and deliverances of persons when infected, which intimate singular and remarkable providence in the particular instances to which they refer; and I esteem my own deliverance to be one next to miraculous, and do record it with thankfulness. But when I am speaking of the plague as a distemper arising from natural causes, we must consider it as it was really propagated by natural means. Nor is it at all the less a judgment for its being under the conduct of human causes and effects; for as the Divine Power has formed the whole scheme of nature, and maintains nature in its course, so the same Power thinks fit to let his own actings with men, whether of mercy or judgment, to go on in the ordinary course of natural causes, and he is pleased to act by those natural causes as the ordinary means, excepting and reserving to himself, nevertheless, a power to act in a supernatural way when he sees occasion. Now it is evident, that, in the case of an infection, there is no apparent extraordinary occasion for supernatural operation; but the ordinary course of things appears sufficiently armed, and made capable of all the effects that Heaven usually directs by a contagion. Among these causes and effects, this of the secret conveyance of infection, imperceptible and unavoidable, is more than sufficient to execute the fierceness of divine vengeance, without putting it upon supernaturals and miracles. The acute, penetrating nature of the disease itself was such, and the infection was received so imperceptibly, that the most exact caution could not secure us while in the place; but I must be allowed to believe--and I have so many examples fresh in my memory to convince me of it, that I think none can resist their evidence,--I say, I must be allowed to believe that no one in this whole nation ever received the sickness or infection, but who received it in the ordinary way of infection from somebody, or the clothes, or touch, or stench of somebody, that was infected before. The manner of its first coming to London proves this also, viz., by goods brought over from Holland, and brought thither from the Levant; the first breaking of it out in a house in Longacre where those goods were carried and first opened; its spreading from that house to other houses by the visible unwary conversing with those who were sick, and the infecting the parish officers who were employed about persons dead; and the like. These are known authorities for this great foundation point, that it went on and proceeded from person to person, and from house to house, and no otherwise. In the first house that was infected, there died four persons. A neighbor, hearing the mistress of the first house was sick, went to visit her, and went home and gave the distemper to her family, and died, and all her household. A minister called to pray with the first sick person in the second house was said to sicken immediately, and die, with several more in his house. Then the physicians began to consider, for they did not at first dream of a general contagion; but the physicians being sent to inspect the bodies, they assured the people that it was neither more or less than the plague, with all its terrifying particulars, and that it threatened an universal infection; so many people having already conversed with the sick or distempered, and having, as might be supposed, received infection from them, that it would be impossible to put a stop to it. Here the opinion of the physicians agreed with my observation afterwards, namely, that the danger was spreading insensibly: for the sick could infect none but those that came within reach of the sick person; but that one man, who may have really received the infection, and knows it not, but goes abroad and about as a sound person, may give the plague to a thousand people, and they to greater numbers in proportion, and neither the person giving the infection, nor the persons receiving it, know anything of it, and perhaps not feel the effects of it for several days after. For example:-- Many persons, in the time of this visitation, never perceived that they were infected till they found, to their unspeakable surprise, the tokens come out upon them, after which they seldom lived six hours; for those spots they called the tokens were really gangrene spots, or mortified flesh, in small knobs as broad as a little silver penny, and hard as a piece of callus[264] or horn; so that when the disease was come up to that length, there was nothing could follow but certain death. And yet, as I said, they knew nothing of their being infected, nor found themselves so much as out of order, till those mortal marks were upon them. But everybody must allow that they were infected in a high degree before, and must have been so some time; and consequently their breath, their sweat, their very clothes, were contagious for many days before. This occasioned a vast variety of cases, which physicians would have much more opportunity to remember than I; but some came within the compass of my observation or hearing, of which I shall name a few. A certain citizen who had lived safe and untouched till the month of September, when the weight of the distemper lay more in the city than it had done before, was mighty cheerful, and something too bold, as I think it was, in his talk of how secure he was, how cautious he had been, and how he had never come near any sick body. Says another citizen, a neighbor of his, to him one day, "Do not be too confident, Mr. ----: it is hard to say who is sick and who is well; for we see men alive and well to outward appearance one hour, and dead the next."--"That is true," says the first man (for he was not a man presumptuously secure, but had escaped a long while; and men, as I have said above, especially in the city, began to be overeasy on that score),--"that is true," says he. "I do not think myself secure; but I hope I have not been in company with any person that there has been any danger in."--"No!" says his neighbor. "Was not you at the Bull Head Tavern in Gracechurch Street, with Mr. ----, the night before last?"--"Yes," says the first, "I was; but there was nobody there that we had any reason to think dangerous." Upon which his neighbor said no more, being unwilling to surprise him. But this made him more inquisitive, and, as his neighbor appeared backward, he was the more impatient; and in a kind of warmth says he aloud, "Why, he is not dead, is he?" Upon which his neighbor still was silent, but cast up his eyes, and said something to himself; at which the first citizen turned pale, and said no more but this, "Then I am a dead man too!" and went home immediately, and sent for a neighboring apothecary to give him something preventive, for he had not yet found himself ill. But the apothecary, opening his breast, fetched a sigh, and said no more but this, "Look up to God." And the man died in a few hours. Now, let any man judge from a case like this if it is possible for the regulations of magistrates, either by shutting up the sick or removing them, to stop an infection which spreads itself from man to man even while they are perfectly well, and insensible of its approach, and may be so for many days. It may be proper to ask here how long it may be supposed men might have the seeds of the contagion in them before it discovered[265] itself in this fatal manner, and how long they might go about seemingly whole, and yet be contagious to all those that came near them. I believe the most experienced physicians cannot answer this question directly any more than I can; and something an ordinary observer may take notice of which may pass their observation. The opinion of physicians abroad seems to be, that it may lie dormant in the spirits, or in the blood vessels, a very considerable time: why else do they exact a quarantine of those who come into their harbors and ports from suspected places? Forty days is, one would think, too long for nature to struggle with such an enemy as this, and not conquer it or yield to it; but I could not think by my own observation that they can be infected, so as to be contagious to others, above fifteen or sixteen days at farthest; and on that score it was, that when a house was shut up in the city, and any one had died of the plague, but nobody appeared to be ill in the family for sixteen or eighteen days after, they were not so strict but that they[266] would connive at their going privately abroad; nor would people be much afraid of them afterwards, but rather think they were fortified the better, having not been vulnerable when the enemy was in their house: but we sometimes found it had lain much longer concealed. Upon the foot of all these observations I must say, that, though Providence seemed to direct my conduct to be otherwise, it is my opinion, and I must leave it as a prescription, viz., that the best physic against the plague is to run away from it. I know people encourage themselves by saying, "God is able to keep us in the midst of danger, and able to overtake us when we think ourselves out of danger;" and this kept thousands in the town whose carcasses went into the great pits by cartloads, and who, if they had fled from the danger, had, I believe, been safe from the disaster: at least, 'tis probable they had been safe. And were this very fundamental[267] only duly considered by the people on any future occasion of this or the like nature, I am persuaded it would put them upon quite different measures for managing the people from those that they took in 1665, or than any that have been taken abroad that I have heard of: in a word, they would consider of separating the people into smaller bodies, and removing them in time farther from one another, and not let such a contagion as this, which is indeed chiefly dangerous to collected bodies of people, find a million of people in a body together, as was very near the case before, and would certainly be the case if it should ever appear again. The plague, like a great fire, if a few houses only are contiguous where it happens, can only[268] burn a few houses; or if it begins in a single, or, as we call it, a lone house, can only burn that lone house where it begins; but if it begins in a close-built town or city, and gets ahead, there its fury increases, it rages over the whole place, and consumes all it can reach. I could propose many schemes on the foot of which the government of this city, if ever they should be under the apprehension of such another enemy, (God forbid they should!) might ease themselves of the greatest part of the dangerous people that belong to them: I mean such as the begging, starving, laboring poor, and among them chiefly those who, in a case of siege, are called the useless mouths; who, being then prudently, and to their own advantage, disposed of, and the wealthy inhabitants disposing of themselves, and of their servants and children, the city and its adjacent parts would be so effectually evacuated that there would not be above a tenth part of its people left together for the disease to take hold upon. But suppose them to be a fifth part, and that two hundred and fifty thousand people were left; and if it did seize upon them, they would, by their living so much at large, be much better prepared to defend themselves against the infection, and be less liable to the effects of it, than if the same number of people lived close together in one smaller city, such as Dublin, or Amsterdam, or the like. It is true, hundreds, yea thousands, of families fled away at this last plague; but then of them many fled too late, and not only died in their flight, but carried the distemper with them into the countries where they went, and infected those whom they went among for safety; which confounded[269] the thing, and made that be a propagation of the distemper which was the best means to prevent it. And this, too, is evident of it, and brings me back to what I only hinted at before, but must speak more fully to here, namely, that men went about apparently well many days after they had the taint of the disease in their vitals, and after their spirits were so seized as that they could never escape it; and that, all the while they did so, they were dangerous to others. I say, this proves that so it was; for such people infected the very towns they went through, as well as the families they went among; and it was by that means that almost all the great towns in England had the distemper among them more or less, and always they would tell you such a Londoner or such a Londoner brought it down. It must not be omitted,[270] that when I speak of those people who were really thus dangerous, I suppose them to be utterly ignorant of their own condition; for if they really knew their circumstances to be such as indeed they were, they must have been a kind of willful murderers if they would have gone abroad among healthy people, and it would have verified indeed the suggestion which I mentioned above, and which I thought seemed untrue, viz., that the infected people were utterly careless as to giving the infection to others, and rather forward to do it than not; and I believe it was partly from this very thing that they raised that suggestion, which I hope was not really true in fact. I confess no particular case is sufficient to prove a general; but I could name several people, within the knowledge of some of their neighbors and families yet living, who showed the contrary to an extreme. One man, the master of a family in my neighborhood, having had the distemper, he thought he had it given him by a poor workman whom he employed, and whom he went to his house to see, or went for some work that he wanted to have finished; and he had some apprehensions even while he was at the poor workman's door, but did not discover it[271] fully; but the next day it discovered itself, and he was taken very ill, upon which he immediately caused himself to be carried into an outbuilding which he had in his yard, and where there was a chamber over a workhouse, the man being a brazier. Here he lay, and here he died, and would be tended by none of his neighbors but by a nurse from abroad, and would not suffer his wife, nor children, nor servants, to come up into the room, lest they should be infected, but sent them his blessing and prayers for them by the nurse, who spoke it to them at a distance; and all this for fear of giving them the distemper, and without which, he knew, as they were kept up, they could not have it. And here I must observe also that the plague, as I suppose all distempers do, operated in a different manner on differing constitutions. Some were immediately overwhelmed with it, and it came to violent fevers, vomitings, insufferable headaches, pains in the back, and so up to ravings and ragings with those pains; others with swellings and tumors in the neck or groin, or armpits, which, till they could be broke, put them into insufferable agonies and torment; while others, as I have observed, were silently infected, the fever preying upon their spirits insensibly, and they seeing little of it till they fell into swooning and faintings, and death without pain. I am not physician enough to enter into the particular reasons and manner of these differing effects of one and the same distemper, and of its differing operation in several bodies; nor is it my business here to record the observations which I really made, because the doctors themselves have done that part much more effectually than I can do, and because my opinion may in some things differ from theirs. I am only relating what I know, or have heard, or believe, of the particular cases, and what fell within the compass of my view, and the different nature of the infection as it appeared in the particular cases which I have related; but this may be added too, that though the former sort of those cases, namely, those openly visited, were the worst for themselves as to pain (I mean those that had such fevers, vomitings, headaches, pains, and swellings), because they died in such a dreadful manner, yet the latter had the worst state of the disease; for in the former they frequently recovered, especially if the swellings broke; but the latter was inevitable death. No cure, no help, could be possible; nothing could follow but death. And it was worse, also, to others; because, as above, it secretly and unperceived by others or by themselves, communicated death to those they conversed with, the penetrating poison insinuating itself into their blood in a manner which it was impossible to describe, or indeed conceive. This infecting and being infected without so much as its being known to either person is evident from two sorts of cases which frequently happened at that time; and there is hardly anybody living, who was in London during the infection, but must have known several of the cases of both sorts. 1. Fathers and mothers have gone about as if they had been well, and have believed themselves to be so, till they have insensibly infected and been the destruction of their whole families; which they would have been far from doing if they had had the least apprehensions of their being unsound and dangerous themselves. A family, whose story I have heard, was thus infected by the father, and the distemper began to appear upon some of them even before he found it upon himself; but, searching more narrowly, it appeared he had been infected some time, and, as soon as he found that his family had been poisoned by himself, he went distracted, and would have laid violent hands upon himself, but was kept from that by those who looked to him; and in a few days he died. 2. The other particular is, that many people, having been well to the best of their own judgment, or by the best observation which they could make of themselves for several days, and only finding a decay of appetite, or a light sickness upon their stomachs,--nay, some whose appetite has been strong, and even craving, and only a light pain in their heads,--have sent for physicians to know what ailed them, and have been found, to their great surprise, at the brink of death, the tokens upon them, or the plague grown up to an incurable height. It was very sad to reflect how such a person as this last mentioned above had been a walking destroyer, perhaps for a week or fortnight before that; how he had ruined those that he would have hazarded his life to save, and had been breathing death upon them, even perhaps in his tender kissing and embracings of his own children. Yet thus certainly it was, and often has been, and I could give many particular cases where it has been so. If, then, the blow is thus insensibly striking; if the arrow flies thus unseen, and cannot be discovered,--to what purpose are all the schemes for shutting up or removing the sick people? Those schemes cannot take place but upon those that appear to be sick or to be infected; whereas there are among them at the same time thousands of people who seem to be well, but are all that while carrying death with them into all companies which they come into. This frequently puzzled our physicians, and especially the apothecaries and surgeons, who knew not how to discover the sick from the sound. They all allowed that it was really so; that many people had the plague in their very blood, and preying upon their spirits, and were in themselves but walking putrefied carcasses, whose breath was infectious, and their sweat poison, and yet were as well to look on as other people, and even knew it not themselves,--I say they all allowed that it was really true in fact, but they knew not how to propose a discovery.[272] My friend Dr. Heath was of opinion that it might be known by the smell of their breath; but then, as he said, who durst smell to that breath for his information, since to know it he must draw the stench of the plague up into his own brain in order to distinguish the smell? I have heard it was the opinion of others that it might be distinguished by the party's breathing upon a piece of glass, where, the breath condensing, there might living creatures be seen by a microscope, of strange, monstrous, and frightful shapes, such as dragons, snakes, serpents, and devils, horrible to behold. But this I very much question the truth of, and we had no microscopes at that time, as I remember, to make the experiment with.[273] It was the opinion, also, of another learned man that the breath of such a person would poison and instantly kill a bird, not only a small bird, but even a cock or hen; and that, if it did not immediately kill the latter, it would cause them to be roupy,[274] as they call it; particularly that, if they had laid any eggs at that time, they would be all rotten. But those are opinions which I never found supported by any experiments, or heard of others that had seen it,[275] so I leave them as I find them, only with this remark, namely, that I think the probabilities are very strong for them. Some have proposed that such persons should breathe hard upon warm water, and that they would leave an unusual scum upon it, or upon several other things, especially such as are of a glutinous substance, and are apt to receive a scum, and support it. But, from the whole, I found that the nature of this contagion was such that it was impossible to discover it at all, or to prevent it spreading from one to another by any human skill. Here was indeed one difficulty, which I could never thoroughly get over to this time, and which there is but one way of answering that I know of, and it is this; viz., the first person that died of the plague was on December 20th, or thereabouts, 1664, and in or about Longacre: whence the first person had the infection was generally said to be from a parcel of silks imported from Holland, and first opened in that house. But after this we heard no more of any person dying of the plague, or of the distemper being in that place, till the 9th of February, which was about seven weeks after, and then one more was buried out of the same house. Then it was hushed, and we were perfectly easy as to the public for a great while; for there were no more entered in the weekly bill to be dead of the plague till the 22d of April, when there were two more buried, not out of the same house, but out of the same street; and, as near as I can remember, it was out of the next house to the first. This was nine weeks asunder; and after this we had no more till a fortnight, and then it broke out in several streets, and spread every way. Now, the question seems to lie thus: Where lay the seeds of the infection all this while? how came it to stop so long, and not stop any longer? Either the distemper did not come immediately by contagion from body to body, or, if it did, then a body may be capable to continue infected, without the disease discovering itself, many days, nay, weeks together; even not a quarantine[276] of days only, but a soixantine,[277]--not only forty days, but sixty days, or longer. It is true there was, as I observed at first, and is well known to many yet living, a very cold winter and a long frost, which continued three months; and this, the doctors say, might check the infection. But then the learned must allow me to say, that if, according to their notion, the disease was, as I may say, only frozen up, it would, like a frozen river, have returned to its usual force and current when it thawed; whereas the principal recess of this infection, which was from February to April, was after the frost was broken and the weather mild and warm. But there is another way of solving all this difficulty, which I think my own remembrance of the thing will supply; and that is, the fact is not granted, namely, that there died none in those long intervals, viz., from the 20th of December to the 9th of February, and from thence to the 22d of April. The weekly bills are the only evidence on the other side, and those bills were not of credit enough, at least with me, to support an hypothesis, or determine a question of such importance as this; for it was our received opinion at that time, and I believe upon very good grounds, that the fraud lay in the parish officers, searchers, and persons appointed to give account of the dead, and what diseases they died of; and as people were very loath at first to have the neighbors believe their houses were infected, so they gave money to procure, or otherwise procured, the dead persons to be returned as dying of other distempers; and this I know was practiced afterwards in many places, I believe I might say in all places where the distemper came, as will be seen by the vast increase of the numbers placed in the weekly bills under other articles[278] of diseases during the time of the infection. For example, in the months of July and August, when the plague was coming on to its highest pitch, it was very ordinary to have from a thousand to twelve hundred, nay, to almost fifteen hundred, a week, of other distempers. Not that the numbers of those distempers were really increased to such a degree; but the great number of families and houses where really the infection was, obtained the favor to have their dead be returned of other distempers, to prevent the shutting up their houses. For example:-- Dead of other Diseases besides the Plague. From the 18th to the 25th of July 942 To the 1st of August 1,004 To the 8th 1,213 To the 15th 1,439 To the 22d 1,331 To the 29th 1,394 To the 5th of September 1,264 To the 12th 1,056 To the 19th 1,132 To the 26th 927 Now, it was not doubted but the greatest part of these, or a great part of them, were dead of the plague; but the officers were prevailed with to return them as above, and the numbers of some particular articles of distempers discovered is as follows:-- Aug. 1-8. Aug. 8-15. Aug. 15-22. Aug. 22-29. Fever 314 353 348 383 Spotted fever 174 190 166 165 Surfeit 85 87 74 99 Teeth 90 113 111 133 --- --- --- --- 663 743 699 780 Aug. 29-Sept. 5. Sept. 5-12. Sept. 12-19. Sept. 19-26. Fever 364 332 309 268 Spotted Fever 157 97 101 65 Surfeit 68 45 49 36 Teeth 138 128 121 112 --- --- --- --- 727 602 580 481 There were several other articles which bore a proportion to these, and which it is easy to perceive were increased on the same account; as aged,[279] consumptions, vomitings, imposthumes,[280] gripes, and the like, many of which were not doubted to be infected people; but as it was of the utmost consequence to families not to be known to be infected, if it was possible to avoid it, so they took all the measures they could to have it not believed, and if any died in their houses, to get them returned to the examiners, and by the searchers, as having died of other distempers. This, I say, will account for the long interval which, as I have said, was between the dying of the first persons that were returned in the bills to be dead of the plague, and the time when the distemper spread openly, and could not be concealed. Besides, the weekly bills themselves at that time evidently discover this truth; for while there was no mention of the plague, and no increase after it had been mentioned, yet it was apparent that there was an increase of those distempers which bordered nearest upon it. For example, there were eight, twelve, seventeen, of the spotted fever in a week when there were none or but very few of the plague; whereas before, one, three, or four were the ordinary weekly numbers of that distemper. Likewise, as I observed before, the burials increased weekly in that particular parish and the parishes adjacent, more than in any other parish, although there were none set down of the plague; all which tell us that the infection was handed on, and the succession of the distemper really preserved, though it seemed to us at that time to be ceased, and to come again in a manner surprising. It might be, also, that the infection might remain in other parts of the same parcel of goods which at first it came in, and which might not be, perhaps, opened, or at least not fully, or in the clothes of the first infected person; for I cannot think that anybody could be seized with the contagion in a fatal and mortal degree for nine weeks together, and support his state of health so well as even not to discover it to themselves:[281] yet, if it were so, the argument is the stronger in favor of what I am saying, namely, that the infection is retained in bodies apparently well, and conveyed from them to those they converse with, while it is known to neither the one nor the other. Great were the confusions at that time upon this very account; and when people began to be convinced that the infection was received in this surprising manner from persons apparently well, they began to be exceeding shy and jealous of every one that came near them. Once, on a public day, whether a sabbath day or not I do not remember, in Aldgate Church, in a pew full of people, on a sudden one fancied she smelt an ill smell. Immediately she fancies the plague was in the pew, whispers her notion or suspicion to the next, then rises and goes out of the pew. It immediately took with the next, and so with them all; and every one of them, and of the two or three adjoining pews, got up and went out of the church, nobody knowing what it was offended them, or from whom. This immediately filled everybody's mouths with one preparation or other, such as the old women directed, and some, perhaps, as physicians directed, in order to prevent infection by the breath of others; insomuch, that if we came to go into a church when it was anything full of people, there would be such a mixture of smells at the entrance, that it was much more strong, though perhaps not so wholesome, than if you were going into an apothecary's or druggist's shop: in a word, the whole church was like a smelling bottle. In one corner it was all perfumes; in another, aromatics,[282] balsamics,[283] and a variety of drugs and herbs; in another, salts and spirits, as every one was furnished for their own preservation. Yet I observed that after people were possessed, as I have said, with the belief, or rather assurance, of the infection being thus carried on by persons apparently in health, the churches and meetinghouses were much thinner of people than at other times, before that, they used to be; for this is to be said of the people of London, that, during the whole time of the pestilence, the churches or meetings were never wholly shut up, nor did the people decline coming out to the public worship of God, except only in some parishes, when the violence of the distemper was more particularly in that parish at that time, and even then[284] no longer than it[285] continued to be so. Indeed, nothing was more strange than to see with what courage the people went to the public service of God, even at that time when they were afraid to stir out of their own houses upon any other occasion (this I mean before the time of desperation which I have mentioned already). This was a proof of the exceeding populousness of the city at the time of the infection, notwithstanding the great numbers that were gone into the country at the first alarm, and that fled out into the forests and woods when they were further terrified with the extraordinary increase of it. For when we came to see the crowds and throngs of people which appeared on the sabbath days at the churches, and especially in those parts of the town where the plague was abated, or where it was not yet come to its height, it was amazing. But of this I shall speak again presently. I return, in the mean time, to the article of infecting one another at first. Before people came to right notions of the infection and of infecting one another, people were only shy of those that were really sick. A man with a cap upon his head, or with cloths round his neck (which was the case of those that had swellings there),--such was indeed frightful; but when we saw a gentleman dressed, with his band[286] on, and his gloves in his hand, his hat upon his head, and his hair combed,--of such we had not the least apprehensions; and people conversed a great while freely, especially with their neighbors and such as they knew. But when the physicians assured us that the danger was as well from the sound (that is, the seemingly sound) as the sick, and that those people that thought themselves entirely free were oftentimes the most fatal; and that it came to be generally understood that people were sensible of it, and of the reason of it,--then, I say, they began to be jealous of everybody; and a vast number of people locked themselves up, so as not to come abroad into any company at all, nor suffer any that had been abroad in promiscuous company to come into their houses, or near them (at least not so near them as to be within the reach of their breath, or of any smell from them); and when they were obliged to converse at a distance with strangers, they would always have preservatives in their mouths and about their clothes, to repel and keep off the infection. It must be acknowledged that when people began to use these cautions they were less exposed to danger, and the infection did not break into such houses so furiously as it did into others before; and thousands of families were preserved, speaking with due reserve to the direction of Divine Providence, by that means. But it was impossible to beat anything into the heads of the poor. They went on with the usual impetuosity of their tempers, full of outcries and lamentations when taken, but madly careless of themselves, foolhardy, and obstinate, while they were well. Where they could get employment, they pushed into any kind of business, the most dangerous and the most liable to infection; and if they were spoken to, their answer would be, "I must trust to God for that. If I am taken, then I am provided for, and there is an end of me;" and the like. Or thus, "Why, what must I do? I cannot starve. I had as good have the plague as perish for want. I have no work: what could I do? I must do this, or beg." Suppose it was burying the dead, or attending the sick, or watching infected houses, which were all terrible hazards; but their tale was generally the same. It is true, necessity was a justifiable, warrantable plea, and nothing could be better; but their way of talk was much the same where the necessities were not the same. This adventurous conduct of the poor was that which brought the plague among them in a most furious manner; and this, joined to the distress of their circumstances when taken, was the reason why they died so by heaps; for I cannot say I could observe one jot of better husbandry[287] among them (I mean the laboring poor) while they were all well and getting money than there was before; but[288] as lavish, as extravagant, and as thoughtless for to-morrow as ever; so that when they came to be taken sick, they were immediately in the utmost distress, as well for want as for sickness, as well for lack of food as lack of health. The misery of the poor I had many occasions to be an eyewitness of, and sometimes, also, of the charitable assistance that some pious people daily gave to such, sending them relief and supplies, both of food, physic, and other help, as they found they wanted. And indeed it is a debt of justice due to the temper of the people of that day, to take notice here, that not only great sums, very great sums of money, were charitably sent to the lord mayor and aldermen for the assistance and support of the poor distempered people, but abundance of private people daily distributed large sums of money for their relief, and sent people about to inquire into the condition of particular distressed and visited families, and relieved them. Nay, some pious ladies were transported with zeal in so good a work, and so confident in the protection of Providence in discharge of the great duty of charity, that they went about in person distributing alms to the poor, and even visiting poor families, though sick and infected, in their very houses, appointing nurses to attend those that wanted attending, and ordering apothecaries and surgeons, the first to supply them with drugs or plasters, and such things as they wanted, and the last to lance and dress the swellings and tumors, where such were wanting; giving their blessing to the poor in substantial relief to them, as well as hearty prayers for them. I will not undertake to say, as some do, that none of those charitable people were suffered to fall under the calamity itself; but this I may say, that I never knew any one of them that miscarried, which I mention for the encouragement of others in case of the like distress; and doubtless if they that give to the poor lend to the Lord, and he will repay them, those that hazard their lives to give to the poor, and to comfort and assist the poor in such misery as this, may hope to be protected in the work. Nor was this charity so extraordinary eminent only in a few; but (for I cannot lightly quit this point) the charity of the rich, as well in the city and suburbs as from the country, was so great, that in a word a prodigious number of people, who must otherwise have perished for want as well as sickness, were supported and subsisted by it; and though I could never, nor I believe any one else, come to a full knowledge of what was so contributed, yet I do believe, that, as I heard one say that was a critical observer of that part,[289] there was not only many thousand pounds contributed, but many hundred thousand pounds, to the relief of the poor of this distressed, afflicted city. Nay, one man affirmed to me that he could reckon up above one hundred thousand pounds a week which was distributed by the churchwardens at the several parish vestries, by the lord mayor and the aldermen in the several wards and precincts, and by the particular direction of the court and of the justices respectively in the parts where they resided, over and above the private charity distributed by pious hands in the manner I speak of; and this continued for many weeks together. I confess this is a very great sum; but if it be true that there was distributed, in the parish of Cripplegate only, seventeen thousand eight hundred pounds in one week to the relief of the poor, as I heard reported, and which I really believe was true, the other may not be improbable. It was doubtless to be reckoned among the many signal good providences which attended this great city, and of which there were many other worth recording. I say, this was a very remarkable one, that it pleased God thus to move the hearts of the people in all parts of the kingdom so cheerfully to contribute to the relief and support of the poor at London; the good consequences of which were felt many ways, and particularly in preserving the lives and recovering the health of so many thousands, and keeping so many thousands of families from perishing and starving. And now I am talking of the merciful disposition of Providence in this time of calamity, I cannot but mention again, though I have spoken several times of it already on other accounts (I mean that of the progression of the distemper), how it began at one end of the town, and proceeded gradually and slowly from one part to another, and like a dark cloud that passes over our heads, which, as it thickens and overcasts the air at one end, clears up at the other end: so, while the plague went on raging from west to east, as it went forwards east, it abated in the west; by which means those parts of the town which were not seized, or who[290] were left, and where it had spent its fury, were (as it were) spared to help and assist the other: whereas, had the distemper spread itself over the whole city and suburbs at once, raging in all places alike, as it has done since in some places abroad, the whole body of the people must have been overwhelmed, and there would have died twenty thousand a day, as they say there did at Naples, nor would the people have been able to have helped or assisted one another. For it must be observed that where the plague was in its full force, there indeed the people were very miserable, and the consternation was inexpressible; but a little before it reached even to that place, or presently after it was gone, they were quite another sort of people; and I cannot but acknowledge that there was too much of that common temper of mankind to be found among us all at that time, namely, to forget the deliverance when the danger is past. But I shall come to speak of that part again. It must not be forgot here to take some notice of the state of trade during the time of this common calamity; and this with respect to foreign trade, as also to our home trade. As to foreign trade, there needs little to be said. The trading nations of Europe were all afraid of us. No port of France, or Holland, or Spain, or Italy, would admit our ships, or correspond with us. Indeed, we stood on ill terms with the Dutch, and were in a furious war with them, though in a bad condition to fight abroad, who had such dreadful enemies to struggle with at home. Our merchants were accordingly at a full stop. Their ships could go nowhere; that is to say, to no place abroad. Their manufactures and merchandise, that is to say, of our growth, would not be touched abroad. They were as much afraid of our goods as they were of our people; and indeed they had reason, for our woolen manufactures are as retentive of infection as human bodies, and, if packed up by persons infected, would receive the infection, and be as dangerous to the touch as a man would be that was infected; and therefore when any English vessel arrived in foreign countries, if they did take the goods on shore, they always caused the bales to be opened and aired in places appointed for that purpose. But from London they would not suffer them to come into port, much less to unload their goods, upon any terms whatever; and this strictness was especially used with them in Spain and Italy. In Turkey and the islands of the Arches,[291] indeed, as they are called, as well those belonging to the Turks as to the Venetians, they were not so very rigid. In the first there was no obstruction at all, and four ships which were then in the river loading for Italy (that is, for Leghorn and Naples) being denied product, as they call it, went on to Turkey, and were freely admitted to unlade their cargo without any difficulty, only that when they arrived there, some of their cargo was not fit for sale in that country, and other parts of it being consigned to merchants at Leghorn, the captains of the ships had no right nor any orders to dispose of the goods; so that great inconveniences followed to the merchants. But this was nothing but what the necessity of affairs required; and the merchants at Leghorn and Naples, having notice given them, sent again from thence to take care of the effects which were particularly consigned to those ports, and to bring back in other ships such as were improper for the markets at Smyrna[292] and Scanderoon.[293] The inconveniences in Spain and Portugal were still greater; for they would by no means suffer our ships, especially those from London, to come into any of their ports, much less to unlade. There was a report that one of our ships having by stealth delivered her cargo, among which were some bales of English cloth, cotton, kerseys, and such like goods, the Spaniards caused all the goods to be burned, and punished the men with death who were concerned in carrying them on shore. This I believe was in part true, though I do not affirm it; but it is not at all unlikely, seeing the danger was really very great, the infection being so violent in London. I heard likewise that the plague was carried into those countries by some of our ships, and particularly to the port of Faro, in the kingdom of Algarve,[294] belonging to the King of Portugal, and that several persons died of it there; but it was not confirmed. On the other hand, though the Spaniards and Portuguese were so shy of us, it is most certain that the plague, as has been said, keeping at first much at that end of the town next Westminster, the merchandising part of the town, such as the city and the waterside, was perfectly sound till at least the beginning of July, and the ships in the river till the beginning of August; for to the 1st of July there had died but seven within the whole city, and but sixty within the liberties; but one in all the parishes of Stepney, Aldgate, and Whitechapel, and but two in all the eight parishes of Southwark. But it was the same thing abroad, for the bad news was gone over the whole world, that the city of London was infected with the plague; and there was no inquiring there how the infection proceeded, or at which part of the town it was begun or was reached to. Besides, after it began to spread, it increased so fast, and the bills grew so high all on a sudden, that it was to no purpose to lessen the report of it, or endeavor to make the people abroad think it better than it was. The account which the weekly bills gave in was sufficient; and that there died two thousand to three or four thousand a week was sufficient to alarm the whole trading part of the world: and the following time being so dreadful also in the very city itself, put the whole world, I say, upon their guard against it. You may be sure also that the report of these things lost nothing in the carriage. The plague was itself very terrible, and the distress of the people very great, as you may observe of what I have said, but the rumor was infinitely greater; and it must not be wondered that our friends abroad, as my brother's correspondents in particular, were told there (namely, in Portugal and Italy, where he chiefly traded), that in London there died twenty thousand in a week; that the dead bodies lay unburied by heaps; that the living were not sufficient to bury the dead, or the sound to look after the sick; that all the kingdom was infected likewise, so that it was an universal malady such as was never heard of in those parts of the world. And they could hardly believe us when we gave them an account how things really were; and how there was not above one tenth part of the people dead; that there were five hundred thousand left that lived all the time in the town; that now the people began to walk the streets again, and those who were fled to return; there was no miss of the usual throng of people in the streets, except as every family might miss their relations and neighbors; and the like. I say, they could not believe these things; and if inquiry were now to be made in Naples, or in other cities on the coast of Italy, they would tell you there was a dreadful infection in London so many years ago, in which, as above, there died twenty thousand in a week, etc., just as we have had it reported in London that there was a plague in the city of Naples in the year 1656, in which there died twenty thousand people in a day, of which I have had very good satisfaction that it was utterly false. But these extravagant reports were very prejudicial to our trade, as well as unjust and injurious in themselves; for it was a long time after the plague was quite over before our trade could recover itself in those parts of the world; and the Flemings[295] and Dutch, but especially the last, made very great advantages of it, having all the market to themselves, and even buying our manufactures in the several parts of England where the plague was not, and carrying them to Holland and Flanders, and from thence transporting them to Spain and to Italy, as if they had been of their own making. But they were detected sometimes, and punished, that is to say, their goods confiscated, and ships also; for if it was true that our manufactures as well as our people were infected, and that it was dangerous to touch or to open and receive the smell of them, then those people ran the hazard, by that clandestine trade, not only of carrying the contagion into their own country, but also of infecting the nations to whom they traded with those goods; which, considering how many lives might be lost in consequence of such an action, must be a trade that no men of conscience could suffer themselves to be concerned in. I do not take upon me to say that any harm was done, I mean of that kind, by those people; but I doubt I need not make any such proviso in the case of our own country; for either by our people of London, or by the commerce, which made their conversing with all sorts of people in every county, and of every considerable town, necessary,--I say, by this means the plague was first or last spread all over the kingdom, as well in London as in all the cities and great towns, especially in the trading manufacturing towns and seaports: so that first or last all the considerable places in England were visited more or less, and the kingdom of Ireland in some places, but not so universally. How it fared with the people in Scotland, I had no opportunity to inquire. It is to be observed, that, while the plague continued so violent in London, the outports, as they are called, enjoyed a very great trade, especially to the adjacent countries and to our own plantations.[296] For example, the towns of Colchester, Yarmouth, and Hull, on that side[297] of England, exported to Holland and Hamburg the manufactures of the adjacent counties for several months after the trade with London was, as it were, entirely shut up. Likewise the cities of Bristol[298] and Exeter, with the port of Plymouth, had the like advantage to Spain, to the Canaries, to Guinea, and to the West Indies, and particularly to Ireland. But as the plague spread itself every way after it had been in London to such a degree as it was in August and September, so all or most of those cities and towns were infected first or last, and then trade was, as it were, under a general embargo, or at a full stop, as I shall observe further when I speak of our home trade. One thing, however, must be observed, that as to ships coming in from abroad (as many, you may be sure, did), some who were out in all parts of the world a considerable while before, and some who, when they went out, knew nothing of an infection, or at least of one so terrible,--these came up the river boldly, and delivered their cargoes as they were obliged to do, except just in the two months of August and September, when, the weight of the infection lying, as I may say, all below bridge, nobody durst appear in business for a while. But as this continued but for a few weeks, the homeward-bound ships, especially such whose cargoes were not liable to spoil, came to an anchor, for a time, short of the Pool, or freshwater part of the river, even as low as the river Medway, where several of them ran in; and others lay at the Nore, and in the Hope below Gravesend: so that by the latter end of October there was a very great fleet of homeward-bound ships to come up, such as the like had not been known for many years. Two particular trades were carried on by water carriage all the while of the infection, and that with little or no interruption, very much to the advantage and comfort of the poor distressed people of the city; and those were the coasting trade for corn, and the Newcastle trade for coals. The first of these was particularly carried on by small vessels from the port of Hull, and other places in the Humber, by which great quantities of corn were brought in from Yorkshire and Lincolnshire; the other part of this corn trade was from Lynn in Norfolk, from Wells, and Burnham, and from Yarmouth, all in the same county; and the third branch was from the river Medway, and from Milton, Feversham, Margate, and Sandwich, and all the other little places and ports round the coast of Kent and Essex.[299] There was also a very good trade from the coast of Suffolk, with corn, butter, and cheese. These vessels kept a constant course of trade, and without interruption came up to that market known still by the name of Bear Key, where they supplied the city plentifully with corn when land carriage began to fail, and when the people began to be sick of coming from many places in the country. This also was much of it owing to the prudence and conduct of the lord mayor, who took such care to keep the masters and seamen from danger when they came up, causing their corn to be bought off at any time they wanted a market (which, however, was very seldom), and causing the cornfactors[300] immediately to unlade and deliver the vessels laden with corn, that they had very little occasion to come out of their ships or vessels, the money being always carried on board to them, and put it into a pail of vinegar before it was carried. The second trade was that of coals from Newcastle-upon-Tyne, without which the city would have been greatly distressed; for not in the streets only, but in private houses and families, great quantities of coal were then burnt, even all the summer long, and when the weather was hottest, which was done by the advice of the physicians. Some, indeed, opposed it, and insisted that to keep the houses and rooms hot was a means to propagate the distemper, which was a fermentation and heat already in the blood; that it was known to spread and increase in hot weather, and abate in cold; and therefore they alleged that all contagious distempers are the worst for heat, because the contagion was nourished, and gained strength, in hot weather, and was, as it were, propagated in heat. Others said they granted that heat in the climate might propagate infection, as sultry hot weather fills the air with vermin, and nourishes innumerable numbers and kinds of venomous creatures, which breed in our food, in the plants, and even in our bodies, by the very stench of which infection may be propagated; also that heat in the air, or heat of weather, as we ordinarily call it, makes bodies relax and faint, exhausts the spirits, opens the pores, and makes us more apt to receive infection or any evil influence, be it from noxious, pestilential vapors, or any other thing in the air; but that the heat of fire, and especially of coal fires, kept in our houses or near us, had quite a different operation, the heat being not of the same kind, but quick and fierce, tending not to nourish, but to consume and dissipate, all those noxious fumes which the other kind of heat rather exhaled, and stagnated than separated, and burnt up. Besides, it was alleged that the sulphureous and nitrous particles that are often found to be in the coal, with that bituminous substance which burns, are all assisting to clear and purge the air, and render it wholesome and safe to breathe in, after the noxious particles (as above) are dispersed and burnt up. The latter opinion prevailed at that time, and, as I must confess, I think with good reason; and the experience of the citizens confirmed it, many houses which had constant fires kept in the rooms having never been infected at all; and I must join my experience to it, for I found the keeping of good fires kept our rooms sweet and wholesome, and I do verily believe made our whole family so, more than would otherwise have been. But I return to the coals as a trade. It was with no little difficulty that this trade was kept open, and particularly because, as we were in an open war with the Dutch at that time, the Dutch capers[301] at first took a great many of our collier ships, which made the rest cautious, and made them to stay to come in fleets together. But after some time the capers were either afraid to take them, or their masters, the States, were afraid they should, and forbade them, lest the plague should be among them, which made them fare the better. For the security of those northern traders, the coal ships were ordered by my lord mayor not to come up into the Pool above a certain number at a time; and[302] ordered lighters and other vessels, such as the woodmongers (that is, the wharf keepers) or coal sellers furnished, to go down and take out the coals as low as Deptford and Greenwich, and some farther down. Others delivered great quantities of coals in particular places where the ships could come to the shore, as at Greenwich, Blackwall, and other places, in vast heaps, as if to be kept for sale; but[303] were then fetched away after the ships which brought them were gone; so that the seamen had no communication with the river men, nor so much as came near one another.[304] Yet all this caution could not effectually prevent the distemper getting among the colliery, that is to say, among the ships, by which a great many seamen died of it; and that which was still worse was, that they carried it down to Ipswich and Yarmouth, to Newcastle-upon-Tyne, and other places on the coast, where, especially at Newcastle and at Sunderland, it carried off a great number of people. The making so many fires as above did indeed consume an unusual quantity of coals; and that upon one or two stops of the ships coming up (whether by contrary weather or by the interruption of enemies, I do not remember); but the price of coals was exceedingly dear, even as high as four pounds a chaldron;[305] but it soon abated when the ships came in, and, as afterwards they had a freer passage, the price was very reasonable all the rest of that year. The public fires which were made on these occasions, as I have calculated it, must necessarily have cost the city about two hundred chaldron of coals a week, if they had continued, which was indeed a very great quantity; but as it was thought necessary, nothing was spared. However, as some of the physicians cried them down, they were not kept alight above four or five days. The fires were ordered thus:-- One at the Custom House; one at Billingsgate; one at Queenhithe, and one at the Three Cranes; one in Blackfriars, and one at the gate of Bridewell; one at the corner of Leadenhall Street and Gracechurch; one at the north and one at the south gate of the Royal Exchange; one at Guildhall, and one at Blackwell Hall gate; one at the lord mayor's door in St. Helen's; one at the west entrance into St. Paul's; and one at the entrance into Bow Church. I do not remember whether there was any at the city gates, but one at the bridge foot there was, just by St. Magnus Church. I know some have quarreled since that at the experiment, and said that there died the more people because of those fires; but I am persuaded those that say so offer no evidence to prove it, neither can I believe it on any account whatever. It remains to give some account of the state of trade at home in England during this dreadful time, and particularly as it relates to the manufactures and the trade in the city. At the first breaking out of the infection there was, as it is easy to suppose, a very great fright among the people, and consequently a general stop of trade, except in provisions and necessaries of life; and even in those things, as there was a vast number of people fled and a very great number always sick, besides the number which died, so there could not be above two thirds, if above one half, of the consumption of provisions in the city as used to be. It pleased God to send a very plentiful year of corn and fruit, and not of hay or grass, by which means bread was cheap by reason of the plenty of corn, flesh was cheap by reason of the scarcity of grass, but butter and cheese were dear for the same reason; and hay in the market, just beyond Whitechapel Bars, was sold at four pounds per load; but that affected not the poor. There was a most excessive plenty of all sorts of fruit, such as apples, pears, plums, cherries, grapes; and they were the cheaper because of the wants of the people; but this made the poor eat them to excess, and this brought them into surfeits and the like, which often precipitated them into the plague. But to come to matters of trade. First, foreign exportation being stopped, or at least very much interrupted and rendered difficult, a general stop of all those manufactures followed of course, which were usually brought for exportation; and, though sometimes merchants abroad were importunate for goods, yet little was sent, the passages being so generally stopped that the English ships would not be admitted, as is said already, into their port. This put a stop to the manufactures that were for exportation in most parts of England, except in some outports; and even that was soon stopped, for they all had the plague in their turn. But though this was felt all over England, yet, what was still worse, all intercourse of trade for home consumption of manufactures, especially those which usually circulated through the Londoners' hands, was stopped at once, the trade of the city being stopped. All kinds of handicrafts in the city, etc., tradesmen and mechanics, were, as I have said before, out of employ; and this occasioned the putting off and dismissing an innumerable number of journeymen and workmen of all sorts, seeing nothing was done relating to such trades but what might be said to be absolutely necessary. This caused the multitude of single people in London to be unprovided for, as also of families whose living depended upon the labor of the heads of those families. I say, this reduced them to extreme misery; and I must confess it is for the honor of the city of London, and will be for many ages, as long as this is to be spoken of, that they were able to supply with charitable provision the wants of so many thousands of those as afterwards fell sick and were distressed; so that it may be safely averred that nobody perished for want, at least that the magistrates had any notice given them of. This stagnation of our manufacturing trade in the country would have put the people there to much greater difficulties, but that the master workmen, clothiers, and others, to the uttermost of their stocks and strength, kept on making their goods to keep the poor at work, believing that, as soon as the sickness should abate, they would have a quick demand in proportion to the decay of their trade at that time; but as none but those masters that were rich could do thus, and that many were poor and not able, the manufacturing trade in England suffered greatly, and the poor were pinched all over England by the calamity of the city of London only. It is true that the next year made them full amends by another terrible calamity upon the city; so that the city by one calamity impoverished and weakened the country, and by another calamity (even terrible, too, of its kind) enriched the country, and made them again amends: for an infinite quantity of household stuff, wearing apparel, and other things, besides whole warehouses filled with merchandise and manufactures, such as come from all parts of England, were consumed in the fire of London the next year after this terrible visitation. It is incredible what a trade this made all over the whole kingdom, to make good the want, and to supply that loss; so that, in short, all the manufacturing hands in the nation were set on work, and were little enough for several years to supply the market, and answer the demands. All foreign markets also were empty of our goods, by the stop which had been occasioned by the plague, and before an open trade was allowed again; and the prodigious demand at home falling in, joined to make a quick vent[306] for all sorts of goods; so that there never was known such a trade all over England, for the time, as was in the first seven years after the plague, and after the fire of London. It remains now that I should say something of the merciful part of this terrible judgment. The last week in September, the plague being come to its crisis, its fury began to assuage. I remember my friend Dr. Heath, coming to see me the week before, told me he was sure the violence of it would assuage in a few days; but when I saw the weekly bill of that week, which was the highest of the whole year, being 8,297 of all diseases, I upbraided him with it, and asked him what he had made his judgment from. His answer, however, was not so much to seek[307] as I thought it would have been. "Look you," says he: "by the number which are at this time sick and infected, there should have been twenty thousand dead the last week, instead of eight thousand, if the inveterate mortal contagion had been as it was two weeks ago; for then it ordinarily killed in two or three days, now not under eight or ten; and then not above one in five recovered, whereas I have observed that now not above two in five miscarry. And observe it from me, the next bill will decrease, and you will see many more people recover than used to do; for though a vast multitude are now everywhere infected, and as many every day fall sick, yet there will not so many die as there did, for the malignity of the distemper is abated;" adding that he began now to hope, nay, more than hope, that the infection had passed its crisis, and was going off. And accordingly so it was; for the next week being, as I said, the last in September, the bill decreased almost two thousand. It is true, the plague was still at a frightful height, and the next bill was no less than 6,460, and the next to that 5,720; but still my friend's observation was just, and it did appear the people did recover faster, and more in number, than they used to do; and indeed if it had not been so, what had been the condition of the city of London? For, according to my friend, there were not fewer than 60,000 people at that time infected, whereof, as above, 20,477 died, and near 40,000 recovered; whereas, had it been as it was before, 50,000 of that number would very probably have died, if not more, and 50,000 more would have sickened; for in a word the whole mass of people began to sicken, and it looked as if none would escape. But this remark of my friend's appeared more evident in a few weeks more; for the decrease went on, and another week in October it decreased 1,843, so that the number dead of the plague was but 2,665; and the next week it decreased 1,413 more, and yet it was seen plainly that there was abundance of people sick, nay, abundance more than ordinary, and abundance fell sick every day; but, as above, the malignity of the disease abated. Such is the precipitant disposition of our people (whether it is so or not all over the world, that is none of my particular business to inquire; but I saw it apparently here), that, as upon the first sight of the infection they shunned one another, and fled from one another's houses and from the city with an unaccountable, and, as I thought, unnecessary fright, so now, upon this notion spreading, viz., that the distemper was not so catching as formerly, and that if it was catched it was not so mortal, and seeing abundance of people who really fell sick recover again daily, they took to such a precipitant courage, and grew so entirely regardless of themselves and of the infection, that they made no more of the plague than of an ordinary fever, nor indeed so much. They not only went boldly into company with those who had tumors and carbuncles upon them that were running, and consequently contagious, but eat and drank with them, nay, into their houses to visit them, and even, as I was told, into their very chambers where they lay sick. This I could not see rational. My friend Dr. Heath allowed, and it was plain to experience, that the distemper was as catching as ever, and as many fell sick, but only he alleged that so many of those that fell sick did not die; but I think that while many did die, and that at best the distemper itself was very terrible, the sores and swellings very tormenting, and the danger of death not left out of the circumstance of sickness, though not so frequent as before,--all those things, together with the exceeding tediousness of the cure, the loathsomeness of the disease, and many other articles, were enough to deter any man living from a dangerous mixture[308] with the sick people, and make them[309] as anxious almost to avoid the infection as before. Nay, there was another thing which made the mere catching of the distemper frightful, and that was the terrible burning of the caustics which the surgeons laid on the swellings to bring them to break and to run; without which the danger of death was very great, even to the last; also the insufferable torment of the swellings, which, though it might not make people raving and distracted, as they were before, and as I have given several instances of already, yet they put the patient to inexpressible torment; and those that fell into it, though they did escape with life, yet they made bitter complaints of those that had told them there was no danger, and sadly repented their rashness and folly in venturing to run into the reach of it. Nor did this unwary conduct of the people end here; for a great many that thus cast off their cautions suffered more deeply still, and though many escaped, yet many died; and at least it[310] had this public mischief attending it, that it made the decrease of burials slower than it would otherwise have been; for, as this notion ran like lightning through the city, and the people's heads were possessed with it, even as soon as the first great decrease in the bills appeared, we found that the two next bills did not decrease in proportion: the reason I take to be the people's running so rashly into danger, giving up all their former cautions and care, and all shyness which they used to practice, depending that the sickness would not reach them, or that, if it did, they should not die. The physicians opposed this thoughtless humor of the people with all their might, and gave out printed directions, spreading them all over the city and suburbs, advising the people to continue reserved, and to use still the utmost caution in their ordinary conduct, notwithstanding the decrease of the distemper; terrifying them with the danger of bringing a relapse upon the whole city, and telling them how such a relapse might be more fatal and dangerous than the whole visitation that had been already; with many arguments and reasons to explain and prove that part to them, and which are too long to repeat here. But it was all to no purpose. The audacious creatures were so possessed with the first joy, and so surprised with the satisfaction of seeing a vast decrease in the weekly bills, that they were impenetrable by any new terrors, and would not be persuaded but that the bitterness of death was passed; and it was to no more purpose to talk to them than to an east wind; but they opened shops, went about streets, did business, and conversed with anybody that came in their way to converse with, whether with business or without, neither inquiring of their health, or so much as being apprehensive of any danger from them, though they knew them not to be sound. This imprudent, rash conduct cost a great many their lives who had with great care and caution shut themselves up, and kept retired, as it were, from all mankind, and had by that means, under God's providence, been preserved through all the heat of that infection. This rash and foolish conduct of the people went so far, that the ministers took notice to them of it, and laid before them both the folly and danger of it; and this checked it a little, so that they grew more cautious. But it had another effect, which they could not check: for as the first rumor had spread, not over the city only, but into the country, it had the like effect; and the people were so tired with being so long from London, and so eager to come back, that they flocked to town without fear or forecast, and began to show themselves in the streets as if all the danger was over. It was indeed surprising to see it; for though there died still from a thousand to eighteen hundred a week, yet the people flocked to town as if all had been well. The consequence of this was, that the bills increased again four hundred the very first week in November; and, if I might believe the physicians, there were above three thousand fell sick that week, most of them newcomers too. One John Cock, a barber in St. Martin's-le-Grand, was an eminent example of this (I mean of the hasty return of the people when the plague was abated). This John Cock had left the town with his whole family, and locked up his house, and was gone into the country, as many others did; and, finding the plague so decreased in November that there died but 905 per week of all diseases, he ventured home again. He had in his family ten persons; that is to say, himself and wife, five children, two apprentices, and a maidservant. He had not been returned to his house above a week, and began to open his shop and carry on his trade, but the distemper broke out in his family, and within about five days they all died except one: that is to say, himself, his wife, all his five children, and his two apprentices; and only the maid remained alive. But the mercy of God was greater to the rest than we had reason to expect; for the malignity, as I have said, of the distemper was spent, the contagion was exhausted, and also the wintry weather came on apace, and the air was clear and cold, with some sharp frosts; and this increasing still, most of those that had fallen sick recovered, and the health of the city began to return. There were indeed some returns of the distemper, even in the month of December, and the bills increased near a hundred; but it went off again, and so in a short while things began to return to their own channel. And wonderful it was to see how populous the city was again all on a sudden; so that a stranger could not miss the numbers that were lost, neither was there any miss of the inhabitants as to their dwellings. Few or no empty houses were to be seen, or, if there were some, there was no want of tenants for them. I wish I could say, that, as the city had a new face, so the manners of the people had a new appearance. I doubt not but there were many that retained a sincere sense of their deliverance, and that were heartily thankful to that Sovereign Hand that had protected them in so dangerous a time. It would be very uncharitable to judge otherwise in a city so populous, and where the people were so devout as they were here in the time of the visitation itself; but, except what of this was to be found in particular families and faces, it must be acknowledged that the general practice of the people was just as it was before, and very little difference was to be seen. Some, indeed, said things were worse; that the morals of the people declined from this very time; that the people, hardened by the danger they had been in, like seamen after a storm is over, were more wicked and more stupid, more bold and hardened in their vices and immoralities, than they were before; but I will not carry it so far, neither. It would take up a history of no small length to give a particular of all the gradations by which the course of things in this city came to be restored again, and to run in their own channel as they did before. Some parts of England were now infected as violently as London had been. The cities of Norwich, Peterborough, Lincoln, Colchester, and other places, were now visited, and the magistrates of London began to set rules for our conduct as to corresponding with those cities. It is true, we could not pretend to forbid their people coming to London, because it was impossible to know them asunder; so, after many consultations, the lord mayor and court of aldermen were obliged to drop it. All they could do was to warn and caution the people not to entertain in their houses, or converse with, any people who they knew came from such infected places. But they might as well have talked to the air; for the people of London thought themselves so plague-free now, that they were past all admonitions. They seemed to depend upon it that the air was restored, and that the air was like a man that had had the smallpox,--not capable of being infected again. This revived that notion that the infection was all in the air; that there was no such thing as contagion from the sick people to the sound; and so strongly did this whimsey prevail among people, that they run altogether promiscuously, sick and well. Not the Mohammedans, who, prepossessed with the principle of predestination, value[311] nothing of contagion, let it be in what it will, could be more obstinate than the people of London. They that were perfectly sound, and came out of the wholesome air, as we call it, into the city, made nothing of going into the same houses and chambers, nay, even into the same beds, with those that had the distemper upon them, and were not recovered. Some, indeed, paid for their audacious boldness with the price of their lives. An infinite number fell sick, and the physicians had more work than ever, only with this difference, that more of their patients recovered, that is to say, they generally recovered; but certainly there were more people infected and fell sick now, when there did not die above a thousand or twelve hundred a week, than there was[312] when there died five or six thousand a week, so entirely negligent were the people at that time in the great and dangerous case of health and infection, and so ill were they able to take or except[313] of the advice of those who cautioned them for their good. The people being thus returned, as it were, in general, it was very strange to find, that, in their inquiring after their friends, some whole families were so entirely swept away that there was no remembrance of them left. Neither was anybody to be found to possess or show any title to that little they had left; for in such cases what was to be found was generally embezzled and purloined, some gone one way, some another. It was said such abandoned effects came to the King as the universal heir; upon which we are told, and I suppose it was in part true, that the King granted all such as deodands[314] to the lord mayor and court of aldermen of London, to be applied to the use of the poor, of whom there were very many. For it is to be observed, that though the occasions of relief and the objects of distress were very many more in the time of the violence of the plague than now, after all was over, yet the distress of the poor was more now a great deal than it was then, because all the sluices of general charity were shut. People supposed the main occasion to be over, and so stopped their hands; whereas particular objects were still very moving, and the distress of those that were poor was very great indeed. Though the health of the city was now very much restored, yet foreign trade did not begin to stir; neither would foreigners admit our ships into their ports for a great while. As for the Dutch, the misunderstandings between our court and them had broken out into a war the year before, so that our trade that way was wholly interrupted; but Spain and Portugal, Italy and Barbary,[315] as also Hamburg, and all the ports in the Baltic,--these were all shy of us a great while, and would not restore trade with us for many months. The distemper sweeping away such multitudes, as I have observed, many if not all of the outparishes were obliged to make new burying grounds, besides that I have mentioned in Bunhill Fields, some of which were continued, and remain in use to this day; but others were left off, and, which I confess I mention with some reflection,[316] being converted into other uses, or built upon afterwards, the dead bodies were disturbed, abused, dug up again, some even before the flesh of them was perished from the bones, and removed like dung or rubbish to other places. Some of those which came within the reach of my observations are as follows:-- First, A piece of ground beyond Goswell Street, near Mountmill, being some of the remains of the old lines or fortifications of the city, where abundance were buried promiscuously from the parishes of Aldersgate, Clerkenwell, and even out of the city. This ground, as I take it, was since[317] made a physic garden,[318] and, after[319] that, has been built upon. Second, A piece of ground just over the Black Ditch, as it was then called, at the end of Holloway Lane, in Shoreditch Parish. It has been since made a yard for keeping hogs and for other ordinary uses, but is quite out of use as a burying ground. Third, The upper end of Hand Alley, in Bishopsgate Street, which was then a green field, and was taken in particularly for Bishopsgate Parish, though many of the carts out of the city brought their dead thither also, particularly out of the parish of St. Allhallows-on-the-Wall. This place I cannot mention without much regret. It was, as I remember, about two or three years after the plague was ceased, that Sir Robert Clayton[320] came to be possessed of the ground. It was reported, how true I know not, that it fell to the King for want of heirs (all those who had any right to it being carried off by the pestilence), and that Sir Robert Clayton obtained a grant of it from King Charles II. But however he came by it, certain it is the ground was let out to build on, or built upon by his order. The first house built upon it was a large fair house, still standing, which faces the street or way now called Hand Alley, which, though called an alley, is as wide as a street. The houses in the same row with that house northward are built on the very same ground where the poor people were buried; and the bodies, on opening the ground for the foundations, were dug up, some of them remaining so plain to be seen, that the women's skulls were distinguished by their long hair, and of others the flesh was not quite perished; so that the people began to exclaim loudly against it, and some suggested that it might endanger a return of the contagion; after which the bones and bodies, as fast as they[321] came at them, were carried to another part of the same ground, and thrown altogether into a deep pit, dug on purpose, which now is to be known[322] in that it is not built on, but is a passage to another house at the upper end of Rose Alley, just against the door of a meetinghouse, which has been built there many years since; and the ground is palisadoed[323] off from the rest of the passage in a little square. There lie the bones and remains of near two thousand bodies, carried by the dead carts to their grave in that one year. Fourth, Besides this, there was a piece of ground in Moorfields, by the going into the street which is now called Old Bethlem, which was enlarged much, though not wholly taken in, on the same occasion. N.B. The author of this journal lies buried in that very ground, being at his own desire, his sister having been buried there a few years before. Fifth, Stepney Parish, extending itself from the east part of London to the north, even to the very edge of Shoreditch churchyard, had a piece of ground taken in to bury their dead, close to the said churchyard; and which, for that very reason, was left open, and is since, I suppose, taken into the same churchyard. And they had also two other burying places in Spittlefields,--one where since a chapel or tabernacle has been built for ease to this great parish, and another in Petticoat Lane. There were no less than five other grounds made use of for the parish of Stepney at that time; one where now stands the parish church of St. Paul, Shadwell, and the other where now stands the parish church of St. John, at Wapping, both which had not the names of parishes at that time, but were belonging to Stepney Parish. I could name many more; but these coming within my particular knowledge, the circumstance, I thought, made it of use to record them. From the whole, it may be observed that they were obliged in this time of distress to take in new burying grounds in most of the outparishes for laying the prodigious numbers of people which died in so short a space of time; but why care was not taken to keep those places separate from ordinary uses, that so the bodies might rest undisturbed, that I cannot answer for, and must confess I think it was wrong. Who were to blame, I know not. I should have mentioned that the Quakers[324] had at that time also a burying ground set apart to their use, and which they still make use of; and they had also a particular dead cart to fetch their dead from their houses. And the famous Solomon Eagle, who, as I mentioned before,[325] had predicted the plague as a judgment, and run naked through the streets, telling the people that it was come upon them to punish them for their sins, had his own wife died[326] the very next day of the plague, and was carried, one of the first, in the Quakers' dead cart to their new burying ground. I might have thronged this account with many more remarkable things which occurred in the time of the infection, and particularly what passed between the lord mayor and the court, which was then at Oxford, and what directions were from time to time received from the government for their conduct on this critical occasion; but really the court concerned themselves so little, and that little they did was of so small import, that I do not see it of much moment to mention any part of it here, except that of appointing a monthly fast in the city, and the sending the royal charity to the relief of the poor, both which I have mentioned before. Great was the reproach thrown upon those physicians who left their patients during the sickness; and, now they came to town again, nobody cared to employ them. They were called deserters, and frequently bills were set up on their doors, and written, "Here is a doctor to be let!" So that several of those physicians were fain for a while to sit still and look about them, or at least remove their dwellings and set up in new places and among new acquaintance. The like was the case with the clergy, whom the people were indeed very abusive to, writing verses and scandalous reflections upon them; setting upon the church door, "Here is a pulpit to be let," or sometimes "to be sold," which was worse. It was not the least of our misfortunes, that with our infection, when it ceased, there did not cease the spirit of strife and contention, slander and reproach, which was really the great troubler of the nation's peace before. It was said to be the remains of the old animosities which had so lately involved us all in blood and disorder;[327] but as the late act of indemnity[328] had lain asleep the quarrel itself, so the government had recommended family and personal peace, upon all occasions, to the whole nation. But it[329] could not be obtained; and particularly after the ceasing of the plague in London, when any one had seen the condition which the people had been in, and how they caressed one another at that time, promised to have more charity for the future, and to raise no more reproaches,--I say, any one that had seen them then would have thought they would have come together with another spirit at last. But, I say, it could not be obtained. The quarrel remained, the Church[330] and the Presbyterians were incompatible. As soon as the plague was removed, the dissenting ousted ministers who had supplied the pulpits which were deserted by the incumbents, retired. They[331] could expect no other but that they[332] should immediately fall upon them[331] and harass them with their penal laws; accept their[331] preaching while they[332] were sick, and persecute them[331] as soon as they[332] were recovered again. This even we that were of the Church thought was hard, and could by no means approve of it. But it was the government, and we could say nothing to hinder it. We could only say it was not our doing, and we could not answer for it. On the other hand, the dissenters reproaching those ministers of the Church with going away, and deserting their charge, abandoning the people in their danger, and when they had most need of comfort, and the like,--this we could by no means approve; for all men have not the same faith and the same courage, and the Scripture commands us to judge the most favorably, and according to charity. A plague is a formidable enemy, and is armed with terrors that every man is not sufficiently fortified to resist, or prepared to stand the shock against.[333] It is very certain that a great many of the clergy who were in circumstances to do it withdrew, and fled for the safety of their lives; but it is true, also, that a great many of them staid, and many of them fell in the calamity, and in the discharge of their duty. It is true, some of the dissenting turned-out ministers staid, and their courage is to be commended and highly valued; but these were not abundance. It cannot be said that they all staid, and that none retired into the country, any more than it can be said of the Church clergy that they all went away. Neither did all those that went away go without substituting curates[334] and others in their places, to do the offices needful, and to visit the sick as far as it was practicable. So that, upon the whole, an allowance of charity might have been made on both sides, and we should have considered that such a time as this of 1665 is not to be paralleled in history, and that it is not the stoutest courage that will always support men in such cases. I had not said this, but had rather chosen[335] to record the courage and religious zeal of those of both sides who did hazard themselves for the service of the poor people in their distress, without remembering that any failed in their duty on either side; but the want of temper among us has made the contrary to this necessary: some that staid, not only boasting too much of themselves, but reviling those that fled, branding them with cowardice, deserting their flocks, and acting the part of the hireling, and the like. I recommend it to the charity of all good people to look back and reflect duly upon the terrors of the time; and whoever does so will see that it is not an ordinary strength that could support it. It was not like appearing in the head of an army, or charging a body of horse in the field; but it was charging death itself on his pale horse.[336] To stay was indeed to die; and it could be esteemed nothing less, especially as things appeared at the latter end of August and the beginning of September, and as there was reason to expect them at that time; for no man expected, and I dare say believed, that the distemper would take so sudden a turn as it did, and fall immediately two thousand in a week, when there was such a prodigious number of people sick at that time as it was known there was; and then it was that many shifted[337] away that had staid most of the time before. Besides, if God gave strength to some more than to others, was it to boast of their ability to abide the stroke, and upbraid those that had not the same gift and support, or ought they not rather to have been humble and thankful if they were rendered more useful than their brethren? I think it ought to be recorded to the honor of such men, as well clergy as physicians, surgeons, apothecaries, magistrates, and officers of every kind, as also all useful people, who ventured their lives in discharge of their duty, as most certainly all such as staid did to the last degree; and several of these kinds did not only venture, but lost their lives on that sad occasion. I was once making a list of all such (I mean of all those professions and employments who thus died, as I call it, in the way of their duty), but it was impossible for a private man to come at a certainty in the particulars. I only remember that there died sixteen clergymen, two aldermen, five physicians, thirteen surgeons, within the city and liberties, before the beginning of September. But this being, as I said before, the crisis and extremity of the infection, it can be no complete list. As to inferior people, I think there died six and forty constables and headboroughs[338] in the two parishes of Stepney and Whitechapel; but I could not carry my list on, for when the violent rage of the distemper, in September, came upon us, it drove us out of all measure. Men did then no more die by tale[339] and by number: they might put out a weekly bill, and call them seven or eight thousand, or what they pleased. It is certain they died by heaps, and were buried by heaps; that is to say, without account. And if I might believe some people who were more abroad and more conversant with those things than I (though I was public enough for one that had no more business to do than I had),--I say, if we may believe them, there was not many less buried those first three weeks in September than twenty thousand per week. However the others aver the truth of it, yet I rather choose to keep to the public account. Seven or eight thousand per week is enough to make good all that I have said of the terror of those times; and it is much to the satisfaction of me that write, as well as those that read, to be able to say that everything is set down with moderation, and rather within compass than beyond it. Upon all these accounts, I say, I could wish, when we were recovered, our conduct had been more distinguished for charity and kindness, in remembrance of the past calamity, and not so much in valuing ourselves upon our boldness in staying; as if all men were cowards that fly from the hand of God, or that those who stay do not sometimes owe their courage to their ignorance, and despising the hand of their Maker, which is a criminal kind of desperation, and not a true courage. I cannot but leave it upon record, that the civil officers, such as constables, headboroughs, lord mayor's and sheriff's men, also parish officers, whose business it was to take charge of the poor, did their duties, in general, with as much courage as any, and perhaps with more; because their work was attended with more hazards, and lay more among the poor, who were more subject to be infected, and in the most pitiful plight when they were taken with the infection. But then it must be added, too, that a great number of them died; indeed, it was scarcely possible it should be otherwise. I have not said one word here about the physic or preparations that were ordinarily made use of on this terrible occasion (I mean we that frequently went abroad up and down the streets, as I did). Much of this was talked of in the books and bills of our quack doctors, of whom I have said enough already. It may, however, be added, that the College of Physicians were daily publishing several preparations, which they had considered of in the process of their practice; and which, being to be had in print, I avoid repeating them for that reason. One thing I could not help observing,--what befell one of the quacks, who published that he had a most excellent preservative against the plague, which whoever kept about them should never be infected, or liable to infection. This man, who, we may reasonably suppose, did not go abroad without some of this excellent preservative in his pocket, yet was taken by the distemper, and carried off in two or three days. I am not of the number of the physic haters or physic despisers (on the contrary, I have often mentioned the regard I had to the dictates of my particular friend Dr. Heath); but yet I must acknowledge I made use of little or nothing, except, as I have observed, to keep a preparation of strong scent, to have ready in case I met with anything of offensive smells, or went too near any burying place or dead body. Neither did I do, what I know some did, keep the spirits high and hot with cordials and wine, and such things, and which, as I observed, one learned physician used himself so much to, as that he could not leave them off when the infection was quite gone, and so became a sot for all his life after. I remember my friend the doctor used to say that there was a certain set of drugs and preparations which were all certainly good and useful in the case of an infection, out of which or with which physicians might make an infinite variety of medicines, as the ringers of bells make several hundred different rounds of music by the changing and order of sound but in six bells; and that all these preparations shall[340] be really very good. "Therefore," said he, "I do not wonder that so vast a throng of medicines is offered in the present calamity, and almost every physician prescribes or prepares a different thing, as his judgment or experience guides him; but," says my friend, "let all the prescriptions of all the physicians in London be examined, and it will be found that they are all compounded of the same things, with such variations only as the particular fancy of the doctor leads him to; so that," says he, "every man, judging a little of his own constitution and manner of his living, and circumstances of his being infected, may direct his own medicines out of the ordinary drugs and preparations. Only that," says he, "some recommend one thing as most sovereign, and some another. Some," says he, "think that Pill. Ruff., which is called itself the antipestilential pill, is the best preparation that can be made; others think that Venice treacle[341] is sufficient of itself to resist the contagion; and I," says he, "think as both these think, viz., that the first is good to take beforehand to prevent it, and the last, if touched, to expel it." According to this opinion, I several times took Venice treacle, and a sound sweat upon it, and thought myself as well fortified against the infection as any one could be fortified by the power of physic. As for quackery and mountebank, of which the town was so full, I listened to none of them, and observed often since, with some wonder, that for two years after the plague I scarcely ever heard one of them about the town. Some fancied they were all swept away in the infection to a man, and were for calling it a particular mark of God's vengeance upon them for leading the poor people into the pit of destruction merely for the lucre of a little money they got by them; but I cannot go that length, neither. That abundance of them died is certain (many of them came within the reach of my own knowledge); but that all of them were swept off, I much question. I believe, rather, they fled into the country, and tried their practices upon the people there, who were in apprehension of the infection before it came among them. This, however, is certain, not a man of them appeared for a great while in or about London. There were indeed several doctors who published bills recommending their several physical preparations for cleansing the body, as they call it, after the plague, and needful, as they said, for such people to take who had been visited and had been cured; whereas, I must own, I believe that it was the opinion of the most eminent physicians of that time, that the plague was itself a sufficient purge, and that those who escaped the infection needed no physic to cleanse their bodies of any other things (the running sores, the tumors, etc., which were broken and kept open by the direction of the physicians, having sufficiently cleansed them); and that all other distempers, and causes of distempers, were effectually carried off that way. And as the physicians gave this as their opinion wherever they came, the quacks got little business. There were indeed several little hurries which happened after the decrease of the plague, and which, whether they were contrived to fright and disorder the people, as some imagined, I cannot say; but sometimes we were told the plague would return by such a time; and the famous Solomon Eagle, the naked Quaker I have mentioned, prophesied evil tidings every day, and several others, telling us that London had not been sufficiently scourged, and the sorer and severer strokes were yet behind. Had they stopped there, or had they descended to particulars, and told us that the city should be the next year destroyed by fire, then, indeed, when we had seen it come to pass, we should not have been to blame to have paid more than common respect to their prophetic spirits (at least, we should have wondered at them, and have been more serious in our inquiries after the meaning of it, and whence they had the foreknowledge); but as they generally told us of a relapse into the plague, we have had no concern since that about them. Yet by those frequent clamors we were all kept with some kind of apprehensions constantly upon us; and if any died suddenly, or if the spotted fevers at any time increased, we were presently alarmed; much more if the number of the plague increased, for to the end of the year there were always between two and three hundred[342] of the plague. On any of these occasions, I say, we were alarmed anew. Those who remember the city of London before the fire must remember that there was then no such place as that we now call Newgate Market; but in the middle of the street, which is now called Blow Bladder Street, and which had its name from the butchers, who used to kill and dress their sheep there (and who, it seems, had a custom to blow up their meat with pipes, to make it look thicker and fatter than it was, and were punished there for it by the lord mayor),--I say, from the end of the street towards Newgate there stood two long rows of shambles for the selling[343] meat. It was in those shambles that two persons falling down dead as they were buying meat, gave rise to a rumor that the meat was all infected; which though it might affright the people, and spoiled the market for two or three days, yet it appeared plainly afterwards that there was nothing of truth in the suggestion: but nobody can account for the possession of fear when it takes hold of the mind. However, it pleased God, by the continuing of the winter weather, so to restore the health of the city, that by February following we reckoned the distemper quite ceased, and then we were not easily frighted again. There was still a question among the learned, and[344] at first perplexed the people a little; and that was, in what manner to purge the houses and goods where the plague had been, and how to render them[345] habitable again which had been left empty during the time of the plague. Abundance of perfumes and preparations were prescribed by physicians, some of one kind, some of another, in which the people who listened to them put themselves to a great, and indeed in my opinion to an unnecessary, expense; and the poorer people, who only set open their windows night and day, burnt brimstone, pitch, and gunpowder, and such things, in their rooms, did as well as the best; nay, the eager people who, as I said above, came home in haste and at all hazards, found little or no inconvenience in their houses, nor in their goods, and did little or nothing to them. However, in general, prudent, cautious people did enter into some measures for airing and sweetening their houses, and burnt perfumes, incense, benjamin,[346] resin, and sulphur in their rooms, close shut up, and then let the air carry it all out with a blast of gunpowder; others caused large fires to be made all day and all night for several days and nights. By the same token that[347] two or three were pleased to set their houses on fire, and so effectually sweetened them by burning them down to the ground (as particularly one at Ratcliff, one in Holborn, and one at Westminster, besides two or three that were set on fire; but the fire was happily got out again before it went far enough to burn down the houses); and one citizen's servant, I think it was in Thames Street, carried so much gunpowder into his master's house, for clearing it of the infection, and managed it so foolishly, that he blew up part of the roof of the house. But the time was not fully come that the city was to be purged with fire, nor was it far off; for within nine months more I saw it all lying in ashes, when, as some of our quaking philosophers pretend, the seeds of the plague were entirely destroyed, and not before,--a notion too ridiculous to speak of here, since, had the seeds of the plague remained in the houses, not to be destroyed but by fire, how has it been that they have not since broken out, seeing all those buildings in the suburbs and liberties, all in the great parishes of Stepney, Whitechapel, Aldgate, Bishopsgate, Shoreditch, Cripplegate, and St. Giles's, where the fire never came, and where the plague raced with the greatest violence, remain still in the same condition they were in before? But to leave these things just as I found them, it was certain that those people who were more than ordinarily cautious of their health did take particular directions for what they called seasoning of their houses; and abundance of costly things were consumed on that account, which I cannot but say not only seasoned those houses as they desired, but filled the air with very grateful and wholesome smells, which others had the share of the benefit of, as well as those who were at the expenses of them. Though the poor came to town very precipitantly, as I have said, yet, I must say, the rich made no such haste. The men of business, indeed, came up, but many of them did not bring their families to town till the spring came on, and that they saw reason to depend upon it that the plague would not return. The court, indeed, came up soon after Christmas; but the nobility and gentry, except such as depended upon and had employment under the administration, did not come so soon. I should have taken notice here, that notwithstanding the violence of the plague in London and other places, yet it was very observable that it was never on board the fleet; and yet for some time there was a strange press[348] in the river, and even in the streets, for seamen to man the fleet. But it was in the beginning of the year, when the plague was scarce begun, and not at all come down to that part of the city where they usually press for seamen; and though a war with the Dutch was not at all grateful to the people at that time, and the seamen went with a kind of reluctancy into the service, and many complained of being dragged into it by force, yet it proved, in the event, a happy violence to several of them, who had probably perished in the general calamity, and who, after the summer service was over, though they had cause to lament the desolation of their families (who, when they came back, were many of them in their graves), yet they had room to be thankful that they were carried out of the reach of it, though so much against their wills. We, indeed, had a hot war with the Dutch that year, and one very great engagement[349] at sea, in which the Dutch were worsted; but we lost a great many men and some ships. But, as I observed, the plague was not in the fleet; and when they came to lay up the ships in the river, the violent part of it began to abate. I would be glad if I could close the account of this melancholy year with some particular examples historically, I mean of the thankfulness to God, our Preserver, for our being delivered from this dreadful calamity. Certainly the circumstances of the deliverance, as well as the terrible enemy we were delivered from, called upon the whole nation for it. The circumstances of the deliverance were indeed very remarkable, as I have in part mentioned already; and particularly the dreadful condition which we were all in, when we were, to the surprise of the whole town, made joyful with the hope of a stop to the infection. Nothing but the immediate finger of God, nothing but omnipotent power, could have done it. The contagion despised all medicine, death raged in every corner; and, had it gone on as it did then, a few weeks more would have cleared the town of all and everything that had a soul. Men everywhere began to despair; every heart failed them for fear; people were made desperate through the anguish of their souls; and the terrors of death sat in the very faces and countenances of the people. In that very moment, when we might very well say, "Vain was the help of man,"[350]--I say, in that very moment it pleased God, with a most agreeable surprise, to cause the fury of it to abate, even of itself; and the malignity declining, as I have said, though infinite numbers were sick, yet fewer died; and the very first week's bill decreased 1,843, a vast number indeed. It is impossible to express the change that appeared in the very countenances of the people that Thursday morning when the weekly bill came out. It might have been perceived in their countenances that a secret surprise and smile of joy sat on everybody's face. They shook one another by the hands in the streets, who would hardly go on the same side of the way with one another before. Where the streets were not too broad, they would open their windows and call from one house to another, and asked how they did, and if they had heard the good news that the plague was abated. Some would return, when they said good news, and ask, "What good news?" And when they answered that the plague was abated, and the bills decreased almost two thousand, they would cry out, "God be praised!" and would weep aloud for joy, telling them they had heard nothing of it; and such was the joy of the people, that it was, as it were, life to them from the grave. I could almost set down as many extravagant things done in the excess of their joy as of their grief; but that would be to lessen the value of it. I must confess myself to have been very much dejected just before this happened; for the prodigious numbers that were taken sick the week or two before, besides those that died, was[351] such, and the lamentations were so great everywhere, that a man must have seemed to have acted even against his reason if he had so much as expected to escape; and as there was hardly a house but mine in all my neighborhood but what was infected, so, had it gone on, it would not have been long that there would have been any more neighbors to be infected. Indeed, it is hardly credible what dreadful havoc the last three weeks had made: for, if I might believe the person whose calculations I always found very well grounded, there were not less than thirty thousand people dead, and near one hundred thousand fallen sick, in the three weeks I speak of; for the number that sickened was surprising, indeed it was astonishing, and those whose courage upheld them all the time before, sunk under it now. In the middle of their distress, when the condition of the city of London was so truly calamitous, just then it pleased God, as it were, by his immediate hand, to disarm this enemy: the poison was taken out of the sting. It was wonderful. Even the physicians themselves were surprised at it. Wherever they visited, they found their patients better,--either they had sweated kindly, or the tumors were broke, or the carbuncles went down and the inflammations round them changed color, or the fever was gone, or the violent headache was assuaged, or some good symptom was in the case,--so that in a few days everybody was recovering. Whole families that were infected and down, that had ministers praying with them, and expected death every hour, were revived and healed, and none died at all out of them. Nor was this by any new medicine found out, or new method of cure discovered, or by any experience in the operation which the physicians or surgeons attained to; but it was evidently from the secret invisible hand of Him that had at first sent this disease as a judgment upon us. And let the atheistic part of mankind call my saying what they please, it is no enthusiasm: it was acknowledged at that time by all mankind. The disease was enervated, and its malignity spent; and let it proceed from whencesoever it will, let the philosophers search for reasons in nature to account for it by, and labor as much as they will to lessen the debt they owe to their Maker, those physicians who had the least share of religion in them were obliged to acknowledge that it was all supernatural, that it was extraordinary, and that no account could be given of it. If I should say that this is a visible summons to us all to thankfulness, especially we that were under the terror of its increase, perhaps it may be thought by some, after the sense of the thing was over, an officious canting of religious things, preaching a sermon instead of writing a history, making myself a teacher instead of giving my observations of things (and this restrains me very much from going on here, as I might otherwise do); but if ten lepers were healed, and but one returned to give thanks, I desire to be as that one, and to be thankful for myself. Nor will I deny but there were abundance of people who, to all appearance, were very thankful at that time: for their mouths were stopped, even the mouths of those whose hearts were not extraordinarily long affected with it; but the impression was so strong at that time, that it could not be resisted, no, not by the worst of the people. It was a common thing to meet people in the street that were strangers, and that we knew nothing at all of, expressing their surprise. Going one day through Aldgate, and a pretty many people being passing and repassing, there comes a man out of the end of the Minories; and, looking a little up the street and down, he throws his hands abroad: "Lord, what an alteration is here! Why, last week I came along here, and hardly anybody was to be seen." Another man (I heard him) adds to his words, "'Tis all wonderful; 'tis all a dream."--"Blessed be God!" says a third man; "and let us give thanks to him, for 'tis all his own doing." Human help and human skill were at an end. These were all strangers to one another, but such salutations as these were frequent in the street every day; and, in spite of a loose behavior, the very common people went along the streets, giving God thanks for their deliverance. It was now, as I said before, the people had cast off all apprehensions, and that too fast. Indeed, we were no more afraid now to pass by a man with a white cap upon his head, or with a cloth wrapped round his neck, or with his leg limping, occasioned by the sores in his groin,--all which were frightful to the last degree but the week before. But now the street was full of them, and these poor recovering creatures, give them their due, appeared very sensible of their unexpected deliverance, and I should wrong them very much if I should not acknowledge that I believe many of them were really thankful; but I must own that for the generality of the people it might too justly be said of them, as was said of the children of Israel after their being delivered from the host of Pharaoh, when they passed the Red Sea, and looked back and saw the Egyptians overwhelmed in the water, viz., "that they sang his praise, but they soon forgot his works."[352] I can go no further here. I should be counted censorious, and perhaps unjust, if I should enter into the unpleasing work of reflecting, whatever cause there was for it, upon the unthankfulness and return of all manner of wickedness among us, which I was so much an eyewitness of myself. I shall conclude the account of this calamitous year, therefore, with a coarse but a sincere stanza of my own, which I placed at the end of my ordinary memorandums the same year they were written:-- A dreadful plague in London was, In the year sixty-five, Which swept an hundred thousand souls Away, yet I alive. H.F.[353] FOOTNOTES: [4] It was popularly believed in London that the plague came from Holland; but the sanitary (or rather unsanitary) conditions of London itself were quite sufficient to account for the plague's originating there. Andrew D. White tells us, that it is difficult to decide to-day between Constantinople and New York as candidates for the distinction of being the dirtiest city in the world. [5] Incorrectly used for "councils." [6] In April, 1663, the first Drury Lane Theater had been opened. The present Drury Lane Theater (the fourth) stands on the same site. [7] The King's ministers. At this time they held office during the pleasure of the Crown, not, as now, during the pleasure of a parliamentary majority. [8] Gangrene spots (see text, pp. 197, 198). [9] The local government of London at this time was chiefly in the hands of the vestries of the different parishes. It is only of recent years that the power of these vestries has been seriously curtailed, and transferred to district councils. [10] The report. [11] Pronounced H[=o]´burn. {Transcriber's note: [=o] indicates o-macron} [12] Was. [13] Were. [14] Outlying districts; so called because they enjoyed certain municipal immunities, or liberties. Until recent years, a portion of Philadelphia was known as the "Northern Liberties." [15] Attempts to believe the evil lessened. [16] Was. [17] Were. [18] The chief executive officer of the city of London still bears this title. [19] One of the many instances in which Defoe mixes his tenses. [20] Whom. We shall find many more instances of Defoe's misuse of this form, as also of others (see Introduction, p. 15). [21] Used almost in its original sense of a military barrier. [22] Whom. [23] See Matt, xxvii. 40; Mark xv. 30; Luke xxiii. 35. [24] Denial. [25] The civil war between the Royalists and the Parliamentarians, 1642-51. [26] Whom. [27] This argument is neatly introduced to account for the narrator's staying in the city at all, when he could easily have escaped. [28] Explained by the two following phrases. [29] Whom. [30] "Lay close to me," i.e., was constantly in my mind. [31] Kept safe from the plague. [32] "My times are in thy hand" (Ps. xxxi. 15). [33] Dorking is about twenty miles southwest of London. [34] Rather St. Martin's-in-the-Fields and St. Giles's. [35] Was. [36] Charles II. and his courtiers. The immunity of Oxford was doubtless due to good drainage and general cleanliness. [37] Eccl. xii. 5. [38] Have seen. [39] Nor. This misuse of "or" for "nor" is frequent with Defoe. [40] The four inns of court in London which have the exclusive right of calling to the bar, are the Inner Temple, the Middle Temple, Lincoln's Inn, and Gray's Inn. The Temple is so called because it was once the home of the Knights Templars. [41] The city proper, i.e., the part within the walls, as distinguished from that without. [42] Were. [43] The population of London at this time was probably about half a million. It is now about six millions. (See Macaulay's History, chap. iii.) [44] Acel´dama, the field of blood (see Matt. xxvii. 8). [45] Phlegmatic hypochondriac is a contradiction in terms; for "phlegmatic" means "impassive, self-restrained," while "hypochondriac" means "morbidly anxious" (about one's health). Defoe's lack of scholarship was a common jest among his more learned adversaries, such as Swift, and Pope. [46] It was in this very plague year that Newton formulated his theory of gravitation. Incredible as it may seem, at this same date even such men as Dryden held to a belief in astrology. [47] William Lilly was the most famous astrologer and almanac maker of the time. In Butler's Hudibras he is satirized under the name of Sidrophel. [48] Poor Robin's Almanack was first published in 1661 or 1662, and was ascribed to Robert Herrick, the poet. [49] See Rev. xviii. 4. [50] Jonah iii. 4. [51] Flavius Josephus, the author of the History of the Jewish Wars. He is supposed to have died in the last decade of the first century A.D. [52] So called because many Frenchmen lived there. In Westminster there was another district with this same name. [53] "Gave them vapors," i.e., put them into a state of nervous excitement. [54] Soothsayers. [55] In astrology, the scheme or figure of the heavens at the moment of a person's birth. From this the astrologers pretended to foretell a man's destiny. [56] Roger Bacon, a Franciscan friar of the thirteenth century, had a knowledge of mechanics and optics far in advance of his age: hence he was commonly regarded as a wizard. The brazen head which he manufactured was supposed to assist him in his necromantic feats; it is so introduced by Greene in his play of Friar Bacon and Friar Bungay (1594). [57] A fortune teller who lived in the reign of Henry VIII., and was famous for her prophecies. [58] The most celebrated magician of mediæval times (see Spenser's Faërie Queene and Tennyson's Merlin and Vivien). [59] Linen collar or ruff. [60] Him. [61] The interlude was originally a short, humorous play acted in the midst of a morality play to relieve the tedium of that very tedious performance. From the interlude was developed farce; and from farce, comedy. [62] Charles II. and his courtiers, from their long exile in France, brought back to England with them French fashions in literature and in art. [63] To be acted. [64] Buffoons, clowns. [65] About 62½ cents. [66] About twenty-five dollars; but the purchasing power of money was then seven or eight times what it is now. [67] Strictly speaking, this word means "love potions." [68] Exorcism is the act of expelling evil spirits, or the formula used in the act. Defoe's use of the word here is careless and inaccurate. [69] Bits of metal, parchment, etc., worn as charms. [70] Making the sign of the cross. [71] Paper on which were marked the signs of the zodiac,--a superstition from astrology. [72] A meaningless word used in incantations. Originally the name of a Syrian deity. [73] Iesus Hominum Salvator ("Jesus, Savior of Men"). The order of the Jesuits was founded by Ignatius de Loyola in 1534. [74] The Feast of St. Michael, Sept. 29. [75] This use of "to" for "of" is frequent with Defoe. [76] The Royal College of Physicians was founded by Thomas Linacre, physician to Henry VIII. Nearly every London physician of prominence is a member. [77] The city of London proper lies entirely in the county of Middlesex. [78] Literally, "hand workers;" now contracted into "surgeons." [79] Cares, duties. [80] Consenting knowledge. [81] Disposed of to the public, put in circulation. [82] That is, by the disease. [83] Happen. [84] Engaged. [85] Heaps of rubbish. [86] A kind of parish constable. [87] The writer seems to mean that the beggars are so importunate, there is no avoiding them. [88] Fights between dogs and bears. This was not declared a criminal offense in England until 1835. [89] Contests with sword and shield. [90] The guilds or organizations of tradesmen, such as the goldsmiths, the fishmongers, the merchant tailors. [91] St. Katherine's by the Tower. [92] Trinity (east of the) Minories. The Minories (a street running north from the Tower) was so designated from an abbey of St. Clare nuns called Minoresses. They took their name from that of the Franciscan Order, Fratres Minores, or Lesser Brethren. [93] St. Luke's. [94] St. Botolph's, Bishopsgate. [95] St. Giles's, Cripplegate. [96] Were. [97] Chemise. [98] This word is misplaced; it should go before "perish." [99] Before "having," supply "the master." [100] Fences. [101] From. [102] This old form for "caught" is used frequently by Defoe. [103] Came to grief. [104] "Who, being," etc., i.e., who, although single men, had yet staid. [105] The wars of the Commonwealth or of the Puritan Revolution, 1640-52. [106] Holland and Belgium. [107] "Hurt of," a common form of expression used in Defoe's time. [108] Manager, economist. This meaning of "husband" is obsolete. [109] A participial form of expression very common in Old English, the "a" being a corruption of "in" or "on." [110] Were. [111] "'Name of God," i.e., in the name of God. [112] Torches. [113] "To and again," i.e., to and fro. [114] Were. [115] As if. [116] Magpie. [117] This word is from the same root as "lamp." The old form "lanthorn" crept in from the custom of making the sides of a lantern of horn. [118] Supply "be." [119] Inclination. [120] In expectation of the time when. [121] Their being checked. [122] This paragraph could hardly have been more clumsily expressed. It will be found a useful exercise to rewrite it. [123] "To have gone," i.e., to go. [124] Spotted. [125] "Make shift," i.e., endure it. [126] Device, expedient. [127] "In all" is evidently a repetition. [128] Objects cannot very well happen. Defoe must mean, "the many dismal sights I saw as I went about the streets." [129] As. [130] "Rosin" is a long-established misspelling for "resin." Resin exudes from pine trees, and from it the oil of turpentine is separated by distillation. [131] As distinguished from fish meat. [132] Defoe uses these pronouns in the wrong number, as in numerous other instances. [133] The projecting part of a building. [134] Their miraculous preservation was wrought by their keeping in the fresh air of the open fields. It seems curious that after this object lesson the physicians persisted in their absurd policy of shutting up infected houses, thus practically condemning to death their inmates. [135] Used here for "this," as also in many other places. [136] Supply "with." [137] Such touches as this created a widespread and long-enduring belief that Defoe's fictitious diary was an authentic history. [138] "Running out," etc., i.e., losing their self-control. [139] Idiocy. In modern English, "idiotism" is the same as "idiom." [140] Gangrene, death of the soft tissues. [141] Before "that" supply "we have been told." [142] Hanging was at this time a common punishment for theft. In his novel Moll Flanders, Defoe has a vivid picture of the mental and physical sufferings of a woman who was sent to Newgate, and condemned to death, for stealing two pieces of silk. [143] Cloth, rag. [144] They could no longer give them regular funerals, but had to bury them promiscuously in pits. [145] Evidently a repetition. [146] In old and middle English two negatives did not make an affirmative, as they do in modern English. [147] It is now well known that rue has no qualities that are useful for warding off contagion. [148] "Set up," i.e., began to play upon. [149] Constrained. [150] Because they would have been refused admission to other ports. [151] Matter. So used by Sheridan in The Rivals, act iii. sc. 2. [152] Probably a misprint for "greatly." [153] This. [154] Are. [155] He has really given two days more than two months. [156] A count. [157] Range, limits. [158] Unknown. [159] Lying. [160] Was. [161] Notice this skillful touch to give verisimilitude to the narrative. [162] Country. [163] "Without the bars," i.e., outside the old city limits. [164] Profession. [165] The plague. [166] The legal meaning of "hamlet" in England is a village without a church of its own: ecclesiastically, therefore, it belongs to the parish of some other village. [167] All Protestant sects other than the Established Church of England. [168] A groat equals fourpence, about eight cents. It is not coined now. [169] A farthing equals one quarter of a penny. [170] About ten miles down the Thames. [171] The t is silent in this word. [172] Hard-tack, pilot bread. [173] Old form for "rode." [174] See the last sentence of the next paragraph but one. [175] Roadstead, an anchoring ground less sheltered than a harbor. [176] Substitute "that they would not be visited." [177] The plague. [178] St. Margaret's. [179] Nota bene, note well. [180] Dul´ich. All these places are southward from London. Norwood is six miles distant. [181] Old form of "dared." [182] Small vessels, generally schooner-rigged, used for carrying heavy freight on rivers and harbors. [183] London Bridge. [184] This incident is so overdone, that it fails to be pathetic, and rather excites our laughter. [185] Supply "themselves." [186] Barnet was about eleven miles north-northwest of London. [187] Holland and Belgium. [188] See Luke xvii. 11-19. [189] Well. [190] With speed, in haste. [191] This word is misplaced. It should go immediately before "to lodge." [192] Luck. [193] Whom. [194] A small sail set high upon the mast. [195] "Fetched a long compass," i.e., went by a circuitous route. [196] The officers. [197] Refused. [198] Nearly twenty miles northeast of London. [199] He. This pleonastic use of a conjunction with the relative is common among illiterate writers and speakers to-day. [200] Waltham and Epping, towns two or three miles apart, at a distance of ten or twelve miles almost directly north of London. [201] Pollard trees are trees cut back nearly to the trunk, and so caused to grow into a thick head (poll) of branches. [202] Entertainment. In this sense, the plural, "quarters," is the commoner form. [203] Preparing. [204] Peddlers. [205] "Has been," an atrocious solecism for "were." [206] To a miraculous extent. [207] "Put to it," i.e., hard pressed. [208] There are numerous references in the Hebrew Scriptures to parched corn as an article of food (see, among others, Lev. xxiii. 14, Ruth ii. 14, 2 Sam. xvii. 28). [209] Supply "(1)." [210] Soon. [211] Substitute "would." [212] Whom. [213] Familiar intercourse. [214] Evidently a repetition. [215] "For that," i.e., because. [216] Singly. [217] Supply "to be." [218] Buildings the rafters of which lean against or rest upon the outer wall of another building. [219] Supply "of." [220] The plague. [221] "Middling people," i.e., people of the middle class. [222] At the mouth of the Thames. [223] Awnings. [224] Two heavy timbers placed horizontally, the upper one of which can be raised. When lowered, it is held in place by a padlock. Notches in the timbers form holes, through which the prisoner's legs are thrust, and held securely. [225] The constables. [226] The carters. [227] The goods. [228] In spite of, notwithstanding. [229] Supply "who." [230] "Cum aliis," i.e., with others. Most of the places mentioned in this list are several miles distant from London: for example, Enfield is ten miles northeast; Hadley, over fifty miles northeast; Hertford, twenty miles north; Kingston, ten miles southwest; St. Albans, twenty miles northwest; Uxbridge, fifteen miles west; Windsor, twenty miles west; etc. [231] Kindly regarded. [232] Which. [233] The citizens. [234] Such statements. [235] For "so that," substitute "so." [236] How. [237] It was not known in Defoe's time that minute disease germs may be carried along by a current of air. [238] Affected with scurvy. [239] "Which," as applied to persons, is a good Old English idiom, and was in common use as late as 1711 (see Spectator No. 78; and Matt. vi. 9, version of 1611). [240] Flung to. [241] Changed their garments. [242] Supply "I heard." [243] At. [244] Various periods are assigned for the duration of the dog days: perhaps July 3 to Aug. 11 is that most commonly accepted. The dog days were so called because they coincided with the heliacal rising of Sirius or Canicula (the little dog). [245] An inn with this title (and probably a picture of the brothers) painted on its signboard. [246] Whom. [247] The Act of Uniformity was passed in 1661. It required all municipal officers and all ministers to take the communion according to the ritual of the Church of England, and to sign a document declaring that arms must never be borne against the King. For refusing obedience to this tyrannical measure, some two thousand Presbyterian ministers were deprived of their livings. [248] Madness, as in Hamlet, act iii. sc. 1. [249] "Represented themselves," etc., i.e., presented themselves to my sight. [250] "Dead part of the night," i.e., from midnight to dawn. Compare, "In the dead waste and middle of the night." Hamlet, act i. sc 2. [251] "Have been critical," etc., i.e., have claimed to have knowledge enough to say. [252] Being introduced. [253] The plague. [254] "First began" is a solecism common in the newspaper writing of to-day. [255] Literally, laws of the by (town). In modern usage, "by-law" is used to designate a rule less general and less easily amended than a constitutional provision. [256] "Sheriff" is equivalent to shire-reeve (magistrate of the county or shire). London had, and still has, two sheriffs. [257] Acted. [258] The inspection, according to ordinance, of weights, measures, and prices. [259] "Pretty many," i.e., a fair number of. [260] The officers. [261] Were. [262] "Falls to the serious part," i.e., begins to discourse on serious matters. [263] See note, p. 28. The Mohammedans are fatalists. {Transcriber's note: The reference is to footnote 28.} [264] A growth of osseous tissue uniting the extremities of fractured bones. [265] Disclosed. [266] The officers. [267] Leading principle. [268] Defoe means, "can burn only a few houses." In the next line he again misplaces "only." [269] Put to confusion. [270] Left out of consideration. [271] The distemper. [272] A means for discovering whether the person were infected or not. [273] Defoe's ignorance of microscopes was not shared by Robert Hooke, whose Micrographia (published in 1664) records numerous discoveries made with that instrument. [274] Roup is a kind of chicken's catarrh. [275] Them, i.e., such experiments. [276] From the Latin quadraginta ("forty"). [277] From the Latin sexaginta ("sixty"). [278] Kinds, species. [279] Old age. [280] Abscesses. [281] Himself. [282] The essential oils of lavender, cloves, and camphor, added to acetic acid. [283] In chemistry, balsams are vegetable juices consisting of resins mixed with gums or volatile oils. [284] Supply "they declined coming to public worship." [285] This condition of affairs. [286] Collar. [287] Economy. [288] Supply "they were." [289] Action (obsolete in this sense). See this word as used in 2 Henry IV., act iv. sc. 4. [290] Which. [291] Sailors' slang for "Archipelagoes." [292] An important city in Asia Minor. [293] A city in northern Syria, better known as Iskanderoon or Alexandretta. The town was named in honor of Alexander the Great, the Turkish form of Alexander being Iskander. [294] Though called a kingdom, Algarve was nothing but a province of Portugal. It is known now as Faro. [295] The natives of Flanders, a mediæval countship now divided among Holland, Belgium, and France. [296] Colonies. In the reign of Charles II., the English colonies were governed by a committee (of the Privy Council) known as the "Council of Plantations." [297] The east side. [298] On the west side. [299] See map of England for all these places. Feversham is in Kent, forty-five miles southeast of London; Margate is on the Isle of Thanet, eighty miles southeast. [300] Commission merchants. [301] Privateers. Capers is a Dutch word. [302] Supply "he." [303] Supply "the coals." [304] "One another," by a confusion of constructions, has been used here for "them." [305] By a statute of Charles II. a chaldron was fixed at 36 coal bushels. In the United States, it is generally 26¼ hundredweight. [306] Opening. [307] "To seek," i.e., without judgment or knowledge. [308] Mixing. [309] Him. [310] This unwary conduct. [311] Think. [312] Were. [313] Accept. [314] Personal chattels that had occasioned the death of a human being, and were therefore given to God (Deo, "to God"; dandum, "a thing given"); i.e., forfeited to the King, and by him distributed in alms. This curious law of deodands was not abolished in England until 1846. [315] The southern coast of the Mediterranean, from Egypt to the Atlantic. [316] Censure. [317] Afterward. [318] "Physic garden," i.e., a garden for growing medicinal herbs. [319] Since. [320] Lord mayor of London, 1679-80, and for many years member of Parliament for the city. [321] The workmen. [322] Recognized. [323] Fenced. [324] Members of the Society of Friends, a religious organization founded by George Fox about 1650. William Penn was one of the early members. The society condemns a paid ministry, the taking of oaths, and the making of war. [325] See p. 105, next to the last paragraph. [326] Die. "Of the plague" should immediately follow "died." [327] See Note 3, p. 26. {Transcriber's note: the reference is to footnote 26.} [328] The act of indemnity passed at the restoration of Charles II. (1660). In spite of the King's promise of justice, the Parliamentarians were largely despoiled of their property, and ten of those concerned in the execution of Charles I. were put to death. [329] Family and personal peace. [330] The Established Church of England, nearly all of whose ministers were Royalists. The Presbyterians were nearly all Republicans. [331] The dissenting ministers. [332] The Churchmen. [333] Of. [334] What we should call an assistant minister is still called a curate in the Church of England. [335] "I had not said this," etc., i.e., I would not have said this, but would rather have chosen, etc. [336] See Rev. vi. 8. [337] Moved away (into the country). [338] The duties of headboroughs differed little from those of the constables. The title is now obsolete. [339] Count. [340] "Must." In this sense common in Chaucer. The past tense, "should," retains something of this force. Compare the German sollen. [341] Otherwise known as theriac (from the Greek [Greek: thêriakos], "pertaining to a wild beast," since it was supposed to be an antidote for poisonous bites). This medicine was compounded of sixty or seventy drugs, and was mixed with honey. [342] Supply "died." [343] Supply "of." [344] Substitute "which." [345] Those. [346] A corruption of "benzoin," a resinous juice obtained from a tree that flourishes in Siam and the Malay Archipelago. When heated, it gives off a pleasant odor. It is one of the ingredients used in court-plaster. [347] This word should be omitted. [348] The "press gang" was a naval detachment under the command of an officer, empowered to seize men and carry them off for service on men-of-war. [349] Off Lowestoft, in 1665. Though the Dutch were beaten, they made good their retreat, and heavily defeated the English the next year in the battle of The Downs. [350] See Ps. lx. 11; cviii. 12. [351] Were. [352] See Exod. xiv., xv., and xvi. 1-3. [353] "H.F." is of course fictitious.
oclc-transitioning-2020 ---- Transitioning to the Next Generation of Metadata Transitioning to the Next Generation of Metadata Karen Smith-Yoshimura O C L C R E S E A R C H R E P O R T Transitioning to the Next Generation of Metadata Karen Smith-Yoshimura Senior Program Officer © 2020 OCLC. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ September 2020 OCLC Research Dublin, Ohio 43017 USA www.oclc.org ISBN: 978-1-55653-167-5 DOI: 110.25333/rqgd-b343 OCLC Control Number: 1197990500 ORCID iDs Karen Smith-Yoshimura https://orcid.org/0000-0002-8757-2962 Please direct correspondence to: OCLC Research oclcresearch@oclc.org Suggested citation: Smith-Yoshimura, Karen. 2020. Transitioning to the Next Generation of Metadata. Dublin, OH: OCLC Research. https://doi.org/10.25333/rqgd-b343. http://creativecommons.org/licenses/by/4.0/ http://www.oclc.org https://orcid.org/0000-0002-8757-2962 mailto:oclcresearch@oclc.org https://doi.org/10.25333/rqgd-b343 C O N T E N T S Executive Summary ........................................................................... vi Introduction ......................................................................................... 1 The Transition to Linked Data and Identifiers ..................................... 4 Expanding the use of persistent identifiers ...........................................................4 Moving from “authority control” to “identity management” ..................................8 Addressing the need for multiple vocabularies and equity, diversity, and inclusion .................................................................................................................. 11 Linked data challenges .......................................................................................... 15 Describing “Inside-Out” and “Facilitated” Collections ....................16 Archival collections ................................................................................................ 16 Archived websites ...................................................................................................17 Audio and video collections .................................................................................. 18 Image collections .................................................................................................. 20 Research data ......................................................................................................... 22 Evolution of “Metadata as a Service” ............................................... 25 Metrics ....................................................................................................................25 Consultancy ...........................................................................................................25 New applications ....................................................................................................26 Bibliometrics .......................................................................................................... 27 Semantic indexing ................................................................................................. 27 Preparing for Future Staffing Requirements .................................... 28 The culture shift .....................................................................................................28 Learning opportunities ..........................................................................................29 New tools and skills ............................................................................................... 30 Self-education ........................................................................................................ 31 Addressing staff turnover ...................................................................................... 31 Impact ............................................................................................... 32 Acknowledgments ............................................................................ 33 Appendix .......................................................................................... 34 Notes ................................................................................................. 35 F I G U R E S FIGURE 1 “Changing Resource Description Workflows” by OCLC Research ....................... 4 FIGURE 2 Some 300 abbreviated author names for a five-page article in Physical Review Letters ........................................................................................... 6 FIGURE 3 Examples of some DOI and ARK identifiers .......................................................... 8 FIGURE 4 One Wikidata identifier links to other identifiers and labels in different languages ............................................................................................. 9 FIGURE 5 Excerpt from the survey results from the 2017 EDI survey of the Research Library Partnership .......................................................................... 13 FIGURE 6 Responses to 2019 survey on challenges related to managing A/V collections ....................................................................................................... 19 FIGURE 7 The OCLC ResearchWorks IIIF Explorer retrieves images about “Paris Maps” across CONTENTdm collections .................................................... 22 FIGURE 8 Distribution of 465 Indigenous language codes in the Australian National Bibliographic Database ........................................................ 26 FIGURE 9 UK Hatchette’s “River of Authors” generated from the British Library’s catalog metadata .........................................................................27 E X E C U T I V E S U M M A R Y The OCLC Research Library Partners Metadata Managers Focus Group, first established in 1993, is one of the longest-standing groups within the OCLC Research Library Partnership (RLP), a transnational network of research libraries. The Focus Group provides a forum for administrators responsible for creating and managing metadata in their institutions to share information about topics of common concern and to identify metadata management issues. The issues raised by the Focus Group are pursued by OCLC Research in support of the RLP and inform OCLC products and services. This report, Transitioning to the Next Generation of Metadata, synthesizes six years (2015-2020) of OCLC Research Library Partners Metadata Managers Focus Group discussions and what they may foretell for the “next generation of metadata.” The firm belief that metadata underlies all discovery regardless of format, now and in the future, permeates all Focus Group discussions. Yet metadata is changing. Format-specific metadata management based on curated text strings in bibliographic records understood only by library systems is nearing obsolescence, both conceptually and technically. Innovations in librarianship are exerting pressure on metadata management practices to evolve as librarians are required to provide metadata for far more resources of various types and to collaborate on institutional or multi-institutional projects with fewer staff. This report traces how metadata is evolving and considers the impact this transition may have on library services, posing such questions as: • Why is metadata changing? • How is the creation process changing? • How is the metadata itself changing? • What impact will these changes have on future staffing requirements, and how can libraries prepare? The future of linked data is tied to the future of metadata: the metadata that libraries, archives, and other cultural heritage institutions have created and will create will provide the context for future linked data innovations as “statements” associated with those links. The impact will be global, affecting how librarians and archivists will describe the inside-out and facilitated collections, inspiring new offerings of “metadata as a service,” and influencing future staffing requirements. Transitioning to the next generation of metadata is an evolving process, intertwined with changing standards, infrastructures, and tools. Together, Focus Group members came to a common understanding of the challenges, shared possible approaches to address them, and inoculated these ideas into other communities that they interact with. vi I N T R O D U C T I O N The OCLC Research Library Partners Metadata Managers Focus Group (hereafter referenced as the Focus Group),1 first established in 1993, is one of the longest-standing groups within the OCLC Research Library Partnership (RLP),2 a transnational network of research libraries. The Focus Group provides a forum for administrators responsible for creating and managing metadata in their institutions to share information about topics of common concern and to identify metadata management issues. The issues raised by the Focus Group are pursued by OCLC Research in support of the RLP and inform OCLC products and services. The firm belief that metadata underlies all discovery regardless of format, now and in the future, permeates all Focus Group discussions. Metadata provides the research infrastructure necessary for all libraries’ “value delivery systems,” fulfilling their community’s requests for information and resources. Metadata is crucial for transitioning to next generations of library and discovery systems. Good metadata created today can easily be reused in a linked data environment in the future.3 As noted in the British Library’s Foundations for the Future: “Our vision is that by 2023 the Library’s collection metadata assets will be unified on a single, sustainable, standards-based infrastructure offering improved options for access, collaboration and open reuse.”4 Format-specific metadata management based on curated text strings in bibliographic records understood only by library systems is nearing obsolescence, both conceptually and technically. Format-specific metadata management based on curated text strings in bibliographic records understood only by library systems is nearing obsolescence, both conceptually and technically. Innovations in librarianship are exerting pressure on metadata management practices to evolve as librarians are required to provide metadata for far more resources of various types and to collaborate on institutional or multi-institutional projects with fewer staff. “Traditional methods of metadata generation, management and dissemination,” suggests the British Library’s Collection Management Strategy, “are not scalable or appropriate to an era of rapid digital change, rising audience expectations and diminishing resources.”5 Focus Group members are eager to unleash the power of metadata in legacy records for different interactions and uses by both machines and end-users in the future. Consistent metadata created according to past rules or standards need to be transformed into new structures. 1 2 Transitioning to the Next Generation of Metadata Why is metadata changing? Traditional library metadata was and is made by librarians conforming to rules that are mainly used and understood by librarians. It is record-centered, expensive to produce, and has historic size limitations. Metadata is limited in its coverage, notably not including articles within scholarly journals or other scholarly outputs. The infrastructure has been inadequate for managing corrections and enhancements, inducing an emphasis on perfection that has exacerbated the slowness of metadata creation. In short, the metadata could be better, there is not enough of it, and the metadata that does exist is not used widely outside the library domain. How is the creation process changing? Metadata is no longer created by library staff alone. Today, publishers, authors, and other interested parties are equally involved in metadata creation. Metadata creation has also been pushed forward in the scholarly life cycle, with publishers creating metadata records earlier than in the traditional cataloging process. Metadata can now be enhanced or corrected by machines or by crowdsourcing. How is the metadata itself changing? Machine-readable cataloging (MARC) was created to replicate the metadata traditionally found on library catalog cards. We are transitioning from MARC records to assemblages of well-coded and shareable, linkable components, with an emphasis on references, and we are eliminating anachronistic abbreviations not understood by machines. Instead of relying only on library vocabularies such as subject headings and coded lists, the developing assemblages can accommodate vocabularies created for specific domains, expanding the metadata’s potential audiences. In short, the metadata could be better, there is not enough of it, and the metadata that does exist is not used widely outside the library domain. The Focus Group’s composition has fluctuated over time, and currently comprises representatives from 63 RLP Partners in 11 countries spanning four continents.6 The group includes both past and incoming chairs of the Program for Cooperative Cataloging (PCC),7 providing cross-fertilization between the two. Topics for group discussions can be proposed by any Focus Group member and are selected by an eight-member Planning Group (see appendix), who then write “context statements” explaining why the topic is considered timely and important and then develop question sets that delve into the topic. Context statements and question sets are then distributed to all Focus Group members who are given three to five weeks to submit their responses. Compilations of the Focus Group’s responses inform face-to-face discussions held in conjunction with the American Library Association conferences8 and in subsequent virtual meetings. As the Focus Group facilitator, I have summarized and synthesized these discussions in a series of OCLC Research Hanging Together Blog publications.9 Nearly 40 blog posts on a wide range of metadata-related topics have been published on this forum over the past six years. Transitioning to the Next Generation of Metadata 3 The Metadata Managers Focus Group is just one activity within the broader OCLC Research Library Partnership, which is devoted to extensive professional development opportunities for library staff. Focus Group members value their affiliation with the Research Library Partnership as a channel to becoming the “change agents” of future metadata management.10 Focus Group members’ responses to question sets have facilitated intra-institutional discussions and helped metadata managers understand how their institutions’ situation compares with peers within the Partnership. These Focus Group discussions identified a broad range of metadata-related issues, documented in this report. Transitioning to the next generation of metadata is an evolving process, intertwined with changing standards, infrastructures, and tools. Together, Focus Group members came to a common understanding of the challenges, shared possible approaches to address them, and inoculated these ideas into other communities that they interact with. Collectively, Focus Group members command a wide range of experiences with linked data. The Focus Group’s keen interest in linked data implementations sparked the series of OCLC Research’s International Linked Data Surveys for Implementers.11 A subset of Focus Group members have participated in various linked data projects, including the OCLC Research Project Passage and CONTENTdm Linked Data pilot, OCLC’s Shared Entity Management Infrastructure, Library of Congress’ Bibliographic Framework Initiative (BIBFRAME), the Mellon-funded Linked Data for Production (LD4P) project, the Share-VDE initiative, and the IMLS planning grant Shareable Local Name Authorities, which exposed issues raised by identifier hubs in the linked data environment.12 In addition, Focus Group members contribute to the PCC task groups addressing aspects of linked data work, including the PCC Task Group on Linked Data Best Practices, Task Group on Identity Management, Task Group on URIs in MARC, and the PCC Linked Data Advisory Committee.13 This cross-fertilization has prompted the Focus Group to examine issues around the entities represented in institutional resources. This report synthesizes six years (2015-2020) of OCLC Research Library Partners Metadata Managers Focus Group discussions and what they may foretell for the “next generation of metadata.” The document is organized in the following sections, each representing an emerging trend identified in the Focus Group’s discussions: • The transition to linked data and identifiers: expanding the use of persistent identifiers as part of the shift from “authority control” to “identity management” • Describing the “inside-out” and “facilitated” collections: challenges in creating and managing metadata for unique resources created or curated by institutions in various formats and shared with consortia • Evolution of “metadata as a service”: increased involvement with metadata creation beyond the traditional library catalog • Preparing for future staffing requirements: the changing landscape calls for new skill sets needed by both new professionals entering the field and seasoned catalogers The document concludes with some observations on the forecasted impact of the next generation of metadata on the wider library community. 4 Transitioning to the Next Generation of Metadata The Transition to Linked Data and Identifiers Linked data offers the ability to take advantage of structured data with an emphasis on context. It relies on language-neutral identifiers pointing to objects, with a focus on “things” replacing the “strings” inherent in current authority and catalog records. These identifiers can then be connected to related data, vocabularies, and terms in other languages, disciplines, and domains, including nonlibrary domains. Linked data applications can consume others’ contributions and thus free metadata specialists from having to re-describe things already described elsewhere, allowing them instead to focus on providing access to their institutions’ unique and distinctive collections. This promises a richer user experience and increased discoverability with more contextual relationships than is possible with our current systems. Furthermore, linked data offers an opportunity to go beyond the library domain by drawing on information about entities from diverse sources.14 FIGURE 1. “Changing Resource Description Workflows” by OCLC Research15 The hope is that linked data will allow libraries to offer new, value-added services that current models cannot support, that outside parties will be able to make better use of library resource descriptions, and that the data will be richer because more parties share in its creation. Moving to a linked data environment portends changes to resource description workflows, as shown in figure 1. The drive to move metadata operations to linked data depends on the availability of tools, access to linked data sources for reuse, documented best practices on identifiers and the metadata descriptions associated with them (“statements”), and a critical mass of implementations on a network level. EXPANDING THE USE OF PERSISTENT IDENTIFIERS The Focus Group discussed the “future-proofing” of cataloging, which refers to the opportunities to unleash the power of metadata in legacy records for different interactions and uses in the future. Persistent identifiers were viewed as crucial to transitioning from current metadata to future applications.16 Identifiers, in the form of language-neutral alphanumeric strings, serve as a shorthand for assembling the elements required to uniquely describe an object or resource. They can be resolved over networks with specific protocols for finding, identifying, and using that object or resource. In the nonlibrary domain, Social Security and employee numbers are examples of https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-overview.html Transitioning to the Next Generation of Metadata 5 such identifiers. In the library and academic domains, Focus Group members pointed to ORCID (Open Researcher and Contributor ID)17 as a “glue” that holds together the four arms of scholarly work: publishing, repository, library catalog, and researchers—but ORCID is limited to only living researchers. ORCID is increasingly used in STEM (science, technology, engineering, mathematics) journals for all authors and contributors18 and included in institutions’ Research Information Management systems. ISNI (International Standard Name Identifier)19 uniquely identifies persons and organizations involved in creative activities used by libraries, publishers, databases, and rights management organizations, and it covers nonliving creators. Persistent identifiers were viewed as crucial to transitioning from current metadata to future applications. Persistent identifiers are used by parties such as Google and HathiTrust for service integration.20 More institutions are using geospatial coordinates in metadata or URIs (Uniform Resource Identifiers) pointing to geospatial coordinates that support API (Application Programming Interface) calls to GeoNames,21 enabling map visualizations. Research institutions are also adopting person identifiers such as ORCID to streamline the collection of the institutional research record, usually through a Research Information Management system, as documented in the 2017 OCLC Research Report Convenience and Compliance: Case Studies on Persistent Identifiers in European Research Information Management.22 While publishers serve as a key player in the metadata workstream, publisher data does not always meet library requirements. For example, publisher data for monographs usually does not include identifiers. The British Library is working with five UK publishers to add ISNIs23 to their metadata as a promising proof-of-concept for publishers and libraries working together earlier in the supply chain. The ability to batch load or algorithmically add identifiers in the future is on Focus Group members’ wish list. No single person identifier covers all use cases. Researchers’ names have been only partially represented in national name authority files that identify persons both living and dead. A sizable quantity of legacy names are represented only by text strings in bibliographic records. Authority records are created only by institutions involved in the PCC’s Name Authority Cooperative Program (NACO)24 or in national library programs. Even then, authority records are created selectively for certain headings or sometimes only when references are involved. The LC/NACO name authority file contained only 30% of the total names reflected in WorldCat’s bibliographic record access points (9 million LC/NACO records compared to the 30 million total names reported on the WorldCat Identities project page as of 2012).25 By 2020, this percentage decreased to 18%: 11 million LC/NACO authority records compared to 62 million in WorldCat Identities. These statistics illustrate that the number of names represented in bibliographic records are increasing more quickly than those that are under authority control. Authority files focus on the “preferred form” of a name, which can vary depending on language, discipline, context, and time period. Scholars have objected to the very concept of a “preferred form,” as the name may be referred to differently depending on the context.26 When a name has multiple forms, historians need to know the provenance of each name following the citation 6 Transitioning to the Next Generation of Metadata practices commonly used in their field. An identifier linked to different forms of names, each associated with the provenance and context, could resolve this conundrum. Researcher names are just one example of a need unmet by current identifier systems. Institutions have been minting their own “local identifiers” to meet this need. Use cases for local identifiers include registering all researchers on campus; representing entities that are underrepresented in national authority files such as authors of electronic dissertations and theses, performers, events, local place names, and campus buildings; identifying entities in digital library projects and institutional repositories; reflecting multilingual needs of the community; and supporting “housekeeping” tasks such as recording archival collection titles.27 Focus Group members’ consistent need to disambiguate names across disciplines and formats spurred creating the OCLC Research working group on Registering Researchers in Authority Files.28 The need to accurately record researchers’ institutional affiliations to reflect the institution’s scholarly output, to promote cross-institutional collaborations, and to lead to more successful recruitment and funding led to another working group on Addressing the Challenges with Organizational Identifiers and ISNI,29 which presented new data modeling of organizations that others could adapt for their own uses. Since then, the Research Organization Registry (ROR) was launched to develop an open, sustainable, usable, and unique identifier for every research organization in the world.30 Disambiguating names is the most labor-intensive part of authority work and will still be a prerequisite for assigning unique identifiers. Given the different name identifier systems already in use, libraries need a name reconciliation service. Authority work and algorithms based on text string matching have limits; the results will still need manual expert review. Tapping the expertise in user communities to verify if two identifiers represent the same person may help. Disambiguation is particularly difficult for authors or contributors listed in journal articles, where names are often abbreviated and there may be dozens or even hundreds of contributors. For example, an article in Physical Review Letters—Precision Measurement of the Top Quark Mass in Lepton + Jets Final State—has approximately 300 abbreviated author names for a five- page article (figure 2).31 This exemplifies the different practices among disciplines. By contrast, other objects with many contributors such as feature films and orchestral recordings are usually represented by only a relative handful of the associated names in library legacy metadata.32 Such differences make creating metadata that is uniform, understandable, and widely reusable a challenge. FIGURE 2 . Some 300 abbreviated author names for a five-page article in Physical Review Letters ar X iv :1 40 5. 17 56 v2 [ he p- ex ] 16 J un 2 01 4 FERMILAB-PUB-14-123-E Precision measurement of the top-quark mass in lepton+jets final states V.M. Abazov,31 B. Abbott,67 B.S. Acharya,25 M. Adams,46 T. Adams,44 J.P. Agnew,41 G.D. Alexeev,31 G. Alkhazov,35 A. Altona,56 A. Askew,44 S. Atkins,54 K. Augsten,7 C. Avila,5 F. Badaud,10 L. Bagby,45 B. Baldin,45 D.V. Bandurin,73 S. Banerjee,25 E. Barberis,55 P. Baringer,53 J.F. Bartlett,45 U. Bassler,15 V. Bazterra,46 A. Bean,53 M. Begalli,2 L. Bellantoni,45 S.B. Beri,23 G. Bernardi,14 R. Bernhard,19 I. Bertram,39 M. Besançon,15 R. Beuselinck,40 P.C. Bhat,45 S. Bhatia,58 V. Bhatnagar,23 G. Blazey,47 S. Blessing,44 K. Bloom,59 MI. Boehnlein,45 D. Boline,64 E.E. Boos,33 G. Borissov,39 M. Borysoval,38 A. Brandt,70 O. Brandt,20 R. Brock,57 A. Bross,45 D. Brown,14 X.B. Bu,45 M. Buehler,45 V. Buescher,21 V. Bunichev,33 S. Burdinb,39 C.P. Buszello,37 E. Camacho-Pérez,28 B.C.K. Casey,45 H. Castilla-Valdez,28 S. Caughron,57 S. Chakrabarti,64 K.M. Chan,51 A. Chandra,72 E. Chapon,15 G. Chen,53 S.W. Cho,27 S. Choi,27 B. Choudhary,24 S. Cihangir,45 D. Claes,59 J. Clutter,53 M. Cookek,45 W.E. Cooper,45 M. Corcoran,72 F. Couderc,15 M.-C. Cousinou,12 D. Cutts,69 A. Das,42 G. Davies,40 S.J. de Jong,29, 30 E. De La Cruz-Burelo,28 F. Déliot,15 R. Demina,63 D. Denisov,45 S.P. Denisov,34 S. Desai,45 C. Deterrec,20 K. DeVaughan,59 H.T. Diehl,45 M. Diesburg,45 P.F. Ding,41 A. Dominguez,59 A. Dubey,24 L.V. Dudko,33 A. Duperrin,12 S. Dutt,23 M. Eads,47 D. Edmunds,57 B. Ellison,43 V.D. Elvira,45 Y. Enari,14 H. Evans,49 V.N. Evdokimov,34 A. Fauré,15 L. Feng,47 T. Ferbel,63 F. Fiedler,21 F. Filthaut,29, 30 W. Fisher,57 H.E. Fisk,45 M. Fortner,47 H. Fox,39 S. Fuess,45 P.H. Garbincius,45 A. Garcia-Bellido,63 J.A. García-González,28 V. Gavrilov,32 W. Geng,12, 57 C.E. Gerber,46 Y. Gershtein,60 G. Ginther,45, 63 O. Gogota,38 G. Golovanov,31 P.D. Grannis,64 S. Greder,16 H. Greenlee,45 G. Grenier,17 Ph. Gris,10 J.-F. Grivaz,13 A. Grohsjeanc,15 S. Grünendahl,45 M.W. Grünewald,26 T. Guillemin,13 G. Gutierrez,45 P. Gutierrez,67 J. Haley,68 L. Han,4 K. Harder,41 A. Harel,63 J.M. Hauptman,52 J. Hays,40 T. Head,41 T. Hebbeker,18 D. Hedin,47 H. Hegab,68 A.P. Heinson,43 U. Heintz,69 C. Hensel,1 I. Heredia-De La Cruzd,28 K. Herner,45 G. Heskethf ,41 M.D. Hildreth,51 R. Hirosky,73 T. Hoang,44 J.D. Hobbs,64 B. Hoeneisen,9 J. Hogan,72 M. Hohlfeld,21 J.L. Holzbauer,58 I. Howley,70 Z. Hubacek,7, 15 V. Hynek,7 I. Iashvili,62 Y. Ilchenko,71 L. Illingworth,45 A.S. Ito,45 S. Jabeenm,45 M. Jaffré,13 A. Jayasinghe,67 M.S. Jeong,27 R. Jesik,40 P. Jiang,4 K. Johns,42 E. Johnson,57 M. Johnson,45 A. Jonckheere,45 P. Jonsson,40 J. Joshi,43 A.W. Jung,45 A. Juste,36 E. Kajfasz,12 D. Karmanov,33 I. Katsanos,59 R. Kehoe,71 S. Kermiche,12 N. Khalatyan,45 A. Khanov,68 L. Kharchilava,62 Y.N. Kharzheev,31 I. Kiselevich,32 J.M. Kohli,23 A.V. Kozelov,34 J. Kraus,58 A. Kumar,62 M. Kupco,8 T. Kurča,17 V.A. Kuzmin,33 S. Lammers,49 P. Lebrun,17 H.S. Lee,27 S.W. Lee,52 W.M. Lee,45 X. Lei,42 J. Lellouch,14 D. Li,14 H. Li,73 L. Li,43 Q.Z. Li,45 J.K. Lim,27 D. Lincoln,45 J. Linnemann,57 V.V. Lipaev,34 R. Lipton,45 H. Liu,71 Y. Liu,4 A. Lobodenko,35 M. Lokajicek,8 R. Lopes de Sa,64 R. Luna-Garciag,28 K. L. Lyon,45 A.K.A. Maciel,1 R. Madar,19 R. Magaña-Villalba,28 S. Malik,59 V.L. Malyshev,31 J. Mansour,20 J. Martínez-Ortega,28 R. McCarthy,64 C.L. McGivern,41 M.M. Meijer,29, 30 A. Melnitchouk,45 D. Menezes,47 P.G. Mercadante,3 M. Merkin,33 A. Meyer,18 J. Meyeri,20 F. Miconi,16 N.K. Mondal,25 M. Mulhearn,73 E. Nagy,12 M. Narain,69 R. Nayyar,42 H.A. Neal,56 J.P. Negret,5 P. Neustroev,35 H.T. Nguyen,73 T. Nunnemann,22 J. Orduna,72 N. Osman,12 J. Osta,51 A. Pal,70 N. Parashar,50 V. Parihar,69 S.K. Park,27 R. Partridgee,69 N. Parua,49 A. Patwaj ,65 B. Penning,45 M. Perfilov,33 Y. Peters,41 K. Petridis,41 G. Petrillo,63 P. Pétroff,13 M.-A. Pleier,65 V.M. Podstavkov,45 A.V. Popov,34 M. Prewitt,72 D. Price,41 N. Prokopenko,34 J. Qian,56 A. Quadt,20 B. Quinn,58 P.N. Ratoff,39 I. Razumov,34 I. Ripp-Baudot,16 F. Rizatdinova,68 M. Rominsky,45 A. Ross,39 C. Royon,15 P. Rubinov,45 R. Ruchti,51 G. Sajot,11 A. Sánchez-Hernández,28 M.P. Sanders,22 A.S. Santosh,1 G. Savage,45 M. Savitskyi,38 L. Sawyer,54 T. Scanlon,40 R.D. Schamberger,64 Y. Scheglov,35 H. Schellman,48 C. Schwanenberger,41 R. Schwienhorst,57 J. Sekaric,53 H. Severini,67 E. Shabalina,20 V. Shary,15 S. Shaw,57 A.A. Shchukin,34 V. Simak,7 P. Skubic,67 P. Slattery,63 D. Smirnov,51 G.R. Snow,59 J. Snow,66 I. Snyder,65 S. Söldner-Rembold,41 L. Sonnenschein,18 K. Soustruznik,6 J. Stark,11 D.A. Stoyanova,34 M. Strauss,67 L. Suter,41 P. Svoisky,67 M. Titov,15 V.V. Tokmenin,31 Y.-T. Tsai,63 D. Tsybychev,64 B. Tuchming,15 C. Tully,61 L. Uvarov,35 S. Uvarov,35 S. Uzunyan,47 R. Van Kooten,49 W.M. van Leeuwen,29 N. Varelas,46 E.W. Varnes,42 LI. A. Vasilyev,34 A.Y. Verkheev,31 L.S. Vertogradov,31 M. Verzocchi,45 M. Vesterinen,41 D. Vilanova,15 P. Vokac,7 H.D. Wahl,44 M.H.L.S. Wang,45 J. Warchol,51 G. Watts,74 M. Wayne,51 J. Weichert,21 L. Welty-Rieger,48 M.R.J. Williams,49 G.W. Wilson,53 M. Wobisch,54 D.R. Wood,55 T.R. Wyatt,41 Y. Xie,45 R. Yamada,45 https://arxiv.org/pdf/1405.1756.pdf https://arxiv.org/pdf/1405.1756.pdf https://arxiv.org/pdf/1405.1756.pdf Transitioning to the Next Generation of Metadata 7 Abbreviated forms of author names on journal articles make it difficult—and often impossible—to match them to the correct authority form or an identifier, if it exists. Associating ORCIDs with article authors makes it easier to differentiate authors with the same abbreviated forms. Research Information Management (RIM) systems apply identity management for local researchers so that they are correctly associated with the articles they have written. Their articles are displayed as part of their profiles. (See for example, Experts@Minnesota or University of Illinois at Urbana- Champaign’s Experts research profiles.)33 For researcher identity management to work, individuals must create and maintain their own ORCIDs. Institutions have been encouraging their researchers to include an ORCID in their profiles. Researchers have greater incentives to adopt ORCID to meet national and funder requirements such as those of the National Science Foundation and the National Institutes of Health in the United States.34 Research Information Management Systems harvest metadata from abstract and indexing databases such as Scopus, Web of Science, and PubMed, each of which has its own person identifiers that help with disambiguation; they may also be linked to an author’s ORCID. Linked data could access information across many environments, including those in Research Information Systems, but would require accurately linking multiple identifiers for the same person to each other. Some Focus Group members are performing metadata reconciliation work, such as searching matching terms from linked data sources and adding their URIs in metadata records as a necessary first step toward a linked data environment or as part of metadata enhancement work.35 Improving the quality of the data improves users’ experiences in the short term and will help with the transition to linked data later. Most metadata reconciliation is done on personal names, subjects, and geographic names. Sources used for such reconciliation include OCLC’s Virtual International Authority File (VIAF), the Library of Congress’s linked data service (id.loc.gov), ISNI, the Getty’s Union List of Artists Names (ULAN), Art and Architecture Thesaurus (AAT), and Thesaurus of Geographic Names (TGN), OCLC’s Faceted Application of Subject Terminology (FAST), and various national authority files. Selections of the source depend on the trustworthiness of the organization responsible, subject matter, and richness of the information. Such metadata reconciliation work is labor intensive and does not scale well. Some members of the Focus Group have experimented with obtaining identifiers (persistent URIs from linked data sources) to eventually replace their current reliance on text strings. Institutions concluded that it is more efficient to create URIs in authority records at the outset rather than reconcile them later. The University of Michigan has developed a LCNAF Named Entity Reconciliation program36 using Google’s Open Refine that searches the VIAF file with the VIAF API for matches, looks for Library of Congress source records within a VIAF cluster, and extracts the authorized heading. This results in a dataset pairing the authorized LC Name Authority File heading with the original heading and a link to the URI of the LCNAF linked data service. This service could be modified to bring in the VIAF identifier instead; it gets fair results even though it uses string matching. A long list of nonlibrary sources that could enhance current authority data or could be valuable to link to in certain contexts has been identified. Wikidata and Wikipedia led the list. Other sources include: AllMusic, author and fan sites, Discogs, EAC-CPF (Encoded Archival Context for Corporate Bodies, Persons, and Families), EAD (Encoded Archival Description), family trees, GeoNames, GoodReads, IMDb (Internet Movie Database), Internet Archive, Library Thing, LinkedIn, MusicBrainz, ONIX (ONline Information eXchange), Open Library, ORCID, and Scopus ID. The PCC’s Task Group on URIs in MARC’s document, Formulating and Obtaining URIs: A Guide to Commonly Used Vocabularies and Reference Sources,37 provides valuable guidance for collecting data from these other sources. Wikidata is viewed as an important source for expanding the language range and providing multilingual metadata more easily than with current library systems.38 8 Transitioning to the Next Generation of Metadata Identifiers for “works” represent a particular challenge, as there is no consensus on what represents a “distinctive work.”39 Local work identifiers cannot be shared or reused. Focus Group members voiced concern that differing interpretations of what a “work” is could hamper the ability to reuse data created elsewhere and look to a central trusted repository like OCLC to publish persistent Work Identifiers that could be used throughout the community. Identifiers need to be both unchanging over time and independent of where the digital object is or will be stored. For instance, identifiers for data sets such as digital resources and collections in institutional repositories include system-generated IDs, locally minted identifiers, PURL handles, DOIs (Digital Object Identifiers), URIs, URNs, and ARKs (Archival Resource Keys). A few examples of DOI and ARK Identifiers are shown in figure 3. Resources can have both multiple copies and versions that change over time. Institutional repositories used as collaborative spaces can lead to multiple publications from the same data sets, a problem compounded by self-deposits from coauthors at different institutions into different repositories. Furthermore, libraries (as well as funders and national assessment efforts) want to be able to link related pieces (such as preprints, supplementary data, and images) with the publication. Multiple DOIs pointing to the same object pose a problem. Some libraries use DataCite or Crossref to mint and publish unique, long-term identifiers and thus minimize the potential for broken citation links.40 Ideally, libraries would contribute to a hub for the metadata describing their researchers’ data sets regardless of where the data sets are stored. FIGURE 3. Examples of some DOI (left) and ARK (right) identifiers 41 MOVING FROM “AUTHORITY CONTROL” TO “IDENTITY MANAGEMENT” The emphasis in authority work is shifting from construction of text strings to identity management—differentiating entities, creating identifiers, and establishing relationships among entities.42 The intellectual work required to differentiate names is the same for both current authority work and identify management. Focus Group members agree that the future is in identity management and getting away from “managing text strings” as the basis of controlling headings in Examples of Some DOI and ARK Identifiers. Transitioning to the Next Generation of Metadata 9 bibliographic records.43 But identity management poses a change in focus, from providing access points in resource descriptions to describing the entities in the resource (work, persons, corporate bodies, places, events) and establishing the relationships and links among them. Identity management poses a change in focus, from providing access points in resource descriptions to describing the entities in the resource (work, persons, corporate bodies, places, events) and establishing the relationships and links among them. The transition from “authority control” and “authorized access points” in our legacy systems to identity management requires us to separate identifiers from their associated labels. A unique identifier could be associated with an aggregate of attributes that would enable users to distinguish one entity from another.44 Ideally, libraries could take advantage of the identifiers and attributes from other, nonlibrary sources. Wikidata, for example, aggregates a variety of identifiers as well as labels in different languages, as shown in figure 4. FIGURE 4. One Wikidata identifier links to other identifiers and labels in different languages One Wikidata Identifier Links to Other Identifiers and Labels in Different Languages Wikidata Identifier Q19526 https://www.wikidata.org/wiki/Q19526 10 Transitioning to the Next Generation of Metadata Providing contextual information is more important than providing one unique label. Labels could differ depending on communities—such as various spellings of names and terms, different languages and writing systems, and different disciplines—without requiring that one form be preferred over another. Label preference becomes localized rather than homogenized for global use. A key barrier to moving from text strings to identity management is the lack of technology and infrastructure to support it. New tools are needed to index and display information about the entities described with links to the sources of the identifiers. Since multiple identifiers may point to the same entity, tools to reconcile them will also be needed. Some systems index only the controlled access points, which is a problem when dealing with names represented in different languages. Can library systems be reconfigured to deal with identifiers as the match point, collocation point, and the key to whatever associated labels are displayed and indexed?45 Some Focus Group members are experimenting with Wikidata as another option to assign identifiers for names not represented in authority files, which would broaden the potential pool of contributors.46 Many libraries are looking toward Wikidata and Wikibase—the software platform underlying Wikidata—to solve some of the long-standing issues faced by technical services departments, archival units, and others.47 Wikidata/Wikibase are viewed as a possible alternative to traditional authority control and have other potential benefits such as embedded multilingual support and bridging the silos describing the institution’s resources. Focus Group members’ experimentations with Wikidata and OCLC projects using the Wikibase platform indicate that Wikibase is a plausible framework for realizing linked data implementations. This infrastructure could enable the Focus Group and the wider bibliographic and archival communities to focus on the entities that need to be created, their relationship with each other, and how best they can increase discoverability by end-users. Identity management could also bridge the variations of names found in journal articles, scholarly profile services, and library catalogs, transcending these now siloed domains. This bridge is a requirement to fulfill the promises of linked data. Because Wikidata was originally seeded by drawing data from Wikipedia, representation of books in Wikidata has a focus on “works” and their authors. This focus on works and authors could be viewed as an alternate version of the traditional author/title entries in authority files. Books that are “notable” are more likely to be represented in Wikidata. Recently, an effort to support citations in Wikipedia articles, WikiCite,48 demonstrates a need to register and support identifiers that make up those citations, including information about a specific edition or document. One of the most practical—and powerful—aspects of identity management is to reduce the amount of copying/pasting in library metadata workflows when an identifier is stewarded in an external location. Identifiers could provide a bridge between MARC and non-MARC environments and to nonlibrary resources. Librarians would not have to be the experts in all domains.49 Many resources curated or managed by libraries are not under authority control, such as digital and archival Transitioning to the Next Generation of Metadata 11 collections, institutional repositories, and research data. Identifiers could provide links to these resources. Identity management could also bridge the variations of names found in journal articles, scholarly profile services, and library catalogs, transcending these now siloed domains. This bridge is a requirement to fulfill the promises of linked data. ADDRESSING THE NEED FOR MULTIPLE VOCABULARIES AND EQUITY, DIVERSITY, AND INCLUSION Concepts or subject headings are particularly thorny as terminology can differ depending on the time period and discipline. In some cases, terms may be considered pejorative, harmful, or even racist by some communities. Addressing language issues is important as libraries seek to develop relationships and build trust with marginalized communities. The issues around equity, diversity, and inclusion are complex, and the vocabulary used in subject headings is just one aspect, and language-neutral identifiers represent one approach. The issue of supporting “alternate” subject headings came to the fore when the Library of Congress’ initial solution to change the LC subject heading for “Illegal aliens” to “Undocumented immigrants” failed to be implemented. This prompted one Focus Group member to comment, “Being held hostage to a national system slow to change in the face of changing semantics is damaging to libraries, as generally we pride ourselves on being welcoming and inclusive.” End-users hold their libraries accountable for what appears in their catalogs. Although LCSH is the Library of Congress Subject Headings, it is used worldwide, sometimes losing its context.50 Addressing language issues is important as libraries seek to develop relationships and build trust with marginalized communities. Some see Faceted Application of Subject Terminology (FAST)51 as a means to engage the community to mitigate the issues that have driven attempts to develop alternate subject headings for LCSH. A subset of the Focus Group has been applying FAST to records that would otherwise lack any subjects. FAST was originally developed by OCLC as a medium between totally-uncontrolled keywords at one end of the spectrum and difficult-to-learn-and-apply precoordinated subject strings at the other end.52 FAST headings provide an easy transition to a linked data environment, since each FAST heading has a unique identifier. As FAST headings are generated from Library of Congress precoordinated subject headings, they can also include the same terminology that some consider inappropriate or disrespectful. The recently launched FAST Policy and Outreach Committee53 represents FAST users to oversee community engagement, term contributions, and procedures and to recommend improvements. Its vision statement reads: FAST will be a fully supported, widely adopted and community developed general subject vocabulary derived from LCSH with tools and services that serve the needs of diverse communities and contexts.54 Multiple overlapping and sometimes conflicting vocabularies already exist in legacy library data.55 For example, Focus Group members in New Zealand add terms from the Māori Subject 12 Transitioning to the Next Generation of Metadata Headings thesaurus (Ngā Upoko Tukutuku) to the same records as LC subject headings; Focus Group members in Australia add terms authorized in the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) Thesauri.56 There may be no satisfactory equivalences across languages. Different concepts in national library vocabularies cannot always be mapped unequivocally to English concepts. The multiyear MACS (Multilingual Access to Subjects)57 built relationships across three subject vocabularies: Library of Congress Subject Headings, the German GND integrated authority file, and the French RAMEAU (Répertoire d’autorité-matière encyclopédique et alphabétique unifié). It has been a labor-intensive process and is not known to be widely implemented.58 A growing percentage of data in institutions’ discovery layers comes from non-MARC, nonlibrary sources. Metadata describing universities’ research data and materials in Institutional Repositories is usually treated differently—and separately. How should institutions provide normalization and access to the entities described so users do not experience the “collision of name spaces” and ambiguous terms (or terms meaning different things depending on the source)? Synaptica Knowledge Solutions’ Ontology Management – Graphite tool59 to create and manage various types of controlled vocabularies seems promising in this context. Focus Group members cited examples of established vocabularies or datasets that have become outdated or do not provide for local needs or sensibilities. Slow or unresponsive maintenance models for established vocabularies have tempted some to consider distributed models. High training thresholds to participate in current models have contributed to a desire for alternatives.60 Linked data could provide the means for local communities to prefer a different label for an established vocabulary’s preferred term for a concept or entity. One might reference a local description of a concept or entity not represented—or not represented satisfactorily—in established vocabularies or linked data sources. If these kinds of amendments and additions are made possible in a linked data environment, others could agree (or disagree) with the point of view by linking to the new resource. Such a distributed model for managing both terminology and entity description raises issues around metadata stability expectations, metadata interoperability, and metadata maintenance. How could a distributed model avoid people duplicating work on the same entity or concept? How would a distributed model record the trustworthiness of the contributors, or determine who would be allowed to contribute? Educational institutions and libraries have under- taken EDI initiatives, and metadata departments have been struggling to support them. Stability and permanence issues have been highlighted by the numerous vocabularies created for specific projects that, once funding ended, remain frozen in time. As one Focus Group member noted, “Nothing is sadder than a vocabulary that someone invented that was left to go stale.” Such examples provide a major reason for librarians wanting to rely on international authority files rather than on local solutions. They also exemplify the value of the Library of Congress taking on the entire cost of creating and maintaining LCSH. The OCLC Research report on the findings from a 2017 survey of the Research Library Partnership on equity, diversity, and inclusion (EDI)61 spurred discussions on the complexity of embedding Transitioning to the Next Generation of Metadata 13 equity, diversity, and inclusion in controlled vocabularies in library catalogs.62 Educational institutions and libraries have undertaken EDI initiatives, and metadata departments have been struggling to support them. The excerpt from the EDI survey in figure 5 shows that metadata in library catalogs lags behind other areas in support of the institution’s EDI goals and principles. FIGURE 5. Excerpt from the survey results from the 2017 EDI survey of the Research Library Partnership 63 Focus Group members are eager to provide more detailed subject access than is currently offered by national subject heading systems, such as LCSH, which has more granularity for Western European places than for Southeast Asia and Africa. They see the need to offer more accurate and current terms and replace terms that reflect bias or are considered offensive with more neutral terms. Challenges that Focus Group members identified in offering more respectful terminology in subject access for users: • Discovery: Using other, less-offensive vocabularies locally can split collections in the library catalog, thus hampering discovery of all relevant materials. • Lack of consensus: Focus Group members doubt that there can ever be complete consensus about any given text string. Terms that may be offensive to one community may not always be clear to others. (For example, “Dissident art” rather than “Non-conformist art.”) • Speed: The process of changing standard subject headings can be very slow. • Capacity: Changing headings in existing records can require a massive undertaking. Targeted access point maintenance occurs in the context of access point maintenance generally. For example, the Library of Congress recently changed the heading “Mentally handicapped” to 0% 20% 40% 60% 80% 100% Collection building Select materials for digitiation Metadata in library catalogs Metadata in archival collections Metadata in digital collections Terminologies or vocabularies Changed Plan to change What Areas Have You Changed or Plan to Change Due to Your Institutions EDI Goals and Principals? 14 Transitioning to the Next Generation of Metadata “People with mental disabilities.” Implementing such changes in the catalog can involve a mix of automated, vended, and manual remediation methods, as well as decisions about resource allocation.64 Some noted it would be less labor-intensive to present a “cultural sensitivity” message as part of the search interface to alert users that terms and annotations they find in a catalog may reflect the creator’s attitude or the period in which the item was created and may be considered inappropriate today in some contexts. • Sharing: Local vocabularies cannot be shared with other systems. • Maintenance: Some who have tried to use local vocabularies more suitable for their context and communities found them too burdensome to maintain and abandoned them. • Language barriers: The language of our controlled vocabularies may be exclusive to audiences who do not read that language. The Ohio State University Libraries has tried to address this by developing some non-Latin script equivalents of English subject terms. • Classification: Current classification systems are apt to segregate ethnic groups. Rather than include them as part of an overall concept like history, education, or literature, they tend to be grouped together as one lump. As institutions store more publications off-site, the need to shelve materials together and have just one classification in a record has subsided, but few apply multiple classifications in one record. Requirements for a distributed system that accommodates multiple vocabularies and could also support EDI converged around the need to support semantic relationships among different vocabularies. Communities of practice need a hub to aggregate and reconcile terms within their own domains. It was noted that different communities of practice might use terms that conflict with others’ terminologies or mean different things. The PCC Linked Data Advisory Committee’s Linked Data Infrastructure Models: Areas of Focus for PCC Strategies65 describes high-level functional requirements and a spectrum of models anticipated as cultural heritage institutions adopt linked data as a strategy for data sharing. The model must be both scalable and extensible, with the ability to accommodate the proliferation of new topics and terms symptomatic of the humanities and sciences and facilitate contributions by the researchers themselves. It needs to be flexible enough to coexist with other vocabularies. Replacing text strings with stable, persistent identifiers would facilitate using different labels depending on context. This would accommodate both different languages and scripts (and different spellings within a language, such as American vs. British English), as well as terms that are more respectful to marginalized communities. The 19 October 2017 OCLC Research Works in Progress webinar on “Decolonizing Descriptions: Finding, Naming and Changing the Relationship between Indigenous People, Libraries, and Archives”66 described the process launched by the Association for Manitoba Archives and the University of Alberta Libraries to examine subject headings and classification schemes and consider how they might be more respectful and inclusive of the experiences of Indigenous peoples. Expanding vocabularies to include those used in other communities requires building trust relationships. A model of “community contribution” for new terms and community voting could be more inclusive. Libraries’ current “consensus environment” excludes a lot of people. Much metadata is currently created according to Western knowledge constructs, and systems have been designed around them. Communicating the history of changes and the provenance of each new or modified term would provide transparency that could contribute to the trustworthiness of the source. The edit history and discussion pages that are part of each Wikidata entity description is a possible model to follow. Requiring provenance as part of a distributed vocabulary model may help in creating an alternative environment that is more equitable, diverse, and inclusive. Transitioning to the Next Generation of Metadata 15 LINKED DATA CHALLENGES Identifiers and vocabularies are just two components required in the transition to linked data. A vital part of describing entities are the associated statements made. How will libraries resolve or reconcile conflicts between statements?67 Different types of inconsistencies may appear than do now with, for example, different birthdates for persons. The provenance of each statement becomes more critical. Even in the current environment, certain sources are more trusted and give catalogers confidence in their accuracy. Libraries often have a list of “preferred sources.”68 OCLC Research explored how libraries might apply Google’s “Knowledge Vault” to identify statements that may be more “truthful” than others in the 2015 “Works in Progress Webinar: Looking Inside the Library Knowledge Vault.”69 Focus Group members posited that aggregations such as WorldCat, the Virtual International Authority File (VIAF), and Wikidata may allow the library community to view statements from these sources with more confidence than others. Librarians could share their expertise by establishing the relationships between and among statements from different sources. But good linked data requires good metadata. Administrators are well aware of the tension between delivering access to library collections in a timely manner and providing good quality description. The metadata descriptions must be full enough to allow libraries to manage their collections and to support accessibility and discoverability for the end-user. Many libraries need to compromise between speed over accuracy, speed over depth, or brevity over nothing. These compromises are reflected by using inadequate vendor records, by creating minimal or less-than-full level descriptions for certain types of resources, and by limiting authority work. Minimal-level cataloging is commonly used as an alternative to leaving materials uncatalogued, often because of large volume of materials and insufficient staff resources.70 These less-than-full descriptions will result in fewer and less accurate linked data statements. Good linked data requires good metadata. The transition period from legacy cataloging systems reliant on MARC to a new linked data environment with entities and statements has many challenges since both standards and practices are moving targets. It is unclear how libraries will share statements rather than records in a linked data environment. Focus Group members were divided on whether a centralized linked data store would be needed to provide “trustworthy provenance” or whether data should be distributed with peer-to-peer sharing.71 Different statements might be correct in their own contexts. “Conflicting statements” might represent different world views. Selecting statements based on provenance could be challenging to our principles of equity, diversity, and inclusion. The Focus Group members wondered how to involve the many vendors that supply or process MARC records in the transition to linked data. In the United Kingdom, the Jisc initiative “Plan M” (where “M” stands for “metadata”) seeks to streamline the metadata supply among libraries, publishers, data suppliers, and infrastructure providers.72 Among the implications cited by stakeholders in the UK’s National Bibliographic Knowledgebase (NBK) in Plan M’s 10-year vision: “Linked data instances of the NBK will need to be created and maintained requiring convincing business-cases around the impact this could have on research.”73 Working with others in the linked data environment involves people unfamiliar with the library environment, requiring metadata specialists to explain what their needs are in terms nonlibrarians can understand. 16 Transitioning to the Next Generation of Metadata Describing “Inside-Out” and “Facilitated” Collections OCLC Vice President and Chief Strategist, Lorcan Dempsey, refers to the shifting emphasis of libraries to support the creation, curation, and discoverability of institutional resources as the “inside-out collection” (in contrast to the “outside-in collection,” in which the library buys or licenses materials from external providers to make them accessible to a local audience). Providing access to a broader range of local, external, and collaborative resources around user needs is the “facilitated collection.”74 Focus Group members’ activities have increasingly focused on metadata that will provide access to the resources unique to their institutions as well as those in their consortia or national networks. All resources collected, created, and curated by libraries require metadata to make them discoverable. However, Focus Group members concentrated on the challenges and issues related to specific formats: • Archival collections • Archived websites • Audio and video collections • Image collections • Research data All these content types can be categorized as belonging to “inside-out” collections and present different challenges. For example, Focus Group members described efforts to retrieve metadata from completely different systems as “super challenging.” In addition, many of these resources are not under any authority control. Reconciling access points from various thesauri and metadata mapping work requires technical services expertise and skills.75 This reconciliation also will be needed in the previously discussed linked data environment. This section summarizes the discussions on these format types. ARCHIVAL COLLECTIONS Archival collections are in many ways the crown jewels of collections as they are unique research resources providing insights into the world across many centuries and places, providing the primary sources for new knowledge creation. Increasing visibility for these collections reaps significant benefits for both scholars and libraries and archives. Archives are, however, complex and present different metadata issues compared with traditional library collections. As institutions turn to ArchiveSpace and other content management systems to provide infrastructures for structured archival metadata, various issues are emerging.76 Archives have had more autonomy than libraries within their institutions because they have unique collections with their own population of users, their own metadata standards, and their own systems. While some institutions have integrated archival processing within technical services, most maintain a separate unit. Archivists do not have the tradition of creating authority records and sharing identifiers for the same entity as is common among librarians. They also tend to use the fullest form of a name based on the information found in collections, while librarians focus on “preferred” form found in publications. Even so, a significant shift from artisanal archival approaches to metadata standardization has been occurring. Transitioning to the Next Generation of Metadata 17 So how can archivists and librarians better integrate their metadata and name authority practices? The number of personal names in archival collections can be so large that most are uncontrolled and without identifiers. However, the contextual information that archivists provide for person and organization entities could enrich the information provided in authority files—a use case that was explored in the 2017-2018 Project Passage pilot77 and examined in more detail in 2019-2020 by the OCLC Research Library Partners Archives and Special Collections Linked Data Review Group.78 The increased reliance on electronic and digital resources during the COVID-19 pandemic will likely accelerate institutions digitizing their archival and distinctive collections that have been available only in physical form.79 More metadata may be created from digitized versions of these resources. ARCHIVED WEBSITES For some years, archives and libraries have been archiving web resources of scholarly or institutional interest to ensure their continuing access and long-term survival. Some websites are ephemeral or intentionally temporary, such as those created for a specific event. Institutions would like to archive and preserve the content of their websites as part of their historical record. A large majority of web content is harvested by web crawlers, but the metadata generated by harvesting alone is considered insufficient to support discovery.80 Some archived websites are institutional, theme-based collections supporting a specific research area such as Columbia University’s Human Rights, Historic Preservation and Urban Planning, and New York City Religions.81 National libraries archive websites within their national domain. For example the National Library of Australia’s Archived websites (1996-now)82 collect websites in partnership with cultural institutions around Australia, government websites formerly accessible through the Australian Government Web Archive, and websites from the .au domain collected annually through large scale crawl harvests. These curated collections by subject provide snapshots of Australian cultural and social history. Examples of consortia-based archived websites include the Ivy Plus Libraries Confederation’s Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) and Contemporary Composers Web Archive (CCWA) and the New York Art Resources Consortium (NYARC), which captures dynamic web-based versions of auction catalogs and artist, gallery, and museum websites.83 The Focus Group discussed the challenges for creating and managing the metadata needed to enhance machine-harvested metadata from websites. Some of the challenges identified: • Type of website matters. Descriptive metadata requirements may depend on the type of website archived (e.g., transient sites, research data, social media, or organizational sites). Sometimes only the content of the sites is archived when the user experience of the site (its “look-and-feel”) is not considered significant. • Practices vary. Some characteristics of websites are not addressed by existing descriptive rules such as RDA (Resource Description and Access) and DACS (Describing Archives: A Content Standard). Metadata tends to follow bibliographic description traditions or archival practice depending on who creates the metadata. • Consider scale and projected use. Metadata requirements may differ depending on the scale of material being archived and its projected use. For example, digital humanists look at web content as data and analyze it for purposes such as identifying trends, while other users merely need individual pages. The level of metadata granularity (collection, seed/URL, document) may also vary based on anticipated user needs, scale of material being crawled, and available staffing. 18 Transitioning to the Next Generation of Metadata • Update frequency. Many websites are updated repeatedly, requiring re-crawling when the content has changed. Some types of change can result in capture failures. • Multi-institutional websites. Some websites are archived by multiple institutions. Each may have captured the same site on different dates and with varying crawl specifications. How can they be searched and used in conjunction with one another? A 2015 survey of the OCLC Research Library Partnership revealed the “lack of descriptive metadata guidelines” as the biggest challenge related to website archiving, leading to the formation of the OCLC Research Library Partnership Web Archiving Metadata Working Group.”84 The challenges that the Focus Group identified were explored in depth by this working group, which issued a report of its recommendations in 2018, Descriptive Metadata for Web Archiving.85 AUDIO AND VIDEO COLLECTIONS Focus Group members reported that their institutions had repositories filled with large amounts of audiovisual (A/V) materials, which often represent unique, local collections.86 However, as Chela Scott Weber states in the publication Research and Learning Agenda for Archives, Special, and Distinctive Collections in Research Libraries, “For decades, A/V materials in our collections were largely either separated from related manuscript material (often shunted away to be dealt with at a later date) or treated at the item level. Both have served to create sizeable backlogs of un-quantified and un-described A/V materials.”87 Much of this audiovisual material urgently requires preservation, digitization, clarification of conditions of use, and description. In addition, the needed skill sets and stakeholders across institutions are complex. The nature of the management of A/V resources requires knowledge of the use context as well as technical metadata issues, providing a complex environment to think through requirements for description and access. Further, libraries must deal with current time-based media that is either being produced locally as part of research and learning, or streaming media that is being commercially licensed. Focus Group discussions focused on the A/V resources within archival collections—often in deteriorating formats, in large backlogs, and sometimes requiring rare and expensive equipment to access and assess the files. For locally generated content, institutions prefer that the creators describe their own resources. Metadata describing the same A/V materials may differ across library, archival, and digital asset management systems. The overarching challenge was how much effort needs to be invested in describing these A/V materials because they are unique. Institutions have used hierarchical structures to aggregate similar materials with finding aids that are marked up in the Encoded Archival Description standard,88 which provides useful contextual information for individual items within a specific collection. But often an aggregated approach to description can lack important details about individual items needed for discovery, such as transcribed title and date broadcast. This is a particularly acute issue for legacy data describing recordings from years past. Metadata describing the same A/V materials may differ across library, archival, and digital asset management systems. Transitioning to the Next Generation of Metadata 19 Some hope that better discovery layers will alleviate the need to repeat the same information across databases, but presenting the information to users would require using consistent access points across systems. The same will be true in a linked data environment. But the challenge to link between items and the finding aid and to maintain the links over time despite changes in systems will remain. Metadata for A/V materials needs to include important technical information, such as details about the A/V capture and digitization process like compression, year digitized, the technology used, and file compatibility. This data is critical to ensure perpetual access for such enormous files and mercurial playback formats. Some Focus Group members have implemented PREMIS (Preservation Metadata: Implementation Strategies),89 the international standard for metadata to support the preservation of digital objects and ensure their long-term usability, for some of their A/V materials. OCLC Senior Program Officer Chela Scott Weber continues working with the Research Library Partnership on the needs and challenges of managing A/V collections, summarized in OCLC Research Hanging Together Blog posts: “Assessing Needs of AV in Special Collections” and “Scale & Risk: Discussing Challenges to Managing A/V Collections in the RLP.”90 A subset of the Focus Group members responded to Weber’s 2019 survey to assess the needs of audiovisual materials in special collections within the Research Library Partnership; incorporating A/V collections into archival and digital collections workflows were two of the challenges that most interested respondents, as shown in figure 6. FIGURE 6. Responses to 2019 survey on challenges related to managing A/V collections What Challenges Related to Managing A/V Collections Would You Be Interested in the RLP Addressing? (n=137) 0 20 40 60 80 100 120 140 Resource allocation, assessment, and prioritization Digital asset management and preservation Physical collection management Digitization and preservation reformatting Incorporating into digital collection workflows Incorporating into archival workflows Selection, appraisal, and collection development Very interested Interested Somewhat interested Not interested 20 Transitioning to the Next Generation of Metadata IMAGE COLLECTIONS Focus group members manage a wide variety of image collections presenting challenges for metadata management. In some cases, image collections that developed outside the library and its data models need to be integrated with other collections or into new search environments. Depending on the nature of the collection and its users, questions arise concerning identification of works, depiction of entities, chronology, geography, provenance, genre, subjects (“of- ness” and “about-ness”). Image collections also offer opportunities for crowdsourcing and interdisciplinary research.91 Many libraries describe their digital image resources on the collection level while selectively describing items. As much as possible, enhancements are done in batch. Some do authority work, depending on the quality of the accompanying metadata. Some libraries have disseminated metadata guidelines to help bring more consistency to the data. Among the challenges discussed by the Focus Group: • Variety of systems and schemas: Image collections created in different parts of the institution such as art or anthropology departments serve different purposes and use different systems and schemas than those used by the library. The metadata often comes in spreadsheets or unstructured accompanying data. Often, the metadata created by other departments requires much editing, massaging, and manual review. The situation is simpler when all digitization is handled through one centralized location and the library does all the metadata creation. Some libraries are using Dublin Core for their image collections’ metadata and others are using MODS (Metadata Object Description Schema).92 Some wrap the metadata records in METS (Metadata Encoding and Transmission Standard),93 a schema maintained by the Library of Congress designed to express the hierarchical nature of digital library objects, the names and locations of the files that comprise those objects, and the associated metadata. Some suggested that MODS be used in conjunction with MADS (Metadata Authority Description Schema).94 • Duplicate metadata for different objects: Metadata for a scanned set of drawings may be identical, even though there are slight differences in those drawings. Duplicating the metadata across similar objects is likely due to limited staff. Possibly the faculty or the photographers could add more details. • Lack of provenance: A common challenge is receiving image collections with scanty metadata and with no information regarding their provenance. For example, metadata staff at one institution were given OCR’ed text retrieved by a researcher from HathiTrust. Millions of images lacked the location of the original source material and therefore limited—if not discredited—any further use. • Maintaining links between metadata and images: How should libraries store images and keep them in sync with the metadata? There may be rights issues from relying on a specific platform to maintain links between metadata and images. Where should thumbnails live? • Relating multiple views and versions of same object: Multiple versions of the same object taken over time can be very useful for disciplines like forensics. For example, Brown University decided to describe a “blob” of various images of the same thing in different formats and then describe the specific versions included. This work was done even though there is no system yet that displays relationships among images, such as components of a piece, even when the metadata in records are wrapped and stored in METS. Transitioning to the Next Generation of Metadata 21 • Managing relationships with faculty and curators: It is important to ensure that faculty feel their needs are met. Collaboration is necessary among holders of the materials, metadata specialists, and developers as all come from different perspectives. The challenge is to support both a specific purpose and groups of people as well as large-scale discovery. • Aggregating digital collections: Institutions have been sharing the metadata for their digital collections with both national and international discovery services. Within individual organizations, librarians create and recreate metadata for digital and digitized resources in a plethora of systems—the library catalog, archive management, digital asset and preservation systems, the institutional repository, research management systems, and external subscription-based repositories. Targets for sharing this metadata range from tailored topic- based digital discovery services to national and international aggregations such as Google Scholar, HathiTrust, Digital Public Library of America (DPLA), Internet Archive, Trove, and WorldCat to online exhibitions such as Google Arts and Culture or image banks such as Flickr or Unsplash. Such aggregations can help inform an institution’s own collection development, as librarians can see their contributions in the context of others’ content and identify gaps that they may wish to fill locally.95 Aggregators often have different guidelines and input formats. Aggregators’ very reasonable contention that they cannot support many variations in submitted metadata conflict with contributors’ very reasonable contention that they cannot support the different needs of a wide range of aggregators. Disseminating corrections or updates between the source and the aggregation can be problematic. Information that may have been corrected in the chain leading to incorporation in the aggregation may not be pushed back to the source, so that the same errors must be corrected repeatedly. It is often not clear what data elements have been updated, when, or by whom. Aggregating images and bringing together different images or versions of the same object was the goal of the 2012-2013 OCLC Research Europeana Innovation Pilots,96 which developed a method for hierarchically structuring cultural objects at different similarity levels to find “semantic clusters”— those that include terms with a similar meaning. In 2017, OCLC implemented the International Interoperability Image Framework (IIIF)97 Presentation Manifest protocol in its CONTENTdm digital content management system, an aggregation containing more than 70 million digital records contributed by over 2,500 libraries worldwide. In 2019 OCLC Research developed an IIIF Explorer experimental prototype for testing and evaluation that searches across all the CONTENTdm images using the IIIF Presentation Manifest protocol,98 as shown in figure 7. Aggregating content across IIIF-compliant systems may facilitate discovery across the plethora of platforms containing digital content mentioned above. In 2020, OCLC Research launched the CONTENTdm Linked Data Pilot,99 focused on developing scalable methods and approaches to produce machine-readable representations of entities and relationships and make visible the connections formerly invisible. Existing record-based metadata is being converted to linked data by replacing strings of characters with identifiers from known authority files and local library-defined vocabularies; the resulting graphs of entities and relationships can retrieve contextual information from sources such as GeoNames and Wikidata. This pilot (to be completed by August 2020) is addressing many of the above challenges identified by the Focus Group. 22 Transitioning to the Next Generation of Metadata FIGURE 7. The OCLC ResearchWorks IIIF Explorer retrieves images about “Paris Maps” across CONTENTdm collections RESEARCH DATA Research funders expect that the research data resulting from research they support will be archived and made available to others. Institutions have allotted more resources to collecting and curating this scholarly resource for reuse within the scholarly record. OCLC Research Scientist Ixchel Faniel’s two-part blog entry “Data Management and Curation in 21st Century Archives” (Sept 2015)100 prompted the discussion among Focus Group members on the metadata needed for research data management.101 To maximize the chances that metadata for research data are shareable (that is, sufficiently comparable) and helpful to those considering reusing the data, our communities would benefit from sharing ideas and discussing plans to meet emerging discovery needs. Metadata is important for both discovery and reuse of datasets. The 2016 OCLC Research report Building Blocks: Laying the Foundation for a Research Data Management Program noted: Datasets are useful only when they can be understood. Encourage researchers to provide structured information about their data, providing context and meaning and allowing others to find, use and properly cite the data. At minimum, advise researchers to clearly tell the story of how they gathered and used the data and for what purpose. This information is best placed in a readme.txt file that includes project information and project-level metadata, as well as metadata about the data itself (e.g., file names, file formats and software used, title, author, date, funder, copyright holder, description, keywords, observation unit, kind of data, type of data and language).102 The OCLC ResearchWorks IIIF Explorer Retrieves Images about “Paris Maps” across CONTENTdm Collections https://researchworks.oclc.org/iiif-explorer/search?q=paris%20maps Transitioning to the Next Generation of Metadata 23 All four of the of 2017-2018 The Realities of Research Data Management series webinars103 led by OCLC Senior Program Officer Rebecca Bryant mention the importance of metadata. Research information infrastructure calls on many of the key strengths of the library profession. Metadata is fundamental to our complex research environment—beginning with the planning our researchers do before and during the creation of data; to managing the data; then to disseminating the knowledge gained; finally through to understanding the impact, engagement, and the resulting reputation of our home institutions.104 Libraries’ expertise in metadata standards, identifiers, linked data, and data sharing systems as well as technical systems can be invaluable to the research life cycle. Faniel highlighted this value in the November 2019 Next blog post “Let’s Cook Up Some Metadata Consistency”: [C]ataloging for discovery using terms and definitions that are consistent across repositories is critical, if we want the data and their associated metadata to be discoverable for reuse in any way imaginable. . . . Librarians and archivists can help create consistencies in metadata that build bridges between researchers and repositories, thus greatly increasing the discovery, reuse, and value of their institutions’ research investments.105 National contexts differ. For example, our Australian colleagues can take advantage of Australia’s National Computational Infrastructure for big data and the Australian Data Archive for the social sciences.106 Canada has launched a national network called Portage for the “shared stewardship of research data.”107 Libraries’ expertise in metadata standards, identifiers, linked data, and data sharing systems as well as technical systems can be invaluable to the research life cycle. Some institutions have developed templates to capture metadata in a structured form. Some Focus Group members noted the need to keep such forms as simple as possible as it can be difficult to get researchers to fill them in. All agreed data creators needed to be the main source of metadata. But what will inspire data creators to produce quality metadata? New ways of training and outreach are needed, an area of exploration within Metadata 2020’s Research Communications project.108 Focus Group members generally agreed on the data elements required to support reuse: licenses, processing steps, tools, data documentation, data definitions, data steward, grant numbers, and geospatial and temporal data (where relevant). Metadata schema used includes Dublin Core, MODS (Metadata Object Description Schema) and DDI (Data Documentation Initiative’s metadata standard). The Digital Curation Centre in the UK provides a linked catalog of metadata standards.109 The Research Data Alliance’s Metadata Standards Directory Working Group has set up a community- maintained directory of metadata standards for different disciplines.110 The disparity of metadata schemas across disciplines represents a hurdle in institutions’ discovery layers. 24 Transitioning to the Next Generation of Metadata The importance of identifiers for both the research data and the data creator(s) has become more widely acknowledged. DOIs, Handles and ARKs (Archival Resource Key) have been used to provide persistent access to datasets. Identifiers are available at the full data set level and for component parts, and they can be used to track downloads and potentially help measure impact. Both ORCID and ISNI are in use to identify data creators uniquely, and work is continuing on the Research Organizational Registry to address institutional affiliations. Among the most critical issues identified by Focus Group members is that metadata specialists need to be more involved in the early stages of the research life cycle. Researchers need to understand the importance of metadata in their data management plans. The lack of “metadata governance” across an institution makes integrating workflows between repositories and discovery layers problematic. Some Focus Group members have started to analyze the metadata requirements for the research data life cycle, not just the final product, asking questions like: Who are the collaborators?111 How do various projects use different data files? What kind of analysis tools do they use? What are the relationships of data files across a project, between related projects, and to other scholarly output such as related journal articles? Research support services such as those offered at the University of Michigan112 are being developed to assist researchers during all phases of the research data life cycle, often through collaboration with other campus units. Among the most critical issues identified by Focus Group members is that metadata specialists need to be more involved in the early stages of the research life cycle. Researchers need to understand the importance of metadata in their data management plans. The lack of “metadata governance” across an institution makes integrating workflows between repositories and discovery layers problematic. Some libraries have started to provide research data management support in a variety of ways. For example, metadata specialists work with their institutions’ Scholarly Communications and Publishing Division which also manages the Institutional Repository. These institutional repositories may have only the “citation” or “metadata-only” records with a link to the full text or data set deposited in a disciplinary repository. “Metadata consultation services” may be provided to advise on the data management plan, which includes appropriate metadata standards and controlled vocabularies, a strategy to effectively organize their data, and an approach that will facilitate reuse of the data years after the research is completed. The OCLC Research The Realities of Research Data Management report series classifies metadata support as part of the “expertise” function, and flags some variations in its case studies.113 At the University of Illinois at Urbana-Champaign, metadata consultants help researchers with metadata regardless of where the research data is deposited; Monash University supports metadata curation only for local deposits.114 Transitioning to the Next Generation of Metadata 25 Communication is key for researchers to understand the importance of metadata throughout the research life cycle. Some universities offer “research sprints” where researchers partner with a team of expert librarians that may include metadata creation, management, analysis, and preservation. The “Shared BigData Gateway for Research Libraries,” hosted by Indiana University and partially funded by the Institute of Museum and Library Services, is developing a cloud-based platform to share data and expertise across institutions, including datasets such as records from the US Patent and Trademark Office and the Microsoft Academic Graph.115 Curation of research data as part of the evolving scholarly record requires new skill sets, including deeper domain knowledge and experience with data modeling and ontology development. Libraries are investing more effort in becoming part of their faculty’s research process and are offering services that help ensure that their research data will be accessible if not also preserved. Good metadata will help guide other researchers to the research data they need for their own projects, and the data creators will have the satisfaction of knowing that their data has benefitted others.116 Evolution of “Metadata as a Service” Metadata underlies the ability to discover all resources in the inside-out and facilitated collections. Focus Group members anticipate more involvement with metadata creation beyond the traditional library catalog and new services that leverage both legacy and future metadata. METRICS Library strategic goals often include key phrases such as “foster discovery and use,” “enrich the user experience,” and “explore new ways to support the whole life cycle of scholarship,” all of which is predicated on quality metadata. Usage metrics—such as how frequently items have been borrowed, cited, downloaded, or requested—could be used to build a wide range of library services and activities. Focus Group members identified some possible services: informing collection management decisions about weeding projects and identifying materials for offsite storage; evaluating subscriptions; comparing citations for researchers’ publications with what the library is not purchasing; and improving relevancy ranking, personalizing search results, offering recommendation services in the discovery layer, and measuring impact of library usage on research or student success or learning analytics.117 The University of Minnesota conducted a study to investigate the relationships between first-year undergraduate students’ use of the academic library, academic achievement, and retention.118 The results suggest a strong correlation between using academic library services and resources—particularly database logins, book loans, electronic journal logins, and library workstation logins—and higher grade point averages. In the United Kingdom, the Jisc Library Impact Data Project found a similar correlation.119 CONSULTANCY Metadata’s value is demonstrated by integrating it into the fabric of both the library and other units across the campus. For example, metadata specialists can provide “metadata as a service”— consultancy in the earliest stages of both library and research projects.120 An emerging trend is for digital humanities departments to request advice from metadata specialists on metadata standards and how to use controlled vocabularies. More visibility of this metadata consultant role appears in recent library job postings. In one Metadata Librarian job posting at Cornell,121 one duty cited was 20% for “metadata outreach and consultation”: “Maintains strong working relationships and communicates regularly with staff across Cornell, fostering collaborative efforts between Metadata Services and the greater Cornell community.” Georgia Tech is recruiting a metadata librarian who 26 Transitioning to the Next Generation of Metadata will “serve as a metadata consultant to larger library projects/initiatives. Work closely with other Library departments, Emory University Libraries, GALILEO, University System of Georgia Libraries, and other partners involved in joint projects.”122 NEW APPLICATIONS The shared and consistent use of MARC fields supports new applications. Libraries currently use identifiers in bibliographic records to fetch tables of contents, abstracts, reviews, and cover images and to generate floor maps of where to locate resources in a specific classification range (such as in OCLC’s integration with StackMap).123 Bibliographic metadata is used to populate Digital Asset Management Systems and Institutional Repositories, and with tools such as Tableau and OpenRefine, can enable a richer analysis of collections and a view of collections. MARC metadata is connecting scholars with the bibliographic data for their projects and can generate relationships to related resources with applications such as Yewno.124 MARC metadata is also being used to inform institutional output measures and affiliation tracking and serves as a source to build organization histories. The provenance implicit in an institution’s bibliographic metadata has proven helpful in documenting theft cases. Analyzing catalog data by data mining can also be used to enrich the metadata, such as generating language codes missing in related records or identifying the original titles of translated works. MARC data has also supported generating subject maps to discover relationships otherwise not explicit in the cataloging metadata.125 Visualizations represent another type of metadata service. A striking example is from the Auslang national codeathon held in 2019, a collaboration among the National Library of Australia, the Australian Institute of Aboriginal and Torres Strait Islander Studies, Trove, Libraries Australia, and the State and Territory libraries—a national code-a-thon to identify items in Indigenous Australian languages.126 Figure 8 shows the results, a map indicating the 465 Indigenous languages in the Australian National Bibliographic Database tagged as a result of the code-a-thon, and an example of involving the community to enhance bibliographic metadata. FIGURE 8. Distribution of 465 Indigenous language codes in the Australian National Bibliographic Database Distribution of 465 Indigenous Language Codes in the Australian National Bibliographic Database https://www.nla.gov.au/our-collections/processing-and-describing-the-collections /Austlang-national-codeathon https://www.nla.gov.au/our-collections/processing-and-describing-the-collections /Austlang-national-codeathon Transitioning to the Next Generation of Metadata 27 BIBLIOMETRICS Library metadata is also being used to generate bibliometrics, statistical methods to analyze books, articles, and other publications. Using library metadata for Digital Humanities research projects has much potential. For example, a Library of Congress researcher used bibliographic metadata to trace the history of publishing and copyright; UCLA researchers have used cataloging metadata to track the commercialization of inventions such as insulin. A novel use of cataloging metadata was by Hachette UK, the United Kingdom’s second largest bookseller, which commissioned the Graphic History Company to unlock the histories of all nine of Hachette’s publishing houses and weave them into a cohesive story by asking the British Library for every author and book title published by their nine publishing houses spanning 250 years. The British Library provided a list of over 55,000 authors, from which 5,000 of the most prominent were selected to create perhaps the most beautiful example of metadata use: a giant mural spanning eight floors featuring all 5,000 authors in chronological order. (Figure 9 shows one part of the mural; for more images of the mural, see Hachette’s River of Authors.)127 FIGURE 9. UK Hatchette’s “River of Authors” generated from the British Library’s catalog metadata SEMANTIC INDEXING When controlled vocabularies and thesauri are converted into linked open data and shared publicly, their traditional role of facilitating collection browsing will fade but could find a renewed purpose within web-based knowledge organizations systems (KOS).128 As Marcia Zeng points out in Knowledge Organization Systems (KOS) in the Semantic Web: a multi-dimensional review, UK Hatchette’s “River of Authors” Generated from the British Library’s Catalog Metadata 28 Transitioning to the Next Generation of Metadata a KOS vocabulary is more than just the source of values to be used in metadata descriptions: by modeling the underlying semantic structures of domains, KOS act as semantic road maps and make possible a common orientation by indexers and future users, whether human or machine.129 Good examples of such repurposing are the Getty Vocabularies that allow browsing of Getty’s representation of knowledge and also helps users generate their own SPARQL queries that can be embedded in external applications. Another example is Social Networks and Archival Context (SNAC),130 which enables browsing of entities and relationships independently of their collections of origins. In such cases, the discovery tool pivots to being person-centric (or family-centric, or topic- centric), rather than (only) collection-centric. Rather than one “global domain,” metadata specialists could provide added value by adding bridges from the metadata in library domain databases to other domains. Wikidata is an example of a platform aggregating entities from different sources and linking to more details in various language Wikipedias. Some institutions have employed Wikimedians in Residence to accelerate this process. Focus Group members hope that Artificial Intelligence—or at least machine-learning—could mitigate the amount of current manual effort to link names and concepts in research data. Perhaps algorithms could be used to match names based on related metadata or sources, relate topics to each other based on context, disambiguate names based on other metadata available, and analyze datasets to identify possible biases in a collection.131 A few Research Library Partners participate in Artificial Intelligence for Libraries, Archives & Museums (AI4LAM),132 an “international, participatory community focused on advancing the use of artificial intelligence in, for and by libraries, archives, and museums.”133 Some high-level recommendations on enhancing descriptions at scale and improving discovery are noted in Thomas Padilla’s OCLC Research 2019 position paper Responsible Operations: Data Science, Machine Learning, and AI in Libraries.134 Preparing for Future Staffing Requirements The anticipated changes from transitioning to the next generation of metadata will also shift staffing requirements to prepare for the future. Focus Group members identified new skill sets needed for both professionals entering the field as well as seasoned catalogers, driven by the changing information technology landscape and increasing staff attrition. Focus Group members characterized professionals as those who “trail-blaze innovations,” which are then routinized for nonprofessionals. These discussions reinforce Padilla’s recommendations on investigating core competencies, committing to internal talent, and expanding evidence-based training.135 THE CULTURE SHIFT Focus Group members reported a delicate balance of allocating staff to “traditional cataloging activities” (such as original and copy cataloging, authority work) with more exploratory R&D projects, such as linked data projects, exploring new data models and technologies such as Wikidata, and learning about emerging standards and identifiers. A culture shift is needed: from pride in production alone to valuing opportunities to learn, explore, and try new approaches to metadata work. Metadata specialists must understand that improving all metadata is more important than any individual’s productivity numbers. This culture shift requires buy-in from administrators to support training programs for staff to learn new workflows for processing multiple formats and to view metadata specialists as more than just “production machines.” Transitioning to the Next Generation of Metadata 29 Metadata managers faced with staff reductions while still being expected to maintain production levels must justify allocating staff time for R&D—or “play time”—to explore such questions as: What can we stop doing? What is the one thing you learned that we all need to do more of? What do you need to move forward? What open source software could help us do the work more efficiently? What new methods could enhance discoverability, access, and use of our facilitated collections? Managers must incorporate goals for success that are not based solely on numbers.136 A culture shift is needed: from pride in production alone to valuing opportunities to learn, explore, and try new approaches to metadata work. Indications of this culture shift include institutions outsourcing some metadata work or training support staff to create metadata for the “easier stuff” while mandating that catalogers only do what well-trained humans can do. Metadata managers could scope the materials requiring metadata that support staff or students can handle, providing templates where possible. If you remove these tasks, the majority of what remains requires highly skilled metadata specialists with expertise in languages, physical formats, and disambiguating and describing persons, organizations, and other entities. LEARNING OPPORTUNITIES To encourage the culture shift among metadata specialists to change their mindsets about how they work and stimulate interest in learning opportunities, Focus Group members have used several approaches: • Identify who on your team has the aptitude to acquire new skills. At one institution, the staff member shared what she learned and the whole unit became “lively” because she brought her colleagues along. It created appreciation for “continuous learning” and staff presented their activities at national conferences. • Convene cross-team group discussions to look at problem metadata and come up with solutions, encouraging staff to move forward together. Staff less interested in new skills can pick up some of the production from those learning new skills and producing less. • Launch “reading clubs” where staff all read an article and respond to three discussion questions to inspire metadata specialists to think about broader metadata issues outside of their daily work. • Hold weekly group “video-viewing brown-bag lunches” for staff on new developments such as linked data so staff can “watch and learn” together. • Participate in multi-institutional projects to collaborate with peers to solve problems and cross-pollinate ideas. • Encourage participation in professional conferences and standards development. 30 Transitioning to the Next Generation of Metadata Educating and training catalogers has been at the forefront of many discussions in the metadata community. Both new professionals and seasoned catalogers need new skills to successfully transition to the emerging linked data environment. Catalogers are learning about and experimenting with BIBFRAME while remaining responsible for traditional bibliographic control of collections. Metadata specialists utilize tools for metadata mapping, remediation, and enhancement. They identify and map semantic relationships among assorted taxonomies to make multiple thesauri intelligible to end users. For the more technical aspects of metadata management, competition for talent from other industries has been increasing. This may intensify as metadata becomes more central to various areas of government, nonprofit, and private enterprise.137 NEW TOOLS AND SKILLS The extent of metadata specialists’ collaboration with IT or systems staff varies among institutions. Such collaboration is necessary for many reasons, including managing data that is outside the library’s control. Some noted that “cultural differences” exist between the professions: developers tend to be more dynamic and focus on quick prototyping and iteration, while librarians focus first on documenting what is needed and are more “schematic.” Which is more likely to be successful: teaching metadata specialists IT skills or teaching IT staff metadata principles? The “holy grail” is to recruit someone with an IT background interested in metadata services. Retaining staff with IT skills is difficult—they are in demand for higher-paying jobs in the private sector. Focus Group members’ experiences have shown that it is easier for librarians to learn programming skills than it is to hire IT specialists to learn the “technical services mindset.” Ideally, Focus Group members would like a few staff who have the technical skills to take batch actions on data, or at least who know how to use the external tools available to automate as many tasks as possible. For many years, Focus Group members have been using MarcEdit and/or other tools such as OpenRefine, scripts (e.g., Python, Ruby, or Perl), and macros for metadata reconciliation and batch processing.138 MarcEdit is the most popular tool, and has a large, global, and active user community as indicated in its 2017 Usage Snapshot.139 Terry Reese, MarcEdit’s developer, estimates that about one-third of all users work in non-MARC environments and two-thirds of the most active users are OCLC members. Focus Group members reported that they use MarcEdit for data transformation, enhancing vendor records, building MARC records from spreadsheets, linked data reconciliation, de-duplicating records within a file, merging two or more records into one, Z39.50 harvesting, and reconciling metadata before sending records to other systems. The 2017 release of MarcEdit 7 includes new features such as light weight clustering functionality, providing a powerful way to find relationships between data without introducing a large learning curve. It also has mechanisms that support linked data.140 Reese has created a series of YouTube tutorials available on his MarcEdit Playlist.141 Managers want to focus less on specific schema and more on metadata principles that can be applied to a range of different formats and environments. Desirable soft skills include problem- solving, effective collaboration, willingness—even eagerness—to try new things, understanding researchers’ needs, and advocacy. Although some metadata specialists have always enjoyed experimenting with new approaches, often they lack the time to learn new tools or methodologies while keeping up with their routine work assignments. Libraries should promote metadata as an exciting career option to new professionals in venues such as library schools and ALA’s New Members Roundtable. Emphasizing that metadata encompasses much more than library cataloging—entity identification; descriptive standards used in various academic disciplines; describing born-digital, archival, and research data that can interact with the semantic Web—can increase its appeal. As one Focus Group member noted, “We bring order out of a vacuum.”142 Transitioning to the Next Generation of Metadata 31 SELF-EDUCATION Metadata increasingly is being created outside the library by academics and students with minimal training, leading to a need for more catalogers with record maintenance skills. Focus Group members noted the need for technical skills such as simple scripting, data remediation, and identity management to reconcile equivalents across multiple registries. Frequently mentioned sources of instruction include Library Juice Academy, MarcEdit tutorials, LinkedIn Learning (which acquired Lynda.com), Library of Congress Training Webinars, ALCTS Webinars, Code Academy, Software Carpentry, and conferences such as Code4Lib and Mashcat.143 W3C’s Data on the Web Best Practices and Semantic Web for the Working Ontologist were recommended reading.144 Crucial to the success of such training is the ability to quickly apply what has been learned. If new skills are not applied, people forget what they have learned. Staff feel frustrated when they have invested the time to learn something that they cannot use in their daily work. Focus Group members have seen a big shift from relying on Library of Congress instructions to self- education from multiple sources. Some approaches mentioned by participants: • Emphasize continuity of metadata principles when introducing an expanded scope of work. • Take advantage of the Library Workflow Exchange,145 a site designed to help librarians share workflows and best practices across institutions, including scripts. • From the 2017 Electronic Resources and Libraries Conference: “Don’t wait; iterate!” In other words, rather than waiting until staff have all the required skills, let them do tasks iteratively, learning as they go, so they are ready for new tasks when the time comes. • Have small groups of metadata specialists take programming courses together, after which they can continue to meet and discuss ways to apply their new skills to automate routine tasks. • Encourage staff to participate in events such as OCLC’s DevConnect Webinars146 to learn from libraries using OCLC APIs to enhance their library operations and services. • Create reading and study groups that include cross-campus or cross-divisional staff. • Expand the scope of current work to enable metadata specialists to apply their skills to new domains or terminology, such as using Dublin Core for digital collections. Involve staff in digital projects from the conceptual stage to developing project specifications, quality assurance practices and tool selection. As a bonus, this fosters collaborative teamwork relationships. • Hire graduate students in computer science for short-term tasks such as creating scripts. ADDRESSING STAFF TURNOVER Turnover in a professional position within a cataloging or metadata unit now comes with the significant risk that it may be impossible to convince administrators to retain the position in the unit and repost it. This is particularly true when the outgoing incumbent performed a high proportion of “traditional” work, such as original cataloging in MARC. The odds of retaining the position are much greater if careful thought goes into how the position could be reconfigured or re-purposed to meet emerging needs.147 Most Focus Group members have had to address varying amounts of turnover, either from retirements or staff leaving for other positions. Half of them needed to reconfigure the positions of outgoing librarians. Looking at what other institutions are advertising helps in creating an attractive position description. Many cataloging positions do not require an MLS degree, so recruiting 32 Transitioning to the Next Generation of Metadata professionals has focused on adaptability, aligning new positions with university priorities, and on eagerness to learn and take initiative in areas such as metadata for research output, open access, digital collections, and linked data. Mapping out future strategies and designing ways of making metadata interoperate across systems have been components of recent recruitments. New staff with programming skills are sought after, as they can apply batch techniques to metadata that can compensate for the loss of staff. Using technology in the service of library service helps catalogers “do more with less.” Focus Group members want new staff to be aware of both the shared cataloging community and the overlaps with other cultural heritage organizations such as archives and museums. The library environment keeps evolving, and librarians have had to reflect on their priorities moving forward. Metadata managers need to rethink the roles of metadata specialists beyond “traditional” cataloging work. Potential candidates with more flexible skill sets have become more attractive than those with a traditional cataloging background who may not adapt well to working in new environments. Many cataloging roles and descriptions may need to be rewritten and retooled. Perhaps the only activities that will perennially remain professional tasks are those like management, scouting new trends, strategizing, participating in new international standards, leading and implementing changes, and thinking about the big picture. Impact The next generation of metadata will become even more focused on entities rather than record- based descriptions of an institution’s collections. Focus Group members’ linked data activities, including their participation in OCLC Research’s Project Passage and CONTENTdm Linked Data pilots, contributed to OCLC obtaining Andrew W. Mellon funding for its two-year Shared Entity Management Infrastructure project,148 launched in January 2020. Eleven of the Shared Entity Management Infrastructure Advisory Group members are also Focus Group members. The project builds on OCLC Research’s linked data work, and will provide a production infrastructure with persistent, authoritative identifiers for persons and works. It will be largely API-based, allowing librarians to customize their workflows around linked data infrastructure. This infrastructure has long been desired by Focus Group members as it will address many of the challenges documented above around persistent identifiers, especially identifiers for “works.” The next generation of metadata will become even more focused on entities rather than record-based descriptions of an institution’s collections. Authoritative, persistent identifiers provided by the Shared Entity Management Infrastructure will supply the needed language-neutral links to trustworthy sources. The metadata that libraries, archives, and other cultural heritage institutions have created and will create will provide the context for these entities, as “statements” associated with those links. The impact will be global, affecting how librarians and archivists will describe the inside-out and facilitated collections, inspiring new offerings of “metadata as a service,” and influencing future staffing requirements. Transitioning to the Next Generation of Metadata 33 A C K N O W L E D G M E N T S OCLC Research wishes to thank all Research Library Partners Metadata Managers Focus Group members who have shared their experiences and thoughts summarized here. Additionally, we extend thanks to the dedicated Metadata Managers Planning Group, which initiated the topics and provided the context statements and question sets, the responses to which served as the basis of our discussions. In addition, we particularly appreciate the insightful comments from the following Focus Group members who reviewed an earlier version of this document; their comments improved this synthesis. • Charlene Chou, New York University • Suzanne Pilsk, Smithsonian Institution • Greg Reeve, Brigham Young University • Alexander Whelan, Columbia University • Helen K. R. Williams, London School of Economics I also extend thanks to current and former OCLC colleagues: Rebecca Bryant, Jody DeRidder, Annette Dortmund, Rachel Frick, Janifer Gatenby, Jean Godby, Shane Huddleston, Andrew Pace, Merrilee Proffitt, Nathan Putnam, Stephan Schindehette, and Chela Weber for their careful review of all or parts of earlier versions of this document. Thank you to Erica Melko for her editing, Jeanette McNicol for the design of this report, and JD Shipengrover for the cover artwork. On a personal note, I have greatly benefited from my interactions with the OCLC Research Partners Metadata Managers Focus Group and have been delighted to play a small part in this transition to the next generation of metadata. 34 Transitioning to the Next Generation of Metadata A P P E N D I X OCLC Research Library Partners Metadata Managers Planning Group 2015-2020 Planning Group members selected the topics for the OCLC Research Library Partners Metadata Managers discussions, wrote up the context statements why the topic was important and timely, and developed the question sets that Focus Group members responded to. The Planning Group initiators for each topic also reviewed draft summaries that were later posted on the OCLC Research Hanging Together blog. Current Planning Group members are listed in bold; institutional affiliations are given for the time when they served on the Planning Group: • Jennifer Baxmeyer, Princeton University • Sharon Farnel, University of Alberta • Steven Folsom, Harvard University and Cornell University • Erin Grant, University of Washington • Dawn Hale, Johns Hopkins University • Myung-Ja Han, University of Illinois, Urbana-Champaign • Kate Harcourt, Columbia University • Corey Harper, New York University • Stephen Hearn, University of Minnesota • Daniel Lovins, Yale University • Roxanne Missingham, Australian National University • Chew Chiat Naun, Cornell University and Harvard University • Suzanne Pilsk, Smithsonian • John Riemer, University of California, Los Angeles • Carlen Ruschoff, University of Maryland • Philip Schreur, Stanford University • Jackie Shieh, George Washington University • Melanie Wacker, Columbia University Transitioning to the Next Generation of Metadata 35 N O T E S 1. OCLC Research Library Partnership Metadata Managers Focus Group. https://www.oclc.org/research/areas/data-science/metadata-managers.html. 2. OCLC Research. “The OCLC Research Library Partnership.” https://www.oclc.org/research/partnership.html. 3. Smith-Yoshimura. 2017. “Metadata Advocacy” Hanging Together: the OCLC Research Blog, 17 October 2017. https://hangingtogether.org/?p=6282. 4. British Library. 2019. Foundations for the Future: The British Library’s Collection Metadata Strategy 2019-2023. London: British Library. https://www.bl.uk/bibliographic/pdfs/british -library-collection-metadata-strategy-2019-2023.pdf. 5. Ibid, 4. 6. Statistics as of 1 June 2020. 7. Library of Congess. “Program for Cooperative Cataloging.” https://www.loc.gov/aba/pcc/. 8. Except for June 2020, when all discussions were held virtually only because of the COVID-19 pandemic. 9. See Hanging Together: The OCLC Research Blog, search-category Metadata. https://hangingtogether.org/?cat=81. 10. Benefits from affiliating with the RLP are cited in Smith-Yoshimura. 2018. “What Metadata Managers Expect from and Value about the Research Library Partnership,” Hanging Together: The OCLC Research Blog, 16 April 2018. https://hangingtogether.org/?p=6683. 11. Analyses of the three International Linked Data Surveys for Implementers 2014-2018 and the spreadsheet of all responses to the surveys are available. See OCLC Research. 2020. “Linked Data.” International Linked Data Survey. https://www.oclc.org/research/themes/data-science /linkeddata/linked-data-survey.html. 12. Godby, Jean, Karen Smith-Yoshimura, Bruce Washburn, Kalan Davis, Karen Detling, Christine Fernsebner Eslao, Steven Folsom, Xiaoli Li, Marc McGee, Karen Miller, Honor Moody, Holly Tomren, and Craig Thomas. 2019. Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage. Dublin, OH: OCLC Research. https://doi.org/10.25333/faq3-ax08; OCLC Research. 2020. “CONTENTdm Linked Data pilot.” https://www.oclc.org/research /themes/data-science/linkeddata/contentdm-linked-data-pilot.html; OCLC. 2020. “WorldCat®: OCLC and Linked Data.” Shared Entity Management Infrastructure. https://www.oclc.org/en/worldcat/linked-data/shared-entity-management-infrastructure.html; https://www.oclc.org/research/areas/data-science/metadata-managers.html https://www.oclc.org/research/partnership.html https://hangingtogether.org/?p=6282 https://www.bl.uk/bibliographic/pdfs/british-library-collection-metadata-strategy-2019-2023.pdf https://www.bl.uk/bibliographic/pdfs/british-library-collection-metadata-strategy-2019-2023.pdf https://www.loc.gov/aba/pcc/ https://hangingtogether.org/?cat=81 https://hangingtogether.org/?p=6683 https://www.oclc.org/research/themes/data-science/linkeddata/linked-data-survey.html https://www.oclc.org/research/themes/data-science/linkeddata/linked-data-survey.html https://doi.org/10.25333/faq3-ax08 https://www.oclc.org/research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html https://www.oclc.org/research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html https://www.oclc.org/en/worldcat/linked-data/shared-entity-management-infrastructure.html 36 Transitioning to the Next Generation of Metadata Library of Congress. “BIBFRAME.” Bibliographic Framework Initiative. https://www.loc.gov/bibframe/; Futornick, Michelle. 2019. “LD4P2 Linked Data for Production: Pathway to Implementation.” LS4P2 Project Background and Goals. Lyrasis. Posted 14 January 2019. https://wiki.lyrasis.org/display/LD4P2/LD4P2+Project+Background+and+Goals; Share-VDE (Share Virtual Discovery Environment). “An Effective Environment for the Use of Linked Data by Libraries.” Accessed 17 September 2019. https://www.share-vde.org /sharevde/clusters?l=en; Casalini, Michele, Chiat Naun Chew, Chad Cluff, Michelle Durocher, Steven Folsom, Paul Frank, Janifer Gatenby, Jean Godby, Jason Kovari, Nancy Lorimer, Clifford Lynch, Peter Murray, Jeremy Myntti, Anna Neatrour, Cory Nimer, Suzanne Pilsk, Daniel Pitti, Isabel Quintana, Jing Wang, and Simeon Warner. 2018. National Strategy for Shareable Local Name Authorities National Forum: White Paper. Ithaka, New York: Cornell University Library eCommons digital repository. https://hdl.handle.net/1813/56343. 13. Library of Congress. 2019. PCC (Program for Cooperative Cataloging) Task Group on Linked Data Best Practices. 2019. PCC Task Group on Linked Data Best Practices Final Report: Submitted to PCC Policy Committee 12 September 2019. Washington DC: Library of Congress. https://www.loc.gov/aba/pcc/taskgroup/linked-data-best-practices-final-report.pdf; Library of Congress. 2018. “Charge for PCC Task Group on Identity Management in NACO,” 5. American Bar Association, Program for Cooperative Cataloging, revised 22 May 2018. https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO -rev2018-05-22.pdf; Library of Congress. 2020 “PCC Task Group on URIs in MARC.” Programs of the PCC. Charge. Accessed 19 September 2020. https://www.loc.gov/aba/pcc/bibframe/TaskGroups /URI-TaskGroup.html; Library of Congress. 2018. “PCC Linked Data Advisory Committee: Linked Data Advisory Committee Charge.” PCC Task Groups 2018. Task Groups. Revised 24 July 2018. [Word doc; 28KB]. https://www.loc.gov/aba/pcc/taskgroup/task-groups.html. 14. Smith-Yoshimura, Karen. 2015. “Shift to Linked Data for Production.” OCLC Research Hanging Together Blog, 13 May 2015. https://hangingtogether.org/?p=5195. 15. OCLC Research. 2020. “LInked Data.” Linked Data Overview. https://www.oclc.org/research /areas/data-science/linkeddata/linked-data-overview.html. [All figures CC BY 4.0] 16. Smith-Yoshimura, Karen. 2019. “‘Future Proofing’ of Cataloging.” OCLC Research Hanging Together Blog, 10 November 2019 https://hangingtogether.org/?p=7526. 17. ORCID: Connecting Research and Researchers. “What is Orcid.” Our Vision. Accessed 19 September 2020. https://orcid.org/about/what-is-orcid/mission. 18. See for example the list of signatories of journal publishers requiring ORCID IDs for authors. ORCID. “ORCID Open Letter - Publishers.” Accessed 19 September 2020. https://orcid.org/content/requiring-orcid-publication-workflows-open-letter. https://www.loc.gov/bibframe/ https://wiki.lyrasis.org/display/LD4P2/LD4P2+Project+Background+and+Goals https://www.share-vde.org/sharevde/clusters?l=en https://www.share-vde.org/sharevde/clusters?l=en https://hdl.handle.net/1813/56343 https://www.loc.gov/aba/pcc/taskgroup/linked-data-best-practices-final-report.pdf https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO-rev2018-05-22.pdf https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO-rev2018-05-22.pdf https://www.loc.gov/aba/pcc/bibframe/TaskGroups/URI-TaskGroup.html https://www.loc.gov/aba/pcc/bibframe/TaskGroups/URI-TaskGroup.html https://www.loc.gov/aba/pcc/taskgroup /task-groups.html https://hangingtogether.org/?p=5195 https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-overview.html https://www.oclc.org/research/areas/data-science/linkeddata/linked-data-overview.html https://hangingtogether.org/?p=7526 https://orcid.org/about/what-is-orcid/mission https://orcid.org/content/requiring-orcid-publication-workflows-open-letter Transitioning to the Next Generation of Metadata 37 19. ISNI. “What is ISNI.” Accessed 19 September 2020. https://isni.org/page/what-is-isni/. 20. HathiTrust is a not-for-profit collaborative of academic and research libraries preserving more than 17 million digitized items. See: HathiTrust Digital Library. “Welcome to HathtiTrust.” Accessed 19 September 2020. https://www.hathitrust.org/about. 21. GeoNames. “Browse the Names.” Accessed 19 September 2020. https://www.geonames.org/. 22. Bryant, Rebecca, Annette Dortmund, and Constance Malpas. 2017. Convenience and Compliance: Case Studies on Persistent Identifiers in European Research Information. Dublin, OH: OCLC Research. https://doi.org/10.25333/C32K7M. 23. ISNI currently holds 11.02 million identities: 10.26 million individuals (of which 2.91 million are researchers) and 933,039 organizations. Statistics retrieved from ISNI. See ISNI. “Key Statistics.” Accessed 5 May 2020. https://isni.org/. 24. Library of Congress. 2020. “NACO – Name Authority Cooperative Program.” Documents and Updates. Programs for Cataloging and Acquisitions (PCC). Accessed 19 September 2020. http://www.loc.gov/aba/pcc/naco/index.html. 25. Smith-Yoshimura, Karen. 2015. “Getting identifiers Created for Legacy Names.” Hanging Together: The OCLC Research Blog, 30 October 2015. https://hangingtogether.org/?p=5463. 26. Smith-Yoshimura, Karen. 2013. “Irreconcilable Differences? Name Authority Control & Humanities Scholarship” Hanging Together: The OCLC Research Blog, 27 March 2013. https://hangingtogether.org/?p=2621. 27. Smith-Yoshimura, Karen. 2017. “Use Cases for Local Identifiers.” Hanging Together: The OCLC Research Blog, 5 May 2017. https://hangingtogether.org/?p=5938. 28. OCLC Research. 2020. “Registering Researchers in Authority Files.” https://www.oclc.org/research/themes/research-collections/registering-researchers.html. 29 Smith-Yoshimura, Karen, Janifer Gatenby, Grace Agnew, Christopher Brown, Kate Byrne, Matt Carruthers, Peter Fletcher, Stephen Hearn, Xiaoli Li, Marina Muilwijk, Chew Chiat Naun, John Riemer, Roderick Sadler, Jing Wang, Glen Wiley, and Kayla Willey. 2016. Addressing the Challenges with Organizational Identifiers and ISNI. Dublin, Ohio: OCLC Research. https://doi.org/10.25333/C3FC9Q. 30. Research Organization Registry (ROR). “About.” https://ror.org/about/. 31. V. M. Abazov, B. Abbott, B. S. Acharya, M. Adams, T. Adams, J. P. Agnew, G. D. Alexeev et al. (2014) 2020. “Precision Measurement of the Top-Quark Mass in Lepton+jets Final States.” (Archived 24 February 2020) ArXiv.org: 1501.07912. https://arxiv.org/pdf/1405.1756. 32. Smith-Yoshimura, Karen. 2017. “How Much Metadata Is Practical?” Hanging Together: The OCLC Research Blog, 14 November 2017. https://hangingtogether.org/?p=6328. 33. University of Minnesota. 2020. “Experts@Minnesota.” Find Profiles. https://experts.umn.edu/en/persons/ or https://isni.org/page/what-is-isni/ https://www.hathitrust.org/about https://www.geonames.org/ https://doi.org/10.25333/C32K7M https://isni.org/ http://www.loc.gov/aba/pcc/naco/index.html https://hangingtogether.org/?p=5463 https://hangingtogether.org/?p=2621 https://hangingtogether.org/?p=5938 https://www.oclc.org/research/themes/research-collections/registering-researchers.html https://doi.org/10.25333/C3FC9Q https://doi.org/10.25333/C3FC9Q https://ror.org/about/ https://arxiv.org/pdf/1405.1756 https://hangingtogether.org/?p=6328 https://experts.umn.edu/en/persons/ 38 Transitioning to the Next Generation of Metadata University of Illinois at Urbana-Champaign. 2020. “Illinois Experts.” Find U of I Research, View Scholarly Works, and Discover New Collaborators. https://experts.illinois.edu/. 34. The National Institute of Health (NIH): National Institute of Allergy and Infectious Diseases (NIAID) on 7 April 2020 mandates ORCIDs for training, fellowship, education, or career development awards in FY20. See NIH: NIAID. 2019. “ORCID iD: Required for Some, Encouraged for All.” NIAID Funding News. Last reviewed 7 August 2019. https://www.niaid.nih.gov/grants-contracts/orcid-id-required-some-encouraged-all; See also Lyrasis. 2020. “SciENcv and ORCID to Streamline NIH and NSF Grant Applications.” LyrasisNow (blog), 8 April 2020. https://lyrasisnow.org/sciencv-and-orcid-to-streamline-nih -and-nsf-grant-applications/. 35. Smith-Yoshimura, Karen. 2016. “Metadata Reconciliation.” Hanging Together: The OCLC Research Blog, 28 September 2016. https://hangingtogether.org/?p=5710. 36. Carruthers, Matt. (2014) 2020. mcarruthers/LCNAF-Named-Entity-Reconciliation. GitHub Repository. https://github.com/mcarruthers/LCNAF-Named-Entity-Reconciliation. 37. Deliot, Corine, Steven Folsom, Myung-Ja Han, Nancy Lorimer, Terry Reese, and Adam Schiff. 2019. Formulating and Obtaining URIs: A Guide to Commonly used Vocabularies and Reference Sources. Library of Congress PCC Task Group on URIs in MARC. https://www.loc.gov/aba/pcc/bibframe/TaskGroups/formulate_obtain_URI_guide.pdf. 38. Smith-Yoshimura, Karen. 2019. “New Ways of Using and Enhancing Cataloging and Authority Records.” Hanging Together: The OCLC Research Blog, 2 April 2019. https://hangingtogether.org/?p=5710. 39. Smith-Yoshimura, Karen. 2015. “Persistent Identifiers for Local Collections.” Hanging Together: The OCLC Research Blog, 27 October 2015. https://hangingtogether.org/?p=5445. 40. DataCite. “Assign DOIs.” https://datacite.org/dois.html; Wilkinson, Laura J. 2020. “Constructing your DOIs.” Crossref: The Crossref Curriculum. Last updated 8 April 2020. https://www.crossref.org/education/member-setup/constructing -your-dois/. 41. See DOI examples in detail from: DOI. 2020. “DOI System Examples.” Accessed 20 September 2020. https://www.doi.org/demos.html; and See ARK examples in detail from: Department, Dallas (Tex ) Police. 1963. “[Photographs of Identification Cards].” Collection. University of North Texas. The Portal to Texas History digital repository. https://texashistory.unt.edu/ark:/67531/metapth346793/. https://experts.illinois.edu/ https://www.niaid.nih.gov/grants-contracts/orcid-id-required-some-encouraged-all https://lyrasisnow.org/sciencv-and-orcid-to-streamline-nih-and-nsf-grant-applications/ https://lyrasisnow.org/sciencv-and-orcid-to-streamline-nih-and-nsf-grant-applications/ https://hangingtogether.org/?p=5710 https://github.com/mcarruthers/LCNAF-Named-Entity-Reconciliation https://www.loc.gov/aba/pcc/bibframe/TaskGroups/formulate_obtain_URI_guide.pdf https://hangingtogether.org/?p=5710 https://hangingtogether.org/?p=5445 https://datacite.org/dois.html https://www.crossref.org/education/member-setup/constructing-your-dois/ https://www.crossref.org/education/member-setup/constructing-your-dois/ https://www.doi.org/demos.html https://texashistory.unt.edu/ark:/67531/metapth346793/ Transitioning to the Next Generation of Metadata 39 42. “Identity management” here reflects its usage among metadata specialists (See, for example, Library of Congress. 2018. “Charge for PCC Task Group on Identity Management in NACO,” 5. American Bar Association, Program for Cooperative Cataloging. Revised 22 May 2018. https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO- rev2018-05-22.pdf.) But the term has other meanings depending on the audience; for example, identity access management, as described in: Wikiwand. “Identity Management.” https://www.wikiwand.com/en/Identity_management. 43. Smith-Yoshimura, Karen. 2018. “The Coverage of Identity Management Work.” Hanging Together: The OCLC Research Blog, 8 October 2018. https://hangingtogether.org/?p=6805. 44. Smith-Yoshimura, Karen. 2017. “Beyond the Authorized Access Point? Hanging Together: The OCLC Research Blog, 10 October 2017. https://hangingtogether.org/?p=6271. 45. Smith-Yoshimura, “Coverage of Identity Management.” (See note 43.) 46. Watch the highly-rated Webinar by Andrew Lih and Robert Fernandez. 2018. “Works in Progress Webinar: Introduction to Wikidata for Librarians: Structuring Wikipedia and Beyond.” Produced by OCLC Research, 12 June 2018. MP4 video presentation, 1:1:51. https://www.oclc.org/research/events/2018/06-12.html. 47. Smith-Yoshimura, Karen. 2020. “Experimentations with Wikidata/Wikibase, Hanging Together: The OCLC Research Blog, 18 June 2020. https://hangingtogether.org/?p=8002. 48. Wikimedia. “WikiCite.” Home. https://meta.wikimedia.org/wiki/WikiCite. 49. Smith-Yoshimura, Karen. 2016. “Impact of Identifiers on Authority Workflows. Hanging Together: The OCLC Research Blog, 22 March 2016. https://hangingtogether.org/?p=5603. 50. Smith-Yoshimura, Karen. 2019. “Strategies for Alternate Subject Headings and Maintaining Subject Headings. Hanging Together: The OCLC Research Blog, 29 October 2019. https://hangingtogether.org/?p=7591. 51. OCLC 2020. “FAST (Faceted Application of Subject Terminology).” https://www.oclc.org/en/fast.html. 52. Smith-Yoshimura, Karen. 2016. “Faceted Vocabularies.” Hanging Together: The OCLC Research Blog, 31 October 2016. https://hangingtogether.org/?p=5739. 53 OCLC 2020. “FAST.” (See note 51.) 54. OCLC 2020. “FAST (Faceted Application of Subject Terminology).” Heading #3, FAST Policy and Outreach (FPOC) Committee, : https://www.oclc.org/en/fast.html. 55. Smith-Yoshimura, Karen. 2017. “Vocabulary Control Data in Discovery Environments.” Hanging Together: The OCLC Research Blog, 5 October 2017. https://hangingtogether.org/?p=6264. https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO-rev2018-05-22.pdf https://www.loc.gov/aba/pcc/taskgroup/PCC-TG-Identity-Management-in-NACO-rev2018-05-22.pdf https://www.wikiwand.com/en/Identity_management https://hangingtogether.org/?p=6805 https://hangingtogether.org/?p=6271 https://www.oclc.org/research/events/2018/06-12.html https://hangingtogether.org/?p=8002 https://meta.wikimedia.org/wiki/WikiCite https://hangingtogether.org/?p=5603 https://hangingtogether.org/?p=7591 https://www.oclc.org/en/fast.html https://hangingtogether.org/?p=5739 https://www.oclc.org/en/fast.html https://hangingtogether.org/?p=6264 40 Transitioning to the Next Generation of Metadata 56. National Library, New Zealand Government. “Ngā Upoko Tukutuku / Māori Subject Headings” http://mshupoko.natlib.govt.nz/mshupoko/; AIATSIS Pathways: Gateway to the AIATSIS Thesauri. “Pathways.” http://www1.aiatsis.gov.au/. 57. Deutsche Nationalbibliothek. 2019. “MACS - Multilingual Access to Subjects.” (Archived 13 Jan 2019.) https://web.archive.org/web/20190113003823/https:/www.dnb.de/EN/Wir/Kooperation /MACS/macs_node.html. 58. Smith-Yoshimura, Karen. 2019. “Knowledge Organization Systems.” Hanging Together: The OCLC Research Blog, 17 March 2019. https://hangingtogether.org/?p=7135. 59. Synaptica. “Ontology Management – Graphite.” https://www.synaptica.com/graphite/. 60. Smith-Yoshimura, Karen. 2018. “Are Distributed Models for Vocabulary Maintenance Viable?” Hanging Together: The OCLC Research Blog, 12 April 2018. https://hangingtogether.org/?p=6672. 61. OCLC Research. 2020. “Equity, Diversity, and Inclusion in the OCLC Research Library Partnership Survey.” Overview. Accessed 20 September 2020. https://www.oclc.org/research /areas/community-catalysts/rlp-edi.html. 62. Smith-Yoshimura, Karen. 2018. “Creating Metadata for Equity, Diversity, and Inclusion.” Hanging Together: The OCLC Research Blog, 7 November 2018. https://hangingtogether.org/?p=6833. 63. Smith-Yoshimura. “Distributed Models.” (See note 60.) 64. Smith-Yoshimura, Karen. 2019. “Strategies for Alternate Subject Headings and Maintaining Subject Headings.” Hanging Together: The OCLC Research Blog, 29 October 2019. https://hangingtogether.org/?p=7591. 65. Baxmeyer, Jennifer, Karen Coyle, Joanna Dyla, MJ Han, Steven Folsom, Phil Schreur, and Tim Thompson. 2017. Linked Data Infrastructure Models: Areas of Focus for PCC Strategies. Library of Congress PCC Linked Data Advisory Committee. https://www.loc.gov/aba/pcc /documents/LinkedDataInfrastructureModels.pdf. 66. Bone, Christine, Sharon Farnel, Sheila Laroque, and Brett Lougheed. 2017. “Works in Progress Webinar: Decolonizing Descriptions: Finding, Naming and Changing the Relationship between Indigenous People, Libraries and Archives “ Produced by OCLC Research, 19 October 2017. MP4 video presentation, 54:35.00. https://www.oclc.org/research/events/2017/10-19.html. 67. Smith-Yoshimura, Karen. 2015. “Shift to Linked Data for Production.” Hanging Together: The OCLC Research Blog, 13 May 2015. https://hangingtogether.org/?p=5195. 68. Smith-Yoshimura, Karen. 2015. “Working in Shared Files.” Hanging Together: The OCLC Research Blog, 7 April 2015. https://hangingtogether.org/?p=5091. 69. Bruce Washburn and Jeff Mixter, 2018. “Works in Progress Webinar: Looking Inside the Library Knowledge Vault.” Produced by OCLC Research, 12 August 2018. MP4 video presentation, 57:45:00. https://www.oclc.org/research/events/2015/08-12.html. http://mshupoko.natlib.govt.nz/mshupoko/ http://www1.aiatsis.gov.au/ https://web.archive.org/web/20190113003823/https:/www.dnb.de/EN/Wir/Kooperation/MACS/macs_node.html https://web.archive.org/web/20190113003823/https:/www.dnb.de/EN/Wir/Kooperation/MACS/macs_node.html https://hangingtogether.org/?p=7135 https://www.synaptica.com/graphite/ https://hangingtogether.org/?p=6672 https://www.oclc.org/research/areas/community-catalysts/rlp-edi.html https://www.oclc.org/research/areas/community-catalysts/rlp-edi.html https://hangingtogether.org/?p=6833 https://hangingtogether.org/?p=7591 https://www.loc.gov/aba/pcc/documents/LinkedDataInfrastructureModels.pdf https://www.loc.gov/aba/pcc/documents/LinkedDataInfrastructureModels.pdf https://www.oclc.org/research/events/2017/10-19.html https://hangingtogether.org/?p=5195 https://hangingtogether.org/?p=5091 https://www.oclc.org/research/events/2015/08-12.html Transitioning to the Next Generation of Metadata 41 70. Smith-Yoshimura, Karen. 2019. Systematic Reviews of Our Metadata, Hanging Together: The OCLC Research Blog, 10 April 2019. https://hangingtogether.org/?p=7117. 71. Smith-Yoshimura, Karen. 2015. “Working in Shared File. ”Hanging Together: The OCLC Research Blog, 7 April 2015. https://hangingtogether.org/?p=5091. 72. Jisc Library Services. n.d. “What Is ‘Plan M’?” Accessed 21 September 2020. https://libraryservices.jiscinvolve.org/wp/2019/12/plan-m/; Smith-Yoshimura, Karen. 2020. “Knowledge Management and Metadata.” Hanging Together: The OCLC Research Blog, 9 April 2020. https://hangingtogether.org/?p=7845; For more information about the current phase of “Plan M” (May–November 2020), see Grindley, Neil. “Moving Plan M Forwards – We Need Your Help!” Library Services (PlanM) (blog), Jisc, 6 May 2020. https://libraryservices.jiscinvolve.org/wp/2020/05/planm_nextphase/. 73. Grindley, Neil. 2019. “Plan M: Definition, Principles and Direction.” Jisc. (Word docx.) http://libraryservices.jiscinvolve.org/wp/files/2019/12/Plan-M-Definition-and-Direction-1.docx. 74. Dempsey, Lorcan. 2016. “Library Collections in the Life of the User: Two Directions.” LIBER Quarterly 26(4): 338–359. http://doi.org/10.18352/lq.10170. 75. Smith-Yoshimura, Karen. 2019. “Presenting Metadata from Different Sources in Discovery Layers. Hanging Together: The OCLC Research Blog, 16 April 2019. https://hangingtogether.org/?p=7880. 76. Smith-Yoshimura, Karen. 2017. “Metadata for Archival Collections.” Hanging Together: The OCLC Research Blog, 30 May 2017. https://hangingtogether.org/?p=5903. 77. Godby, Jean, Karen Smith-Yoshimura, Bruce Washburn, Kalan Knudson Davis, Karen Detling, Christine Fernsebner Eslao, Steven Folsom, Xiaoli Li, Marc McGee, Karen Miller, Honor Moody, Craig Thomas, and Holly Tomren. 2019. Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage, 49-51. Dublin, OH: OCLC Research. https://doi.org/10.25333/faq3-ax08. 78. The OCLC Research Library Partnership Archives and Special Collections Linked Data Review Group is described at https://www.oclc.org/research/partnership/working-groups/archives -special-collections-linked-data-review.html. 79. Smith-Yoshimura, Karen. 2020. “Metadata Management in Times of Uncertainty.” Hanging Together: The OCLC Research Blog, 15 June 2020. https://hangingtogether.org/?p=7998. 80. Smith-Yoshimura, Karen. 2016. “Metadata for Archived Websites.” Hanging Together: The OCLC Research Blog, 14 March 2016. https://hangingtogether.org/?p=5591. 81. Archive-It. 2008. “Human Rights.” Columbia University Libraries Collection. (Archived May 2008). https://archive-it.org/collections/1068; Archive-It. 2010. “New York City Places and Spaces.” Columbia University Libraries Collection. (Archived January 2010). https://archive-it.org/collections/1757; https://hangingtogether.org/?p=7117 https://hangingtogether.org/?p=5091 https://libraryservices.jiscinvolve.org/wp/2019/12/plan-m/ https://hangingtogether.org/?p=7845 https://libraryservices.jiscinvolve.org/wp/2020/05/planm_nextphase/ http://libraryservices.jiscinvolve.org/wp/files/2019/12/Plan-M-Definition-and-Direction-1.docx http://doi.org/10.18352/lq.10170 http://Presenting metadata from different sources in discovery layers http://Presenting metadata from different sources in discovery layers https://hangingtogether.org/?p=7880 https://hangingtogether.org/?p=5903 https://doi.org/10.25333/faq3-ax08 https://www.oclc.org/research/partnership/working-groups/archives-special-collections-linked-data-review.html https://www.oclc.org/research/partnership/working-groups/archives-special-collections-linked-data-review.html https://hangingtogether.org/?p=7998 https://hangingtogether.org/?p=5591 https://archive-it.org/collections/1068 https://archive-it.org/collections/1757 42 Transitioning to the Next Generation of Metadata Archive-It. 2010. “Burke Library New York City Religions.” Columbia University Libraries Collection. (Archived May 2010). https://archive-it.org/collections/1945. 82. NLA. “Trove.” Archived Websites. Sub Collections. Accessed 20 September 2020. https://trove.nla.gov.au/website. 83. Archive-It. 2014. “Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY).” Ivy Plus Libraries Confederation Collection. (Archived June 2014.) https://archive-it.org/collections/4638; Archive-It. 2013. “Contemporary Composers Web Archive (CCWA).” Ivy Plus Libraries Confederation Collection. (Archived October 2013.) https://archive-it.org/collections/4019; NYARC: New York Art Resources Consortium. “Web Archiving.” http://www.nyarc.org /content/web-archiving. 84. OCLC Research. 2020. “Web Archiving Metadata Working Group” The Problem, Addressing the Problem, Outputs. https://www.oclc.org/research/themes/research-collections/wam.html. 85. Dooley, Jackie, and Kate Bowers. 2018. Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3005C. 86. Smith-Yoshimura, Karen. 2018. “Metadata for Audio and Videos.” Hanging Together: The OCLC Research Blog, 29 October 2018. https://hangingtogether.org/?p=6814. 87. Weber, Chela Scott. 2017. Research and Learning Agenda for Archives, Special, and Distinctive Collections in Research Libraries. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3C34F. 88. Library of Congress. “Standards.” Encoded Archival Description (EAD) Official Site. Accessed 21 September, 2020. https://www.loc.gov/ead/. 89. Library of Congress. “Standards.” Preservation Metadata Maintenance Activity (PREMIS). Accessed 21 September, 2020. https://www.loc.gov/standards/premis/. 90. Weber, Chela Scott. 2019. “Assessing Needs of AV in Special Collections.” Hanging Together: The OCLC Research Blog, 23 July 2019. https://hangingtogether.org/?p=7405; Weber, Chela Scott. 2019. “Scale & Risk: Discussing Challenges to Managing A/V Collections in the RLP.” Hanging Together: The OCLC Research Blog, 1 October 2019. https://hangingtogether.org/?p=7479. 91. Smith-Yoshimura, Karen. 2015. “Managing Metadata for Image Collections.” Hanging Together: The OCLC Research Blog, 9 April 2015. https://hangingtogether.org/?p=5130. 92. Library of Congress. “Standards.” Metadata Object Description Schema (MODS). Accessed 21 September 2020. http://www.loc.gov/standards/mods/. 93. Ibid. https://archive-it.org/collections/1945 https://trove.nla.gov.au/website https://archive-it.org/collections/4638 https://archive-it.org/collections/4019 http://www.nyarc.org/content/web-archiving http://www.nyarc.org/content/web-archiving https://www.oclc.org/research/themes/research-collections/wam.html https://www.oclc.org/research/themes/research-collections/wam.html https://doi.org/10.25333/C3005C https://hangingtogether.org/?p=6814 https://doi.org/10.25333/C3C34F https://www.loc.gov/ead/ https://www.loc.gov/standards/premis/ https://hangingtogether.org/?p=7405 https://hangingtogether.org/?p=7479 https://hangingtogether.org/?p=5130 http://www.loc.gov/standards/mods/ Transitioning to the Next Generation of Metadata 43 94. Library of Congress. “Standards.” Metadata Authority Description Schema (MADS).” Accessed 21 September 2020. http://www.loc.gov/standards/mads/. 95. Smith-Yoshimura, Karen. 2016. “Sharing Digital Collections Workflows.” Hanging Together: The OCLC Research Blog, 2 November 2016. https://hangingtogether.org/?p=5744. 96. OCLC Research. 2020. “Europeana Innovation Pilots.” Accessed 20 September 2020. http://www.oclc.org/research/themes/data-science/europeana.html?urlm=168921. 97. IIIF (International Image Interoperability Framework): Enabling Richer Access to the World’s Images. “Home.” Accessed 20 September 2020. https://iiif.io/. 98. OCLC Research. 2020. “OCLC ResearchWorks IIIF Explorer.” https://www.oclc.org/research /themes/data-science/iiif/iiifexplorer.html. 99. OCLC Research. 2020. “CONTENTdm Linked Data Pilot.” Introduction. https://www.oclc.org /research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html. 100. Smith-Yoshimura, Karen. 2015. “Data Management and Curation in 21st Century Archives – Part 1.” 21 September 2015. http://hangingtogether.org/?p=5375. 101. Smith-Yoshimura, Karen. 2016. “Metadata for Research Data Management.” Hanging Together: The OCLC Research Blog, 18 April 2016. https://hangingtogether.org/?p=5616. 102. Erway, Ricky, Laurence Horton, Amy Nurnberger, Reid Otsuji, and Amy Rushing. 2015. Building Blocks: Laying the Foundation for a Research Data Management Program, 8. Dublin, OH: OCLC Research. https://doi.org/10.25333/C39P86. 103. See the OCLC Research Data Management Planning Guide at https://www.oclc.org/research/areas/research-collections/rdm/guide.html. 104. Smith-Yoshimura, Karen. 2020. “Knowledge Management and Metadata.” Hanging Together: The OCLC Research Blog, 9 April 2020. https://hangingtogether.org/?p=7845. 105. Faniel, Ixchel M. 2019. “Let’s Cook Up Some Metadata Consistency.” Next (blog), OCLC, 21 November 2019. http://www.oclc.org/blog/main/lets-cook-up-some-metadata-consistency/. 106. NCI (National Computational Infrastructure): Australia. “Home.” Accessed 21 September 2020. http://nci.org.au/; ADA (Australian Data Archive). “Home.” Accessed 21 September 2020. https://www.ada.edu.au/. 107. Portage Network. “Home.” Accessed 21 September 2020. https://portagenetwork.ca/. 108. Metadata 2020 is a “collaboration advocating richer, connected, reusable, open metadata for all research outputs” (http://www.metadata2020.org/). The Metadata 2020 Researcher Communications project is outlined here: http://www.metadata2020.org/projects /researcher-communications/. http://www.loc.gov/standards/mads/ https://hangingtogether.org/?p=5744 http://www.oclc.org/research/themes/data-science/europeana.html?urlm=168921 https://iiif.io/ https://www.oclc.org/research/themes/data-science/iiif/iiifexplorer.html https://www.oclc.org/research/themes/data-science/iiif/iiifexplorer.html https://www.oclc.org/research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html https://www.oclc.org/research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html https://www.oclc.org/research/themes/data-science/linkeddata/contentdm-linked-data-pilot.html http://hangingtogether.org/?p=5375 https://hangingtogether.org/?p=5616 https://doi.org/10.25333/C39P86 https://www.oclc.org/research/areas/research-collections/rdm/guide.html https://hangingtogether.org/?p=7845 http://www.oclc.org/blog/main/lets-cook-up-some-metadata-consistency/ http://nci.org.au/ https://www.ada.edu.au/ https://portagenetwork.ca/ http://www.metadata2020.org/ http://www.metadata2020.org/projects/researcher-communications/ http://www.metadata2020.org/projects/researcher-communications/ 44 Transitioning to the Next Generation of Metadata 109. Digital Curation Centre. “Disciplinary Metadata.” List of Metadata Standards. Accessed 21 September 2020. http://www.dcc.ac.uk/resources/metadata-standards/list. 110. RDA Metadata Directory. “Metadata Standards Directory Working Group.” GitHub Repository. Accessed 21 September 2020. http://rd-alliance.github.io/metadata-directory/. 111. NISO is about to make CRediT (Contributor Roles Taxonomy)—which identifies 14 roles describing each contributor’s specific contribution to the scholarly output—a standard. CRediT was developed by CASRAI, the Consortia Advancing Standards in Research Administration Information. See CASRAI. “CRediT – Contributor Roles Taxonomy.” Accessed 21 September 2020. https://casrai.org/credit/. 112. University of Michigan Library. 2020. “Data Services.” http://www.lib.umich.edu/research -data-services. 113. OCLC Research. 2020. “The Realities of Research Data Management.” Overview. https://www.oclc.org/research/publications/2017/oclcresearch-research-data -management.html. 114. Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2017. Scoping the University RDM Service Bundle. The Realities of Research Data Management, Part 2, pp. 16, 21. Dublin, OH: OCLC Research. https://doi.org/10.25333/C3Z039. 115. Indiana University. 2018. “IU will Lead $2 Million Partnership to Expand Access to Research Data: IU Libraries and IU Network Science Institute Are Leading a Public-Private Partnership to Create the Shared BigData Gateway for Research Libraries” News at UI, (Science and Technology.) Indiana University, 18 October 2018. https://news.iu.edu/stories/2018/10/iu /releases/18-shared-bigdata-gateway-for-research-networks.html; Microsoft. 2020. “Microsoft Academic Graph.” Established 5 June 2015. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/; For more details, watch the August 2019 recording of “Democratizing Access to Large Datasets through Shared Infrastructure.” See Wittenberg, Jamie, and Valentin Pentchev. “Works in Progress Webinar: Democratizing Access to Large Datasets through Shared Infrastructure.” Produced by OCLC Research, 8 August 2019. MP4 video presentation, 58:34:00. https://www.oclc.org/research/events/2019/080819-democratizing-access-large -datasets-shared-infrastructure.html. 116. NISO’s Reproducibility Badging and Definitions now out for public comment may also help researchers extend the benefit of their research to others. See “Taxonomy, Definitions, and Recognition Badging Scheme Working Group | NISO Website.” n.d. Accessed 22 September 2020. https://www.niso.org/standards-committees/reproducibility-badging. 117. Smith-Yoshimura, Karen. 2015. “Services Built on Usage Metrics.” Hanging Together: The OCLC Research Blog, 30 September 2015. https://hangingtogether.org/?p=5430. 118. Krista M. Soria, Jan Fransen, Shane Nackerud. 2014. “Stacks, Serials, Search Engines, and Students’ Success: First-Year Undergraduate Students’ Library Use, Academic Achievement, and Retention.” Journal of Academic Librarianship 40: 84-91. https://doi.org/10.1016/j.acalib.2013.12.002. http://www.dcc.ac.uk/resources/metadata-standards/list http://rd-alliance.github.io/metadata-directory/ https://casrai.org/credit/ http://www.lib.umich.edu/research-data-services http://www.lib.umich.edu/research-data-services https://www.oclc.org/research/publications/2017/oclcresearch-research-data-management.html https://www.oclc.org/research/publications/2017/oclcresearch-research-data-management.html https://doi.org/10.25333/C3Z039 https://news.iu.edu/stories/2018/10/iu/releases/18-shared-bigdata-gateway-for-research-networks.html https://news.iu.edu/stories/2018/10/iu/releases/18-shared-bigdata-gateway-for-research-networks.html https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ https://www.oclc.org/research/events/2019/080819-democratizing-access-large-datasets-shared-infrastructure.html https://www.oclc.org/research/events/2019/080819-democratizing-access-large-datasets-shared-infrastructure.html https://www.niso.org/standards-committees/reproducibility-badging https://hangingtogether.org/?p=5430 https://doi.org/10.1016/j.acalib.2013.12.002 Transitioning to the Next Generation of Metadata 45 119. See Jisc. “Library Impact Data Project (LIDP).” Accessed 21 September 2020. http://www.activitydata.org/LIDP.html. 120. Smith-Yoshimura, Karen. 2019. “Alternatives to Statistics for Measuring Success and Value of Cataloging.” Hanging Together: The OCLC Research Blog, 15 April 2019. https://hangingtogether.org/?p=7122. 121. DLF (Digital Library Federation). 2015. “Metadata Librarian, Cornell University Library.” DLF (blog), 11 June 2015. https://www.diglib.org/metadata-librarian-cornell-university-library/. 122. Salary.com. (2019) 2020. “Metadata Librarian.” Posted by Georgia Tech University 13 November 2019. (Archived 2 September 2020) https://web.archive.org/web/20200903061830/https://www.salary.com/job/gt-library /metadata-librarian/e5644ece-c847-4cfb-994f-c4c80fa81e3d. 123. OCLC. 2020. “Locate Items in the Library with StackMap.” https://help.oclc.org/Discovery_and _Reference/WorldCat_Discovery/Search_results/Locate_items_in_the_library_with_StackMap. 124. Yewno: Transforming Information into Knowledge. 2020. “Home.” https://www.yewno.com/. 125. Smith-Yoshimura, Karen. 2019. “New Ways of Using and Enhancing Cataloging and Authority Records” Hanging Together: The OCLC Research Blog, 2 April 2019. https://hangingtogether.org/?p=7805. 126. National Library of Australia (NLA). “Austlang National Codeathon.” Accessed 21 September 2020. https://www.nla.gov.au/our-collections/processing-and-describing-the-collections /Austlang-national-codeathon [Map of Australia. 2020 HERE, Bing, Microsoft Corporation]; NLA. “Trove.” Search. Uncover. Australia. Accessed 21 September 2020. https://trove.nla.gov.au/. 127. The Graphic History Company – Hachette UK. “River of Authors.” Accessed 21 September 2020. http://theghc.co/project.php?project=hachette-uk-a-river-of-authors. 128. Smith-Yoshimura, Karen. 2019. “Knowledge Organization Systems.” Hanging Together: The OCLC Research Blog, 17 April 2019. https://hangingtogether.org/?p=7135. 129. Zeng, Marcia Lei, and Philipp Mayr. 2019. “Knowledge Organization Systems (KOS) in the Semantic Web: A Multi-dimensional Review.” International Journal on Digital Libraries 20: 209- 230. https://doi.org/10.1007/s00799-018-0241-2. 130. SNAC (Social Networks and Archival Context). “About SNAC.” What is SNAC? https://portal.snaccooperative.org/about. 131. Smith-Yoshimura, Karen. 2020. “Knowledge Management and Metadata.” Hanging Together: The OCLC Research Blog, 9 April 2020. https://hangingtogether.org/?p=7845. 132. AI4LAM (Artificial Intelligence for Libraries, Archives & Museums). Updated 18 May 2020 https://sites.google.com/view/ai4lam/home. http://www.activitydata.org/LIDP.html https://hangingtogether.org/?p=7122 https://www.diglib.org/metadata-librarian-cornell-university-library/ https://web.archive.org/web/20200903061830/https://www.salary.com/job/gt-library/metadata-librarian https://web.archive.org/web/20200903061830/https://www.salary.com/job/gt-library/metadata-librarian https://help.oclc.org/Discovery_and_Reference/WorldCat_Discovery/Search_results/Locate_items_in_the_library_with_StackMap https://help.oclc.org/Discovery_and_Reference/WorldCat_Discovery/Search_results/Locate_items_in_the_library_with_StackMap https://www.yewno.com/ https://hangingtogether.org/?p=7805 https://www.nla.gov.au/our-collections/processing-and-describing-the-collections/Austlang-national-codeathon https://www.nla.gov.au/our-collections/processing-and-describing-the-collections/Austlang-national-codeathon https://trove.nla.gov.au/ http://theghc.co/project.php?project=hachette-uk-a-river-of-authors https://hangingtogether.org/?p=7135 https://doi.org/10.1007/s00799-018-0241-2 https://portal.snaccooperative.org/about https://hangingtogether.org/?p=7845 https://sites.google.com/view/ai4lam/home 46 Transitioning to the Next Generation of Metadata 133. AI4LAM’s mission is to organize, share, and elevate knowledge about and use of artificial intelligence by libraries, archives, and museums. It was founded in 2018, inspired by the success of the International Image Interoperability Framework (IIIF) in coordinating large scale collaboration on interoperable technology to advance LAMs. See AI4LAM. “About.” Our Mission. https://sites.google.com/view/ai4lam/about. 134. Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI in Libraries. Dublin, OH: OCLC Research. https://doi.org/10.25333/xk7z-9g97. 135. Ibid, 17-19. 136. Smith-Yoshimura, Karen. 2019. “Alternatives to Statistics for Measuring Success and Value of Cataloging.” Hanging Together: The OCLC Research Blog, 15 April 2019. https://hangingtogether.org/?p=7122. 137. Smith-Yoshimura, Karen. 2017. “New Skill Sets for Metadata Management.” Hanging Together: The OCLC Research blog, 17 April 2017. https://hangingtogether.org/?p=5929. 138. Smith-Yoshimura, Karen. 2018. “MarcEdit and Other Tools for Batch Processing and Metadata Reconciliation.” Hanging Together: The OCLC Research Blog, 26 March 2018. https://hangingtogether.org/?p=6646. 139. Reese, Terry. 2018 “MarcEdit 2017 Usage Information.“ Terry’s Worklog (blog), 9 September 2020. http://blog.reeset.net/archives/2572. 140. Reese, Terry. 2020. “Working with Linked Data In MarcEdit.” MarcEdit Development (blog). Accessed 21 September 2020. https://marcedit.reeset.net/working-with-linked-data-in- marcedit. 141. Reese, Terry. 2018. “MarcEdit Playlist.” 139 YouTube videos. Last updated 26 December 2018. https://www.youtube.com/playlist?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg. 142. Smith-Yoshimura, Karen. 2017. “New Skill Sets for Metadata Management.” Hanging Together: The OCLC Research blog, 17 April 2017. https://hangingtogether.org/?p=5929. 143. “XML and RDF-Based Systems Archives.” n.d. Library Juice Academy (blog). Accessed 22 September 2020. https://libraryjuiceacademy.com/certificate/xml-and-rdf-based-systems/; Reese, Terry. 2013. “Tutorials.” YouTube (selected). MarcEdit Development (blog). 14 March 2013. http://marcedit.reeset.net/tutorials; “Lynda: Online Courses, Classes, Training, Tutorials.” n.d. Lynda.com - from LinkedIn Learning. Accessed 22 September 2020. https://www.lynda.com/; “Learn to Code - for Free.” n.d. Codecademy. Accessed 22 September 2020. https://www.codecademy.com/; https://sites.google.com/view/ai4lam/about https://doi.org/10.25333/xk7z-9g97 https://hangingtogether.org/?p=7122 https://hangingtogether.org/?p=5929 https://hangingtogether.org/?p=6646 http://blog.reeset.net/archives/2572 https://marcedit.reeset.net/working-with-linked-data-in-marcedit https://marcedit.reeset.net/working-with-linked-data-in-marcedit https://www.youtube.com/playlist?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg https://hangingtogether.org/?p=5929 https://libraryjuiceacademy.com/certificate/xml-and-rdf-based-systems/ http://marcedit.reeset.net/tutorials https://www.lynda.com/ https://www.codecademy.com/ Transitioning to the Next Generation of Metadata 47 Software Carpentry. “Teaching Basic Lab Skills for Research Computing.” Upcoming Workshops. Accessed 22 September 2020. https://software-carpentry.org/. 144. “Data on the Web Best Practices.” n.d. Accessed 22 September 2020. https://www.w3.org/TR/dwbp/; Semantic Web for the Working Ontologist. (2008) 2020. http://workingontologist.org/. 145. Library Workflow Exchange. n.d. “About.” Accessed 21 September 2020. http://www.libraryworkflowexchange.org/about/. 146. OCLC Developer Network. 2020. “DevConnect Webinars. https://www.oclc.org/developer/ events/devconnect-workshops.en.html. 147. Smith-Yoshimura, Karen. 2019. “Stewardship of Professional FTEs In Metadata Work and Turnover.” Hanging Together: The OCLC Research Blog, 18 October 2019. https://hangingtogether.org/?p=7580. 148. OCLC. 2020. “WorldCat®: OCLC and Linked Data.” Shared Entity Management Infrastructure. https://www.oclc.org/en/worldcat/linked-data/shared-entity-management-infrastructure.html. https://software-carpentry.org/ https://www.w3.org/TR/dwbp/ http://workingontologist.org/ http://www.libraryworkflowexchange.org/about/ https://www.oclc.org/developer/events/devconnect-workshops.en.html https://www.oclc.org/developer/events/devconnect-workshops.en.html https://hangingtogether.org/?p=7580 https://www.oclc.org/en/worldcat/linked-data/shared-entity-management-infrastructure.html For more information about our work related to digitizing library collections, please visit: oc.lc/digitizing 6565 Kilgour Place Dublin, Ohio 43017-3395 T: 1-800-848-5878 T: +1-614-764-6000 F: +1-614-764-6096 www.oclc.org/research ISBN: 978-1-55653-167-5 DOI: 10.25333/rqgd-b343 RM-PR-216787-WWAE 2009 O C L C R E S E A R C H R E P O R T http://oc.lc/digitizing Executive Summary Introduction The Transition to Linked Data and Identifiers Expanding the use of persistent identifiers Moving from “authority control” to “identity management” Addressing the need for multiple vocabularies and equity, diversity, and inclusion Linked data challenges Describing “Inside-Out” and “Facilitated” Collections Archival collections Archived websites Audio and video collections Image collections Research data Evolution of “Metadata as a Service” Metrics Consultancy New applications Bibliometrics Semantic indexing Preparing for Future Staffing Requirements The culture shift Learning opportunities New tools and skills Self-education Addressing staff turnover Impact Acknowledgments Appendix Notes FIGURE 1. “Changing Resource Description Workflows” by OCLC Research FIGURE 2. Some 300 abbreviated author names for a five-page article in Physical Review Letters FIGURE 3. Examples of some DOI (left) and ARK (right) identifiers FIGURE 4. One Wikidata identifier links to other identifiers and labels in different languages FIGURE 5. Excerpt from the survey results from the 2017 EDI survey of the Research Library Partnership FIGURE 6. Responses to 2019 survey on challenges related to managing A/V collections FIGURE 7. The OCLC ResearchWorks IIIF Explorer retrieves images about “Paris Maps” across CONTENTdm collections FIGURE 8. Distribution of 465 Indigenous language codes in the Australian National Bibliographic Database FIGURE 9. UK Hatchette’s “River of Authors” generated from the British Library’s catalog metadata Blank Page
orr-bootleg-2020 ---- Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation Laurel Orr†, Megan Leszczynski†, Simran Arora†, Sen Wu†, Neel Guha†, Xiao Ling‡, and Christopher Ré† †Stanford University ‡Apple {lorr1,mleszczy,simran,senwu,nguha,chrismre}@cs.stanford.edu, xiaoling@apple.com Abstract A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly grounded in reasoning patterns for disambiguation. We define core reasoning patterns for disambiguation, create a learning procedure to encourage the self-supervised model to learn the patterns, and show how to use weak supervision to enhance the signals in the training data. Encoding the reasoning patterns in a simple Transformer architecture, Bootleg meets or exceeds state-of-the-art on three NED benchmarks. We further show that the learned representations from Bootleg successfully transfer to other non-disambiguation tasks that require entity-based knowledge: we set a new state-of- the-art in the popular TACRED relation extraction task by 1.0 F1 points and demonstrate up to 8% performance lift in highly optimized production search and assistant tasks at a major technology company. 1 Introduction Knowledge-aware deep learning models have recently led to significant progress in fields ranging from natural language understanding [38, 41] to computer vision [56]. Incorporating explicit knowledge allows for models to better recall factual information about specific entities [38]. Despite these successes, a persistent challenge that recent works continue to identify is how to leverage knowledge for low-resource regimes, such as tail examples that appear rarely (if at all) in the training data [16]. In this work, we study knowledge incorporation in the context of named entity disambiguation (NED) to better disambiguate the long tail of entities that occur infrequently during training.1 Humans disambiguate by leveraging subtle reasoning over entity-based knowledge to map strings to entities in a knowledge base. For example, in the sentence “Where is Lincoln in Logan County?”, resolving the mention “Lincoln” to “Lincoln, IL” requires reasoning about relations because “Lincoln, IL”—not “Lincoln, NE” or “Abraham Lincoln”—is the capital of Logan County. Previous NED systems disambiguate by memorizing co-occurrences between entities and textual context in a self-supervised manner [16, 51]. The self-supervision is critical to building a model that is easy to maintain and does not require expensive hand-curated features. However, these approaches struggle to handle tail entities: a baseline SotA model from [16] achieves less than 28 F1 points over the tail, compared to 86 F1 points over all entities. Despite their rarity in training data, many real-world entities are tail entities: 89% of entities in the Wikidata knowledge base do not have Wikipedia pages to serve as a source of textual training data. However, to achieve 60 F1 points on disambiguation, we find that the prior SotA baseline model should see an entity 1In this work, we define tail entities as those occurring 10 or fewer times in the training data. 1 ar X iv :2 01 0. 10 36 3v 3 [ cs .C L ] 2 3 O ct 2 02 0 How tall is Lincoln? The Core Reasoning Patterns of NED Type Affordance people have “heights” Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL LOC PER ORG LOC co-occurrence with "Nebraska" Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Where is Lincoln in Logan County? "capital-of" relation Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Logan County, IL Logan County, OK Logan County, OH Is a Lincoln or Ford more expensive? consistent "car" types Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Ford Motor Ford, Australia Henry Ford KG Relations Entity Memorization Type Consistency Increasing generality of pattern Where is Lincoln Nebraska? Up to 100x more data needed to recover performance of Bootleg over the tail Overcoming the Long Tail of NED Baseline Bootleg Tail Torso Head Unseen 0 ~100x F1 0 0.2 0.4 0.6 0.8 1.0 Number entity occurrences in training 1 102 104 106 Figure 1: (Left) shows the four reasoning patterns for disambiguation. The correct entity is bolded. (Right) shows F1 versus number of times an entity was seen in training data for a baseline NED model compared to Bootleg across the head, torso, tail, and unseen. on-the-order-of 100 times during training (Figure 1 (right)). This presents a scalability challenge as there are 15x more entities in Wikidata than in Wikipedia, the majority of which are tail entities. For the model to observe each of these tail entities 100x, the training data would need to be scaled by 1,500x the size of Wikipedia. Prior approaches struggle with the tail, yet industry applications such as search and voice assistants are known to be tail-heavy [4, 20]. Given the requirement for high quality tail disambiguation, major technology companies continue to press on this challenge [29, 39]. Instead of scaling the training data until co-occurrences between tail entities and text can be memorized, we define a principled set of reasoning patterns for entity disambiguation across the head and tail. When humans disambiguate entities, they leverage signals from context as well as from entity relations and types. For example, resolving “Lincoln” in the text “How tall is Lincoln?” to “Abraham Lincoln” requires reasoning that people, not locations or car companies, have heights—a type affordance pattern. These core patterns apply to both head and tail examples with high coverage and involve reasoning over entity facts, relations, and types, information which is available for both head and tail in structured data sources. 2 Thus, we hypothesize that these patterns assembled from the structured resources can be learned over training data and generalize to the tail. In this work, we introduce Bootleg, an open-source, self-supervised NED system designed to succeed on head and tail entities. 3 Bootleg encodes the entity, relation, and type signals as embedding inputs to a simple stacked Transformer architecture. The key challenges we face are understanding how to use knowledge for NED, designing a model that learns those patterns, and fully extracting the useful knowledge signals from the training data: • Tail Reasoning: Humans use subtle reasoning patterns to disambiguate different entities, especially unfamiliar tail entities. The first challenge is characterizing these reasoning patterns and understanding their coverage over the tail. • Poor Tail Generalization: We find that a model trained using standard regularization and a combination of entity, type and relation information performs 10 F1 points worse on disambiguating unseen entities compared to the two models which respectively use only type and only relation information. We find this performance drop is due to the model’s over-reliance on discriminative textual and entity features compared to more general type and relation features. • Underutilized Data: Self-supervised models improve with more training data [7]. However, only a 2We find that type affordance patterns apply to over 84% of all examples, including tail examples, while KG relation patterns apply to over 27% of all examples and type consistency applies to over 8% of all examples. In Wikidata, 75% of entities that are not in Wikipedia have type or knowledge graph connectivity signals, and among tail entities, 88% are in non-tail type categories and 90% are in non-tail relation categories. 3Bootleg is open-source at http://hazyresearch.stanford.edu/bootleg 2 http://hazyresearch.stanford.edu/bootleg limited portion of the standard NED training dataset, Wikipedia, is useful: Wikipedia lacks labels [19] and we find that an estimated 68% of entities in the dataset are not labeled.4 Bootleg addresses these challenges through three contributions: • Reasoning Patterns for Disambiguation: We contribute a principled set of core disambiguation patterns for NED (Figure 1 (left))—entity memorization, type consistency, KG relation, and type affordance—and show that on slices of Wikipedia examples exemplifying each pattern, Bootleg provides a lift over the baseline SotA model on tail examples by 18 F1, 56 F1, 62 F1, and 45 F1 points respectively. Overall, using these patterns, Bootleg meets or exceeds state-of-the-art performance on three NED benchmarks and outperforms the prior SotA by more than 40 F1 points on the tail of Wikipedia. • Generalizing Learning to the Tail: Our key insight is that there are distinct entity-, type-, and relation- tails. Over tail entities (based on entity count in the training data), 88% have non-tail types and 90% have non-tail relations. The model should balance these signals differently depending on the particular entity being disambiguated. We thus contribute a new 2D regularization scheme to combine the entity, tail, and relation signals and achieve a lift of 13.6 F1 points on unseen entities compared to the model using standard regularization techniques. We conduct extensive ablation studies to verify the effectiveness of our approach. • Weak Labelling of Data: Our insight is that because Wikipedia is highly structured—most sentences on an entity’s Wikipedia page refer to that entity via pronouns or alternative names—we can weakly label our training data to label mentions. Through weak labeling, we increase the number of labeled mentions in the training data by 1.7x, and find this provides a 2.6 F1 point lift on unseen entities. With these three contributions, Bootleg achieves SotA on three NED benchmarks. We further show that embeddings from Bootleg are useful for downstream applications that require the knowledge of entities. We show the reasoning patterns learned in Bootleg transfer to tasks beyond NED by extracting Bootleg’s learned embeddings and using them to set a new SotA by 1.0 F1 points on the TACRED relation extraction task [2, 53], where the prior SotA model also uses entity-based knowledge [38]. Bootleg representations further provide an 8% performance lift on highly optimized industrial search and assistant tasks at a major technology company. For Bootleg’s embeddings to be viable for production, it is critical that these models are space-efficient: the models using only Bootleg relation and type embeddings each achieve 3.3x the performance of the prior SotA baseline over unseen entities using 1% of the space. 2 NED Overview and Reasoning Patterns We now define the task of named entity disambiguation (NED), the four core reasoning patterns, and the structural resources required for learning the patterns. Task Definition Given a knowledge base of entities E and an input sentence, the goal of named entity disambiguation is to determine the entities e ∈ E referenced in each sentence. Specifically, the input is a sequence of N tokens W = {w1, . . . , wN} and a set of M non-overlapping spans in the sequence W, termed mentions, to be disambiguated M = {m1, . . . , mM}. The output is the most likely entity for each mention. The Tail of NED We define the tail, torso, and head of NED as entities occurring less than 11 times, between 11 and 1,000, and more than 1,000 times in training, respectively. Following Figure 1 (right), the head represents those entities a simple language-based baseline model can easily resolve, as shown by a baseline SotA model from [16] achieving 86 F1 over all entities. These entities were seen enough times during training to memorize distinguishing contextual cues. The tail represents the entities these models struggle to resolve due to their rarity in training data, as shown by the same baseline model achieving less than 28 F1 on the tail. 4We computed this statistic by computing the number of proper nouns and the number of pronouns/known aliases for an entity on that entity’s page that were not already linked. 3 2.1 Four Reasoning Patterns When humans disambiguate entities in text, they conceptually leverage signals over entities, relationships, and types. Our empirical analysis reveals a set of desirable reasoning patterns for NED. The patterns operate at different levels of granularity (see Figure 1 (left))—from patterns which are highly specific to an entity, to patterns which apply to categories of entities—and are defined as follows. • Entity Memorization: We define entity memorization as the factual knowledge associated with a specific entity. Disambiguating “Lincoln” in the text “Where is Lincoln, Nebraska?” requires memorizing that “Lincoln, Nebraska”, not “Abraham Lincoln” frequently occurs with the text “Nebraska” (Figure 1 (left)). This pattern is easily learned by now-standard Transformer-based language models. As this pattern is at the entity-level, it is the least general pattern. • Type Consistency: Type consistency is the pattern that certain textual signals in text indicate that the types of entities in a collection are likely similar. For example, when disambiguating “Lincoln” in the text “Is a Lincoln or Ford more expensive?”, the keyword “or” indicates that the entities in the pair (or sequence) are likely of the same Wikidata type, “car company”. Type consistency is a more general pattern than entity memorization, covering 12% of the tail examples in a sample of Wikipedia.5 • KG Relations: We define the knowledge graph (KG) relation pattern as when two candidates have a known KG relationship and textual signals indicate that the relation is discussed in the sentence. For example, when disambiguating “Lincoln” in the sentence “Where is Lincoln in Logan County?”, “Lincoln, IL” has the KG relationship “capital of” with Logan County while Lincoln, NE does not. The keyword “in” is associated with the relation “capital of” between two location entities, indicating that “Lincoln, IL” is correct, despite being the less popular candidate entity associated with “Lincoln”. As patterns over pairs of entities with KG relations cover 23% of the tail examples, this is a more general reasoning pattern than consistency. • Type Affordance: We define type affordance as the textual signals associated with a specific entity- type in natural language. For example, “Manhattan” is likely resolved to the cocktail rather than the burrough in the sentence “He ordered a Manhattan.” due to the affordance that drinks, not locations, are “ordered”. As affordance signals cover 76% of the tail examples, it is the most general reasoning pattern. Required Structural Resources An NED system requires entity, relation, and type knowledge signals to learn these reasoning patterns. Entity knowledge is captured in unstructured text, while relation signals and type signals are readily available in structured knowledge bases such as Wikidata: from a sample of Wikipedia, 27% of all mentions and 23% of tail mentions participate in a relation, and 97% of all mentions and 92% of tail mentions are assigned some type in Wikidata. As these structural resources are readily available for all entities, they are useful for generalizing to the tail. A rare entity with a particular type or relation can leverage textual patterns learned from every other entity with that type or relation. Given the input signals and reasoning patterns, the next key challenge is ensuring that the model combines the discriminative entity and more general relation and type signals that are useful for disambiguation. 3 Bootleg Architecture for Tail Disambiguation We now describe our approach to leverage the reasoning patterns based on entity, relation, and type signals. We then present our new regularization scheme to inject inductive bias of when to use general versus discriminative reasoning patterns and our weak labeling technique to extract more signal from the self-supervision training data. 5Coverage numbers are calculated from representative slices of Wikidata that require each reasoning pattern. Additional details in Section 5. 4 “Where is Lincoln in Logan County?” Lincoln, ILLincoln, NEAbraham Lincoln AddAttn type embsentity emb AddAttn relation embs Cat + Proj Ent2Ent Phrase2Ent Softmax + Lincoln, IL Logan County, IL KG2Ent single layer Logan Country, OHLogan County, OKLogan Country, IL ue re te E W E W BERT Figure 2: Bootleg’s neural model. The entity, type, and relation embeddings are generated for each candidate and concatenated to form our entity representation matrix E. This, together with our word embedding matrix W, are inputs to Bootleg’s Ent2Ent, Phrase2Ent, and KG2Ent modules which aim to encode the four reasoning patterns. The most likely candidate for each mention is returned. 3.1 Encoding the Signals We first encode the structural signals—entities, KG relations and types—by mapping each to a set of embeddings. • Entity Embedding: Each entity e is represented by a unique embedding ue. • Type Embedding: Let T be the set of possible entity types. Given a known mapping from an entity e to its set {te,1, . . . , te,T |te,i ∈ T} of T possible types, Bootleg assigns an embedding te,i to each type. Because an entity can have multiple types, we use an additive attention [3], AddAttn, to create a single type embedding te = AddAttn([te,1, . . . , te,T ]). We further allow the model to leverage coarse named entity recognition types through a mention-type prediction module (see Appendix A for details). This coarse predicted type is concatenated with the assigned type to form te. • Relation Embedding: Let R represent the set of possible relationships any entity can participate in. Similar to types, given a mapping from an entity e to its set {re,1, . . . , re,R|re,i ∈R} of R relationships, Bootleg assigns an embedding re,i to each relation. Because an entity can participate in multiple relations, we use the additive attention to compute re = AddAttn([re,1, . . . , re,R]). As in existing work [16, 40], given the input sentence of length N and set of M mentions, Bootleg generates for each mention mi a set Γ(mi) = {e1i , . . . , e K i } of K possible entity candidates that could be referred to by mi. For each candidate and its associated types and relations, Bootleg uses a multi-layer perceptron e = MLP([ue, te, re]) to generate a vector representation for each candidate entity, for each mention. We denote this entity matrix as E ∈ RM×K×H, where H is the hidden dimension. We use BERT to generate contextual embeddings for each token in the input sentence. We denote this sentence embedding as W ∈ RN×H. W and E are passed to Bootleg’s model architecture, described next. 3.2 Bootleg Model Architecture The design goal of Bootleg is to capture the reasoning patterns by modeling textual signals associated with entities (for entity memorization), co-occurrences between entity types (for type consistency), textual signals associated with relations along with which entities are explicitly linked in the KG (for KG relations), 5 and textual signals associated with types (for type affordance). We design three modules to capture these design goals: a phrase memorization module, a co-occurrence memorization module, and a knowledge graph connection module. The model architecture is shown in Figure 2. We describe each module next. Phrase Memorization Module We design the phrase memorization module, Phrase2Ent, to encode the dependencies between the input text and the entity, relation, and type embeddings. The purpose of this module is to learn textual cues for the entity memorization and type affordance patterns. It should also learn relation context for the KG relation pattern. It will, for example, allow the person type embedding to encode the association with the keyword “height”. The module accepts as input E and W and outputs Ep = MHA(E, W), where MHA is the standard multi-headed attention with a feed-forward layer and skip connections [48]. Co-occurrence Memorization Module We design the co-occurrence memorization module, Ent2Ent, to encode the dependencies between entities. The purpose of the Ent2Ent module is to learn textual cues for the type consistency pattern. The module accepts E and computes Ec = MHA(E) using self-attention. Knowledge Graph (KG) Connection Module We design the KG module, KG2Ent, to collectively resolve entities based on pairwise connectivity features. Let K represent the adjacency matrix of a (possibly weighted) graph where the nodes are entities and an edge between ei and ej signifies that the two entities share some pairwise feature. Given E, KG2Ent computes Ek = softmax(K + wI)E + E where I is the identity and w is a learned scalar weight that allows Bootleg to learn to balance the original entity and its connections. This module allows for representation transfer between two related entities, meaning entities with a high-scoring representation will boost the score of related entities. The second computation acts as a skip connection between the input and output. In Bootleg, we allow the user to specify multiple KG2Ent modules: one for each adjacency matrix. The purpose of KG2Ent along with Phrase2Ent is to learn the KG relation pattern. End-to-End The computations for one layer of Bootleg includes: E′ =MHA(E, W) + MHA(E) Ek =softmax(K + wI)E′ + E′ where Ek is passed as the entity matrix to the next layer. After the final layer, Bootleg scores each entity by computing Sdis = max(EkvT , E′vT ) with Sdis ∈ RM×K and learned scoring vector v ∈ RH. Bootleg then outputs the highest scoring candidate for each mention. This scoring treats Ek and E′ as two separate predictions in an ensemble method, allowing the model to use collective reasoning from Ek when it achieves the highest scoring representation. If there are multiple KG2Ent modules, we use the average of their outputs as input to the next layer and, for scoring, take the maximum score across all outputs. For training, we use the cross-entropy loss of S to compute the disambiguation loss Ldis. 3.3 Improving Tail Generalization Regularization is the standard technique to encourage models to generalize, as models will naturally fit to discriminative features. However, we demonstrate that standard regularization is not effective when we want to leverage a combination of general and discriminative signals. We then present two techniques, regularization and weak labeling, to encourage Bootleg to incorporate general structural signals and learn general reasoning patterns. 3.3.1 Regularization We hypothesize that Bootleg will over-rely on the more discriminative entity features compared to the more general type and relation features to lower training loss. However, tail disambiguation requires Bootleg to leverage the general features. Using standard regularization techniques, we evaluate three models which respectively use only type embeddings, only relation embeddings, and a combination of type, relation, and 6 entity embeddings. Bootleg’s performance on unseen entities is 10 F1 points worse on the latter than each of the former two, suggesting that standard regularization is not sufficient when the signals operate at different granularities (details Table 9 in Appendix B). We can improve tail performance if Bootleg leverages memorized discriminative features for popular entities and general features for rare entities. We achieve this by designing a new regularization scheme for the entity-specific embedding u, which has two key properties: it is 2-dimensional and more popular entities are regularized less than less popular ones. • 2-dimensional: In contrast to 1-dimensional dropout, 2-dimensional regularization involves masking the full embedding. With probability p(e), we set u = 0 before the MLP layer; i.e., e = MLP([0, te, re]). Entirely masking the entity embedding in these cases, the model learns to disambiguate using the type and relation patterns, without entity knowledge. • Inverse Popularity: We find in ablations (Appendix B) that setting p(e) proportional to the power of the inverse of the entity e’s popularity in the training data (i.e., the more popular the less regularized), gives us the best performance and improves by 13.6 F1 on unseen entities over standard regularization. In contrast, fixing p(e) at 80% improves performance by over 11.3 F1 over standard regularization, and regularizing proportional to the power of popularity only improves performance by 3.8 F1 (details in Section 4). The regularization scheme encourages Bootleg to use entity-specific knowledge when the entity is seen enough times to memorize entity patterns and encourages the use of generalizable patterns over the rare, highly masked, entities. 3.3.2 Weakly Supervised Data Labeling We use Wikipedia to train Bootleg: we define a self-supervision task in which the internal links in Wikipedia are the gold entity labels for mentions during training. Although this dataset is large and widely used, it is often incomplete with an estimated 68% of named entities being unlabeled. Given the scale and the requirement that Bootleg be self-supervised, it is not feasible to hand-label the data. Our insight is that because Wikipedia is highly structured—most sentences on an entity’s Wikipedia page refer to that entity via pronouns or alternative names—we can weakly label our training data [44] to label mentions. We use two heuristics for weak labeling: the first labels pronouns that match the gender of a person’s Wikipedia page as references to that person, and the second labels known alternative names for an entity if the alternative name appears in sentences on the entity’s Wikipedia page. Through weak labeling, we increase the number of labeled mentions in the training data by 1.7x across Wikipedia, and find this provides a 2.6 F1 lift on unseen entities (full results in Appendix B Table 11). 4 Experiments We demonstrate that Bootleg (1) nearly matches or exceeds state-of-the-art performance on three standard NED benchmarks and (2) outperforms a BERT-based NED baseline on the tail. As NED is critical for downstream tasks that require the knowledge of entities, we (3) verify Bootleg’s learned reasoning patterns can transfer by using them for a downstream task: using Bootleg’s learned representations, we achieve a new SotA on the TACRED relation extraction task and improve performance on a production task at a major technology company by 8%. Finally, we (4) demonstrate that Bootleg can be sample-efficient by using only a fraction of its learned entity embeddings without sacrificing performance. We (5) ablate Bootleg to understand the impact of the structural signals and the regularization scheme on improved tail performance. 4.1 Experimental Setup Wikipedia Data We define our knowledge base as the set of entities with mentions in Wikipedia (for a total of 5.3M entities). We allow each mention to have up to K = 30 possible candidates. As Bootleg is a sentence disambiguation system, we train on individual sentences from Wikipedia, where the anchor links and our weak labeling (Section 3.3) serve as mention labels. 7 Table 1: We compare Bootleg to the best published numbers on three NED benchmarks. “-” indicates that the metric was not reported. Bolded numbers indicate the best value for each metric on each benchmark. Benchmark Model Precision Recall F1 KORE50 Hu et al. [24]7 80.0 79.8 79.9 Bootleg 86.0 85.4 85.7 RSS500 Phan et al. [40] 82.3 82.3 82.3 Bootleg 82.5 82.5 82.5 AIDA Févry et al. [16] - 96.7 - Bootleg 96.9 96.7 96.8 Our candidate lists Γ are mined from Wikipedia anchor links and the “also known as” field in Wikidata. For each person, we further add their first and last name as aliases linking to that person. We use the mention boundaries provided in the Wikipedia data and generate candidates by performing a direct lookup in Γ. We use Wikidata and YAGO knowledge graphs and Wikipedia to extract structural data about entity types and relations as input for Bootleg. Further details about data are in Appendix B. Metrics We report micro-average F1 scores for all metrics over true anchor links in Wikipedia (not weak labels). We measure the torso and tail sets based on the number of times that an entity is the gold entity across Wikipedia anchors and weak labels, as this represents the number of times an entity is seen by Bootleg. For benchmarks, we also report precision and recall using the number of mentions extracted by Bootleg and the number of mentions defined in the data as denominators, respectively. The numerator is the number of correctly disambiguated mentions. For Wikipedia data experiments, we filter mentions such that (a) the gold entity is in the candidate set and (b) they have more than one possible candidate. The former is to decouple candidate generation from model performance for ablations.6 The latter is to not inflate a model’s performance, as all models are trivially correct when there is a single candidate. Training For our main Bootleg model, we train for two epochs on Wikipedia sentences with a maximum sentence length of 100. For our benchmark model, we train for one epoch and additionally add a title embedding feature, a sentence co-occurrence KG matrix as another KG module, and a Wikipedia page co-occurrence statistical feature. Additional details about the models and training procedure are in Appendix B. 4.2 Bootleg Performance Benchmark Performance To understand the overall performance of Bootleg, we compare against reported state-of-the-art numbers of two standard sentence benchmarks (KORE50, RSS500) and the standard document benchmark (AIDA CoNLL-YAGO). Benchmark details are in Appendix B. For AIDA, we first convert each document into a set of sentences where a sentence is the document title, a BERT SEP token, and the sentence. We find this is sufficient to encode document context into Bootleg. We fine-tune the pretrained Bootleg model on the AIDA training set with learning rate of 0.00007, 2 epochs, batch size of 16, and evaluating every 25 steps. We choose the test score associated with the best validation score.8 In Table 1, we show that Bootleg achieves up to 5.8 F1 points higher than prior reported numbers on benchmarks. Tail Performance To validate that Bootleg improves tail disambiguation, we compare against a baseline model from Févry et al. [16], which we refer to as NED-Base.9 NED-Base learns entity embeddings by 6We drop only 1% of mentions from this filter. 8We use the standard candidate list from Pershina et al. [36] when comparing to existing systems for fine-tuning and inference for AIDA CoNLL-YAGO. 9As code for the model from Févry et al. [16] is not publicly available, we re-implemented the model. We used our candidate 8 Table 2: (top) We compare Bootleg to a BERT-based NED baseline (NED-Base) on validation sets of a Wikipedia dataset. We report micro-average F1 scores. All torso, tail, and unseen validation sets are filtered by the number of entity occurrences in the training data and such that the mention has more than one candidate. Model All Entities Torso Entities Tail Entities Unseen Entities NED-Base 85.9 79.3 27.8 18.5 Bootleg 91.3 87.3 69.0 68.5 Bootleg (Ent-only) 85.8 79.0 37.9 14.9 Bootleg (Type-only) 88.0 81.6 62.9 61.6 Bootleg (KG-only) 87.1 79.4 64.0 64.7 # Mentions 4,065,778 1,911,590 162,761 9,626 maximizing the dot product between the entity candidates and fine-tuned BERT-contextual representations of the mention. NED-Base is successful overall on the validation achieving 85.9 F1 points, which is within 5.4 F1 points of Bootleg (Table 2). However, when we examine performance over the torso and tail, we see that Bootleg outperforms NED-Base by 8 and 41.2 F1 points, respectively. Finally, on unseen entities, Bootleg outperforms NED-Base by 50 F1 points. Note that NED-Base only has access to textual data, indicating that text is often sufficient for popular entities, but not for rare entities. 4.3 Downstream Evaluation Relation Extraction Using the learned representations from Bootleg, we achieve the new state-of-the-art on TACRED, a standard relation extraction benchmark. TACRED involves identifying the relationship between a specified subject and object in an example sentence as one of 41 relation types (e.g., spouse) or no relation. Relation extraction is a well-suited for evaluating Bootleg because the substrings in the text can refer to many different entities, and the disambiguated entities impact the set of likely relations. Given an example, we run inference with the Bootleg model to disambiguate named entities and generate the contextual Bootleg entity embedding matrix, which we feed to a simple Transformer architecture that uses SpanBERT [27] (details in Appendix C). We achieve a micro-average test F1 score of 80.3, which improves upon the prior state of the art—KnowBERT [38], which also uses entity-based knowledge—by 1.0 F1 points and the baseline SpanBERT model by 2.3 F1 points on TACRED-Revisited data (Table 3) ([53], Alt et al. [2]). We find that the Bootleg downstream model corrects errors made by the SpanBERT baseline, for example by leveraging entity, type, and relation information or recognizing that different textual aliases refer to the same entity (see Table 4). generators and fine-tuned a pretrained BERT encoder rather than training a BERT encoder from scratch, as is done in Févry et al. [16]. We trained NED-Base on the same weak labelled data as Bootleg for 2 epochs. Table 3: Test micro-average F1 score on revised TACRED dataset. Validation Set F1 Bootleg Model 80.3 KnowBERT 79.3 SpanBERT 78.0 9 Table 4: The following are examples of how the contextual entity representation from Bootleg, generated from entity, relation, and type signals, can help our downstream model. We provide the TACRED example, signals provided by Bootleg, as well our model and the baseline SpanBERT models’ predictions. TACRED Example Bootleg Signals Our Prediction SpanBERT Prediction Vincent Astor, like Marshall (subj), died unexpectedly of a heart attack (obj) in 1959 .. . Gold relation: Cause of Death Disambiguates “Marshall” to Thomas Riley Marshall and “heart attack” to myocardial infarction, which have the Wikidata relation “cause of death” Cause of Death No Relation The International Water Management (obj) Institute or IWMI (subj) study said both . ... Gold relation: Alternate Names Disambiguates alias “International Water Management Institute” and its acronym, the alias “IWMI”, to the same Wikidata entity Alternate Names No Relation In studying the slices for which the Bootleg downstream model improves upon the baseline SpanBERT model, we rank TACRED examples in three ways: by the proportion of words where Bootleg disambiguates it as an entity, leverages Wikidata relations for the embedding, and leverages Wikidata types for the embedding. For each of these three, we report the gap between the SpanBERT model and Bootleg model’s error rates on the examples with above-median proportion (more Bootleg signal) relative to the below-median proportion (less Bootleg signal). We find that the relative gap between the baseline and Bootleg error rates is larger on the slice above (with more Bootleg information) than below the median by 1.10x, 4.67x, and 1.35x respectively: with more Bootleg information, the improvement our SotA model provides over SpanBERT increases (more details in Appendix C). Industry Use Case We additionally demonstrate how the learned entity embeddings from Bootleg provide useful information to a system at a large technology company that answers factoid queries such as “How tall is the president of the United States?". We use Bootleg’s embeddings in the Overton [45] system and compare to the same system without Bootleg embeddings as the baseline. We measure the overall test quality (F1) on an in-house entity disambiguation task as well as the quality over the tail slices which include unseen entities. Per company policy, we report relative to the baseline rather than raw F1 score; for example, if the baseline F1 score is 80.0 and the subject F1 is 88.0, the relative quality is 88.0/80.0 = 1.1. Table 5 shows that the use of Bootleg’s embeddings consistently results in a positive relative quality, even over Spanish, French, and German, where improvements are most visible over tail entities. 4.4 Memory Usage We explore the memory usage of Bootleg during inference and demonstrate that by only using the entity embeddings for the top 5% of entities, ranked by popularity in the training data, Bootleg reduces its Table 5: Relative F1 quality of an Overton[45] model with Bootleg embeddings over one without in four languages. Validation Set English Spanish French German All Entities 1.08 1.03 1.02 1.00 Tail Entities 1.08 1.17 1.05 1.03 10 All Torso Tail Unseen F1 0.6 0.7 0.8 0.9 1.0 Compression ratio 0 20 40 60 80 100 Figure 3: We show the error across all entities, torso entities, tail entities, and unseen entities as we decrease the number embeddings we use during inference, assigning the non-popular entities to a fixed unseen entity embedding. For example, a compression ratio of 80 means only the top 20% of entity embeddings are used, ranked by entity popularity. embedding memory consumption by 95%, while sacrificing only 0.8 F1 points over all entities. We find that the 5.3M entity embeddings used in Bootleg consume the most memory, taking 5.2 GB of space while the attention network only consumes 39 MB (1.37B updated model parameters in total, 1.36B from embeddings). As Bootleg’s representations must be used in a variety of downstream tasks, the representations must be memory-efficient: we thus study the effect of reducing Bootleg’s memory footprint by only using the most popular entity embeddings. Specifically, for the top k% of entities ranked by the number of occurrences in training data, we keep the learned entity embedding intact. For the remaining entities, we choose a random entity embedding for an unseen entity to use instead. Instead of storing 5.3M entity embeddings, we thus store ((100−k)/100)∗5.3M, which gives a compression ratio of (100 −k). Figure 3 shows performance for k of 100, 50, 20, 10, 5, 1, and 0.1. We see that when just the top 5% of entity embeddings are used, we only sacrifice 0.8 F1 points overall and in fact score 2 F1 points higher over the tail. We hypothesize that the increase in tail performance is due to the fact that the majority of mention candidates all have the same learned embedding, decreasing the amount of conflict among candidates from textual patterns. 4.5 Ablation Study Bootleg To better understand the performance gains of Bootleg, we perform an ablation study over a subset of Wikipedia (data details explained in Appendix B). We train Bootleg with: (1) only learned entity embeddings (Ent-only), (2) only type information from type embeddings (Type-only), and (3) only knowledge graph information from relation embeddings and knowledge graph connections (KG-only). All model sizes are reported in Appendix B Table 10. In Table 2, we see that just using type or knowledge graph information leads to improvements on the tail of over 25 F1 points and on the unseen entities of over 46 F1 points compared to the Ent-only model. However, neither the Type-only nor KG-only model performs as well on any of the validation sets as the full Bootleg model. An interesting comparison is between Ent-only and NED-Base. NED-Base overall outperforms Ent-only due to the fine-tuning of BERT word embeddings. We attribute the high performance of Ent-only on the tail compared to NED-Base to our Ent2Ent module which allows for memorizing co-occurrence patterns over entities. Regularization To understand the impact of our entity regularization function p(e) on overall performance, we perform an ablation study on a sample of Wikipedia (explained in Appendix B). We apply (1) a fixed regularization set to a constant percent of 0, 20, 50 and 80, (2) a regularization function proportional to the power of the inverse popularity, and (3) the inverse of (2). Table 6 shows results over unseen entities (full results and details in Appendix B). We see that the fixed regularization of 80% achieves the highest F1 over the fixed regularizations of (1). The method that regularizes by inverse popularity achieves the highest 11 Table 6: We show the micro F1 score over unseen entities for a Wikipedia sample as we vary the entity regularization scheme p(e). A scalar percent means a fixed regularization. InvPop (inverse poularity scheme) applies less regularization for more popular entities and Pop applies more regularization for more popular entities. p(e) 0% 20% 50% 80% Pop InvPop Unseen Entities 48.6 52.5 57.7 59.9 52.4 62.2 overall F1. We further see that the scheme where popular entities are more regularized sees a drop of 9.8 F1 points in performance compared to the inverse popularity scheme. 5 Analysis We have shown that Bootleg excels on benchmark tasks and that Bootleg’s learned patterns can transfer to non-NED tasks. We now verify whether the defined entity, type consistency, KG relation, and affordance reasoning patterns are responsible for these results. We evaluate each over a representative slice of the Wikipedia validation set that exemplifies one of the reasoning patterns and present the results from each ablated model (Table 7). • Entity To evaluate whether Bootleg captures factual knowledge about entities in the form of textual entity cues, we consider the slice of 28K overall, 5K tail examples where the gold entity has no relation or type signals available. • Type Consistency To evaluate whether Bootleg captures consistency patterns, we consider the slice of 312K overall, 19K tail examples that contain a list of three or more sequential distinct gold entities, where all items in the list share at least one type. • KG Relation To evaluate whether Bootleg captures KG relation patterns, we consider the slice of 1.1M overall, 37K tail examples for which the gold entities are connected by a known relation in the Wikidata knowledge graph. • Type Affordance To evaluate whether Bootleg captures affordance patterns, we consider a slice where the sentence contains keywords that are afforded by the type of the gold entity. We mine the keywords afforded by a type by taking the 15 keywords that receive the highest TF-IDF scores over training examples with that type. This slice has 3.4M overall, 124K tail examples. Pattern Analysis For the slice representing each reasoning pattern, we find that Bootleg provides a lift over the Entity-only and NED-Base models, especially over the tail. We find that Bootleg generally combines the entity, relation, and type signals effectively, performing better than the individual Entity-only, KG-only, and Type-only models, although the KG-only model performs well on the KG relation slice. The lift from Bootleg across slices indicates the model’s ability to capture the reasoning required for the slice. We provide additional details in Appendix D. Error Analysis We next study the errors made by Bootleg and find four key error buckets. • Granularity Bootleg struggles with granularity, predicting an entity that is too general or too specific compared to the gold entity (example in Table 8). Considering the set of examples where the predicted entity is a Wikidata subclass of the gold entity or vice versa, Bootleg predicts a too general or specific entity in 12% of overall and 7% of tail errors. • Numerical Bootleg struggles with entities containing numerical tokens, which may be due to the fact that the BERT model represents some numbers with sub-word tokens and is known to not perform as well for numbers as other language models [49] (example in Table 8). To evaluate examples requiring 12 Table 7: We report the Overall/Tail F1 scores across each ablation model for a slice of data that exemplifies a reasoning pattern. Each slice is representative but may not cover every example that contains the reasoning pattern. Model Entity Type Consistency KG Relation Type Affordance NED-Base 59/29 84/29 91/30 87/28 Bootleg 66/47 95/85 98/92 93/73 Bootleg (Ent-only) 59/31 87/45 90/42 87/39 Bootleg (Type-only) 53/44 93/80 93/69 90/66 Bootleg (KG-only) 40/29 92/79 97/93 89/68 % Coverage 0.7%/3.3% 8%/12% 27%/23% 84%/76% reasoning over numbers, we consider the slice of data where the entity title contains a year, as this is the most common numerical feature in a title. This slice covers 14% of overall and 25% of tail errors. • Multi-Hop There is room for improvement in multi-hop reasoning. In the example shown Table 8, none of the present gold entities—Stillwater Santa Fe Depot, Citizens Bank Building (Stillwater, Oklahoma), Hoke Building (Stillwater, Oklahoma), or Walker Building (Stillwater, Oklahoma)—are directly connected in Wikidata; however, they share connections to the entity “Oklahoma”. This indicates that the correct disambiguation is Citizens Bank Building (Stillwater, Oklahoma), not Citizens Bank Building (Burnsville, North Carolina). To evaluate examples requiring 2-hop reasoning, we consider examples where none of the present entities are directly linked in the KG, but a present pair connects to a different entity that is not present in the sentence. We find this occurs in 6% of overall and 7% of tail errors. This type of error represents a fundamental limitation of Bootleg as we do not encode any form of multi-hop reasoning over a KG in Bootleg. Our KG information only encodes single-hop patterns (i.e., direct connections). • Exact Match Bootleg struggles on several examples in which the exact entity title is present in the text. Considering examples where the BERT Baseline is correct but Bootleg is incorrect, in 28% of the examples, the textual mention is an exact match of the entity title. Further, 32% of the examples contain a keyword from the entity title that Bootleg misses (example in Table 8). We attribute this decrease in performance to Bootleg’s regularization. This mention-to-entity similarity would need to be encoded in Bootleg’s entity embedding, but the regularization encourages Bootleg to not use entity-level information. 6 Related Work We discuss related work in terms of both NED and the broader picture of self-supervised models and tail data. Standard, pre-deep-learning approaches to NED have been rule-based [1] or leverage statistical techniques and manual feature engineering to filter and rank candidates [50]. For example, link counts and similarity scores between entity titles and mention are two such features [12]. These systems tend to be hard to maintain over time, with the work of Petasis et al. [37] building a model to detect when a rule-based NED system needs to be retrained and updated. In recent years, deep learning systems have become the new standard (see Mudgal et al. [32] for a high-level overview of deep learning approaches to entity disambiguation and entity matching problems). The most recent state-of-the-art models generally rely on deep contextual word embeddings with entity embeddings [16, 46, 51]. As we showed in Table 2, these models perform well over popular entities, but struggle to resolve the tail. Jin et al. [26] and Hoffart et al. [23] study disambiguation at the tail, and both rely on phrase-based language models for feature extraction. Unlike our work, they do not fuse type or knowledge graph information for disambiguation. 13 Table 8: We identify four key error buckets for Bootleg: granularity, numerical errors, multi-hop reasoning, and missed exact matches. We provide a Wikipedia example, the gold entity, and Bootleg’s predicted entity for each example. Error Wikipedia Example Bootleg Prediction Gold Entity Granularity Posey is the recipient of a Golden Globe Award nomination, a Satel- lite Award nomination and two In- dependent Spirit Award nominations. Satellite Awards Satellite Award for Best Actress – Motion Picture Numerical He competed in the individual road race and team time trial events at the 1976 Summer Olympics. Cycling at the 1960 Summer Olympics – 1960 Men’s Road Race Cycling at the 1976 Sum- mer Olympics – 1976 Men’s Road Race Multi-hop Other nearby historic buildings in- clude the Santa Fe Depot, the Cit- izens Bank Building, the Hoke Building, the Walker Building, and the Courthouse Citizens Bank Build- ing (Burnsville, North Carolina) Citizens Bank Building (Stillwater, Oklahoma) Exact Match According to the Nielsen Media Research, the episode was watched by 469 million viewers... Nielsen ratings Nielsen Media Research Disambiguation with Types Similar to our work, recent approaches have found that type information can be useful for entity disambiguation [9, 14, 21, 31, 43, 55]. Dredze et al. [14] use predicted coarse-grained types as entity features into a SVM classifier. Chen et al. [9] models type information as local context and integrates a BERT contextual embedding into the model from [17]. Raiman and Raiman [43] learns its own type systems and performs disambiguation through type prediction alone (essentially capturing the type affordance pattern). Ling et al. [31] demonstrate that the 112-type FIGER type ontology could improve entity disambiguation, and the LATTE framework [55] uses multi-task learning to jointly perform type classification and entity disambiguation on biomedical data. Gupta et al. [21] adds both an entity-level and mention-level type objective using type embeddings embeddings. We build on these works using fine and coarse-grained entity-level type embeddings and a mention-level type prediction task. Disambiguation with Knowledge Graphs Several recent works have also incorporated (knowledge) graph information through graph embeddings [35], co-occurrences in the Wikipedia hyperlink graph [42], and the incorporation of latent relation variables [30] to aid disambiguation. Cetoli et al. [8] and Mulang et al. [33] incorporate Wikidata triples as context into entity disambiguation by encoding triples as textual phrases (e.g., “