Chapter 4 Machine Learning in Digital Scholarship Andrew Janco Haverford College Introduction We are entering an exciting time when research on machine learning and innovation no longer requires background knowledge in programming, mathematics, or data science. Tools like Run- wayML, the Teachable Machine, and Google AutoML allow researchers to train project-specific classification and object detection models. Other tools such as Prodigy or INCEpTION provide the means to train custom named entity recognition and named entity linking models. Yet with- out a clear way to communicate the value and potential of these solutions to humanities scholars, they are unlikely to incorporate them into their research practices. Since 2014, dramatic innovations in machine learning have occurred, providing new capa- bilities in computer vision, natural language processing, and other areas of applied artificial in- telligence. Scholars in the humanities, however, are often skeptical. They are eager to realize the potential of these new methods in their research and scholarship, but they do not yet have the means to do so. They need to make connections between machine capabilities, research in the sciences, and tangible outcomes for humanities scholarship, but very often, drawing these con- nections is more a matter of chance than deliberate action. Is it possible to make such connections deliberately and identify how machine learning methods can benefit a scholar’s research? This article outlines a method for connecting the technical possibilities of machine learning with the intellectual goals of academic researchers in the humanities. It argues for a reframing of the problem. Rather than appropriating innovations from computer science and artificial intelli- gence, this approach starts from humanities-based methods and practices. This shift allows us to work from the needs of humanities scholars in terms that are familiar and have recognized value to their peers. Machines can augment scholars’ tasks with greater scale, precision, and reproducibil- 43 44 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 ity than are possible for a single scholar alone. However, only relatively basic and repetitive tasks can presently be delegated to machines. This article argues that John Unsworth’s concept of “scholarly primitives” is an effective tool for identifying basic tasks that can be completed by computers in ways that advance humani- ties research (2000). As Unsworth writes, primitives are “basic functions common to scholarly activity across disciplines, over time, and independent of theoretical orientation.” They are the building blocks of research and analysis. As the roots and foundations of our work, “primitives” provide an effective starting point for the augmentation of scholarly tasks. Here it is important to note that the end goal is not the automation of scholarship, but rather the delegation of appropriate tasks to machines. As François Chollet recently noted, Our field isn’t quite “artificial intelligence” — it’s “cognitive automation”: the en- coding and operationalization of human-generated abstractions / behaviors / skills. The “intelligence” label is a category error. (2020) This view shifts our focus from the potential intelligence of machines towards their abil- ity to complete useful tasks for human ends. Specifically, they can augment scholars’ work by performing repetitive tasks at scale with superhuman speed and precision. I proceed from this understanding to argue for an experimental and interpretive approach to machine learning that highlights the value of the interaction between the scholar and machine rather than what ma- chines can produce. *** Unsworth’s notion “scholarly primitive” takes its meaning from programming and refers to the most basic operations and data types of a programming language. Primitives form the build- ing blocks for all other components and operations of the language. This borrowing of termi- nology also suggests that primitives are not universal. A sequence of characters called a string is a primitive in Python, but not in Java or C. The architecture of a language’s primitives changes over time and evolves with community needs. The Python and C communities, for example, have em- braced Unicode as a standard to allow strings in every human language (including emojis). Other communities continue to use a range of character encodings, which grants greater flexibility to the individual programmer and avoids the notion that there should be a common standard. For scholarship, the term offers a metaphor and point of departure. It poses a question: What are the most basic elements of scholarly research and analysis? Unsworth offers several initial ex- amples of primitives to illustrate their value without a claim that they are comprehensive, includ- ing discovering, annotating, comparing, referring, sampling, illustrating, and representing. These terms offer a “list of functions (recursive functions) that could be the basis for a manageable but also useful tool-building enterprise in humanities computing.” Primitives can thus guide us in the creation of computational tools for scholarship. For example, with the primitive of comparison, a scholar might study different editions of a text, searching for similarities and differences that often lead to new insights or highlight ideas that would otherwise be taken for granted. As a tool, comparison can (but does not always) re- veal new information. For an assignment in graduate school, I compared a historical calendar that showed the days of the week against entries in Stalin’s appointment book. The simple juxtaposi- tion revealed that none of Stalin’s appointments were on a Sunday. This example raises questions for further investigation and interpretation. If Stalin was an atheist who worked at all times of Janco 45 the day and night, why wouldn’t he schedule meetings on Sundays? Perhaps it was a legacy from Stalin’s youth spent in seminary? Is there a similar pattern in other periods of Stalin’s life? The craft of humanities research relies on many such simple initial queries. It should be noted that these little experiments are just the beginning of a research project. Nonetheless, the utility of comparison is clear. If anything, it seems so basic as to go unnoticed. This particular comparison offered an insight and new knowledge that led to further research questions. Such beginnings are often a matter of luck. However, machine learning offers an opportu- nity to increase the dimensionality of comparisons. The similarities and differences between two editions of a text can easily be quantified using Levenshtein distance.1 However, that will only capture the differences at the level of characters on a page. With machine learning, we can train embeddings that account for semantics, authors, time periods, genders and other features of a text and its contents simultaneously. We can quantify similarity in new ways that facilitate new forms of comparison. This approach builds on the original meaning and purpose of comparison as a form of “scholarly primitive,” but opens additional directions for research and opportunities for insights. Rather than relying on happenstance or intuition to find productive comparisons, we can systematically search and compare research materials. The second “scholarly primitive” that lends itself well to augmentation is annotation. This activity takes different forms across disciplines. A literary scholar might underline notable sec- tions of a text by writing a note in the margins. A historian transcribes information from an archival source into a notebook. At their core, these actions add observations and associations to the original materials. Those steps in the research process are the first, most basic step, that con- nects information in a source to a larger set of research materials. We add context and meaning to materials that make them part of a larger collection. When working with texts or images, machine learning models are presently capable of mak- ing simple annotations and associations. For example, named entity recognition models (NER) are able to recognize person names, place names, and other key words in text. Each label is an annotation that makes a claim about the content of the text. “Steamboat Springs” or “New York City” are linked to an entity called PLACE. Once again, we are speaking about the most basic first steps that scholars perform during research. I know that Steamboat Springs is a place. It’s where I grew up. However, another scholar, one less versed in small mountain towns in Colorado, might not recognize the town name. They might identify it as a spring or a ski resort; perhaps a volcanic field in Nevada. The idea of “scholarly primitives” forces us to confront the importance of do- main knowledge and the role that it plays in the interpretation of materials. To teach a machine to find entities, we must first explain everything in very specific terms. We can train the machine to use surrounding contextual information in order to predict — correctly — that “Steamboat Springs” refers to a town, a spring, or a ski resort. As part of a project with Philip Gleissner, I trained a model that correctly identifies Soviet journal names in diary entries. For instance, the machine uses contextual clues to identify when the term Volga refers to the journal by that name and not to the river or the automobile. Where is the mention of “October” a journal name and not a month, a factory name, or the revolu- tion? The trained model makes it possible to identify references to journals in a corpus of over 400,000 diary entries. This in turn makes it possible to research the diaries with a focus on reader reception. Normally, this would be a laborious and time-consuming task. Each time the machine predicts an entity in the text, it adds annotations. What was simply text is now marked as an en- 1Named after the Soviet mathematician Vladimir Levenshtein, Levenshtein distance uses the number of changes that would be needed to make two objects identical as a measure of their similarity. 46 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 tity. As part of this project, we had to define the relevant entities, create training data, and train the model to accomplish a specific task. This process has tangible value for scholarship because it forces us to break down complicated research processes into their most basic tasks and processes. As noted before, annotation can be an act of association and linking. Natural language pro- cessing is capable of not only recognizing entities in a text, but also associating that text with a record in a knowledge base. This capability is called named entity linking. Using embeddings, a statistical language model can not only predict that “Steamboat Springs” is a town, but that it is a specific town with the record Q984721 in dbpedia. This association opens a wealth of contex- tual information about the place, including its population, latitude and longitude, and elevation. A scholar might have ample knowledge and experience reading literature — specifically, Milton. A machine does not, but it has access to context information that enriches analysis and permits associations. The result is a reading of a literary work that accounts for contextual knowledge. To be sure, named entity linking is not a replacement for domain knowledge. However, it is able to augment a scholar’s contextual knowledge of materials and make that information available for study during research. At this point, we are asking the machine not only to sort or filter data, but to reason actively about its contents. Machine learning offers the potential to automate humanities annotation tasks at scale. This is true of basic tasks, such as recognizing that a given text is a letter. It is also true of object recognition tasks, such as identifying a state seal in a letterhead or other visual at- tributes. A Haverford College student was doing research on documents in a digital archive that we are building with the Grupo de Apoyo Mutuo (GAM), of more than three thousand case inves- tigations of disappeared persons during the Guatemalan Civil War. They noticed that many of the documents were signed with a thumbprint. The student and I trained an image classification model to identify those documents, thus providing the capability to search the entire collection of documents for this visual attribute. The thumbprints provided a proxy for literacy and allowed the student to study the collection in new ways. Similarly, documents containing the state seal of Guatemala are typically letters from the government in reply to GAM’s requests for information about disappeared persons. At present, several excellent tools exist to facilitate machine annotation of images and texts. Google’s Teachable Machine offers an intuitive web application that humanities faculty and stu- dents can use to train classification models for images, sounds, and poses. To take the example above, the user would upload images of correspondence. They would then upload images of doc- uments that are not letters.2 Once training begins, a base model is loaded and trained on the new categories. Because the model already has existing training on image categories, it is able to learn the new category with only a few examples. This process is called transfer learning. For more advanced tasks, Google offers AutoML Vision and Natural Language, which are able to process large collections of text or images and to deploy trained models using Google cloud infrastruc- ture. Similar products are available from Amazon, IBM, and other companies. Runway ML offers a locally installed program with more advanced capabilities than the Teachable Machine. Runway ML works with a wide range of machine learning models and is an excellent way for scholars to explore their capabilities without having to write code.3 The accessibility of tools like 2In the Google Cloud Terms of Service there is specific assurance that your data will not be shared or used for any other purpose than the training of the model. More expert analysis may find concerns, and caution is always warranted. At present, there seems to be no more risk in using cloud services for ML tasks than there are for using cloud services more generally. See ?iiTb,ff+HQm/X;QQ;H2X+QKfi2`Kbf. 3Teachable Machine, ?iiTb,ffi2�+?�#H2K�+?BM2XrBi?;QQ;H2X+QKf; Google AutoML, ?iiTb,ff+HQm /X;QQ;H2X+QKf�miQKHf; RunwayML, ?iiTb,ff`mMr�vKHX+QKf. https://cloud.google.com/terms/ https://teachablemachine.withgoogle.com/ https://cloud.google.com/automl/ https://cloud.google.com/automl/ https://runwayml.com/ Janco 47 Runway allows for low-stakes experimentation and exploration. It is also a particularly good way for scholars to explore new methods and discover new materials. For Unsworth, discovery is largely the process of identifying new resources. We can find new sources in a library catalog, on the shelf, or in a conversation. These activities require a human in the loop because it is the person’s incomplete knowledge of a source that makes it a “discovery” when found. Given that machines reason about the content of text and images in ways that are quite unlike those of humans, machine learning opens new possibilities for discovery. When it comes to the differences in our own habits of mind and the computational processes of artificial networks, we may speak of “neurodiversity.” Scholars can benefit from these differences, since the strengths of machine thinking complement our needs. Machine learning models offer a variety of ways to identify similarity and difference with re- search materials. Yale’s PixPlot, for example, uses a convolutional network to train image embed- dings which are then plotted relative to one another in two-dimensional space with a stochastic nearest neighbor algorithm (t-SNE) (Duhaime n.d.).4 PixPlot creates a striking visualization of hundreds or thousands of images, which are organized and clustered by their relative visual sim- ilarity. As a research tool, PixPlot and similar projects offer a quick means to identify statistically relevant similarities and clusters. This visualization reveals what patterns are most evident to the machine and provides a discovery tool for associations that might not be evident to a human researcher. Ben Schmidt has applied a comparable process to “machine read” and visualize four- teen million texts in the HathiTrust (n.d., 2018).5 Using the relative co-occurrence of words in a book, Schmidt is able to train book embeddings. Schmidt’s vectors provide an original way to organize and label texts based purely on the machine’s “reading” of a book. These machine- generated labels and clusters can be compared against human-generated metadata. The value of this work is the human investigation of what machine models find significant in a collection of research materials. For example, with topic modeling, a scholar must interpret what a particular algorithm has identified as a statistically significant topic by interpreting a cryptic chain of words. The topic “menu, platter, coffee, ashtray” is likely related to a diner. In these efforts, Scattertext offers an effective tool to visualize what terms are most distinctive of a text category. In a given corpus of text, I can identify which words are most exemplary of poetry and which words are most exemplary of prose. Scattertext creates a striking and useful visualization, or it can be used in the terminal to process large collections of text. Conclusion As a conceptual tool, “scholarly primitives” has considerable promise to connect the intellectual goals of academic researchers in the humanities with the technical possibilities of machine learn- ing. Rather than focusing on the capabilities of machine learning methods and the priorities of machine learning researchers, this method offers a means to build from the existing research practices of humanities scholars. It allows us to identify what kinds of tasks would benefit from being augmented. Using “primitives” shifts the focus away from large abstract goals, such as re- search findings and interpretive methods, to micro-methods and actions of humanities research. By augmenting these activities, we are able to benefit from the scale and precision afforded by 4See also ?iiTb,ff�`ib2tT2`BK2MibXrBi?;QQ;H2X+QKfibM2K�Tf. 5At time of writing, Schmidt’s digital monograph Creating Data (n.d.) is a work in progress, with most sections empty until the official publication. https://artsexperiments.withgoogle.com/tsnemap/ 48 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 computational methods, as well as the valuable interplay between scholars and machines as hu- manities research practices are made explicit and reproducible. References Chollet, François. 2020. “Our Field Isn’t Quite ‘Artificial Intelligence’ — It’s ‘Cognitive Au- tomation’: The Encoding and Operationalization of Human-Generated Abstractions / Be- haviors / Skills. The ‘Intelligence’ Label Is a Category Error.” Twitter, January 6, 2020, 10:45 p.m. ?iiTb,ffirBii2`X+QKf7+?QHH2ifbi�imbfRkR9jNk9Nejd8yk8ee9. Duhaime, Douglas. n.d. “PixPlot.” Yale DHLab. Accessed July 12, 2020. ?iiTb,ff/?H�#X v�H2X2/mfT`QD2+ibfTBtTHQif. Schmidt, Benjamin. n.d. “A Guided Tour of the Digital Library.” In Creating Data: The Inven- tion of Information in the American State, 1850-1950. ?iiT,ff+`2�iBM;/�i�Xmbf/� i�b2ibf?�i?B@72�im`2bf. . 2018. “Stable Random Projection: Lightweight, General-Purpose Dimension- ality Reduction for Digitized Libraries.” Journal of Cultural Analytics, October. ?iiTb, ff/QBXQ`;fRyXkkR93fReXyk8. Unsworth, John. 2000. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” Paper presented at the Symposium on Humanities Computing: Formal Methods, Experimental Practice, King’s College, Lon- don, May 2000. ?iiT,ffrrrXT2QTH2XpB`;BMB�X2/mf�DKmkKfEBM;bX8@yyfT`BK BiBp2bX?iKH. https://twitter.com/fchollet/status/1214392496375025664 https://dhlab.yale.edu/projects/pixplot/ https://dhlab.yale.edu/projects/pixplot/ http://creatingdata.us/datasets/hathi-features/ http://creatingdata.us/datasets/hathi-features/ https://doi.org/10.22148/16.025 https://doi.org/10.22148/16.025 http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html