01-hintze-artificial ---- Chapter 1 Artificial Intelligence in the Humanities: Wolf in Disguise, or Digital Revolution? Arend Hintze Dalarna University Jorden Schossau Michigan State University Introduction Artificial Intelligence, with its ability to machine learn coupled to an almost human-like under- standing, sounds like the ideal tool to the humanities. Instead of using primitive quantitative methods to count words or catalogue books, current advancements promise to reveal insights that otherwise could only be obtained by years of dedicated scholarship. But are these technolo- gies imbued with intuition or understanding, and do they learn like humans? Are they capable of developing their own perspective, and can they aid in qualitative research? In the 80s and 90s, as home computers were becoming more common, Hollywood was sen- sationalizing the idea of smart or human-like Artificial Intelligent machines (AI) through movies such as Terminator, Blade Runner, Short Circuit, and Bicentennial Man. At the same time, the home experience of personal computing highlighted the difference between Hollywood intelli- gent machines and the reality of how “dumb” machines really were. Home, or even industry machines, could not answer simple natural language questions of anything but the simplest of complexity. Instead, users or programmers needed to painstakingly implement an algorithm to address their question. Then, the user was required to wait for the machine to slavishly follow each instruction that was programmed while hoping that whoever entered the instructions did 3 4 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 not make a mistake. Despite the Hollywood intelligent machines sensation, people understood that computers did not and could not think like humans, but that they do excel at perform- ing repetitive tasks with extreme speed and fidelity. This shaped the expectations for interacting with computers. Computers became efficient tools that required specific instruction in order to achieve a desired outcome. Computational technology and user experience drastically changed over the next 20 years. Technology became much more intuitive to use while it also became much more powerful at handling large data sets. For instance, Google can return search results for websites as a response to even the silliest or sparsest request, with a decent chance that the results are relevant to the question asked. Did you read a manual before you used your smartphone, or did you like everyone else just “figure it out”? Or, as a consequence of modern-day media and its on-demand services, children ask to skip a song playing through radio broadcast. The older technologies quickly feel archaic. These technological advancements go hand in hand with the developments in the field of machine learning and artificial intelligence. The automotive industry is on the cusp of fully self- driving cars. Electronic assistants are not only keeping track of our dates and responding to spo- ken language, they will also soon start making our appointments by speaking to other humans on our behalf. Databases are getting new voice-controlled intuitive interfaces, changing a typ- ical incomprehensible “a1G1*h �o:Ub�H�`vV 6_PJ 2KTHQv22GBbi q>1_1 v2�`>B`2/ = kyRkc” to a spoken “Average salary of our employees hired after 2012?” Another phenomenon is the trend in many disciplines to go from “qualitative” to “quanti- tative” research, or to think about the “system” rather than the “components.” The field that probably experienced this trend first was biology. While obviously descriptive about species of organisms, biologists also always wanted to understand the mechanisms that drive life on earth spanning micro to macro scales. Consequently, a lot is known about the individual chemical components that constitute our metabolism, the components that drive cell division and DNA replication, and which genes are involved in, for example, developmental processes. However, in many cases, our scientific knowledge only covers single functions of single components. In the context of the cell, the state of the organism and how other components interact matters a lot. Cancer, for example, cannot be explained by a single mutation on a single gene but involves many complex interactions (Hanahan and Weinberg 2011). Ecosystems don’t collapse because a single insect dies, but because indirect changes in the food chain interact in complex ways (for a review of the different theories, see Tilman 1996). As a result, systems biology emerged. Systems biolo- gists use large data sets and are often dependent on computer models to understand phenomena on the systems level. The field of Bioinformatics is one such example of an entire field that emerged as a result of using computers to study entire systems that were otherwise humanly intractable. The human genome project to sequence the complete human genome finished in 2003, a time when our con- sumer data storage was limited by the amount of data that fit on a DVD (4.9 GB). While the hu- man genome fits on a DVD, the data that came from the sequencing machines was much larger. Short repetitive sequences first needed assembly, which at that time was a high-performance com- puting task. Other fields have since undergone their own computational revolutions, and now the hu- manities begin their computational revolution. Computers have been a part of core library in- frastructure and experience for some time now, by cataloging entries in a database and allowing intuitive user exploration of that database. However, the digital humanities go beyond this (Fitz- Hintze and Schossau 5 patrick 2012). The ability to analyze (crawl) extremely large corpora of different sources, monitor the internet using the Internet of Things as large sensor arrays, and detect patterns by using so- phisticated algorithms can each produce a treasure trove of quantitative data. Until this point these tasks could only be described or analyzed qualitatively. Additionally, artificial intelligence promises models of the human mind (Yampolskiy and Fox 2012). Machine learning allows us to learn from these data sets in ways that exceed human capa- bilities, while an artificial brain will eventually allow us to objectively describe a subjective experi- ence (through quantifying neural activations or positively and negatively associated memories). This would ultimately close the gap between quantitative and qualitative approaches by allowing an inspection of experience. However, this bridging between quantitative and qualitative methods causes a possible ten- sion for the humanities, which historically defines itself by qualitative methodologies. When qualitative experiences or responses can be finely quantified, such as sadness caused by reading a particular passage, or the curiosity caused by viewing certain works of art, then the field will undergo a revolution. When this happens, we will be able to quantify and discuss how sadness was learned by reading, or how much surprise was generated by viewing an artwork. This is exactly the point where the metaphors break down. Current computational models of the mind are not sophisticated enough to allow these kinds of inferences. Machine learning algorithms work well for what they do but have nothing to do with what a person would call learning. Artificial intelligence is a broad encompassing field. It includes methods that might have appeared to be magic only a couple of years ago (such as generative adversarial networks). Algorithmic finesse resulting from these advances is capable of beating humans in chess (Camp- bell, Hoane Jr, and Hsu 2002), but it is only a very specialized algorithm that has nothing to do with the way humans play or learn chess. This means we are back to the problem we had in the 80s. Instead of being disappointed by the difference between modern technology and Hol- lywood technology, we are disappointed by the difference between modern technology and the experience implied by the labels given to those technologies. Applying misnomer terminology, such as “smart,” “intelligent,” “search,” and “learning” to modern technologies that have little to do with those terms is misleading. It is possible that such technology was deliberately branded with these terms for the improved marketing and sales, effectively redefining them and obscuring their original meaning. Consequently, we again are disappointed by the mismatch of the expec- tations of our computing infrastructure and the reality of our experiences. The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Learning: Phenomenon versus Mechanism Learning is an electrochemical process that involves cells, their genetic makeup, and how they are interconnected. Some interplay between external stimuli and receptor proteins in specialized sensor neurons leads to electrochemical signals propagating over a network of interconnected cells, which themselves respond with physical and genetic changes to said stimuli, probably also dependent on previous stimuli (Kandel, Schwartz, Jessel 2000). This concoction of elaborate terms might suggest that we know in principle which parts are involved and where they are, but we are far from an understanding of the learning mechanism. The description above is as generic as saying that a city functions because cars drive on streets. Even though we might know a lot 6 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 about long-term potentiation or the mechanism of neurons which fire together wiring together (aka Hebbian learning), neither of these processes actually mechanistically explains how learning works. Neuroscience, neurophysiology, and cognitive science have not been able to discover this complete process in such a way that we can replicate it, though some inroads are being made (El- Boustani et al. 2018). Similarly, we find promising new interdisciplinary efforts like “Cognitive computational neuroscience” that try to bridge the gap between neuro- and cognitive science and computation (Kriegeskorte and Douglas 2018). So, unfortunately, while the components involved can be identified, the question about “how learning works” cannot be answered mech- anistically. However, a lot is known about the phenomenon of learning. It happens during the lifetime of an organism. What happens between the lifetimes of related organisms is an adaptive process called evolution: inheritance, variation, and natural selection over many generations up to 3.5 billion years here on Earth enabled populations of organisms to succeed in their environments in any way they could. Evolutionary forces found ways for organisms to adapt to their environment during their own lifetimes. While this can take many forms, such as storing energy, seeking shel- ter, having a fight or flight response, it has led to the phenomenon we now call learning. Instead of discussing the diversity of learning in the animal kingdom, we will discuss the richest example: human learning. Here, learning is defined as the cognitive adaptation to external stimulus. The phenomenon of learning can be observed as an increase in performance over time. Learning makes the organism better at doing something. In humans, because we have language and a much higher degree of abstract thinking, an improvement in performance can be facilitated very quickly. While it takes time to learn how to juggle, the ability to find the mean of a series of samples can be quickly com- municated by reading Wikipedia. Both types of lifetime adaptations are called learning. How- ever, these lifetime adaptations are facilitated by two different cognitive processes: explicit or im- plicit learning. 1 Explicit learning—or episodic memory—is fact-based memory. What you did yesterday, what happened in your childhood, or the list of things you should buy when you go shopping, are all memories. Currently, the engram theory best explains this mechanism (Poo et al. 2016 elaborates on the origins of the term). Explicit memory can be retrieved relatively easily and then used to inform future decisions: “Press the green button if the capital of Italy is Paris, otherwise press the red.” The rate of learning for explicit memory can be much higher than for implicit memory, and it can also be communicated more quickly. Abstract communication, such as “I saw a wolf” allows us to transfer the experience of seeing a wolf quickly to other individuals, even though their evoked explicit memory might not be identical to ours. Learning by using implicit memory—sometimes called procedural memory—is facilitated by much slower processes (Schacter, Chiu, and Ochsner 1993). It is generally based on the idea that learning is a combination of expectation, observation or action, and internal model changes. For example, a recovering hospital patient who has suffered a stroke is handed an apple. In this exchange, the patient forms an expectation of where his hand will be to accept the apple. He en- gages his muscles to move his forearm and hand to accept the apple, which is his action. Then the patient observes that his arm did not arrive at the correct position (due to neurological damage). This discrepancy between expectation and action-outcome drives internal changes so that the patient’s brain learns how to adequately control their arm. Presumably, everything considered a skill is based on this process. While very flexible, this form of memory is not easily communicated nor fast to acquire. For instance, while juggling can be described it cannot be communicated in 1There are more than these two mechanisms, but these are the two major ones. Hintze and Schossau 7 such a way that it enables the recipient to juggle without additional training. This description of explicit and implicit learning is an amalgamation of many different hy- potheses and observations. Also, these processes are not as well segregated in practice as outlined here. What is important is what these two learning mechanisms are based on: observations lead to memory, and internal predictions together with exploration lead to improved models about the world. Lastly, these learning processes only exist in organisms because they previously conferred an evolutionary advantage: Organisms that could memorize and then act on those memories had more offspring than those that did not. This interaction of learning and evolution is called the Baldwin effect (Weber and Depew 2003). Organisms that could explore the environment, make predictions about it, and use observations to optimize their internal models were similarly more capable than organisms that could not. Machines do not Learn; They are Trained Now prepared with a proper intuition about learning, we can turn our attention to machine learning. After all, our intuitions should be meaningful in the computational domain as well, because learning always follows the same pattern. One might be disappointed when looking over the table of contents of a machine learning book and find only methods for creating static trans- formation functions (see Russell and Norvig 2016, one of the putative foundations of machine learning and AI). There will typically be a distinction between supervised and unsupervised learn- ing, between categorical and continuous data, and maybe a section about other “smart” algo- rithms. You will not find a discussion about implicit and explicit memory, let alone methods for implementing these concepts. So, if these important sections in our imaginary machine learning book do not discuss the mechanisms of learning, then what are they discussing? Unsupervised learning describes algorithms that report information based on associations within the data. Clustering algorithms are a popular example of unsupervised learning. These use similarity between data points to form and report on distinct groups of data. Clustering is a very important method but is only a well-designed algorithm that is not adaptive. Supervised learning describes algorithms that refine a transformation function to convert from a certain input to a certain output. The idea is to balance specific and general refining such that the transformation function correctly transforms all known examples but generalizes enough to work well on new variations. For example, we would like the machine to transform image data into textual labels, such as “house” or “car.” The input is an image and the output is a label. The input image data are provided to the machine, and small adjustments to the machine’s function are made depending on how well it provided the correct output. Many iterations later ideally will result in a machine that can transform all image data to correct labels, and even operate correctly on new variations of images not provided before. Supervised learning is extremely powerful and is yet to be fully explored. However, supervised learning is quite dissimilar to actual learning. A common argument is that supervised learning uses feedback in a “student-teacher” paradigm of making changes with feedback until proper behavior is achieved, so it could be considered learning. But this feedback is external, objective, and not at all similar to our prediction and com- parison model that, for instance, operates without an all-knowing oracle whispering “good” or “bad” into our ears. Humans and other organisms instead compare predictions with outcomes, and the choices are driven by an intersection of desire and prediction. What seems astonishing is the diverse and specialized capabilities that these two rather simple types of computation, clustering and classification, can produce. Their economic impact is enor- 8 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 mous, and we are still finding new ways to combine neural networks and exploit deep learning techniques to create amazing data transformations, such as deep fake videos. But so far, each as- tounding example of AI, through machine learning or some other method, is not showcasing all these capabilities as one machine, but instead each as an independently achieved computational marvel. Each of these examples does only exactly what it was trained to do in a narrow domain and no more. Siri, or any other voice assistant for that matter, does not drive a car (López, Que- sada, and Guerrero 2017), Watson does not play chess (Ferrucci et al. 2013), and Google Alpha Go cannot understand spoken language (Gibney 2016). Even hybrid approaches, such as com- bining speech recognition, chess playing, and autonomous driving, would only be a combination of specialty strategies, not a trained entity from the ground up. Modern machine learning gives us an amazing collection of very applicable, but extremely specialized, computational tools that may be customized to particular data sets, but the resulting machines do not learn autonomously as you or I do. There are cutting edge technologies, such as so-called neuromorphic chips (Nawrocki, Voyles, and Shaheen 2016) and other computational brain models that more closely mimic brain function, but they are not what has been sensation- alized in the media as machine learning or AI, and they have yet to showcase competence on difficult problems competitive with standard supervised learning. Curiously, many people in the machine learning community defend the term “learning,” ar- guing there is no difference between learning and training. In traditional machine learning, the trained algorithm is deployed as a service after which it no longer improves. If the data set ever changes, then a new training set including correct labels needs to be generated and a new train- ing phase initiated. However, if the teacher can be forever bundled with the learner and training continued during the deployment phase, even on new never-before-seen data, then indeed the delineation between learning and training is far less clear. Approaches to such lifelong learning exist, but they struggle with what is called catastrophic forgetting—the phenomenon that only the most recent experiences are learned at the expense of older ones (French 1999). This is the objective for Continuous Delivery for machine learning. Unfortunately, creating a new training set is typically the most expensive endeavor for standard supervised machine learning develop- ment. Adequate training then becomes difficult or impossible without involving thousands or millions of human inputs to keep up with training and using the online machine on an ever- evolving data set. Some have tried to use such “human-in-the-loop” methods, but the resulting machine then becomes only a slight extension of the humans who are forever caught in the loop. Is it an intelligent machine, or a human trapped in a machine? To combat this problem of generating the training set, researchers altered the standard super- vised learning paradigm of flexible learner and rigid teacher to make the teacher likewise flexible to generate new data, continually probing the bounds of the student machine. This is the method of Generative Adversarial Networks, or GANs (Goodfellow et al. 2014). The teacher generates training examples and the student discerns between those generated examples and the original labeled training data. After many iterations, the teacher is improved to better fool the student, and the student is improved to better discern generated training data. As amazing as they are, GANs only partially mitigate the problematic requirement for human-labeled training data, be- cause GANs can only mimic a known labeled distribution. If that distribution ever changes, then new labeled data must be generated, and again we have the same problem as before. Unfor- tunately, GANs have been sensationalized as magic, and public and hobbyist expectation is that GANs are a way toward much better artificial intelligence. Disappointment is inevitable because GANs only allow us to explore what it would be like to have more training data from the same Hintze and Schossau 9 data sets we were using before. These expectations are important for machine learning and AI. We are very familiar with learning, to the point where our whole identity as human could be generously defined as the result of being a monkey with an exceptional proclivity for learning. If we now approach AI and machine learning with expectations that these technologies learn as we do, or are an equally general-purpose intelligence, then we will be bitterly disappointed. The best example of such discrepancy is how easily neural networks trained by deep learning can be fooled. Images that are seemingly identical and differ only by a few pixels are grossly misclassified, a mistake no human would make (Nguyen, Yosinski, and Clune 2015). Fortunately, we know about these biases and the possible shortcomings of these methods. As long as we have the right expectations, we can take their flaws into account and still enjoy the prospects they provide. Trained Machines: Tool or Provocation? On one side we have the natural sciences characterized by hypothesis-driven experimentation re- ducing reality to an abstract model of causal interactions. This approach can inform us about the consequences of our possible actions, but only as far in the future as the model can adequately predict. With machine learning and AI, we can move this temporal horizon of prediction far- ther into the future. While weather models might still struggle to predict precipitation 7 days in advance, global climate models predict in detail the effects of global warming in 100 years. But these models are nihilist, void of values, and cannot themselves answer the question if humans would prefer to live in one possible future or another. Is sunshine better than rain? The human- ities, on the other hand, are home to exactly these problems. What are our values? How do we understand what is essential? Now that we know the facts, how should we choose? Do we speak for everyone? The questions seem to be endless, but they are what makes our human experience so special, and what separates the humanities from the sciences. Labels—such as learning or intelligence—are too easily anthropomorphized. A technology branded in this way suggests human-like properties: intelligence, common sense, or even sub- jective opinion. From a name like “deep learning” we expect a system that develops a deep and intuitive understanding with insights more profound than our own. However, these systems do not provide an alternative perspective, but as explained above, are only as good or as biased as the scientist selecting their training data. Just because humans and machine learning are both black boxes in the sense that their inner workings are opaque, does not mean they share other quali- ties. For instance, having labeled the ML training process as “learning” does not imply that ML algorithms are curious and learn from observations. While these new computerized quantitative measures might be welcomed by some scholars, there will be others who view it as an existential threat to the very nature of the humanities. Are these quantitative methods sneaking into the hu- manities disguised by anthropomorphic terms like a wolf shrouded in a sheep’s fleece? From this viewpoint, having the wrong expectations is not only provoking a disappointment, but flooding the humanities with sophisticated technologies that dilute and muddy the nature of qualitative research that makes the humanities special. However, this imminent clash between quantitative and qualitative research also provides a unique opportunity. Suppose there is a question that can only be answered subjectively and qualitatively. If so, it would define a hard boundary against the aforementioned reductionism of the purely causal quantitative approach. At the same time, such a boundary presents the perfect target for an artificially intelligent system to prove its utility. If a computational human analog 10 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 can be created, then it must be capable of performing the same tasks as a humanities researcher. In other words, it must be able to answer subjective and qualitative questions, regardless of its computational and quantitative construction. Failing at such a task would be equivalent to fail- ing the famous Turing test, thereby proving the AI is not yet human-like enough. In this way, the qualitative nature of the humanities poses a challenge—and maybe a threat—to artificially intelligent systems. While some might say the threat is mutual, past successes of interdisciplinary research suggest otherwise: The digital humanities could become the forefront of AI research. Beyond machine training, towards general purpose intelligence Currently, machines do not learn but must be trained, typically with human-labeled data. ML algorithms are not smart as we are, but they can solve specific tasks in sophisticated ways. Per- haps sentience will only be a product of enough time and training data, but the path to sentience probably requires more than time and data. The process that gave rise to human intelligence was evolution. This opportunistic process optimized brains over endless generations to perform ever-changing tasks, and it is the only known example of a process that resulted in such complex intelligence. None of the earlier described computational methods even remotely follow this paradigm: Researchers designed ad hoc algorithms that solved well-defined problems. The next iteration of these methods is either an incremental improvement of existing code, a new method- ological invention, or an application to a new data set. These improvements do not compound to make AI tools better generalists, but instead contribute to the diversity of the existing tools. One approach that does not suffer from these shortcomings is neuro-evolution (Floreano, Dürr, and Mattiussi 2008). Currently, the field of Neuroevolution is in its infancy, but find- ing new and creative solutions to otherwise unsolved problems, such as controlling robots driv- ing cars, is a popular area of focus (Lehman et al. 2020). At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence. While it is not clear how thinking machines will ultimately emerge, they are on the horizon. The dualism of a quantitative system that can be subjective and understand the qualitative nature of existence makes it a strange artifact that cannot be ignored. References Campbell, Murray, A Joseph Hoane Jr, and Feng-hsiung Hsu. 2002. “Deep Blue.” Artificial Intelligence 134 (1–2): 57–83. El-Boustani, Sami, Jacque P K Ip, Vincent Breton-Provencher, Graham W Knott, Hiroyuki Okuno, Haruhiko Bito, and Mriganka Sur. 2018. “Locally Coordinated Synaptic Plasticity of Visual Cortex Neurons in Vivo.” Science 360 (6395): 1349–54. Ferrucci, David, Anthony Levas, Sugato Bagchi, David Gondek, and Erik T Mueller. 2013. “Watson: Beyond Jeopardy!” Artificial Intelligence 199: 93–105. Fitzpatrick, Kathleen. 2012. “The Humanities, Done Digitally.” In Debates in the Digital Hu- manities, edited by Matthew K. Gold, 12–15. Minneapolis: University of Minnesota Press. Hintze and Schossau 11 Floreano, Dario, Peter Dürr, and Claudio Mattiussi. 2008. “Neuroevolution: From Architec- tures to Learning.” Evolutionary Intelligence 1 (1): 47–62. French, Robert M. 1999. “Catastrophic Forgetting in Connectionist Networks.” Trends in Cog- nitive Sciences 3 (4): 128–35. Gibney, Elizabeth. 2016. “Google AI Algorithm Masters Ancient Game of Go.” Nature News 529 (7587): 445. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27 (NIPS 2014), edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, 2672–80. N.p.: Neural Infor- mation Processing Systems Foundation. Hanahan, Douglas, and Robert A Weinberg. 2011. “Hallmarks of Cancer: The Next Genera- tion.” Cell 144 (5): 646–74. Kandel, Eric R, James H Schwartz, and Thomas M Jessell. 2000. Principles of Neural Science. 4th ed. New York: McGraw-Hill. Kriegeskorte, Nikolaus, and Pamela K Douglas. 2018. “Cognitive Computational Neuroscience.” Nature Neuroscience 21: 1148–60. Lehman, Joel et al. 2020. “The Surprising Creativity of Digital Evolution: A Collection of Anec- dotes from the Evolutionary Computation and Artificial Life Research Communities.” Ar- tificial Life 26 (2): 274-306. López, Gustavo, Luis Quesada, and Luis A Guerrero. 2017. “Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces.” In Interna- tional Conference on Applied Human Factors and Ergonomics, edited by Isabel L. Nunes, 241–50. Cham: Springer. Marstaller, Lars, Arend Hintze, and Christoph Adami. 2013. “The Evolution of Representation in Simple Cognitive Networks.” Neural Computation 25 (8): 2079–2107. Nawrocki, Robert A, Richard M Voyles, and Sean E Shaheen. 2016. “A Mini Review of Neu- romorphic Architectures and Implementations.” IEEE Transactions on Electron Devices 63 (10): 3819–29. Nguyen, Anh, Jason Yosinski, and Jeff Clune. 2015. “Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images.” In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 427–36. N.p.: IEEE. Poo, Mu-ming et al. 2016. “What Is Memory? The Present State of the Engram.” BMC Biology 14: 1-18. Russell, Stuart J, and Peter Norvig. 2016. Artificial Intelligence: A Modern Approach. Malaysia: Pearson Education Limited. Schacter, Daniel L, C-Y Peter Chiu, and Kevin N Ochsner. 1993. “Implicit Memory: A Selective Review.” Annual Review of Neuroscience 16 (1): 159–82. Sheneman, Leigh, Jory Schossau, and Arend Hintze. 2019. “The Evolution of Neuroplasticity and the Effect on Integrated Information.” Entropy 21 (5): 1-15. Tilman, David. 1996. “Biodiversity: Population versus Ecosystem Stability.” Ecology 77 (2): 350–63. Tononi, Giulio. 2004. “An Information Integration Theory of Consciousness.” BMC Neuro- science 5: 1–22. Weber, Bruce H, and David J Depew. 2003. Evolution and Learning: The Baldwin Effect Recon- sidered. Cambridge, MA: Mit Press. 12 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 1 Yampolskiy, Roman V, and Joshua Fox. 2012. “Artificial General Intelligence and the Human Mental Model.” In Singularity Hypotheses: A Scientific and Philosophical Assessment, edited by Ammon H. Eden, James H. Moor, Johnny H. Søraker, and Erik Steinhart, 129–45. Hei- delberg: Springer. 00-johnson-preface ---- Preface This collection of essays is the unexpected culmination of a 2018–2020 grant from the Institute of Museum and Library Services to the Hesburgh Libraries at the University of Notre Dame.1 The plan called for a survey and a series of workshops hosted across the country to explore, orig- inally, “the national need for library based topic modeling tools in support of cross-disciplinary discovery systems.” As the project developed, however, it became apparent that the scope of re- search should expand beyond topic modeling and that the scope of output might expand beyond a white paper. The end of the 2010s, we found, was swelling with library-centered investigations of broader machine learning applications across the disciplines, and our workshops demonstrated such a compelling mixture of perspectives on this development that we felt an edited collection of essays from our participants would be an essential witness to the moment in history. With remaining grant funds, we hosted one last workshop at Notre Dame to kick start writing. The resulting essays cover a wide ground. Some present a practical, “how-to” approach to the machine learning process for those who wish to explore it at their own institutions. Oth- ers present individual projects, examining not just technical components or research findings, but also the social, financial, and political factors involved in working across departments (and in some cases, across the town/gown divide). Others still take a larger panoramic view of the ethics and opportunities of integrating machine learning with cross-disciplinary higher education, veer- ing between optimistic and wary viewpoints. The multi-disciplinarity of the essayists and the diversity of their research give each chapter a sui generis flavor, though several shared concerns thread through the collection. Most signifi- cantly, the authors suggest that while the technical aspects of machine learning are a challenge, especially when working with collaborators from different backgrounds, many of their key con- cerns are actually about the ethical and social dimensions of the work. In this sense, the collection is very much of the moment. Two large projects on machine learning, cross-disciplinarity, and libraries ran concurrently with our grant — Cordell 2020 and Padilla 2019, which were com- missioned by major players in the field, the Library of Congress and OCLC, respectively — and both took pains to foreground the wider potential effects of machine learning. As Ryan Cordell puts it, “current cultural attention to ML may make it seem necessary for libraries to implement ML quickly. However, it is more important for libraries to implement ML through their existing commitments to responsibility and care” (1). The voices represented here exhibit a thorough commitment to Cordell’s call for responsibil- ity and care, and they are only a subset of the larger chorus that sounded at the workshops. We editors therefore encourage readers interested in this bigger picture to examine the meta-themes 1LG-72-18-0221-18: “Investigating the National Need for Library Based Topic Modeling Discovery Systems.” See ?iiTb,ffrrrXBKHbX;Qpf;`�Mibf�r�`/2/fH;@dk@R3@ykkR@R3. v https://www.imls.gov/grants/awarded/lg-72-18-0221-18 vi Machine Learning, Libraries, and Cross-Disciplinary Research and detailed information that emerged in the course of the workshops and the original survey through the grant’s final report.2 All of these pieces together capture a fascinating snapshot of an interdisciplinary field in motion. We should note that the working methods of the collection’s editorial team were an attempt to extend the grant’s spirit of collaboration. Through several stages of development, content editors Don Brower, Mark Dehmlow, Eric Morgan, Alex Papson, and John Wang reviewed as- signed essays and provided commentary before notifying general editor Daniel Johnson for prose editing, who in turn shared the updated manuscripts with the authors so the cycle could begin again. The submissions, written variously in Microsoft Word or Google Docs format, were ush- ered through these stages of life in team Google Drive folders and tracked by spreadsheet be- fore eventual conversion by Don Brower into a series of TeX files, provisioned in a version con- trolled Github repository, for more fine-tuned final editing. Like working with diverse teams in the pursuit of machine learning, editing essays together in this fashion, for publication by the Hesburgh Libraries, was a novel way of collaborating, and we editors thought candor about this book-making process might prove insightful to readers. Attending to the social dimensions of the work ourselves, we must note that this collection would not have been possible without the generous support of many people and organizations. We would like to thank the IMLS for providing essential funding support for the grant and the Hesburgh Libraries’ Edward H. Arnold University Librarian, Diane Parr Walker, for her orga- nizational support. Thank you to the members of the Notre Dame IMLS grant team who, at its various stages, provided critical support in managing logistics, conducting research, facilitat- ing workshops, and analyzing results. These individuals include John Wang (grant project di- rector), Don Brower, Mark Dehmlow, Nastia Guimaraes, Melissa Harden, Helen Hockx-Yu, Daniel Johnson, Christina Leblang, Rebecca Leneway, Laurie McGowan, Eric Lease Morgan, and Alex Papson. The University of Notre Dame Office of General Counsel provided key publi- cation advice, and the University of Notre Dame Office of Research provided critical support in administering the grant. Again, many thanks. We would also like to thank the co-signatories of the IMLS Grant Application for supporting the project’s goals: Mark Graves (then Visiting Research Assistant Professor, Center for Theol- ogy, Science, and Human Flourishing, University of Notre Dame), Pamela Graham (Director of Global Studies and Director of the Center for Human Rights Documentation and Research, Columbia University Libraries), and Ed Fox (Professor of Computer Science and Director of the Digital Library Research Laboratory, Virginia Polytechnic Institute and State University). And of course, thanks to the 95 participants in our 2019 IMLS Grant Workshops (too many to enu- merate here) and to the essay authors for sharing their expertise and perspectives in growing our collective knowledge of machine learning and its use in research, scholarship, and cultural her- itage organizations. Your active engagement continues to shape the field, and we look forward to your next achievements. References Cordell, Ryan. 2020. “Machine Learning + Libraries: A Report on the State of the Field.” Com- missioned by LC Labs, Library of Congress. ?iiTb,ffH�#bXHQ+X;Qpfbi�iB+fH�#b frQ`Ff`2TQ`ibf*Q`/2HH@GP*@JG@`2TQ`iXT/7. 2See ?iiTb,ff/QBXQ`;fRyXdkd9f`y@jkyx@FM83. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf https://doi.org/10.7274/r0-320z-kn58 vii Padilla, Thomas. 2019. “Responsible Operations: Data Science, Machine Learning, and AI in Libraries.” Dublin, Ohio: OCLC Research. ?iiTb,ffrrrXQ+H+XQ`;f`2b2�`+?fTm #HB+�iBQMbfkyRNfQ+H+`2b2�`+?@`2bTQMbB#H2@QT2`�iBQMb@/�i�@b+B2M+2 @K�+?BM2@H2�`MBM;@�BX?iKH. https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html 02-harper-generative ---- Chapter 2 Generative Machine Learning Charlie Harper, PhD Case Western Reserve University Introduction Generative machine learning is a hot topic. With the 2020 election approaching, Facebook and Reddit have each issued their own bans on the category of machine-generated or -altered con- tent that is commonly termed “deep fakes” (Cohen 2020; Romm, Harwell, and Stanley-Becker 2020). Calls for regulation of the broader, and very nebulous category of fake news are now part of US political debates, too. Although well known and often discussed in newspapers and on TV because of their dystopian implications, deep fakes are just one application of generative ma- chine learning. There is a remarkable need for others, especially humanists and social scientists, to become involved in discussions about the future uses of this technology, but this first requires a broader awareness of generative machine learning’s functioning and power. Many articles on the subject of generative machine learning exist in specialized, highly technical literature, but there is little that covers this topic for a broader audience while retaining important high-level informa- tion on how the technology actually operates. This chapter presents an overview of generative machine learning with particular focus on generative adversarial networks (GANs). GANs are largely responsible for the revolution in machine-generated content that has occured in the past few years and their impact on our fu- ture extends well beyond that of producing purposefully-deceptive fakes. After covering genera- tive learning and the working of GANs, this chapter touches on some interesting and significant applications of GANs that are not likely to be familiar to the reader. The hope is that this will serve as the start of a larger discussion on generative learning outside of the confines of technical literature and sensational news stories. 13 14 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.1: The three most-common letters following “F” in two Markov chains trained on an English and Italian dictionary. Three examples of generated words are given for each Markov chain that show how the Markov chain captures high-level information about letter arrangements in the different languages. What is Generative Machine Learning? Machine learning, which is a subdomain of Artificial Intelligence, is roughly divided into three paradigms that rely on different methods of learning: supervised, unsupervised, and reinforce- ment learning (Murphy 2012, 1–15; Burkov 2019, 1–8). These differ in the types of datasets used for learning and the desired applications. Supervised and unsupervised machine learning use labeled and unlabeled datasets, respectively, to assign unseen data to human-generated la- bels or statistically-constructed groups. Both supervised and unsupervised approaches are com- monly used for classification and regression problems, where we wish to predict categorical or quantitative information about new data. A combined form of these two paradigms, called semi- supervised learning, that mixes labeled and unlabeled data also exists. Reinforcement learning, on the other hand, is a paradigm in which an agent learns how to function in a specific environ- ment by being rewarded or penalized for its behavior. For example, reinforcement learning can be used to train a robot to successfully navigate around obstacles in a physical space. Generative machine learning, rather than being a specific learning paradigm, encompasses an ever-growing variety of techniques that are capable of generating new data based on learned patterns. The process of learning these patterns can engage both supervised and unsupervised learning. A simple, statistical example of one type of generative learning is a Markov chain. From a given set of data, a Markov chain calculates and stores the probabilities of a following state based on a current state. For example, a Markov chain can be trained on a list of English words to store the probabilities of any one letter occuring after another letter. These probabilities chain together to represent that chance of moving from the current letter state (e.g. the letter q) to a succeeding letter state (e.g. the letter u) based on the data from which it has learned. If another Markov chain were trained on Italian words instead of English, the probabilities would change, and for this reason, Markov chains can capture important high level information about datasets (Figure 2.1). They can then be sampled to generate new data by starting from a random state and probabilistically moving to succeeding states. In figure 2.1, you can see the Harper 15 Figure 2.2: Images generated with a simple statistical model appear as noise as the model is in- sufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google’s QuickDraw dataset). probability that the letter “F” transitions to the three most common succeeding letters in English and Italian. A few examples of “words” generated by two Markov chains trained on an English and Italian dictionary are also given. The example words are generated by sampling the probabil- ity distributions of the Markov chain, letter by letter, so that the generated words are statistically random, but guided by the learned probability of one letter following another. The different probabilities of letter combinations in English and Italian result in distinctly different generated words. This exemplifies how a generative model can capture specific aspects of a dataset to create new data. The letter combinations are nonsense, but they still reflect the high-level structure of Ital- ian and English words in the way letters join together, such as the different utilization of vowels in each language. These basic Markov chains demonstrate the essence of generative learning: a generative approach learns a distribution over a dataset, or in other words, a mathematical rep- resentation of a dataset, which can then be sampled to generate new data that exists within the learned structure of that dataset. How convincing the generated data appears to a human ob- server depends on the type and tuning of the machine learning model chosen and the data upon which the model has been trained. So, what happens if we build a comparable Markov chain with image data1 instead of words, and then sample, pixel by pixel, from it to generate new images? The results are just noise and the generated images reveal no hint of a wine bottle or circle to the human eye (Figure 2.2). The very simple generative statistical model we have chosen to use is incapable of capturing the distribution of the underlying images sufficiently enough to produce realistic new images. Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,2 1In many examples, I have used the Google QuickDraw Dataset to highlight features of generative machine learning. The dataset is freely available (?iiTb,ff;Bi?m#X+QKf;QQ;H2+`2�iBp2H�#f[mB+F/`�r@/�i�b2i) and licensed under CC BY 4.0. 2The order of a Markov chain reflects how many preceding states are taken into account. For example, a 2nd order Markov chain would look at the preceding two letters to calculate the probability of a succeeding letter. Rudimentary autocomplete is a good example of Markov chains in application. https://github.com/googlecreativelab/quickdraw-dataset 16 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.3 Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Capturing the intricate and often-inscrutable distri- butions that underlie real-world media, like full-sized photographs of people, is where deep (i.e. using neural networks) generative learning shines and where generative adversarial networks have revolutionized machine-generated content. Generative Adversarial Networks The problem of capturing the complexity of an image so that a computer can generate new images leads directly to the emergence of Generative Adversarial Networks, which are a neural-network- based model architecture within the broader sphere of generative machine learning. Although prior deep learning approaches to generating data, particularly variational autoencoders, already existed, it was a breakthrough in 2014 that changed the fabric and power of generative machine learning. Like every big development, it has an origin story that has moved into legend with its many retellings. According to the handed-down tale (Giles 2018), in 2014 doctoral student Ian Goodfellow was at a bar with friends when the topic of generating photos arose. His friends were working out a method to create realistic images by using complex statistical analyses of existing images. Goodfellow countered that it would not work; there were too many variables at play within such data. Instead, he put forth the idea of pairing two neural networks against each other in a type of zero-sum game where the goal was to generate believable fake images. According to the story, he developed this idea into working code that night and his paired neural network architecture produced results the very first time. This was the birth of Generative Adversarial Networks or GANs. Goodfellow’s work was quickly disseminated in what is one of the most influential papers in the recent history of machine learning (Goodfellow et al. 2014). GANs have progressed in almost miraculous ways since 2014, but the crux of their architec- ture remains the coupling of two neural networks. Each neural network has a specific function in the pairing. The first network, called the generator, is tasked with generating fake examples of some dataset. To produce this data it randomly samples from an n-dimensional latent space often labeled Z. In simple terms, the generator takes random noise (really a random list of n-numbers where n is the dimensionality of the latent space) as its input and outputs its attempt at a fake piece of data, such as an image, clip of audio, or row of tabular information. The second neural network, called the discriminator, takes both fake and real data as input. Its role is to correctly dis- criminate between fake and real examples.4 The generator and discriminator networks are then coupled together as adversaries, hence “adversarial” in the name. The output from the generator flows into the discriminator, and information on the success or failure of the discriminator to identify fakes (i.e. the discriminator’s loss) flows back through the network so that the genera- tor and discriminator each knows how well it is performing compared to the other. All of this happens automatically, without any need for human supervision. When the generator finds it is doing poorly, it learns to produce better examples by updating its weights and biases through tra- ditional backpropagation (see especially Langr and Bok 2019, 3–16 for a more detailed summary of this). As backpropagation updates the generator network’s weights and biases, the generator 3This is not to imply that these models do not have immense practical applications in other areas of machine learning. 4Its function is exactly that of any other binary classifier found in machine learning. Harper 17 Figure 2.3: At the heart of a GAN are two neural networks, the generator and the discriminator. As the generator learns to produce fake data, the discriminator learns to separate it out. The pairing of the two in an adversarial structure forces each to improve at its given task. Figure 2.4: A GAN being trained on wine bottle sketches from Google’s quickdraw dataset (?iiTb,ff;Bi?m#X+QKf;QQ;H2+`2�iBp2H�#f[mB+F/`�r@/�i�b2i) shows the genera- tor learning how to produce better sketches over time. Moving from left to right, the generator begins by outputting random noise and progressively generates better sketches as it tries to trick the discriminator. inherently begins to map regions of the randomly sampled Z space to characteristics found in the real dataset. Contrarily, as the discriminator finds that it is not identifying better fakes accurately, it learns to separate these out in new ways. At first, the generator outputs random data and the discriminator easily catches these fakes (Figure 2.4). As the results of the discriminator feed back into the generator, however, the gen- erator learns to trick its foe by creating more convincing fakes. The discriminator consecutively learns to better separate out these more convincing fakes. Turn after turn, the two networks drive one another to become better at their specialized tasks and the generated data becomes in- creasingly like the real data.5 At the end of training, ideally, it will not be possible to distinguish between real and fake (Figure 2.5). In the original publication, the first GANs were trained on sets of small images, like the Toronto Face Dataset, which contains 32 ⇥ 32 pixel grayscale photos of faces and facial expres- sions (Goodfellow et al. 2014). Although the generator’s results were convincing when com- pared to the originals, the fake images were still small, colorless, and pixelated. Since then an explosion of research into GANs and increased computational power has led to strikingly realis- 5See ?iiTb,ffTQHQ+Hm#X;Bi?m#XBQf;�MH�#f (accessed Jan 17, 2020) (Kahng et al. 2019). https://github.com/googlecreativelab/quickdraw-dataset https://poloclub.github.io/ganlab/ 18 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.5: The fully trained generator from Figure 2.4 produces examples that are not readily distinguishable from real world data. The top row of sketches were produced by the GAN and the bottom row were drawn by humans. tic images. The most recent milestone was reached in 2019 by researchers with NVIDIA, who built a GAN that generates high-quality photo-realistic images of people (Karras, Laine, and Aila 2019). When contrasted with the results of 2014 (Figure 2.6), the stunning progression of GANs is self-evident, and it is difficult to believe that the person on the right does not exist. Some Applications of Generative Adversarial Networks Over the past five years, many papers on implementations of GANs have been released by re- searchers (Alqahtani, Kavakli-Thorne, and Kumar 2019; Wang, She, and Ward 2019). The list of applications is extensive and ever growing, but it is worth pointing out some of the major exam- ples as of 2019 and why they are significant. These examples highlight the vast power of GANs and underscore the importance of understanding and carefully scrutinizing this type of machine learning. Data Augmentation One major problem in machine learning has always been the lack of labeled datasets, which are re- quired by supervised learning approaches. Labeling data is time consuming and expensive. With- out good labeled data, trained models are limited in their power to learn and in their ability to generalize to real-world problems. Services, such as Amazon’s Mechanical Turk, have attempted to crowdsource the tedious process of manually assigning labels to data, but labeling has remained a bottleneck in machine learning. GANs are helping to alleviate this bottleneck by generating new labeled data that is indistinguishable from the real data. This process can grow a small la- beled dataset into one that is larger and more useful for training purposes. In the area of medical imaging and diagnostics this may have profound effects (Yi, Walia, and Babyn 2019). For exam- ple, GANs can produce photorealistic images of skin lesions that expert dermatologists are able to separate from real images only slightly over 50% of the time (Baur, Albarqouni, and Navab 2018) and they can synthesize high-resolution mammograms for training better cancer detection algorithms (Korkinof et al. 2018). A corollary effect of these developments in medical imaging is the potential to publicly release Harper 19 Figure 2.6: An image of a generated face from the original GAN publication (left) and the 2019 milestone (right) shows how the ability of GANs to produce photo-realistic images has evolved since 2014. large medical datasets and thereby expand researchers’ access to important data. Whereas the dissemination of traditional medical images is constrained by strict health privacy laws, generated images may not be governed by such rules. I qualify this statement with “may”, because any restrictions or ethical guidelines for the use of medical data that is generated from real patient data requires extensive discussion and legal reviews that have not yet happened. Under certain conditions, it may also be possible to infer original data from a GAN (Mukherjee et al. 2019). How institutional review boards, professional medical organizations, and courts weigh in on this topic will be seen in the coming years. In addition to generating entirely new data, a GAN can augment datasets by expanding their coverage to new domains. For example, autonomous vehicles must cope with an array of road and weather conditions that are unpredictable. Training a model to identify pedestrians, street signs, road lines, and so on with images taken on a sunny day will not translate well to variable real-world conditions. Using one dataset, in a process known as style transfer, GANs can translate one image to other domains (Figure2.7). This can include creating night road scenes from day scenes (Romera et al. 2019) and producing images of street signs under varying lighting condi- tions (Chowdhury et al. 2019). This added data permits models to account for greater variability under operating conditions without the high cost of photographing all possible conditions and manually labeling them. Beyond medicine and autonomous vehicles, generative data augmenta- tion will progressively impact other imaging-heavy fields (Shorten and Khoshgoftaar 2019) like remote sensing (L. Ma et al. 2019; D. Ma, Tang, and Zhao 2019). Creativity and Design The question of whether machines can possess creativity or artistic ability is philosophically diffi- cult to answer (Mazzone and Elgammal 2019; McCormack, Gifford, and Hutchings 2019). Still, in 2018, Christie’s auctioned off its first piece of GAN art for $432,500 (Cohn 2018) and GANs 20 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.7: The images on the left are originals and the images on the right have been modified by a GAN with the ability to translate images between the domains of “dirty lens” and “clean lens” on a vehicle (from Uřičář et al. 2019, fig. 11). Harper 21 Figure 2.8: This example of GauGAN in action shows a sketched out scene on the left turned into a photo-realistic landscape on the right. *If any representatives of Christie’s are reading, the author would be happy to auction this piece. are increasingly assisting humans in the creative process for all forms of media. Simple models, like CycleGAN, are already able to stylize images in the manner of Van Gogh or Monet (Zhu et al. 2017), and more varied stylistic GANs are emerging. GauGAN, a beta tool released by NVIDIA, is a great example of GAN-assisted creativity in action. GauGAN allows you to rough out a scene using a paint brush for different categories, like clouds, flowers, and houses (Figure 2.8). It then converts this into a photo reflecting what you have drawn. The online demo6 remains limited, but the underlying model is powerful and has massive potential (Park et al. 2019). Recently, Martin Scorsese’s The Irishman made headlines for its digital de-aging of Robert Deniro and other actors. Although this process did not involve GANs, it is highly likely that in the future, GANs will become a major part of cinematic post- production (Giardina 2019) through assistive tools like GauGAN. Fashion and product design are also being impacted by the use of GANs. Text-to-image syn- thesis, which can take free text or categories as input to generate a photo-realistic image, has promising potential (Rostamzadeh et al. 2018). By accepting text as input, GANs can let de- signers rapidly generate new ideas or visualize concepts for products at the start of the design process. For example, a recently published GAN for clothing design accepts basic text and out- puts modeled images of the described clothing (Banerjee et al. 2019; Figure 9). In an example of automotive design, a single sketch can be used to generate realistic photos of multiple perspec- tives of a vehicle (Radhakrishnan et al. 2018). The many fields that rely on quick sketching or visual prototyping, such as architecture or web design, are likely to be influenced by the use of GAN-assisted design software in coming years. In a similar vein, GANs have an upcoming role in the creation of new medicines, chemi- cals, and materials (Zhavoronkov 2018). By training a GAN on existing chemical and material structures, research is showing that novel chemicals and materials can be designed with particular properties (Gómez-Bombarelli et al. 2018; Sanchez-Lengeling and Aspuru-Guzik 2018). This is facilitated by how information is encoded in the GAN’s latent space (the n-dimensional space from which the generator samples; see “Z” in Figure 2.3). As the generator learns to produce realistic examples, certain aspects of the original data become encoded in regions of the latent 6See ?iiT,ffMpB/B�@`2b2�`+?@KBM;vmHBmX+QKf;�m;�Mf (last accessed January 12, 2019). http://nvidia-research-mingyuliu.com/gaugan/ 22 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 Figure 2.9: Text-to-image synthesis can generate images of new fashions based on a description. From the input “maroon round neck mini print a-line bodycon short sleeves” a GAN has pro- duced these three photos (from Banerjee et al. 2019, fig. 11). Figure 2.10: Two examples of linearly-spaced mappings across the latent space between generated images A and B. Note that by taking one image and moving closer to another, you can alter prop- erties in the image, such as adding steam, removing a cup handle, or changing the angle of view. These characteristics of the dataset are learned by the generator during training and encoded in the latent space. (GAN built on coffee cup sketches from Google’s QuickDraw dataset) space. By moving through this latent space or sampling particular areas, new data with desired properties can then be generated. This can be seen by periodically sampling the latent space and generating an image as one moves between two generated images (Figure 2.10). In the same way, by moving in certain directions or sampling from particular areas of the latent space, new chem- icals or medicines with specific properties can be generated.7 Impersonation and the Invisible I have reserved some of the more dystopian and likely more well-heard-of applications of GANs for last. This is the area where GANs’ ability to generate convincing media is challenging our perceptions of reality and raising extreme ethical questions (Harper 2018). Deep fakes are, of course, the most well known of these. This can include the creation of fake images, videos, and audio of an individual or the modification of any media to alter what someone appears to be doing or saying. In images and video in particular, GANs make it possible to swap the identity of an individual and manipulate facial attributes or expressions (Tolosana et al. 2020). A large portion 7This is also relevant to facial manipulation discussed below. Harper 23 Figure 2.11: GANs are providing a method to reconstruct hidden images of people and objects. Images 1–3 show reconstructions as compared to an input occluded image (OCC) and a ground truth image (GT) (from Fulgeri et al. 2019, fig. 6). of technical literature is, in fact, now devoted to detecting faked and altered media (see Tolosana et al. 2020, Table IV and V). It remains to be seen how successful any approaches will be. From a theoretical perspective, anything that can detect fakes can also be used to train a better generator since the training process of a GAN is founded on outsmarting a detector (i.e. the discriminator network). One shocking extension of deep fakes that has emerged is transcript to video creation, which generates a video of someone speaking from a written text. If you want to see this at work, you can view clips of Nixon giving the speech written in the case of an Apollo 11 disaster.8 As of now, deep fakes like this remain choppy and are largely limited to politicians and celebrities because they require large datasets and additional manipulation, but this limitation is not likely to last. If the evolution of GANs for images is any predictor, the entire emerging field of video generation is likely to progress rapidly. One can imagine the incorporation of text-to-image and deep fakes enabling someone to produce an image of, say, “politican X doing action Y,” simply by typing it. An application of GANs that parallels deep fakes and is likely more menacing in the short term is the infilling or adding of hidden, invisible, or predicted information to existing media. One nascent use is video prediction from an image. For example, in 2017, researchers were able to build a GAN that produced 1-second video clips from a single starting frame (Vondrick and Torralba 2017). This may not seem impressive, but video is notoriously difficult to work with because the content of a succeeding frame can vary so drastically from the preceding frame (for other examples of on-going research into video prediction, see Cai et al. 2018; Wen et al. 2019). For still images, occluded object reconstruction, in which a GAN is trained to produce a full image of a person or object that is partially hidden behind something else, is progressing (Fulgeri et al. 2019; see Figure 11). For some applications, like autonomous driving, this could save lives as it would help to pick out when a partially-occluded pedestrian is about to emerge from 8See ?iiT,ffM2rbXKBiX2/mfkyRNfKBi@�TQHHQ@/22T7�F2@�`i@BMbi�HH�iBQM@�BKb@iQ@2KTQr2`@K Q`2@/Bb+2`MBM;@Tm#HB+@RRk8. http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125 http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125 24 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 behind a parked car. On the other hand, for surveillance technology, it can further undermine anonymity. Indeed, such GANs are already being explicitly studied for surveillance purposes (Fabbri, Calderara, and Cucchiara 2017). Lastly, I would be remiss if I did not mention that researchers have designed a GAN that can generate an image of what you are thinking about, using EEG signals (Tirupattur et al. 2018). GANs and the Future The tension between the creation of more realistic generated data and the technology to detect maliciously generated information is only beginning. The machine learning and data science plat- form, Kaggle, is replete with publicly-accessible python code for building GANs and detecting fake data. Money, too, is freely flowing in this domain of research; The 2019 Deepfake Detec- tion Challenge sponsored by Facebook, AWS, and Microsoft boasted one million dollars in prizes (?iiTb,ffrrrXF�;;H2X+QKf+f/22T7�F2@/2i2+iBQM@+?�HH2M;2 accessed April 20, 2020). Meanwhile, industry leaders, such as NVidia, continue to fund the training of better and more convincing GANs. The structure of a GAN, with its generator and detector paired adver- sarially, is now being mirrored in society as groups of researchers competitively work to create and discern generated data. The path that this machine-learning arms race will take is unpredictable, and, therefore, it is all the more important to scrutinize it and make it comprehensible to the broader publics whom it will affect. References Alqahtani, Hamed, Manolya Kavakli-Thorne, and Gulshan Kumar. 2019. “Applications of Gen- erative Adversarial Networks (GANs): An Updated Review.” Archives of Computational Methods in Engineering, December. ?iiTb,ff/QBXQ`;fRyXRyydfbRR3jR@yRN@yNj 33@v. Banerjee, Rajdeep H., Anoop Rajagopal, Nilpa Jha, Arun Patro, and Aruna Rajan. 2019. “Let AI Clothe You: Diversified Fashion Generation.” In Computer Vision—ACCV 2018 Work- shops, edited by Gustavo Carneiro and Shaodi You, 75–87. Cham: Springer International Publishing. Baur, Christoph, Shadi Albarqouni, and Nassir Navab. 2018. “Generating Highly Realistic Im- ages of Skin Lesions with GANs” September. ?iiTb,ff�`tBpXQ`;f�#bfR3yNXyR9Ry. Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Self-published, Amazon. Cai, Haoye, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. 2018. “Deep Video Generation, Prediction and Completion of Human Action Sequences.” In Computer Vision — ECCV 2018, edited by Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, 374–90. Lecture Notes in Computer Science. Cham: Springer International Publishing. ?iiTb,ff/QBXQ`;fRyXRyydfNd3@j@yjy@yRkRe@3nkj. Chowdhury, Sohini Roy et al. 2019. “Automated Augmentation with Reinforcement Learning and GANs for Robust Identification of Traffic Signs Using Front Camera Images.” In 53rd Asilomar Conference on Signals, Systems & Computers, 79–83. N.p.: IEEE. ?iiTb,ff/Q BXQ`;fRyXRRyNfA111*PL699ee9XkyRNXNy9Nyy8. Cohen, Libby. 2020. “Reddit Bans Deepfakes with ‘Malicious’ Intent.” The Daily Dot. January 10, 2020. ?iiTb,ffrrrX/�BHv/QiX+QKfH�v2`3f`2//Bi@/22T7�F2b@#�Mf. https://www.kaggle.com/c/deepfake-detection-challenge https://doi.org/10.1007/s11831-019-09388-y https://doi.org/10.1007/s11831-019-09388-y https://arxiv.org/abs/1809.01410 https://doi.org/10.1007/978-3-030-01216-8_23 https://doi.org/10.1109/IEEECONF44664.2019.9049005 https://doi.org/10.1109/IEEECONF44664.2019.9049005 https://www.dailydot.com/layer8/reddit-deepfakes-ban/ Harper 25 Cohn, Gabe. 2018. “AI Art at Christie’s Sells for $432,500.” The New York Times, October 25, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fRyfk8f�`ibf/2bB;Mf�B@�`i@bQH/@+ ?`BbiB2bX?iKH. Fabbri, Matteo, Simone Calderara, and Rita Cucchiara. 2017. “Generative Adversarial Models for People Attribute Recognition in Surveillance.” In 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). N.p.: IEEE. ?iiTb,ff/QBXQ` ;fRyXRRyNf�oaaXkyRdX3yd38kR. Fulgeri, Federico, Matteo Fabbri, Stefano Alletto, Simone Calderara, and Rita Cucchiara. 2019. “Can Adversarial Networks Hallucinate Occluded People With a Plausible Aspect?” Com- puter Vision and Image Understanding 182 (May): 71–80. Giardina, Carolyn. 2019. “Will Smith, Robert De Niro and the Rise of the All-Digital Actor.” The Hollywood Reporter, August 10, 2019. ?iiTb,ffrrrX?QHHvrQQ/`2TQ`i2`X+QKf #2?BM/@b+`22Mf`Bb2@�HH@/B;Bi�H@�+iQ`@RkkNd3j. Giles, Martin. 2018. “The GANfather: The Man Who’s given Machines the Gift of Imagina- tion.” MIT Technology Review 121, no. 2 (March/April): 48–53. Gómez-Bombarelli, Rafael, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. 2018. “Automatic Chemical Design Us- ing a Data-Driven Continuous Representation of Molecules.” ACS Central Science 4, no. 2 (February): 268–76. ?iiTb,ff/QBXQ`;fRyXRykRf�+b+2Mib+BXd#yy8dk. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, edited by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, 27:2672–2680. Curran Associates, Inc. ?iiTb,ffT` Q+22/BM;bXM2m`BTbX++fT�T2`fkyR9f7BH2f8+�j2N#Rkk7eR737ye9N9+Nd#R� 7++7j@S�T2`XT/7. Harper, Charlie. 2018. “Machine Learning and the Library or: How I Learned to Stop Worrying and Love My Robot Overlords.” Code4Lib Journal, no. 41 (August). ?iiTb,ffDQm`M� HX+Q/29HB#XQ`;f�`iB+H2bfRjedR. Kahng, Minsuk, Nikhil Thorat, Duen Horng Polo Chau, Fernanda B. Viegas, and Martin Wat- tenberg. 2019. “GAN Lab: Understanding Complex Deep Generative Models Using Inter- active Visual Experimentation.” IEEE Transactions on Visualization and Computer Graph- ics 25, no. 1 (January 2019): 310–320. ?iiTb,ff/QBXQ`;fRyXRRyNfip+;XkyR3Xk3 e98yy. Karras, Tero, Samuli Laine, and Timo Aila. 2019. “A Style-Based Generator Architecture for Generative Adversarial Networks.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4396–4405. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNf *oS_XkyRNXyy98j. Korkinof, Dimitrios, Tobias Rijken, Michael O’Neill, Joseph Yearsley, Hugh Harvey, and Ben Glocker. 2018. “High-Resolution Mammogram Synthesis Using Progressive Generative Adversarial Networks.” Preprint, submitted July 9, 2018. ?iiTb,ff�`tBpXQ`;f�#bf R3ydXyj9yR. Langr, Jakub and Vladimir Bok. 2019. GANs in Action: Deep Learning with Generative Adver- sarial Networks. Shelter Island, NY: Manning Publications. Ma, Dongao, Ping Tang, and Lijun Zhao. 2019. “SiftingGAN: Generating and Sifting La- beled Samples to Improve the Remote Sensing Image Scene Classification Baseline In Vitro.” https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html https://doi.org/10.1109/AVSS.2017.8078521 https://doi.org/10.1109/AVSS.2017.8078521 https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783 https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783 https://doi.org/10.1021/acscentsci.7b00572 https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf https://journal.code4lib.org/articles/13671 https://journal.code4lib.org/articles/13671 https://doi.org/10.1109/tvcg.2018.2864500 https://doi.org/10.1109/tvcg.2018.2864500 https://doi.org/10.1109/CVPR.2019.00453 https://doi.org/10.1109/CVPR.2019.00453 https://arxiv.org/abs/1807.03401 https://arxiv.org/abs/1807.03401 26 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 2 IEEE Geoscience and Remote Sensing Letters 16, no. 7 (July): 1046–1050. ?iiTb,ff/QBX Q`;fRyXRRyNfH;`bXkyR3Xk3Ny9Rj. Ma, Lei, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. 2019. “Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review.” ISPRS Journal of Photogrammetry and Remote Sensing 152 (June): 166–77. ?iiTb,ff/QBXQ`;fRyXR yRefDXBbT`bDT`bXkyRNXy9XyR8. Mazzone, Marian, and Ahmed Elgammal. 2019. “Art, Creativity, and the Potential of Artificial Intelligence.” Arts 8, no. 1 (March): 1–9. ?iiTb,ff/QBXQ`;fRyXjjNyf�`ib3yRyyke. McCormack, Jon, Toby Gifford, and Patrick Hutchings. 2019. “Autonomy, Authenticity, Au- thorship and Intention in Computer Generated Art.” In ComputationalIntelligenceinMu- sic, Sound, Art and Design, edited by Anikó Ekárt, Antonios Liapis, and María Luz Castro Pena, 35–50. Cham: Springer International Publishing. Mukherjee, Sumit, Yixi Xu, Anusua Trivedi, and Juan Lavista Ferres. 2019. “Protecting GANs against Privacy Attacks by Preventing Overfitting.” Preprint, submitted December 31, 2019. ?iiTb,ff�`tBpXQ`;f�#bfkyyRXyyydRpR. Murphy, Kevin P. 2012. Machine Learning : A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. Cambridge, Mass: MIT Press. Park, Taesung, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. “Semantic Image Syn- thesis with Spatially-Adaptive Normalization.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2332–2341. N.p.: IEEE. ?iiTb,ff/QBXQ`;f RyXRRyNf*oS_XkyRNXyyk99. Radhakrishnan, Sreedhar, Varun Bharadwaj, Varun Manjunath, and Ramamoorthy Srinath. 2018. “Creative Intelligence – Automating Car Design Studio with Generative Adversarial Net- works (GAN).” InMachineLearningandKnowledgeExtraction, edited by Andreas Holzinger, Peter Kieseberg, A Min Tjoa, and Edgar Weippl, 160–75. Cham: Springer International Publishing. Romera, Eduardo, Luis M. Bergasa, Kailun Yang, Jose M. Alvarez, and Rafael Barea. 2019. “Bridging the Day and Night Domain Gap for Semantic Segmentation.” In 2019 IEEE Intelligent Vehicles Symposium (IV), 1312–18. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRy NfAoaXkyRNX33Rj333. Romm, Tony, Drew Harwell, and Isaac Stanley-Becker. 2020. “Facebook Bans Deepfakes, but New Policy May Not Cover Controversial Pelosi Video.” The Washington Post. January 7, 2020. ?iiTb,ffrrrXr�b?BM;iQMTQbiX+QKfi2+?MQHQ;vfkykyfyRfyef7�+2#QQ F@#�M@/22T7�F2b@bQm`+2b@b�v@M2r@TQHB+v@K�v@MQi@+Qp2`@+QMi`Qp2`bB �H@T2HQbB@pB/2Qf. Rostamzadeh, Negar, Seyedarian Hosseini, Thomas Boquet, Wojciech Stokowiec, Ying Zhang, Christian Jauvin, and Chris Pal. 2018. “Fashion-Gen: The Generative Fashion Dataset and Challenge.” Preprint, submitted June 21, 2018. ?iiTb,ff�`tBpXQ`;f�#bfR3yeXy3j Rd. Sanchez-Lengeling, Benjamin, and Alán Aspuru-Guzik. 2018. “Inverse Molecular Design Us- ing Machine Learning: Generative Models for Matter Engineering.” Science 361, no. 6400 (July): 360–365. ?iiTb,ff/QBXQ`;fRyXRRkefb+B2M+2X��ikeej. Shorten, Connor, and Taghi M. Khoshgoftaar. 2019. “A Survey on Image Data Augmentation for Deep Learning.” Journal of Big Data 6 (60): 1–48. ?iiTb,ff/QBXQ`;fRyXRR3efb9 y8jd@yRN@yRNd@y. https://doi.org/10.1109/lgrs.2018.2890413 https://doi.org/10.1109/lgrs.2018.2890413 https://doi.org/10.1016/j.isprsjprs.2019.04.015 https://doi.org/10.1016/j.isprsjprs.2019.04.015 https://doi.org/10.3390/arts8010026 https://arxiv.org/abs/2001.00071v1 https://doi.org/10.1109/CVPR.2019.00244 https://doi.org/10.1109/CVPR.2019.00244 https://doi.org/10.1109/IVS.2019.8813888 https://doi.org/10.1109/IVS.2019.8813888 https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ https://arxiv.org/abs/1806.08317 https://arxiv.org/abs/1806.08317 https://doi.org/10.1126/science.aat2663 https://doi.org/10.1186/s40537-019-0197-0 https://doi.org/10.1186/s40537-019-0197-0 Harper 27 Tirupattur, Praveen, Yogesh Singh Rawat, Concetto Spampinato, and Mubarak Shah. 2018. “Thoughtviz: Visualizing Human Thoughts Using Generative Adversarial Network.” In Proceedings of the 26th ACM International Conference on Multimedia, 950–958. New York: Association for Computing Machinery. ?iiTb,ff/QBXQ`;fRyXRR98fjk9y 8y3Xjk9ye9R. Tolosana, Ruben, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega- Garcia. 2020. “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detec- tion.” Preprint, submitted January 1, 2020. ?iiTb,ff�`tBpXQ`;f�#bfkyyRXyyRdN. Uřičář, Michal, Pavel Křížek, David Hurych, Ibrahim Sobh, Senthil Yogamani, and Patrick Denny. 2019. “Yes, We GAN: Applying Adversarial Techniques for Autonomous Driving.” In IS&T International Symposium on Electronic Imaging, 1–16. Springfield, VA: Society for Imaging Science and Technology. ?iiTb,ff/QBXQ`;fRyXkj8kfAaaLXk9dy@RRdjXk yRNXR8X�oJ@y93. Vondrick, Carl, and Antonio Torralba. 2017. “Generating the Future with Adversarial Trans- formers.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2992–3000. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNf*oS_XkyRdXjRN. Wang, Zhengwei, Qi She, and Tomas E. Ward. 2019. “Generative Adversarial Networks: A Survey and Taxonomy.” Preprint, submitted June 4, 2019. ?iiTb,ff�`tBpXQ`;f�#bf RNyeXyR8kN. Wen, Shiping, Weiwei Liu, Yin Yang, Tingwen Huang, and Zhigang Zeng. 2019. “Generating Realistic Videos From Keyframes With Concatenated GANs.” IEEE Transactions on Cir- cuits and Systems for Video Technology 29 (8): 2337–48. ?iiTb,ff/QBXQ`;fRyXRRyNf h*aohXkyR3Xk3edNj9. Yi, Xin, Ekta Walia, and Paul Babyn. 2019. “Generative Adversarial Network in Medical Imag- ing: A Review.” Medical Image Analysis 58 (December): 1–20. ?iiTb,ff/QBXQ`;fRy XRyRefDXK2/B�XkyRNXRyR88k. Zhavoronkov, Alex. 2018. “Artificial Intelligence for Drug Discovery, Biomarker Development, and Generation of Novel Chemistry.” Molecular Pharmaceutics 15, no. 10 (October): 4311–13. ?iiTb,ff/QBXQ`;fRyXRykRf�+bXKQHT?�`K�+2miX3#yyNjy. Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” In 2017 IEEE International Conference on Computer Vision (ICCV), 2242–2251. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRy XRRyNfA**oXkyRdXk99. https://doi.org/10.1145/3240508.3240641 https://doi.org/10.1145/3240508.3240641 https://arxiv.org/abs/2001.00179 https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048 https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048 https://doi.org/10.1109/CVPR.2017.319 https://arxiv.org/abs/1906.01529 https://arxiv.org/abs/1906.01529 https://doi.org/10.1109/TCSVT.2018.2867934 https://doi.org/10.1109/TCSVT.2018.2867934 https://doi.org/10.1016/j.media.2019.101552 https://doi.org/10.1016/j.media.2019.101552 https://doi.org/10.1021/acs.molpharmaceut.8b00930 https://doi.org/10.1109/ICCV.2017.244 https://doi.org/10.1109/ICCV.2017.244 03-plumb-humanities ---- Chapter 3 Humanities and Social Science Reading through Machine Learning Marisa Plumb San Jose State University Introduction The purposes of computational literary studies have evolved and diversified a great deal over the last half century. Within this dynamic and often contentious space, a set of fundamental ques- tions deserve our collective attention: does the computation and digitization of language recast the ways we read, value, and receive words? In what ways can research and scholarship on lit- erature become a more meaningful part of the future development of computer systems? As the theory and practice of computational literary studies evolve, their potential to play a direct role in revising historical narratives and framing new research questions poses cross-disciplinary implications. It’s worthwhile to anchor these questions in the origin stories that today’s digital humanists tell, from the work of Josephine Miles at Berkeley in the 1930s (Buurma and Heffernan 2018) to Roberto Busa’s work in the 1940s to work that links Structuralism and Russian Formalism at the turn of the 19th century (Algee-Hewitt 2015) to today’s systemized explorations of texts. The sciences and humanities have a shared history in their desire to solve the patterns and systems that make language functional and impactful, and there have long been linguistic and computational tools that help advance this work. What’s more challenging to unravel and articulate from these origin stories are the mathematical concepts behind the tools that humanists wield. Ideally one would navigate this historical landscape when assessing the fitness of any given computational technique for addressing a specific humanities research question, but often researchers choose 29 30 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 tools because they are powerful and popular, without a robust understanding of the conceptual assumptions they embody, which are defined by the mathematical and statistical principles they are based on. This can make it difficult to generate reproducible results that contribute to a tool’s methodological development. This is related to a set of issues that drive debates among computationally-minded scholars, which regularly appear in digital humanities forums. In 2019, for instance, Nan Da issued a harsh critique of humanists’ implementation of statistical methods in their research.1 Her claim is that computational methods are not a good match for literary research, and she systematically shows how the results from several computational humanities are not only difficult to reproduce, but can be easily skewed with minor changes to how an algorithm is implemented. Although this debate about digital methods points to a necessary evolution in the field (in which researchers become more accountable to the computational laws that they are utilizing), her essay’s broader mission is to question the appropriateness of using computational tools to investigate literary objects and ideas. Refutations to this claim were swift and abundant (Critical Inquiry 2019), and highlight a number of concepts central to my concern here with future intersections of machine learning and literary research. Respondents such as Mark Algee-Hewitt pointed out that literary scholars em- ploy computational statistical models in order to reveal something about texts that human readers could not. In doing so, literary scholars are at liberty to note where computation reaches its use- ful limit2 and take up more traditional forms of literary analysis (Algee-Hewitt 2019). Katherine Bode explores the promise and pitfalls of this hybrid “close and distant reading” approach in her 2020 article on the intersection of topic modeling and bias. Imperfect as the hybrid method is, stressing the value of familiar interpretive methods remains important, politically and practically, when bringing computation into humanities departments. This essay extends the argument that computational tools do more than turn big data into novel close reading opportunities. Machine learning, and word embedding algorithms in par- ticular, may have a unique ability to shift this conversation into new territory, where scholars begin to ask how historical research can contribute more sophisticated approaches to treating words as data. With historically-minded approaches to dataset creation for machine learning, issues emerge that engender new theoretical frameworks for evaluating the ability of statistical models of information to reveal cultural and artistic dimensions of language. I will first contex- tualize what they do, and then show a few of the mathematical concepts that have driven their development. Of the many available machine learning algorithms, word embedding algorithms have shown particular promise in capturing contextual meanings (of words or other units of textual data) more accurately than previous techniques in natural language processing. Word embeddings en- compass a set of language modeling techniques where words or phrases from a large set of texts (i.e., “corpus”) are analyzed through the use of a neural network architecture. For each vocabu- lary term in the corpus, the neural network algorithm uses the term’s proximity to other words to assign it values that become a vector of real numbers — one high-dimensional vector is generated for each word. (The term “embedding” refers to the mathematics that turns a space with many 1Da’s critique of statistical model usage in computational humanities work sparked a forum of responses in Critical Inquiry. 2This limit typically exists for a combination of three reasons: computer programs can only generate models based on the data we give them, a tool isn’t fully understood and so not robustly explored, and many algorithms and tools are being used in experimental ways. Plumb 31 dimensions per word into a continuous vector space with a much lower dimension.)3 They raise three critical issues to this essay: How do word embeddings reflect the contexts of words in order to capture their relative meanings? If word embeddings approximate word meanings, do they also reflect culture? How can literary history and cultural studies inform how scholars use them? Word embeddings are powerful because they calculate semantic similarities between words based on their distributional properties in large samples of language data. As computational lin- guist Jussi Karlgren puts it: Language is a general-purpose representation of human knowledge, and models to process it vary in the degree they are bound to some task or some specific usage. Currently, the trend is to learn regularities and representations with as little explicit knowledge-based linguistic processing as possible, and recent advances in such gen- eral models for end-to-end learning to address linguistics tasks have been quite suc- cessful. Most of those approaches make little use of information beyond the occur- rence or co-occurrence of words in the linguistic signal and take the single word to be the atomary unit. This is notable because it highlights the power of word embeddings to assign values to words in order to represent their relative meanings, simply based on unstructured language data, without a system of linguistic rules or a labelling system. It also highlights the fact that a word embedding model’s success is based on the parameters of the task it is designed to address. So while the accu- racy and power of word vector algorithms might be recognizable in general-purpose applications that improve with larger training corpora (for instance Google News and Wikipedia), they can be equally powerful representation learning systems for specific historical research tasks that use different benchmarks for success. Humanists using these machine learning methods are learning to think differently about corpora size, corpora content, and the utility of a successfully-trained model for analysis and interpretation. No matter what the application, the success of machine learning applications is predicated on creating good datasets. As a recent paper in IEEE Transactions on Knowledge and Data En- gineering notes, “the majority of the time for running machine learning end-to-end is spent on preparing the data, which includes collecting, cleaning, analyzing, visualizing, and feature en- gineering” (Roh et al. 2019, 1). Acknowledging this helps contextualize machine learning al- gorithms for text analysis tasks in the humanities, but also highlights data curation challenges that can be taken up in new ways by humanists. This naturally raises questions about how ma- chine learning algorithms like word embeddings are implemented for text analysis, and how they should be modified for historical research—they require different computational priorities and frameworks. In parallel to the corpora considerations that computational humanities scholars ponder, there is an abundance of work, across disciplines such as cognitive science and psychology (Grif- fiths et al. 2007), that attempts to refine the problems and limits of using large collections of text for training embeddings. These large collections tend to reflect the biases that exist in soci- ety and history, and in turn, systems based on these datasets can make troubling inferences, now well documented as algorithmic bias.4 Computer science researchers need to evaluate the social dimensions of their applications in diverse societies and find ways to fairly represent all popula- tions. 3See Koehrsen 2018 for a fuller explanation of the process. 4As investigated, for instance, in Noble 2018. 32 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Digital humanities practices can implicitly help address these issues. Literary studies, as it evolves towards multivocality and canon expansion, makes explicit a link between methods of literary analysis and digital practices that are deliberately inclusive, less-biased, and diachronic (rather than ahistorical). Emerging literary scholarship uses computational methods to question hegemonic practices in the history of the field, through the now-familiar practice of data cura- tion (Poole 2013). But this work can also help combat algorithmic bias more broadly, and expand beyond corpus development into algorithmic design. As digital literary scholarship continues to deepen its exchanges with Sociology, History, and Information Science, stronger methodologies for using fair and representative data will become pervasive throughout these disciplines, as well as in commercial applications. Interdisciplinary methodologies are foundational to future com- putational literary research that can make sophisticated contributions to text analysis. The Bengal Annual: A Case Study Complex relationships between words cannot be fully assessed with one flat application of a pow- erful tool to a set of texts. But this does not mean that the usefulness of machine learning for literature is limited: rather, scholars can wield it to control how machines learn sets of relation- ships between concepts. Choosing which texts to include in a corpus is coupled to decisions about whether and how to label its contents, and how to tune the parameters of the algorithms. For the purposes of literary analysis, these should be embraced as interpretive, biased acts—ones that deepen understanding of commonly-employed computational methods—and folded into emerging methodologies. Because humanities scholars are not generating models to serve appli- cations with thousands of end-users who primarily expect accuracy, they can exploit the fallacies of machine learning in order to improve how dataset management and feature engineering are conducted. Working with big data in order to generate models isn’t valuable because it reveals history’s “true” cultural patterns, but because it demonstrates how machines already circulate those “truths.” A scholar’s deep knowledge of the historical content and formalities of language can determine how corpora are compared, how we experiment with known biases, and how we move towards a future landscape of literary analysis that is inclusive of marginalized texts and the latest cultural theory. Roopika Risam, for instance, advocates for both a theoretical and practice-based decoloniza- tion of the digital humanities, noting ways that postcolonial digital archives can intervene in knowledge production in society (2018, 79). Corpora created from periods of revolution, then, might reveal especially useful vector relationships and lead to better understanding of semantic changes during those times. Those word embeddings might be useful for teaching computers racialized language over timelines, so that machine learning applications do not only “read” his- tory as a flat set of relationships, and inevitably reflect the worst of its biases. To begin to unpack this process, I will present a case study on the 1830 Bengal Annual and a corpus of similarly-situated texts. Our team, made up of students in Katherine D. Harris’s graduate seminar on decolonizing Romantic Literature at San Jose State University, asked: can we operationalize questions that arise from close readings of texts to turn problematic quanti- tative evaluations of words into more complex methods of interpretation? A computer cannot interpret complex cultural concepts, but it can be instructed to weigh time period, narrative per- spective, and publication venue, much as a literary scholar would. With the explosion of print culture in England in the first half of the nineteenth century, publishers began introducing new forms of serialized print materials, which included serialized Plumb 33 publications known as literary annuals (Harris 2015). These multi-author texts were commonly produced as high-quality volumes that could be purchased as gifts in the months leading up to the holiday season. As a genre, the annual included poetry, prose, and engravings, among other varieties of content, very often from well-known authors. Literary annuals represent a significant shift in the economics surrounding the production of print materials for mass consumption— for instance, contributors were typically paid. And annuals, though a luxury item, were more affordable than books sold before the mechanization of the printing press (Harris 2015, 1-29). Literary annuals and other periodicals are interesting sites of literary study because they can be read as reinforcing or resisting the British Empire. London-based periodicals were eventually distributed to all of Britain’s colonial holdings, including India (Harris 2019). As The Bengal Annual was written in India and contains a small representation of Indian authors, our project investigates it as a variation on British-centric reading materials of the time, which perhaps offered a provisional voice to a wider community of writers (though not without claims of superiority over the colonized territory it exploits). Some of the contents invoke themes that are affiliated with major Romantic writers such as William Wordsworth and Samuel T. Coleridge, but editor D.L. Richardson included short stories and fiction, which were not held in the same regard as poetry. He also employed local native Indian engravers and writers. To explore the thesis that the concepts and genres typically associated with British Romantic Literature are represented differently in a text that was written and produced in a different space with a set of contributors who were not exclusively British natives, we experimented with word embeddings on semantic similarity tasks, comparing the annual to texts like Lyrical Ballads. Such a task is within the scope of traditional literary analysis, but my agenda was to probe the idea that we need large-scale representations of marginalized voices in order to show real differences from the ideas of the dominant race, class, and gender.5 The project team first used statistical tools to find out if the Annual’s poetry, non-fiction, and fiction contained interesting relationships between vocabularies about body parts, social class, and gender. We gathered information about terms that might reveal how different parts of the body were referenced depending on sex. These differences were validated by traditional close- reading knowledge about British Romantic Literature and its historical contexts,6 and signaled the need to read and analyze the Annual’s passages about body parts, especially ones by writers of different genders and social backgrounds. These simple methods allowed us to take a streamlined approach to confirming that an author’s perspective indeed altered his or her word choices and other aspects of their references to male vs. female bodies. Collecting and mapping those references, however, was not enough to build a larger argu- ment about how discourse on bodies might be different in non-canonical British Romantic Lit- erature. Based on the potential for word embeddings to model semantic spaces for different cor- pora and compare the distribution of terms, the next step was to build a corpus of non-canonical texts of similar scope to a corpus of canonical works, so that models for each could be legitimately compared. This work, currently in progress, faces challenges that are becoming more familiar to digital historians: the digitization of rare texts, the review of digitization processes for accuracy, and the cleaning of data. The primary challenge is to find the correct works to include: this requires historical exper- 5Such textual repositories are important outside of literature departments, too. We need data to represent all voices in training machines to represent any social arena. 6Some of these findings are illustrated in the project’s Scalar site: ?iiT,ffb+�H�`Xmb+X2/mfrQ`Fbfi?2@#2M ;�H@�MMm�Hf#Q/B2b@BM@i?2@�MMm�H. http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual 34 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 tise, but also raises the question of how to uncover unknown authors. Manu Chander’s Brown Romantics calls for a global assessment of Romantic Literature’s impact by “calling attention to its genuinely unacknowledged legislators” (Chander 2017, 11). But he contends that even the authors he was able to study were already individuals who aspired to assimilate with British cul- ture and ideologies in some ways, and perhaps don’t represent political resistance or views entirely antithetical to the British Empire. Guided by Chander’s questions about how to locate dissent in contexts of colonization, we documented instances in the text that highlight the dynamics of colonialism, race, and nation- alism, and compared them to a set of statistical explorations of the text’s vocabulary (particu- larly terms related to national identity, gender, and bodies). Chander’s call for a more globally- comprehensive study of Romanticism speaks to the politics of corpora curation discussed above, but also suggests that corpus comparison can benefit from formal methodological guidelines. Puzzling out how to best combine traditional close readings with quantitative inquiries, and then map that work to a machine-learning research framework, revealed several shortcomings in methodological standardization. It also revealed several opportunities for rethinking the way algorithms could be implemented, by adopting and systematizing familiar comparative research practices. Ideas about such methodologies are emerging in many disciplines, which I highlight later in this essay. Disciplinary directions for word vector research The potential of word embedding techniques for projects such as our BengalAnnual analysis can be seen in the new computational research directions that have emerged in humanities research.7 Vector-space representations are based on high-dimensional vectors8 of real numbers.9 Those vectors’ values are assigned using a word’s relationship to the words near it in a text, based on the likelihood that a word will appear in proximity to other words it is told to “look” at. For example, this visualization demonstrates an embedding space for a historical corpus (1640-1699) using the values assigned to word vectors (figure 3.1). In a visualized space (with reduced dimensions) such as the one in figure 3.1, distances among vectors can be assessed, for example, to articulate the forty words most similar to wit. This partic- ular model (trained using the word2vec algorithm), published in the 2019 Debates in the Digital Humanities,10 allowed the authors to visualize the term wit with synonyms on the left side, and terms related to argumentation on the right, such as indeed, argues, and consequently. This ini- tial exploration prompted Gavin and his co-authors to look at a vector space model for a single author (John Dryden), in order to both validate the model against their subject matter expertise and explore the model’s results. Although word vectors are often employed for machine trans- lation tasks11 or to project analogistic relationships between concepts,12 they can also be used to 7See Kirschenbaum 2007 and Argamon and Olsen 2009. 8A word vector may have hundreds or even thousands of dimensions. 9Word embedding algorithms are modelled on the linguistic concept that context is a primary way that word meanings are produced. Their usefulness is dependent on the breadth and domain-relevance of the corpus they are trained on, meaning that a corpus of medical research vs. a corpus of 1980s television guides vs. a corpus of family law proceedings would generate models that show different relationships between words like “family,” “health,” “heart,” etc. 10See Goldstone 2019. 11Software used to translate text or speech from one language to a target language. Machine translation is a subfield of computational linguistics that can now allow for domain-based (i.e. specialized subject matter) customizations of translations, making translated word choices more context-specific.. 12Although word embeddings aren’t explicitly trained to learn analogies, the vectors exhibit seemingly linear behavior (such as “woman is to queen as man is to king”), which approximately describe a parallelogram. This phenomenon is Plumb 35 Figure 3.1: A visualized space with reduced dimensions of a neighborhood around wit (Gavin et al. 2019, Figure 21.2). question concepts that are traditionally associated with particular literary periods and evaluate those associations with new kinds of evidence. What this type of study suggests is that we can look at cultural concepts like wit in new ways. These results can also facilitate a comparison of historical models of wit to contemporary ones— to show how its meaning may have shifted, using its changing relationship to other words as evidence. This is a growing area of research in the social sciences, computational linguistics, and other disciplines (Kutuzov et al. 2019) In a survey paper on current work in diachronic word embeddings and semantic shifts, Kutuzov et al. note that the surge of interest points to its impor- tance for natural language processing, but that it currently lacks “cohesion, common terminology and shared practices.” Some of this cohesion might be generated by putting the usefulness of word vectors in con- text of the history of information retrieval and the history of distributed representation. Word embeddings emerged in the 1960s, with data modeled as a matrix, and a user’s query of a database represented as a vector. Simple vector operations could be used to locate relevant data or docu- ments. Gerald Salton is generally credited as one of the first to do this, based on the idea that he could represent a document as a vector of keywords and use measures like cosine similarity and di- mensionality reduction to compare documents.13 Since the 1990s, vector space models have been explored in Allen and Hospedales 2019. 13Algorithms like word2vec take as input the linguistic context of words in a given corpus of text, and output an N dimensional space of those words—each word is represented as a vector of dimension N in that Euclidean space. Word vectors with thousands of values are transformed to lower-dimensional spaces in which the directionality of two vectors can be measured using cosine similarity—words that exist in similar contexts would be expected to have a similar cosine measurement and map to like clusters in the distributed space. 36 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 used in distributional semantics. In a paper on the history of vector space models, which exam- ines the trajectory of Gerald Salton’s work, David Dubin notes that these mathematical models can be defined as “a consistent mathematical structure designed to correspond to some physical, biological, social, psychological, or conceptual entity” (2004). In the case of word vectors, word context and colocations give us quantifiable information about a word’s meaning. But research in cognitive science has long questioned the property of linguistic similarity in spatial representations because they don’t align with important aspects of human semantic pro- cessing (Tversky 1977). Tversky shows, for example, that people’s interpretation of semantic sim- ilarity does not always obey the triangle inequality, i.e., the words w1 and w3 are not necessarily similar when both pairs of (w1, w2) and (w2, w3) are similar. While “asteroid” is very similar to “belt” and “belt” is very similar to “buckle”, “asteroid” and “buckle” are not similar (Griffiths et al. 2007). One reason this violation arises is because a word is represented as a single vector even when it has multiple meanings. This has led to research that attempts new methods to capture different senses of words in embedding applications. In a paper surveying techniques for dif- ferentiating words at the “sense” level, Jose Camacho-Collados and Mohammad Taher Pilehvar show that these efforts fall in two camps: “Unsupervised models directly learn word senses from text corpora, while knowledge-based techniques exploit the sense inventories of lexical resources as their main source for representing meanings” (2018, 744). The first method, an unsupervised model, induces different meanings of a word — it is trained to analyze and represent each word sense based on statistical knowledge derived from the contexts within a corpus. The second method for disambiguation relies on information contained in other databases or sources. WordNet, for instance, associates multiple words with concepts, providing a sense inventory for terms. It is made up of synsets, which represent unique concepts that can be expressed through nouns, verbs, adjectives or adverbs. The synset of a concept such as “a busi- ness where patrons can purchase coffee and use WiFi” might be “cafe, coffeeshop, internet cafe” etc. Camacho-Collados and Pilehvar review different ways to process word embedding results using WordNet and similar resources, which essentially provide synonyms that share a common meaning. There exists a relationship between work that addresses word disambiguation and work that addresses the biases that word vector algorithms produce. Just as researchers can modify gen- eral word embedding models to capture a word’s multiple meanings, they can also modify them according to a word’s usage over time. These evolving methods begin to account for the social, historical, and psychological dimensions of language. If one can show that applying word embed- ding algorithms to diachronic corpora or corpora of different domains produces different biases, this would suggest that nuanced shifts in vocabulary and word usage can be used to impact data curation practices that seek to isolate and remove historical bias from other word embedding models. Biases, one might say, persist despite contextual changes. Or, one might say that the short- comings of word embeddings don’t account for changes in bias that are present in context. This is where the domain expertise of literary scholars also becomes essential. Historians’ domain ex- pertise and natural interest in comparative corpora (from different time periods or containing different types of documents) situates their ability to curate datasets that tend to both data ethics and computational innovation. Such work could have impact beyond historical research, and result in data-level corrections to biases that emerge in more general-purpose embedding applica- tions. This could be more effective and reproducible than correcting them superficially (Gonen and Goldberg 2019). For instance, if novel cultural biases can be traced to an origin period, texts Plumb 37 from that period could constitute a sub-corpus. Embedding models specific to that corpus might be subtracted from the vectors generated from a broader dataset. Examining a methodology’s history is an essential way in which scholars can strengthen the validity of computationally-driven research and its integration into literary departments—this type of scholarship reconstitutes literary insights after the risky move of flattening literary texts with the rigor of machines. But as Lauren Klein (2019) and others reveal, scholars have begun to apply interpretation and imagination in both the computational and the “close reading” as- pects of their research. This reinforces that computational shifts in the study of literature are more than just the adoption of useful tools for the sake of locating a novel pattern in data. An increasingly important branch of digital literary research demonstrates the efficacy of engaging the interdisciplinary complexity of computational tools in relation to the complexity of literary analysis. New ideas for close readings and analysis can serve as windows into defining secondary com- putational research questions that emerge from an initial statistical exploration. As in the work reviewed by Camacho-Collados Pilehvar, outside knowledge of word senses can be used for post- processing word embeddings that address theoretical issues. Implementing this type of process for humanities research, one might begin with the question: can I generate word vector models that attend to both author gender and word context if I train them in innovative ways? Does this require a corpus of male authors and one of female authors? Or would this be better accom- plished with an outside lexical source that has already associated word senses with genders? Multi-disciplinary scholars are experimenting with a variety of methods to use word vector algorithms to track semantic complexities, and humanities researchers need an awareness of the technical innovations across a range of these disciplines because they are in a position to bring im- portant domain knowledge to these efforts. Ideally, the questions that unite these disciplinary ef- forts might be: how do we make word contexts and distributional semantics more useful for both historians, who need reproducible results that lead to new interpretation, and technologists, who need historical interpretation to play a larger role in language generalization? Modeling language histories depends on how deeply humanists can understand word embedding models, so that they can augment their inherent shortcomings. Cross-disciplinary collaborations help scholars return to fundamental issues that arise when we treat words as data, and help bring more cohesive methodological standards to language modeling. New directions in cross-disciplinary machine learning frameworks Literary scholars set up computational inquiries with attention to cultural complexity, and seek out instances of language that convey historical context. So while they aren’t likely to lead the charge in correcting fundamental shortcomings of language representation algorithms, they can increasingly impact social assessments of those algorithms, provide methodologies for those al- gorithms to locate anomalies in language usage, and assess whether those algorithms embody socially just practices (D’Ignazio and Klein 2020). Some literary scholars also critique the non- neutral ideologies that are in place in both computing and the humanities (Rhody 2017, 660). These efforts not only make the field of literary studies (and its history) more relevant to a digitally and computationally-driven future, but also help literary scholars create meaningful intersections between their computational tools and theoretical training. That training includes frameworks 38 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 for reading and analysis that computers cannot yet perform, but should aspire to—from close reading, Semiotic Criticism, and Formalism to Post-structuralism, Cultural Studies, and Femi- nist Theory. The varied systems literary scholars have developed for thinking about signs, words, and symbols should not be seen as irreconcilable with computational tools for text analysis. In- stead, they should become the foundation for new methodologies that tackle the shortcomings of machine learning algorithms and project future directions for text analysis. Linguists and scientists interested in natural language processing have often looked to the hu- manities for methods that assign rules to the production of meaning. Such methods exist within the history of literary criticism, some of which are being newly explored as concepts for language modeling algorithms. For instance, data curation takes inspiration from cultural studies, which empowers literary scholars to correct for bias and underrepresentation in literature by expand- ing the canon. Subsequent literary findings from that research need not only be literary ones: they have the potential to serve as models for best practices for computational tools and datasets more broadly. While the rift between society’s most progressive ideas and its technological ad- vancement is not unique to the rise of machine learning, practical opportunities exist to repair the rift with a blend of literary criticism and computational skills, and there are many recent ex- amples14 of the growing importance of combining rich technical explanations, interdisciplinary theories, and original computational work in corpus linguistics and beyond. A desire to wield social and computational concerns simultaneously is evident also in recent work in Linguistics,15 Sociology,16 and History.17 Studies in computational Sociology by Lauren Nelson, Austin C. Kozlowski, Matt Taddy, James A. Evans, Peter McMahan, and Kenneth Benoit contain important parallels for machine learning-driven text analysis. Nelson, for instance, calls for a new three-step methodology to com- putational sociology, one that “combines expert human knowledge and hermeneutic skills with the processing power and pattern recognition of computers, producing a more methodologically rigorous but interpretive approach to content analysis” (2020, 1). She describes a framework that can aid in reproducibility, which was noted as a problem by Da. Kozlowski, Taddy, and Evans, who study relationships between attention and knowledge, in a September 2019 paper on the “geometry of culture” use a vector space model to analyze a century of books. They show “that the markers of class continuously shifted amidst the economic transformations of the twentieth century, yet the basic cultural dimensions of class remained remarkably stable. The notable ex- ception is education, which became tightly linked to affluence independent of its association with cultivated taste” (1). This implies that disciplinary expertise can be used to isolate sub-corpora for use in secondary word embedding research problems. Resulting word similarity findings could aid in both validating the initial research finding and defining domain-specific datasets that are reusable for future research. The idea of using humanities methodologies to inform model architectures for machine learn- 14See Whitt 2018 for a state-of-the-art overview of the intersecting fields of corpus linguistics, historical linguistics, and genre-based studies of language usage. 15A special issue in the journal Language from the Linguistic Society of America published responses to a call to reconcile the unproductive rift between generative linguistics and neural network models. Christopher Potts’s response (2019) advocates an imperative integration between deep learning and traditional linguistic semantics. 16Sociologist Laura K. Nelson (2020) calls for a three-step methodological framework called computational grounded theory. 17Another special issue, this one from Isis, a journal from the History of Science Society, suggests that “the history of knowledge can act as a bridge between the world of the humanities, with its tradition of close reading and detailed understanding of individual cases, and the world of big data and computational analysis” (Laubichler, Maienschein, and Renn 2019, 502). Plumb 39 ing is part of a wider history of computational scientists drawing inspiration from other fields to make AI systems better. Designing humanities research with novel word embedding mod- els stands to widen the territory where machine learning engineers look for conceptual concepts to inspire strategies for improving the performance of artificial language understanding. Many computer scientists are investigating the figurative (Gagliano et al. 2019) and the metaphorical (Mao et al. 2018) in language. As machines get better at reading and interpreting texts, literary studies and theories will become more applicable to how those machines are programmed to look at multiple layers and dimensions of language. Ted Underwood, Andrew Piper, Katherine Bode, James Dobson, and others make connections between computational literary research and social dimensions of the history of vector space model research. Since vector models are based on the 1950s linguistic notion of similarity (Firth 1957), researchers working to show superior algorith- mic performance focus on different aspects of why similarity is important than do researchers seeking cultural insights within their data. But Underwood points out that a word vector can also be seen as a way to quantitatively account for more aspects of meaning (2019). Already, cross-disciplinary scholarship draws on computational linguistics,18 information science,19 and semantic linguistics, and the imperative to understand concepts from all of these fields is grow- ing. As better methods are developed for using word embeddings to better understand texts from different domains and time periods, more sophisticated tools and paradigms emerge that echo the complexity of traditional literary and historical interpretation. Systematic data curation, combined with word embedding algorithms, represent a new inter- pretive system for literary scholars. The potential of machine learning methods for text analysis goes beyond historical literary text analysis, and the methods for literary text analysis using ma- chine learning also go beyond literature departments. The corpora they model and the way they frame their research questions reframe the potential to use systems like word vectors to under- stand aspects of historical language and could have broader ramifications on how other applica- tions model word meanings. Because such literary research generates novel frameworks for using machine learning to represent language, it’s imperative to explore the question: Are there ways that humanities methodologies and research goals can exert greater influence in the computa- tional sciences, make the history of literary studies more relevant in the evolution of machine learning techniques, and better serve our shared social values? References Algee-Hewitt, Mark. 2015. “The Order of Poetry: Information, Aesthetics and Jakobson’s The- ory of Literary Communication.” Presented at the Russian Formalism & the Digital Hu- manities Conference, April 13, Stanford University, Palo Alto, CA. ?iiTb,ff/B;Bi�H? mK�MBiB2bXbi�M7Q`/X2/mf`mbbB�M@7Q`K�HBbK@/B;Bi�H@?mK�MBiB2b. Algee-Hewitt, Mark. 2019. “Criticism, Augmented.” In the Moment (blog). April 1, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfy9fyRf+QKTmi�iBQM�H@HBi2`�`v@ bim/B2b@T�`iB+BT�Mi@7Q`mK@`2bTQMb2bf. Allen, Carl, and Timothy Hospedales. 2019. “Analogies Explained: Towards Understanding Word Embeddings.” In International Conference on Machine Learning, 223–31. PMLR. ?iiT,ffT`Q+22/BM;bXKH`XT`2bbfpNdf�HH2MRN�X?iKH. 18Linguistics scholars are also adopting computational models to make progress with theories related to semantic sim- ilarity. For instance, see Potts 2019. 19See Lin 1998, for example. https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/ https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/ http://proceedings.mlr.press/v97/allen19a.html 40 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Argamon, Shlomo and Mark Olsen. 2009. “Words, Patterns and Documents: Experiments in Machine Learning and Text Analysis.” Digital Humanities Quarterly 3 (2). ?iiT,ffrrrX /B;Bi�H?mK�MBiB2bXQ`;f/?[fpQHfjfkfyyyy9Rfyyyy9RX?iKH. Bode, Katherine. 2020. “Why You Can’t Model Away Bias.” Modern Language Quarterly 81 (1): 95–124. ?iiTb,ff/QBXQ`;fRyXRkR8fyykedNkN@dNjjRyk. Buurma, Rachel Sagner, and Laura Heffernan. 2018. “Search and Replace: Josephine Miles and the Origins of Distant Reading.” Modernism / Modernity Print+ 3, Cycle 1 (April). ?iiTb,ffKQ/2`MBbKKQ/2`MBivXQ`;f7Q`mKbfTQbibfb2�`+?@�M/@`2TH�+2. Camacho-Collados, Jose and Mohammad Taher Pilehvar. 2018. “From Word To Sense Embed- dings: A Survey on Vector Representations of Meaning.” Journal of Artificial Intelligence Research 63 (December): 743–88. ?iiTb,ff/QBXQ`;fRyXReRjfD�B`XRXRRk8N. Chander, Manu Samriti. 2017. Brown Romantics: Poetry and Nationalism in the Global Nine- teenth Century. Lewisburg, PA: Bucknell University Press. Critical Inquiry. 2019. “Computational Literary Studies: A Critical Inquiry Online Forum.” In the Moment (blog). March 31, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfyjf jRf+QKTmi�iBQM�H@HBi2`�`v@bim/B2b@�@+`BiB+�H@BM[mB`v@QMHBM2@7Q`m Kf. Da, Nan Z. 2019. “The Computational Case against Computational Literary Studies.” Critical Inquiry 45 (3): 601–39. ?iiTb,ff/QBXQ`;fRyXRy3efdyk8N9. D’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Cambridge: MIT Press. Douglas, Samantha, Dan Dirilo, Taylor-Dawn Francis, Keith Giles, and Marisa Plumb. n.d. “The Bengal Annual: A Digital Exploration of Non-Canonical British Romantic Litera- ture.” ?iiTb,ffb+�H�`Xmb+X2/mfrQ`Fbfi?2@#2M;�H@�MMm�HfBM/2t. Dubin, David. 2004. “The Most Influential Paper Gerard Salton Never Wrote.” Library Trends 52 (4): 748–64. ?iiTb,ffrrrXB/2�HbXBHHBMQBbX2/mf#Bibi`2�Kf?�M/H2fkR9 kfReNdf.m#BMd93de9XT/7?b2[m2M+24k. Firth, J.R. 1957. “A Synopsis of Linguistic Theory.” In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell. Gagliano, Andrea, Emily Paul, Kyle Booten, and Marti A. Hearst. 2019. “Intersecting Word Vectors to Take Figurative Language to New Heights.” In Proceedings of the Fifth Workshop on Computational Linguistics for Literature, 20-31. San Diego, California: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRyXR3e8jfpRfqRe@ykyj. Gavin, Michael, Collin Jennings, Lauren Kersey, and Brad Pasanek. 2019. “Spaces of Meaning: Conceptual History, Vector Semantics, and Close Reading.” In Debates in the Digital Hu- manities 2019, edited by Matthew K. Gold and Lauren F. Klein, 243–267. Minneapolis: University of Minnesota Press. Goldstone, Andrew. 2019. “Teaching Quantitative Methods: What Makes It Hard (in Literary Studies).” In Debates in the Digital Humanities 2019, edited by Matthew K. Gold and Lau- ren F. Klein. Minneapolis: University of Minnesota Press. ?iiTb,ff/?/2#�i2bX;+X+ mMvX2/mf`2�/fmMiBiH2/@7k�+7dk+@�9eN@9N/3@#2j8@ed7N�+R2j�eyfb2+iB QMfeky+�7N7@y3�3@9382@�9Ne@8R9yykNe2#+/O+?RN. Gonen, Hila and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up System- atic Gender Biases in Word Embeddings But do not Remove Them.” ArXiv:1903.03862, September. ?iiTb,ff�`tBpXQ`;f�#bfRNyjXyj3ek. Griffiths, Thomas L., Mark Steyvers, and Joshua B. Tenenbaum. 2007. “Topics in Semantic Representation.” Psychological Review 114 (2): 211–44. ?iiTb,ff/QBXQ`;fRyXRyjdf http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html https://doi.org/10.1215/00267929-7933102 https://modernismmodernity.org/forums/posts/search-and-replace https://doi.org/10.1613/jair.1.11259 https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://doi.org/10.1086/702594 https://scalar.usc.edu/works/the-bengal-annual/index https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2 https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2 https://doi.org/10.18653/v1/W16-0203 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://arxiv.org/abs/1903.03862 https://doi.org/10.1037/0033-295X.114.2.211 https://doi.org/10.1037/0033-295X.114.2.211 Plumb 41 yyjj@kN8sXRR9XkXkRR. Harris, Katherine D. 2015. Forget Me Not: The Rise of the British Literary Annual, 1823NJ1835. Athens: Ohio University Press. Harris, Katherine D. 2019. “TheBengalAnnual and #bigger6.” Keats-ShelleyJournal 68: 117–18. ?iiTb,ffKmb2XD?mX2/mf�`iB+H2fddRRjk. Kirschenbaum, Matthew. 2007. “The Remaking of Reading: Data Mining and the Digital Hu- manities.” Presented at the National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation, Baltimore, MD, October 11. ?iiTb,ffrrrX+b22XmK#+X2/mf�?BHHQHfL:.Jydf�#bi`�+ibfi�HFbfJEB`b+? 2M#�mKXT/7. Klein, Lauren F. 2019. “What the New Computational Rigor Should Be.” IntheMoment (blog). April 1, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfy9fyRf+QKTmi�iBQM�H @HBi2`�`v@bim/B2b@T�`iB+BT�Mi@7Q`mK@`2bTQMb2b@8f. Koehrsen, Will. 2018. “Neural Network Embeddings Explained.” Towards Data Science, Octo- ber 2, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfM2m`�H@M2irQ`F@2K#2//BM;b @2tTH�BM2/@9/yk32e7y8ke. Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Ana- lyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84 (5): 905–949. ?iiTb,ff/QBXQ`;fRyXRRddfyyyjRkk9RN3ddRj8. Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. “Diachronic word embeddings and semantic shifts: a survey.” In Proceedings of the 27th International Con- ference on Computational Linguistics, 1384-1397. Santa Fe, New Mexico: Association for Computational Linguistics. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vf*R3@RRRd. Laubichler, Manfred D., Jane Maienschein, and Jürgen Renn. 2019. “Computational History of Knowledge: Challenges and Opportunities.” Isis 110 (3): 502-512. Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the Fifteenth International Conference on Machine Learning, 296–304. San Francisco, Califor- nia: Morgan Kaufmann Publishers Inc. Mao, Rui, Chenghua Lin, and Frank Guerin. 2018. “Word Embedding and WordNet Based Metaphor Identification and Interpretation.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1222–31. Mel- bourne, Australia: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRy XR3e8jfpRfSR3@RRRj. Nelson, Laura K. 2020. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research 49 (1): 3–42. ?iiTb,ff/QBXQ`;fRyXRRddfyy9NRk9R RddkNdyjX Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press. Poole, Alex H. 2013. “Now Is the Future Now? The Urgency of Digital Curation in the Digital Humanities.” Digital Humanities Quarterly 7 (2). ?iiT,ffrrrX/B;Bi�H?mK�MBiB2 bXQ`;f/?[fpQHfdfkfyyyRejfyyyRejX?iKH. Potts, Christopher. 2019. “A Case for Deep Learning in Semantics: Response to Pater.” Lan- guage 95 (1): e115–24. ?iiTb,ff/QBXQ`;fRyXRj8jfH�MXkyRNXyyRN. Rhody, Lisa. 2017. “Beyond Darwinian Distance: Situating Distant Reading in a Feminist Ut Pictura Poesis Tradition.” PMLA 132 (3): 659-667. Risam, Roopika. 2018. “Decolonizing the Digital Humanities in Theory and Practice.” In The https://doi.org/10.1037/0033-295X.114.2.211 https://doi.org/10.1037/0033-295X.114.2.211 https://muse.jhu.edu/article/771132 https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/ https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/ https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 https://doi.org/10.1177/0003122419877135 https://www.aclweb.org/anthology/C18-1117 https://doi.org/10.18653/v1/P18-1113 https://doi.org/10.18653/v1/P18-1113 https://doi.org/10.1177/0049124117729703. https://doi.org/10.1177/0049124117729703. http://www.digitalhumanities.org/dhq/vol/7/2/000163/000163.html http://www.digitalhumanities.org/dhq/vol/7/2/000163/000163.html https://doi.org/10.1353/lan.2019.0019 42 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Routledge Companion to Media Studies and Digital Humanities, edited by Jentery Sayers, 78–86. New York: Routledge. Roh, Yuji, Geon Heo, and Steven Euijong Whang. 2019. “A Survey on Data Collection for Ma- chine Learning: A Big Data - AI Integration Perspective.” IEEE Transactions on Knowledge and Data Engineering Early Access: 1–20. ?iiTb,ff/QBXQ`;fRyXRRyNfhE.1XkyRNX kN9eRek. Tversky, Amos. “Features of Similarity.” Psychological Review 84 (4): 327–52. ?iiTb,ff/QBX Q`;fRyXRyjdfyyjj@kN8sX39X9Xjkd. Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press. Whitt, Richard J., ed. 2018. Diachronic Corpora, Genre, and Language Change. John Benjamins Publishing Company. https://doi.org/10.1109/TKDE.2019.2946162 https://doi.org/10.1109/TKDE.2019.2946162 https://doi.org/10.1037/0033-295X.84.4.327 https://doi.org/10.1037/0033-295X.84.4.327 04-janco-machine ---- Chapter 4 Machine Learning in Digital Scholarship Andrew Janco Haverford College Introduction We are entering an exciting time when research on machine learning and innovation no longer requires background knowledge in programming, mathematics, or data science. Tools like Run- wayML, the Teachable Machine, and Google AutoML allow researchers to train project-specific classification and object detection models. Other tools such as Prodigy or INCEpTION provide the means to train custom named entity recognition and named entity linking models. Yet with- out a clear way to communicate the value and potential of these solutions to humanities scholars, they are unlikely to incorporate them into their research practices. Since 2014, dramatic innovations in machine learning have occurred, providing new capa- bilities in computer vision, natural language processing, and other areas of applied artificial in- telligence. Scholars in the humanities, however, are often skeptical. They are eager to realize the potential of these new methods in their research and scholarship, but they do not yet have the means to do so. They need to make connections between machine capabilities, research in the sciences, and tangible outcomes for humanities scholarship, but very often, drawing these con- nections is more a matter of chance than deliberate action. Is it possible to make such connections deliberately and identify how machine learning methods can benefit a scholar’s research? This article outlines a method for connecting the technical possibilities of machine learning with the intellectual goals of academic researchers in the humanities. It argues for a reframing of the problem. Rather than appropriating innovations from computer science and artificial intelli- gence, this approach starts from humanities-based methods and practices. This shift allows us to work from the needs of humanities scholars in terms that are familiar and have recognized value to their peers. Machines can augment scholars’ tasks with greater scale, precision, and reproducibil- 43 44 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 ity than are possible for a single scholar alone. However, only relatively basic and repetitive tasks can presently be delegated to machines. This article argues that John Unsworth’s concept of “scholarly primitives” is an effective tool for identifying basic tasks that can be completed by computers in ways that advance humani- ties research (2000). As Unsworth writes, primitives are “basic functions common to scholarly activity across disciplines, over time, and independent of theoretical orientation.” They are the building blocks of research and analysis. As the roots and foundations of our work, “primitives” provide an effective starting point for the augmentation of scholarly tasks. Here it is important to note that the end goal is not the automation of scholarship, but rather the delegation of appropriate tasks to machines. As François Chollet recently noted, Our field isn’t quite “artificial intelligence” — it’s “cognitive automation”: the en- coding and operationalization of human-generated abstractions / behaviors / skills. The “intelligence” label is a category error. (2020) This view shifts our focus from the potential intelligence of machines towards their abil- ity to complete useful tasks for human ends. Specifically, they can augment scholars’ work by performing repetitive tasks at scale with superhuman speed and precision. I proceed from this understanding to argue for an experimental and interpretive approach to machine learning that highlights the value of the interaction between the scholar and machine rather than what ma- chines can produce. *** Unsworth’s notion “scholarly primitive” takes its meaning from programming and refers to the most basic operations and data types of a programming language. Primitives form the build- ing blocks for all other components and operations of the language. This borrowing of termi- nology also suggests that primitives are not universal. A sequence of characters called a string is a primitive in Python, but not in Java or C. The architecture of a language’s primitives changes over time and evolves with community needs. The Python and C communities, for example, have em- braced Unicode as a standard to allow strings in every human language (including emojis). Other communities continue to use a range of character encodings, which grants greater flexibility to the individual programmer and avoids the notion that there should be a common standard. For scholarship, the term offers a metaphor and point of departure. It poses a question: What are the most basic elements of scholarly research and analysis? Unsworth offers several initial ex- amples of primitives to illustrate their value without a claim that they are comprehensive, includ- ing discovering, annotating, comparing, referring, sampling, illustrating, and representing. These terms offer a “list of functions (recursive functions) that could be the basis for a manageable but also useful tool-building enterprise in humanities computing.” Primitives can thus guide us in the creation of computational tools for scholarship. For example, with the primitive of comparison, a scholar might study different editions of a text, searching for similarities and differences that often lead to new insights or highlight ideas that would otherwise be taken for granted. As a tool, comparison can (but does not always) re- veal new information. For an assignment in graduate school, I compared a historical calendar that showed the days of the week against entries in Stalin’s appointment book. The simple juxtaposi- tion revealed that none of Stalin’s appointments were on a Sunday. This example raises questions for further investigation and interpretation. If Stalin was an atheist who worked at all times of Janco 45 the day and night, why wouldn’t he schedule meetings on Sundays? Perhaps it was a legacy from Stalin’s youth spent in seminary? Is there a similar pattern in other periods of Stalin’s life? The craft of humanities research relies on many such simple initial queries. It should be noted that these little experiments are just the beginning of a research project. Nonetheless, the utility of comparison is clear. If anything, it seems so basic as to go unnoticed. This particular comparison offered an insight and new knowledge that led to further research questions. Such beginnings are often a matter of luck. However, machine learning offers an opportu- nity to increase the dimensionality of comparisons. The similarities and differences between two editions of a text can easily be quantified using Levenshtein distance.1 However, that will only capture the differences at the level of characters on a page. With machine learning, we can train embeddings that account for semantics, authors, time periods, genders and other features of a text and its contents simultaneously. We can quantify similarity in new ways that facilitate new forms of comparison. This approach builds on the original meaning and purpose of comparison as a form of “scholarly primitive,” but opens additional directions for research and opportunities for insights. Rather than relying on happenstance or intuition to find productive comparisons, we can systematically search and compare research materials. The second “scholarly primitive” that lends itself well to augmentation is annotation. This activity takes different forms across disciplines. A literary scholar might underline notable sec- tions of a text by writing a note in the margins. A historian transcribes information from an archival source into a notebook. At their core, these actions add observations and associations to the original materials. Those steps in the research process are the first, most basic step, that con- nects information in a source to a larger set of research materials. We add context and meaning to materials that make them part of a larger collection. When working with texts or images, machine learning models are presently capable of mak- ing simple annotations and associations. For example, named entity recognition models (NER) are able to recognize person names, place names, and other key words in text. Each label is an annotation that makes a claim about the content of the text. “Steamboat Springs” or “New York City” are linked to an entity called PLACE. Once again, we are speaking about the most basic first steps that scholars perform during research. I know that Steamboat Springs is a place. It’s where I grew up. However, another scholar, one less versed in small mountain towns in Colorado, might not recognize the town name. They might identify it as a spring or a ski resort; perhaps a volcanic field in Nevada. The idea of “scholarly primitives” forces us to confront the importance of do- main knowledge and the role that it plays in the interpretation of materials. To teach a machine to find entities, we must first explain everything in very specific terms. We can train the machine to use surrounding contextual information in order to predict — correctly — that “Steamboat Springs” refers to a town, a spring, or a ski resort. As part of a project with Philip Gleissner, I trained a model that correctly identifies Soviet journal names in diary entries. For instance, the machine uses contextual clues to identify when the term Volga refers to the journal by that name and not to the river or the automobile. Where is the mention of “October” a journal name and not a month, a factory name, or the revolu- tion? The trained model makes it possible to identify references to journals in a corpus of over 400,000 diary entries. This in turn makes it possible to research the diaries with a focus on reader reception. Normally, this would be a laborious and time-consuming task. Each time the machine predicts an entity in the text, it adds annotations. What was simply text is now marked as an en- 1Named after the Soviet mathematician Vladimir Levenshtein, Levenshtein distance uses the number of changes that would be needed to make two objects identical as a measure of their similarity. 46 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 tity. As part of this project, we had to define the relevant entities, create training data, and train the model to accomplish a specific task. This process has tangible value for scholarship because it forces us to break down complicated research processes into their most basic tasks and processes. As noted before, annotation can be an act of association and linking. Natural language pro- cessing is capable of not only recognizing entities in a text, but also associating that text with a record in a knowledge base. This capability is called named entity linking. Using embeddings, a statistical language model can not only predict that “Steamboat Springs” is a town, but that it is a specific town with the record Q984721 in dbpedia. This association opens a wealth of contex- tual information about the place, including its population, latitude and longitude, and elevation. A scholar might have ample knowledge and experience reading literature — specifically, Milton. A machine does not, but it has access to context information that enriches analysis and permits associations. The result is a reading of a literary work that accounts for contextual knowledge. To be sure, named entity linking is not a replacement for domain knowledge. However, it is able to augment a scholar’s contextual knowledge of materials and make that information available for study during research. At this point, we are asking the machine not only to sort or filter data, but to reason actively about its contents. Machine learning offers the potential to automate humanities annotation tasks at scale. This is true of basic tasks, such as recognizing that a given text is a letter. It is also true of object recognition tasks, such as identifying a state seal in a letterhead or other visual at- tributes. A Haverford College student was doing research on documents in a digital archive that we are building with the Grupo de Apoyo Mutuo (GAM), of more than three thousand case inves- tigations of disappeared persons during the Guatemalan Civil War. They noticed that many of the documents were signed with a thumbprint. The student and I trained an image classification model to identify those documents, thus providing the capability to search the entire collection of documents for this visual attribute. The thumbprints provided a proxy for literacy and allowed the student to study the collection in new ways. Similarly, documents containing the state seal of Guatemala are typically letters from the government in reply to GAM’s requests for information about disappeared persons. At present, several excellent tools exist to facilitate machine annotation of images and texts. Google’s Teachable Machine offers an intuitive web application that humanities faculty and stu- dents can use to train classification models for images, sounds, and poses. To take the example above, the user would upload images of correspondence. They would then upload images of doc- uments that are not letters.2 Once training begins, a base model is loaded and trained on the new categories. Because the model already has existing training on image categories, it is able to learn the new category with only a few examples. This process is called transfer learning. For more advanced tasks, Google offers AutoML Vision and Natural Language, which are able to process large collections of text or images and to deploy trained models using Google cloud infrastruc- ture. Similar products are available from Amazon, IBM, and other companies. Runway ML offers a locally installed program with more advanced capabilities than the Teachable Machine. Runway ML works with a wide range of machine learning models and is an excellent way for scholars to explore their capabilities without having to write code.3 The accessibility of tools like 2In the Google Cloud Terms of Service there is specific assurance that your data will not be shared or used for any other purpose than the training of the model. More expert analysis may find concerns, and caution is always warranted. At present, there seems to be no more risk in using cloud services for ML tasks than there are for using cloud services more generally. See ?iiTb,ff+HQm/X;QQ;H2X+QKfi2`Kbf. 3Teachable Machine, ?iiTb,ffi2�+?�#H2K�+?BM2XrBi?;QQ;H2X+QKf; Google AutoML, ?iiTb,ff+HQm /X;QQ;H2X+QKf�miQKHf; RunwayML, ?iiTb,ff`mMr�vKHX+QKf. https://cloud.google.com/terms/ https://teachablemachine.withgoogle.com/ https://cloud.google.com/automl/ https://cloud.google.com/automl/ https://runwayml.com/ Janco 47 Runway allows for low-stakes experimentation and exploration. It is also a particularly good way for scholars to explore new methods and discover new materials. For Unsworth, discovery is largely the process of identifying new resources. We can find new sources in a library catalog, on the shelf, or in a conversation. These activities require a human in the loop because it is the person’s incomplete knowledge of a source that makes it a “discovery” when found. Given that machines reason about the content of text and images in ways that are quite unlike those of humans, machine learning opens new possibilities for discovery. When it comes to the differences in our own habits of mind and the computational processes of artificial networks, we may speak of “neurodiversity.” Scholars can benefit from these differences, since the strengths of machine thinking complement our needs. Machine learning models offer a variety of ways to identify similarity and difference with re- search materials. Yale’s PixPlot, for example, uses a convolutional network to train image embed- dings which are then plotted relative to one another in two-dimensional space with a stochastic nearest neighbor algorithm (t-SNE) (Duhaime n.d.).4 PixPlot creates a striking visualization of hundreds or thousands of images, which are organized and clustered by their relative visual sim- ilarity. As a research tool, PixPlot and similar projects offer a quick means to identify statistically relevant similarities and clusters. This visualization reveals what patterns are most evident to the machine and provides a discovery tool for associations that might not be evident to a human researcher. Ben Schmidt has applied a comparable process to “machine read” and visualize four- teen million texts in the HathiTrust (n.d., 2018).5 Using the relative co-occurrence of words in a book, Schmidt is able to train book embeddings. Schmidt’s vectors provide an original way to organize and label texts based purely on the machine’s “reading” of a book. These machine- generated labels and clusters can be compared against human-generated metadata. The value of this work is the human investigation of what machine models find significant in a collection of research materials. For example, with topic modeling, a scholar must interpret what a particular algorithm has identified as a statistically significant topic by interpreting a cryptic chain of words. The topic “menu, platter, coffee, ashtray” is likely related to a diner. In these efforts, Scattertext offers an effective tool to visualize what terms are most distinctive of a text category. In a given corpus of text, I can identify which words are most exemplary of poetry and which words are most exemplary of prose. Scattertext creates a striking and useful visualization, or it can be used in the terminal to process large collections of text. Conclusion As a conceptual tool, “scholarly primitives” has considerable promise to connect the intellectual goals of academic researchers in the humanities with the technical possibilities of machine learn- ing. Rather than focusing on the capabilities of machine learning methods and the priorities of machine learning researchers, this method offers a means to build from the existing research practices of humanities scholars. It allows us to identify what kinds of tasks would benefit from being augmented. Using “primitives” shifts the focus away from large abstract goals, such as re- search findings and interpretive methods, to micro-methods and actions of humanities research. By augmenting these activities, we are able to benefit from the scale and precision afforded by 4See also ?iiTb,ff�`ib2tT2`BK2MibXrBi?;QQ;H2X+QKfibM2K�Tf. 5At time of writing, Schmidt’s digital monograph Creating Data (n.d.) is a work in progress, with most sections empty until the official publication. https://artsexperiments.withgoogle.com/tsnemap/ 48 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 4 computational methods, as well as the valuable interplay between scholars and machines as hu- manities research practices are made explicit and reproducible. References Chollet, François. 2020. “Our Field Isn’t Quite ‘Artificial Intelligence’ — It’s ‘Cognitive Au- tomation’: The Encoding and Operationalization of Human-Generated Abstractions / Be- haviors / Skills. The ‘Intelligence’ Label Is a Category Error.” Twitter, January 6, 2020, 10:45 p.m. ?iiTb,ffirBii2`X+QKf7+?QHH2ifbi�imbfRkR9jNk9Nejd8yk8ee9. Duhaime, Douglas. n.d. “PixPlot.” Yale DHLab. Accessed July 12, 2020. ?iiTb,ff/?H�#X v�H2X2/mfT`QD2+ibfTBtTHQif. Schmidt, Benjamin. n.d. “A Guided Tour of the Digital Library.” In Creating Data: The Inven- tion of Information in the American State, 1850-1950. ?iiT,ff+`2�iBM;/�i�Xmbf/� i�b2ibf?�i?B@72�im`2bf. . 2018. “Stable Random Projection: Lightweight, General-Purpose Dimension- ality Reduction for Digitized Libraries.” Journal of Cultural Analytics, October. ?iiTb, ff/QBXQ`;fRyXkkR93fReXyk8. Unsworth, John. 2000. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” Paper presented at the Symposium on Humanities Computing: Formal Methods, Experimental Practice, King’s College, Lon- don, May 2000. ?iiT,ffrrrXT2QTH2XpB`;BMB�X2/mf�DKmkKfEBM;bX8@yyfT`BK BiBp2bX?iKH. https://twitter.com/fchollet/status/1214392496375025664 https://dhlab.yale.edu/projects/pixplot/ https://dhlab.yale.edu/projects/pixplot/ http://creatingdata.us/datasets/hathi-features/ http://creatingdata.us/datasets/hathi-features/ https://doi.org/10.22148/16.025 https://doi.org/10.22148/16.025 http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html 05-wiegand-cultures ---- Chapter 5 Cultures of Innovation: Machine Learning as a Library Service Sue Wiegand Saint Mary’s College Introduction Libraries and librarians have always been concerned with the preservation of knowledge. To this traditional role, librarians in the 20th century added a new function—discovery—teaching peo- ple to find and use the library’s collected scholarship. Information Literacy, now considered the signature pedagogy in library instruction, evolved from the previous Bibliographic Instruction. As Digital Literacy, the next stage, develops, students can come to the library to learn how to leverage the greatest strengths of Machine Learning. Machines excel at recognizing patterns; researchers at all levels can experiment with innovative digital tools and strategies, and build 21st century skill sets. Librarian expertise in preservation, metadata, and sustainability through standards can be leveraged as a value-added service. Leading-edge librarians now invite all the cu- rious to benefit from the knowledge contained in the scholarly canon, accessible through libraries as curated living collections in multiple formats at distributed locations, transformed into new knowledge using new ways to visualize and analyze scholarship. Library collections themselves, including digitized, unique local collections, can provide the data for new insights and ways of knowing produced by Machine Learning. The library could also be viewed as a technology sandbox, a place to create knowledge, connect researchers, and bring together people, ideas, and new technologies. Many libraries are already rising to this challenge, working with other cultural institutions in creating a culture of innovation as a new learning paradigm, exemplified by Machine Learning instruction and technology tool exploration. 49 50 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 Library Practice The role of the library in preserving, discovering, and creating knowledge continues to evolve. Originally, libraries came into being as collections to be preserved, managed, and disseminated, a central repository of knowledge, possibly for political reasons (Ryholt and Barjamovic 2019, 1–2). Libraries founded by scholars and devoted to learning came later, during the Middle Ages (Cas- son 2001, 145). In more recent times, librarians began “[c]ollecting, organizing, and making information accessible to scholars and to citizens of a democratic republic” based on values de- veloped during the Enlightenment (Bivens-Tatum 2012, 186). Bibliographic Instruction in libraries, and later Information Literacy, embodied the idea of learning in the library as the next step beyond collecting, with librarians instructing on informa- tion infrastructure with the goal of empowering library users to find, evaluate, and use scholarly information in print and digital formats, with an emphasis on privacy and intellectual freedom as core library values. Now, librarians are also contributing to and participating in the learn- ing enterprise by partnering with the disciplines to produce new knowledge. This final step of knowledge creation in the library completes the scholarly communications cycle of building on previous scholarship—“standing on the shoulders of giants.” One way to cultivate innovation in libraries is to include Machine Learning in the library’s array of tools, resources, and services, both behind-the-scenes and public-facing. Librarians are expert at developing standards, preserving the scholarly record, and refining metadata to enhance interdisciplinary discovery of research, scholarship, and creative works. Librarian expertise could go far beyond local library collections to a global perspective and normative practice of participa- tion at scale in innovative emerging technologies such as Machine Learning. For instance, citations analysis of prospective collections for the library to collect and of the institutions’ research outputs would provide valuable information for both further collection development and for developing researchers’ toolkits. Machine Learning with its predilection for finding patterns, would reveal gaps in the literature and open up new questions to be an- swered, solving problems and leading to innovation. As one example, Yewno, a multi-disciplinary platform that uses Machine Learning to help combat “Information Overload,” advertises that it “helps researchers, students, and educators to deeply explore knowledge across interdisciplinary fields, sparking new ideas along the way…” and “makes [government] information accessible by breaking open silos and comprehending the complicated interconnections across agencies and organizations,” among other applications to improve discovery (Yewno n.d.). Also, in 2019, the Library of Congress hosted a Summit as “part of a larger effort to learn about machine learning and the role it could play in helping the Library of Congress reach its strategic goals, such as en- hancing discoverability of the Library’s collections, building connections between users and the Library’s digital holdings, and leveraging technology to serve creative communities and the gen- eral public” (Jakeway 2020). Integration of Machine Learning technologies is already starting at high levels in the library world. New Services A focus on Machine Learning can inspire new library services to enhance teaching and learning. Connecting people with ideas and with technology enables library virtual spaces to be used as a learning service by networking researchers at all levels in the enterprise of knowledge creation. Finding gaps in the literature would be a helpful first step in new library discovery tools. A way Wiegand 51 this could be done is through a “Researchers’ Workstation,” an end-to-end toolkit that might start by using Machine Learning tools to automate alerts of new content in a narrow area of in- terest and help researchers at all levels find and focus on problem-solving. A Researchers’ Work- station could contain a collection of analytic tools and learning modules to guide users through the phases of discovery. Then, managing citations would be an important step in the process— storing, annotating, and sorting out the most relevant. Starting research reports, keeping lab notebooks, finding datasets, and preserving the researcher’s own data are all relevant to the final results. A collaboration tool would enable researchers to find others with similar interests and share data or work collaboratively from anywhere, asynchronously. Having all these tools in one serendipitous virtual place is an extension of the concept of the library as the physical place to start research and scholarship. It is merely the containers of knowledge that are different. Some of this functionality exists already, both in Open Source software such as Zotero for ci- tation management, and in proprietary tools that combine multiple functions, such as Mendeley from Elsevier.1 Other commercial publishers are developing tools to enable researchers to work within their proprietary platforms, from the point of searching for ideas and finding research gaps through the process of writing and submitting finished papers for publication. The Coali- tion of Open Access Repositories (COAR) is similarly developing “next generation repositories” software integrating end-to-end tools for the Open Access literature archived in repositories, to “facilitate the development of new services on top of the collective network, including social net- working, peer review, notifications, and usage assessment.” (Rodrigues et al, 2017, 5). What else might a researcher want to do that the library could include in a Researchers’ Work- station? Finding, writing, and keeping track of grants could be incorporated at some level. Gener- ating a timeline might be helpful, and infographics and data visualizations could improve research communication and even help make the case for the importance of the study with others, espe- cially the public and funders. Project management tools might be welcomed by some researchers, too. Finally, when it’s time to submit the idea (whether at the preliminary or preprint stage) to something like an ArXiv-like repository or an institutional repository, as well as to journals of in- terest (also identified through Machine Learning tools), the process of submission, peer-review, revision, and re-submitting could be done seamlessly. The tools and functions in the Worksta- tion would ideally be modular, interoperable, and easy to learn and use, as well as continuously updated. The Workstation would be a complete ecosystem in the research cycle—saving time in the Scholarly Communications process and providing one place to go to for discovery, liter- ature review, data management, collaboration, preprint posting, peer review, publication, and post-print commenting.2 Collections as Data, Collections as Resources Exemplified by the literature search that now includes a myriad of Open content on a global basis, collections is an area that provides the greatest scope for library Machine Learning innova- tions to date, both applied and basic/theoretical. Especially if the pathway to using the expanded collections is clear and coherent, and the library provides instruction on why and how to use the various tools to save time and increase impact of research, researchers at all levels will benefit from 1See ?iiTb,ffrrrXxQi2`QXQ`; and ?iiTb,ffrrrXK2M/2H2vX+QK. 2In 2013, I wrote a blog that mentions the idea (Wiegand). https://www.zotero.org https://www.mendeley.com 52 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 partnering with librarians for a more comprehensive view of current knowledge in an area. The Always Already Computational: Collections as Data final report and project deliverables and Col- lections as Data: Part to Whole Project were designed to “develop models that support collections as data implementation and holistic reconceptualization of services and roles that support schol- arly use….” The Project specifically seeks “to create a framework and set of resources that guide libraries and other cultural heritage organizations in the development, description, and dissemi- nation of collections that are readily amenable to computational analysis.” (Padilla et al 2019). As a more holistic approach to data-driven scholarship, these resources aim to provide ac- cess to large collections to enable computational use on the national level. Some current library databases have already built this kind of functionality. JSTOR, for example, will provide up to 25,000 documents (or more at special request) in a dataset for analysis.3 Clarivate’s Content as a Service provides Web of Science data to accommodate multiple purposes.4 Besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as Scopus to work with datasets for text mining and computational analysis.5 Using library- licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through mas- sive amounts of research in, for instance, the race to develop a COVID-19 vaccine (Ong 2020; Vamathevan 2019). Learning Spaces Machine Learning is a concept that calls out for educating library users through all avenues, in- cluding library spaces. Taking a clue from other GLAM (Galleries, Libraries, Archives, and Mu- seums) cultural institutions, especially galleries and museums, libraries and archives could mount exhibits and incorporate learning into library spaces as a form of outreach to teach how and why using innovative tools will save time and improve efficiency. Inspirational, continuously- updating dashboards and exhibits could show progress and possibilities, while physical and vir- tual tutorials might provide a game-like interface to spark creativity. Showcasing scholarship and incorporating events and speakers help create a new culture of ideas and exploration. Events bring people together in library spaces to network for collaborative endeavors. As an example, the Cleveland Museum of Art is analyzing visitor experiences using an ArtLens app to promote its collections.6 The Library of Congress, as mentioned, hosted a summit that explored such topics as building Machine Learning literacy, attracting interest in GLAM datasets, operational- izing Machine Learning, crowdsourcing, and copyright implications for the use of content. As another example, in 2017 the United Kingdom’s National Archives attempted to demystify Ma- chine Learning and explore ethics and applications such as topic modeling, which was used to find key phrases in Discovery record descriptions and enable innova- tive exploration of the catalogue; and it was also deployed to identify the subjects being discussed across Cabinet Papers. Other projects included the development 3See ?iiTb,ffrrrXDbiQ`XQ`;f/7`f�#Qmif/�i�b2i@b2`pB+2b. 4See ?iiTb,ff+H�`Bp�i2X+QKfb2�`+?f?b2�`+?4+QKTmi�iBQM�HWky/�i�b2ib. 5See ?iiTb,ff/2pX2Hb2pB2`X+QKf and ?iiTb,ff;mB/2bXHB#X#2`F2H2vX2/mfi2ti@KBMBM;. 6See ?iiTb,ffrrrX+H2p2H�M/�`iXQ`;f�`i@Kmb2mKb@�M/@i2+?MQHQ;v@/2p2HQTBM;@M2r@K2i`B+b @K2�bm`2@pBbBiQ`@2M;�;2K2Mi and ?iiTb,ffrrrX+H2p2H�M/�`iXQ`;f�`iH2Mb@;�HH2`vf�`iH2Mb@� TT. https://www.jstor.org/dfr/about/dataset-services https://clarivate.com/search/?search=computational%20datasets https://dev.elsevier.com/ https://guides.lib.berkeley.edu/text-mining https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics-measure-visitor-engagement https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics-measure-visitor-engagement https://www.clevelandart.org/artlens-gallery/artlens-app https://www.clevelandart.org/artlens-gallery/artlens-app Wiegand 53 of a system that found the most important sentence in a news article to generate automated tweeting, while another team built a system to recognise computer code written in different programming languages — this is a major challenge for digital preservation. (Bell 2018) Finally, the HG Contemporary Gallery in Chelsea, in 2019, mounted an exhibit that utilized a “machine-learning algorithm that did most of the work” (Bogost 2019). Sustainable Innovation Diversity, equity, and inclusion (DEI) concerns with the scholarly record and increasingly with recognized biases implicit in algorithms can be addressed by a very intentional focus on the value of differing perspectives in solving problems. Kat Holmes, an inclusive design expert previously at Microsoft and now a leading user experience designer at Google, urges a framework for inclu- sivity that counteracts bias with different points of view by recognizing exclusion, learning from human diversity, and bringing in new perspectives (Bedrossian 2018). Making more data avail- able, and more diverse data, will significantly improve the imbalance perpetuated by a traditional- only corpus. In sustainability terms, Machine Learning tools must be designed to continuously seek to incorporate diverse perspectives that go beyond the traditional definitions of the scholarly canon if they are to be useful in combating bias. Collections used as data in Machine Learning might undergo analysis by researchers, including librarian researchers, to determine the balance of content. Library subject headings should be improved to better reflect the diversity of human thought, cultures, and global perspectives. Streamlining procedures is to everyone’s benefit, and saving time is universally desired. Ef- ficiency won’t fix the time crunch everyone faces, but with too much to do and too much to read, information overload is a very real threat to advancing the research agenda and confronting a multitude of escalating global problems. Machine Learning techniques, applied at scale to large corpora of textual data, could help researchers pinpoint areas where the human researcher should delve more deeply to eliminate irrelevant sources and hone in on possible solutions to problems. One instance—a new service, Scite.ai “can automatically tell readers whether papers have been supported or contradicted by later academic work” (Khamsi 2020). WHO (World Health Orga- nization) is providing a Global Research Database that can be searched or downloaded.7 In re- search on self-driving vehicles, a systematic literature review found more than 10,000 articles, an estimated year’s worth of reading for an individual. A tool called Iris.ai allowed groupings of this archive by topic and is one of several “targeted navigation” tools in development (Extance 2020). Working together as efficiently as possible is the only way to move ahead, and Machine Learning concepts, tools, and techniques, along with training, can be applied to increasingly large textual datasets to accelerate discovery. Machine Learning, like any other technology, augments human capacities, it does not replace them. If 10% of library resources (measured in whatever way works for each particular library), including both time resources of expert librarians and staff and financial resources, were utilized for innovation, libraries would develop a virtuous self-sustaining cycle. Technologies that are not as useful can be assessed and dropped in an agile library, the useful can be incorporated into the 90% of existing services, and the resources (people and money) repurposed. In the same way, that 7See ?iiTb,ffrrrXr?QXBMif2K2`;2M+B2bf/Bb2�b2bfMQp2H@+Q`QM�pB`mb@kyRNf;HQ#�H@`2b2�`+ ?@QM@MQp2H@+Q`QM�pB`mb@kyRN@M+Qp. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov 54 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 10% of library resources invested into innovations such as Machine Learning, whether in library practice or instruction and other services, will keep the program and the library fresh. Creativity is key and will be the hallmark of successful libraries in the future. Stewardship of resources such as people’s skills and expertise, and strategic use of the collections budget, are already library strengths. By building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based customized Researchers’ Workstation that uses Machine Learning to enhance the efficiency of the scholars’ research cycle. Results and more questions A library that adapted Machine Learning as an innovation technology would improve its prac- tices; add new services; choose, use, and license collections differently; utilize all spaces for learn- ing; and role model innovative leadership. What is a library in rapidly changing times? How can librarians reconcile past identity, add value, and leverage hard-won expertise in a new environ- ment? Change management is a topic that all institutions will have to confront as the digital age continues, as we reinvent ourselves and our institutions in a fast paced technological world. Value-added, distinctive, unique—these are all words that will be part of the conversation. Not only does the library add value, but librarians will have to demonstrate and quantify that value while preparing to pivot at any time in response to crises and innovative opportunities. Distinctive library resources and services that speak to the institutions’ academic mission and purpose will be a key feature. What does the library do that no other entity on campus can do? At each particular inflection point, how best to communicate with stakeholders about the value of the distinctive library mission? Can the library work with other cultural heritage institutions to highlight the unique contributions of all? One possible approach—develop a library science/library studies pedagogy as well as out- reach that encompasses the Scholarship of Teaching and Learning (SoTL) and pervades every- thing the library does in providing resources, services, and spaces. Emphasize that library re- sources help people solve multi-dimensional, complex problems, and then work on new ideas to save the time of researchers, improve discovery systems, advocate and facilitate Open Access and Open Source alternatives while enabling, empowering, and yes, inspiring all users to partici- pate in and contribute to the record of human knowledge. Librarians, as the traditional keepers of the scholarly canon in written form, have standing to do this as part of our legacy and as part of our envisioned future. From the library users’ point of view, librarians should think like the audience we are trying to reach to answer the question—why come into the library or use the library website instead of more familiar alternatives? In an era of increasing surveillance, library tools could be better known for an emphasis on privacy and confidentiality, for instance. This may require thinking more deeply about how we use our metrics and finding other ways to show how use of the library contributes to student success. It is also important to gather quantitative and qualitative evidence from library users themselves, and apply the feedback in an agile improvement loop. In the case of Open Access vs. proprietary information, librarians should make the case for Open Access (OA) by advocating, explaining, and instructing library users from the first time they do literature searches to the time they are graduate students, post-docs, and faculty. Librar- ians should produce Open Educational Resources (OER) as well as encourage classroom faculty to adopt these tools of affordable education. Libraries also need to facilitate Open Access content Wiegand 55 from discovery to preservation by developing search tools that privilege OA, using Open Source software whenever possible. Librarians could lead the way to changing the Scholarly Commu- nications system by emphasizing change at the citations level—encourage researchers to insist on being able to obtain author-archived citations in a seamless way, and facilitate that through development of new discovery tools using Machine Learning. Improving discovery of Open Ac- cess, as well as embarking on expanded library publishing programs and advancing academic re- search, might be the most important endeavors that librarians could undertake at this point in time, to prevent a repeat of the “serials crisis” that commoditized scholarly information and to build a more diverse, equitable, and inclusive scholarly record. Well-funded commercial publish- ers are already engaging scholars and researchers in new proprietary platforms that could lock in academia more thoroughly than “Big Deals” did, even as the paradigm shifts away from large, expensive publishers’ platforms and library subscription cancellations mount due to budget cuts and the desire to optimize value for money. The concept of the “inside-out library” (Dempsey 2016) provides a way of thinking about opening local collections to discovery and use in order to create new knowledge through digiti- zation and semantic linking, with cross-disciplinary technologies to augment traditional research and scholarship. Because these ideas are so new but fast-moving, librarians need to spread the word on possibilities in library publishing. Making local collections accessible for computational research helps to diversify findings and focuses attention on larger patterns and new ideas. In 2019, for instance, the Library of Congress sought to “Maximize the Use of its Digital Collection” by launching a program “to understand the technical capabilities and tools that are required to support the discovery and use of digital collections material,” developing ethical and technolog- ical standards to automate in supporting emerging research techniques and “to preprocess text material in a way that would make that content more discoverable” (Price 2019). Scholarly Com- munication, dissemination, and discovery of research results will continue to be an important function of the library if trusted research results are to be available to all, not just the privileged. The so-called Digital Divide isolates and marginalizes some groups and regions; libraries can be a unifying force. An important librarian role might be to identify gaps, in research or in dissemination, and work to overcome barriers to improving highly distributed access to knowledge. Libraries special- ize in connecting disparate groups. Here is what libraries can do now: instruct new researchers (including undergraduate researchers and up) in theories, skills, and techniques to find, use, pop- ulate, preserve, and cite datasets; provide server space and/or Data Management services; intro- duce Machine Learning and text analysis tools and techniques; provide Machine Learning and text analysis tools and/or services to researchers at all levels. Researchers are now expected or even required to provide public scholarship, i.e., to bring their research into the public realm be- yond obscure research journals, and to explain and illuminate their work, connecting it to the public good, especially in the case of publicly-funded research. Librarians can and should part- ner in the public dissemination of research findings through explaining, promoting, and provid- ing innovative new tools across siloed departments to catalyze cross-disciplinary research. Schol- arly Communications began with books and journals shared by scholars over time, then libraries were assembled and built to contain the written record; librarians should ensure that the Schol- arly Communications and information landscape continues into the future with widely-shared, available resources in all formats, now including interactive, web-based software, embedded data analysis tools, and technical support of emerging Open Source platforms. In addition, the flow of research should be smooth and seamless to the researcher, whether 56 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 in a Researchers’ Workstation or other library tools. The research cycle should be both clearly explained and embedded in systems and tools. The library, as a central place that cuts across narrowly-defined research areas, could provide a systemic place of collaboration. Librarians, see- ing the bigger picture, could facilitate research as well as disseminate and preserve the resulting data in journals and datasets. Further investigations on how researchers work, how students learn, best practices in pedagogy, and life-long learning in the library could mark a new era in librarian- ship, one that involves teaching, learning, and research as a self-reinforcing cycle. Beyond being a purchaser of journals and books, libraries can expand their role in the learning process itself into a cycle of continuous change and exploration, augmented by Machine Learning. Library Science, Research, and Pedagogy In Library and Information Science (LIS), graduate library schools should teach about Machine Learning as a way of innovating and emphasize pervasive innovation as the new normal. Cre- ating a culture of innovation and creativity in LIS classes and in libraries will pay off for society as a whole, if librarians promote the advantages of a culture of innovation in themselves and in library users. Subverting the stereotypes of tradition-bound libraries and librarians will revital- ize the profession and our workplaces, replacing fear of change and an existential identity crisis with a spirit of creative, agile reinvention that will rise to challenges rather than seek solace in de- nial, whether the seemingly impossible problem is preparedness in dealing with a pandemic or creatively addressing climate change. Academic libraries must transition from a space of transactional (one-time) actions into a transformational learning-centered user space, both physical and virtual, that offers an enhanced experience with teaching, learning, and research—a way to re-center the library as the place to get answers that go beyond the Internet. Libraries add value: do faculty, students, and other patrons know, for instance, that when they find the perfect book on a library shelf through browsing (or on the library website with virtual browsing), it is because a librarian somewhere assigned it a call number to group similar books together? The next step in that process is to use Machine Learning to generate subject headings, and also show the librarians accomplishing that. This process is being investigated in different types of works from fiction to scientific literature (Golub 2006, Joorabchi 2011, Wang 2009, Short 2019). Cataloging, metadata, and enabling access through shared standards and Knowledge Bases are all things librarians do that add value for library users overwhelmed with Google hits, and are worthy of further development, including in an Open environment. Preservation is another traditional library function, and now includes born-digital items and digitization of special collections/archives, increasing the library role. Discovery will be enhanced by Artificial/Augmented Intelligence and Machine Learning techniques. All of this should be taught in library schools, to build a new library culture of innovation and problem-solving be- yond just providing collections and information literacy instruction. The new learning paradigm is immersive in all senses, and the future, as reflected in library transformation and partnerships with researchers, galleries, archives, museums, citizen scientists, hobbyists, and life-long learners re-tooling their careers and life, is bright. LIS programs need to reflect that. To promote learning in libraries, librarians could design a “You belong in the Library” cam- paign to highlight our diverse resources and new ways of working with technology, inviting par- ticipation in innovative technologies such as Machine Learning in an increasingly rare public, non-commercial space—telling why, showing how. In many ways, libraries could model ways to Wiegand 57 achieve academic success and life success, updating a traditional role in educating, instructing, preparing for the future, explaining, promoting understanding, and inspiring. Discussion The larger questions now are, who is heard and who contributes? How are gaps, identified in needs analysis, reduced? What are sources of funding for libraries to develop this important work and not leave it to commercial services? Library leadership and innovative thinking must converge to devise ways for libraries to bring people together, producing more diverse, ethical, innovative, inclusive, practical, transformative, and novel library services and physical and virtual spaces for the public good. Libraries could start with analyses of needs—what problems could be solved with more effec- tive literature searches? What research could fill gaps and inform solutions to those needs? What kind of teaching could help build citizens and critical thinkers, rather than simply encouraging consumption of content? Another need is to diversify collections used in Machine Learning, gathering cultural perspectives that reflect true diversity of thought through inclusion. All voices should be heard and empowered. Librarians can help with that. A Researchers’ Workstation could bring together an array of tools and content to allow not only the organization, discovery, and preservation of knowledge, but also facilitate the creation of new knowledge through the sustainable library, beyond the literature search. The world is converging toward networking and collaborative research all in one place. I would like the library to be the free platform that brings all the others to- gether. Coming full circle, my vision is that when researchers want to work on their re- search, they will log on to the library and find all they need…. The library is the one place … to get your scholarly work done. (Wiegand 2013) The library as a platform should be a shared resource—the truest library value. Here is a scenario. Suppose, for example, scholars wish to analyze the timeline of the begin- ning of the Coronavirus crisis. Logging on to the library’s Researchers’ Workstation, they start with the Discovery module to generate a corpus of research papers from, say, December 2019 to June 2020. Using the Machine Learning function, they search for articles and books, looking for gaps and ideas that have not yet been examined in the literature. They access and download full- text, save citations, annotate and take notes, and prepare a draft outline of their research using a word processing function, writing and citing seamlessly. A Methods (protocols) section could help determine the most effective path of the prospective research. Then, they might search for the authors of the preprints and articles they find interesting, check the authors’ profiles, and contact some of them through the platform to discern interest in collaborating. The profile system would list areas of interest, current projects, availability for new projects, etc. Using the Project Management function, scholars might open a new workspace where preliminary thoughts could be shared, with attribution and acknowledgement as appro- priate, and a peer review timeline chosen to invite comments while authors can still claim the idea as their own. If the preprint is successful, and the investigation shows promise after the results are in, the scholars could search for an appropriate journal for publication, the version of record. The au- 58 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 thor, with researcher ID (also contained in his/her profile), has the article added to the final pub- lished section of the profile, with a DOI. The journal showcases the article, sends out tables of content alerts and press releases where it can be picked up by news services and authors invited to comment publicly. Each institution would celebrate its authors’ accomplishments, use the Scholars’ Workstation to determine impact and metrics, and promote the institutions’ research progress. Finally, the article would be preserved through the library repository and also initiatives such as LOCKSS. Future scholars would find it still available and continue to discover and build on the findings presented. All of this and more would be done through the library. Conclusion Machine Learning as a library service can inspire new stages of innovation, energizing and provid- ing a blueprint for the library future—teaching, learning, and scholarship for all. The teaching part of the equation invokes the faculty audience perspective: how can librarians help classroom faculty to integrate both library instruction and library research resources (collections, expertise, spaces) into the educational enterprise (Wiegand and Kominkiewicz 2016)? How can librarians best teach skills, foster engagement, and create knowledge to make a distinctive contribution to the institution? Our answers will determine the library’s future at each academic institution. Machine Learning skills, engagement, and knowledge should fit well with the library’s array of services. Learning is another traditional aspect of library services, this time from the student point of view. The library provides collections—multimedia or print on paper, digital and digitized, proprietary and open, local, redundant, rare, unique. The use of collections is taught by both librarians and disciplinary faculty in the service of learning, including life-long learning for non- academic, everyday knowledge. Students need to know more about Machine Learning, from data literacy to digital competencies, including concerns about privacy, security, and fake news across the curriculum, while learning skills associated with Machine Learning. In addition, through Open Access, library “collections” now encompass the world beyond the library’s physical and virtual spaces. Then, as libraries, like all digitally-inflected institutions, develop “change management” strate- gies, they need to double-down on these unique affordances and communicate them to stake- holders. The most critical strategy is embedding the Scholarship of Teaching and Learning (SoTL) in all aspects of the library workflow. Instead of simply advertising new electronic resources or describing Open Access versus proprietary resources, libraries should broadly embed the lessons of copyright, surveillance, and reproducibility into patron interactions, from the first undergrad- uate literature search to the faculty research consultation. Then, reinforce those lessons by em- phasizing open access and data mining permissions in their discovery tools. These are aspects of the scholarly research cycle over which libraries have some control. By exerting that control, li- braries will promote a culture that positions Machine Learning and other creative digital uses of library data as normal, achievable parts of the scholarly process. To complete the Scholarly Communications lifecycle, support for research, scholarship, and creative works is increasingly provided by libraries as a springboard to creation of knowledge, the library’s newest role. This is where Machine Learning as a new paradigm fits in most compellingly as an innovative practice. Libraries can provide not only associated services such as Data Manage- ment of the datasets resulting from analyzing huge textual corpora, but also databases of propri- Wiegand 59 etary and locally-produced content from inter-connected, cooperating libraries on a global scale. Researchers—faculty, students, and citizens (including alumni)—will benefit from crowdsourc- ing and citizen science while gaining knowledge and contributing to scholarship. But perhaps the largest benefit will be learning by doing, escaping the “black box” of blind consumerism to see how algorithms work and thus develop a more nuanced view of reality in the Machine Age. References Bedrossian, Rebecca. 2018. “Recognizing Exclusion is the Key to Inclusive Design: In Conver- sation with Kat Holmes.” Campaign (blog). July 25, 2018. ?iiTb,ffrrrX+�KT�B;MHB p2X+QKf�`iB+H2f`2+Q;MBxBM;@2t+HmbBQM@F2v@BM+HmbBp2@/2bB;M@+QMp2` b�iBQM@F�i@?QHK2bfR9333dk. Bell, Mark. 2018. “Machine Learning in the Archives.” National Archives (blog). November 8, 2020. ?iiTb,ff#HQ;XM�iBQM�H�`+?Bp2bX;QpXmFfK�+?BM2@H2�`MBM;@�`+?Bp 2bf. Bivens-Tatum, Wayne. 2012. Libraries and the Enlightenment. Los Angeles: Library Juice Press. Accessed January 6, 2020. ProQuest Ebook Central. Bogost, Ian. 2019. “The AI-Art Gold Rush is Here.” The Atlantic. March 6, 2019. ?iiTb, ffrrrXi?2�iH�MiB+X+QKfi2+?MQHQ;vf�`+?Bp2fkyRNfyjf�B@+`2�i2/@�`i@ BMp�/2b@+?2Hb2�@;�HH2`. Casson, Lionel. 2001. Libraries in the Ancient World. New Haven: Yale University Press. Ac- cessed January 6, 2020. ProQuest Ebook Central. Dempsey, Lorcan. 2016. “Library Collections in the Life of the User: Two Directions.” LIBER Quarterly 26: 338–359. ?iiTb,ff/QBXQ`;fRyXR3j8kfH[XRyRdy. Extance, Andy. 2018. “How AI Technology Can Tame the Scientific Literature.” Nature 561: 273-274. ?iiTb,ff/QBXQ`;fRyXRyj3f/9R83e@yR3@yeeRd@8. Golub, K. 2006. “Automated Subject Classification of Textual Web Documents.” Journal of Documentation 62: 350-371. ?iiTb,ff/QBXQ`;fRyXRRy3fyykky9RyeRyeee8yR. Jakeway, Eileen. 2020. “Machine Learning + Libraries Summit: Event Summary now live!” The Signal (blog), Library of Congress. February 12, 2020. ?iiTb,ff#HQ;bXHQ+X;Qpfi?2b B;M�HfkykyfykfK�+?BM2@H2�`MBM;@HB#`�`B2b@bmKKBi@2p2Mi@bmKK�`v@MQ r@HBp2f. Joorabchi, Arash and Abdulhussin E. Mahdi. 2011. “An Unsupervised Approach to Automatic Classification of Scientific Literature Utilising Bibliographic Metadata.” Journal of Infor- mation Science. ?iiTb,ff/QBXQ`;fRyXRRddfyRe888R8yyyyyyy. Khamsi, Rozanne. 2020. “Coronavirus in context: Scite.ai Tracks Positive and Negative Cita- tions for COVID-19 Literature.” Nature. ?iiTb,ff/QBXQ`;fRyXRyj3f/9R83e@yky @yRjk9@e. Padilla, Thomas, Laurie Allen, Hannah Frost, et al. 2019. “Final Report — Always Already Computational: Collections as Data.” Zenodo. May 22, 2019. ?iiTb,ff/QBXQ`;fRyX8 k3Rfx2MQ/QXjR8kNj8. Price, Gary. 2019. “The Library of Congress Posts Solicitation For a Machine Learning/Deep Learning Pilot Program to ‘Maximize the Use of its Digital Collection.’ ” Library Journal. June 13, 2019. Rodrigues, Eloy et al. 2017. “Next Generation Repositories: Behaviours and Technical Rec- ommendations of the COAR Next Generation Repositories Working Group.” Zenodo. https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/1488872 https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/1488872 https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/1488872 https://blog.nationalarchives.gov.uk/machine-learning-archives/ https://blog.nationalarchives.gov.uk/machine-learning-archives/ https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-galler https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-galler https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-galler https://doi.org/10.18352/lq.10170 https://doi.org/10.1038/d41586-018-06617-5 https://doi.org/10.1108/00220410610666501 https://blogs.loc.gov/thesignal/2020/02/machine-learning-libraries-summit-event-summary-now-live/ https://blogs.loc.gov/thesignal/2020/02/machine-learning-libraries-summit-event-summary-now-live/ https://blogs.loc.gov/thesignal/2020/02/machine-learning-libraries-summit-event-summary-now-live/ https://doi.org/10.1177/016555150000000 https://doi.org/10.1038/d41586-020-01324-6 https://doi.org/10.1038/d41586-020-01324-6 https://doi.org/10.5281/zenodo.3152935 https://doi.org/10.5281/zenodo.3152935 60 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 November 28, 2017. ?iiTb,ff/QBXQ`;fRyX8k3Rfx2MQ/QXRkR8yR9. Ryholt, K. S. B, and Gojko Barjamovic, eds. 2019. Libraries Before Alexandria: Ancient near Eastern Traditions. Oxford: Oxford University Press. Vamathevan, Jessica, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Lee, Anant Madabhushi, Parantu Shah, Michaela Spitzer, and Shanrong Zhao. 2019. “Applications of Machine Learning in Drug Discovery and Development.” Nat Rev Drug Discov 18: 463–477. ?iiTb,ff/QBXQ`;fRyXRyj3fb9R8dj@yRN@yyk9@8. Wang, Jun. 2009. “An Extensive Study on Automated Dewey Decimal Classification.” Journal of the American Society for Information Science & Technology 60: 2269–86. ?iiTb,ff/Q BXQ`;fRyXRyykf�bBXkRR9d. Wiegand, Sue. 2013. “ACS Solutions: The Sturm und Drang.” ACRLog (blog), Association of College and Research Libraries. November 8, 2020. ?iiTb,ff�+`HQ;XQ`;fkyRjfy9fy ef�+b@bQHmiBQMb@i?2@bim`K@mM/@/`�M;f. Wiegand, Sue and Frances Kominkiewisz. 2016. Unpublished manuscript. “Integration of Stu- dent Learning through Library and Classroom Instruction.” Yewno. n.d. “Yewno — Transforming Information into Knowledge.” Accessed January 6, 2020. ?iiTb,ffrrrXv2rMQX+QKf. Further Reading Abbattista, Fabio, Luciana Bordoni, and Giovanni Semeraro. 2003. “Artificial Intelligence for Cultural Heritage and Digital Libraries.” Applied Artificial Intelligence 17, no. 8/9: 681. ?iiTb,ff/QBXQ`;fRyXRy3yfdRj3kdk83. Ard, Constance. 2017. “Advanced Analytics Meets Information Services.” Online Searcher 41, no. 6: 21–24. “Artificial Intelligence and Machine Learning in Libraries.” 2019. Library Technology Reports 55, no. 1: 1–29. Badke, William. 2015. “Infolit Land. The Effect of Artificial Intelligence on the Future of In- formation Literacy.” Online Searcher 39, no. 4: 71–73. Boman, Craig. 2019. “Chapter 4: An Exploration of Machine Learning in Libraries.” Library Technology Reports 55: 21–25. Breeding, Marshall. 2018. “Chapter 6: Possible Future Trends.” Library Technology Reports 54, no. 8: 31–32. Dempsey, Lorcan, Constance Malpas, and Brian Lavoie. 2014. “Collection Directions: The Evolution of Library Collections and Collecting” portal: Libraries and the Academy 14, no. 3 (July): 393-423. ?iiTb,ff/QBXQ`;fRyXRj8jfTH�XkyR9XyyRj. Enis, Matt. 2019. “Labs in the Library.” Library Journal 144, no. 3: 18–21. Finley, Thomas. 2019. “The Democratization of Artificial Intelligence: One Library’s Approach.” Information Technology & Libraries 38, no. 1: 8–13. ?iiTb,ff/QBXQ`;fRyXeyRdfBi �HXpj3BRXRyNd9. Frank, Eibe and Gordon W. Paynter. 2004. “Predicting Library of Congress Classifications From Library of Congress Subject Headings.” Journal of The American Society for Information Science and Technology 55, no. 3. ?iiTb,ff/QBXQ`;fRyXRyykf�bBXRyjey. Geary, Daniel. 2019. “How to Bring AI into Your Library.” Computers in Libraries 39, no. 7: 32–35. https://doi.org/10.5281/zenodo.1215014 https://doi.org/10.1038/s41573-019-0024-5 https://doi.org/10.1002/asi.21147 https://doi.org/10.1002/asi.21147 https://acrlog.org/2013/04/06/acs-solutions-the-sturm-und-drang/ https://acrlog.org/2013/04/06/acs-solutions-the-sturm-und-drang/ https://www.yewno.com/ https://doi.org/10.1080/713827258 https://doi.org/10.1353/pla.2014.0013 https://doi.org/10.6017/ital.v38i1.10974 https://doi.org/10.6017/ital.v38i1.10974 https://doi.org/10.1002/asi.10360 Wiegand 61 Griffey, Jason. 2019. “Chapter 5: Conclusion.” Library Technology Reports 55, no. 1: 26–28. Inayatullah, Sohail. 2014. “Library Futures: From Knowledge Keepers to Creators.” Futurist 48, no. 6: 24–28. Johnson, Ben. 2018. “Libraries in the Age of Artificial Intelligence.” Computers in Libraries 38, no. 1: 14–16. Kuhlman, C., L. Jackson, and R. Chunara. 2020. “No Computation without Representation: Avoiding Data and Algorithm Biases through Diversity.” ArXiv:2002.11836v1 [cs.CY], February. ?iiT,ff�`tBpXQ`;f�#bfkyykXRR3je. Lane, David C. and Claire Goode. 2019. “OERu’s Delivery Model for Changing Times: An Open Source NGDLE.” Paper presented at the 28th ICDE World Conference on Online Learning, Dublin, Ireland, November 2019. ?iiTb,ffQ2`mXQ`;f�bb2ibfJ�`+QKbf P1_m@L:.G1@T�T2`@6AL�G@S.6@p2`bBQMXT/7. Liu, Xiaozhong, Chun Guo, and Lin Zhang. 2014. “Scholar Metadata and Knowledge Gener- ation with Human and Artificial Intelligence.” Journal of the Association for Information Science & Technology 65, no. 6: 1187–1201. ?iiTb,ff/QBXQ`;fRyXRyykf�bBXkjyRj. Mitchell, Steve. 2006. “Machine Assistance in Collection Building: New Tools, Research, Issues, and Reflections.” Information Technology & Libraries 25, no. 4: 190–216. ?iiTb,ff/Q BXQ`;fRyXeyRdfBi�HXpk8B9Xjj8j. Ojala, Marydee. 2019. “ProQuest’s New Approach to Streamlining Selection and Acquisitions.” Information Today 36, no. 1: 16–17. Ong, Edison, Mei U. Wong, Anthony Huffman, and Yongqun He. 2020. “COVID-19 Coro- navirus Vaccine Design Using Reverse Vaccinology and Machine Learning.” Frontiers in Immunology 11. ?iiTb,ff/QBXQ`;fRyXjj3Nf7BKKmXkykyXyR83R. Orlowitz, Jake. 2017. “You’re a Researcher Without a Library: What Do You Do?” AWikipedia Librarian (blog), Medium. November 15, 2017. ?iiTb,ffK2/BmKX+QKf�@rBFBT2/ B�@HB#`�`B�MfvQm`2@�@`2b2�`+?2`@rBi?Qmi@�@HB#`�`v@r?�i@/Q@vQm@/Q @e3RR�jyjdj+/. Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI in Libraries. Dublin, OH: OCLC Research. ?iiTb,ff/QBXQ`;fRyXk8jjjftFdx@N;Nd. Plosker, George. 2018. “Artificial Intelligence Tools for Information Discovery.” OnlineSearcher 42, no. 3: 31–35. ?iiTb,ffrrrXBM7QiQ/�vX+QKfPMHBM2a2�`+?2`f�`iB+H2bf62 �im`2bf�`iB7B+B�H@AMi2HHB;2M+2@hQQHb@7Q`@AM7Q`K�iBQM@.Bb+Qp2`v@R k9dkRXb?iKH. Rak, Rafal, Andrew Rowley, William Black, and Sophie Ananiadou. 2012. “Argo: an Integra- tive, Interactive, Text Mining-based Workbench Supporting Curation.” Database : the Jour- nal of Biological Databases and Curation. ?iiTb,ff/QBXQ`;fRyXRyNjf/�i�#�b2f# �byRy. Schmidt, Lena, Babatunde Kazeem Olorisade, Julian Higgins, and Luke A. McGuinness. 2020. “Data Extraction Methods for Systematic Review (Semi)automation: A Living Review Pro- tocol.” F1000Research 9: 210. ?iiTb,ff/QBXQ`;fRyXRke33f7Ryyy`2b2�`+?Xkkd 3RXk. Schonfeld, Roger C. 2018. “Big Deal: Should Universities Outsource More Core Research In- frastructure?” Ithaka S+R. ?iiTb,ff/QBXQ`;fRyXR3ee8fb`Xjyeyjk. Schockey, Nick. 2013. “How Open Access Empowered a 16-year-old to Make Cancer Break- through.” June 12, 2013. ?iiT,ffrrrXQT2M�++2bbr22FXQ`;fpB/2QfpB/2Qfb?Q r?B/48j38RR8Wj�oB/2QWj�Ny99k. http://arxiv.org/abs/2002.11836 https://oeru.org/assets/Marcoms/OERu-NGDLE-paper-FINAL-PDF-version.pdf https://oeru.org/assets/Marcoms/OERu-NGDLE-paper-FINAL-PDF-version.pdf https://doi.org/10.1002/asi.23013 https://doi.org/10.6017/ital.v25i4.3353 https://doi.org/10.6017/ital.v25i4.3353 https://doi.org/10.3389/fimmu.2020.01581 https://medium.com/a-wikipedia-librarian/youre-a-researcher-without-a-library-what-do-you-do-6811a30373cd https://medium.com/a-wikipedia-librarian/youre-a-researcher-without-a-library-what-do-you-do-6811a30373cd https://medium.com/a-wikipedia-librarian/youre-a-researcher-without-a-library-what-do-you-do-6811a30373cd https://doi.org/10.25333/xk7z-9g97 https://www.infotoday.com/OnlineSearcher/Articles/Features/Artificial-Intelligence-Tools-for-Information-Discovery-124721.shtml https://www.infotoday.com/OnlineSearcher/Articles/Features/Artificial-Intelligence-Tools-for-Information-Discovery-124721.shtml https://www.infotoday.com/OnlineSearcher/Articles/Features/Artificial-Intelligence-Tools-for-Information-Discovery-124721.shtml https://doi.org/10.1093/database/bas010 https://doi.org/10.1093/database/bas010 https://doi.org/10.12688/f1000research.22781.2 https://doi.org/10.12688/f1000research.22781.2 https://doi.org/10.18665/sr.306032 http://www.openaccessweek.org/video/video/show?id=5385115%3AVideo%3A90442 http://www.openaccessweek.org/video/video/show?id=5385115%3AVideo%3A90442 62 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 5 Short, Matthew. 2019. “Text Mining and Subject Analysis for Fiction; or, Using Machine Learn- ing and Information Extraction to Assign Subject Headings to Dime Novels.” Cataloging & Classification Quarterly 57, no. 5: 315–336. ?iiTb,ff/QBXQ`;fRyXRy3yfyRejNj d9XkyRNXRe8j9Rj. Thompson, Paul, Riza Theresa Batista-Navarro, and Georgio Kontonatsios. 2016. “Text Mining the History of Medicine.” PloS One 11, no. 1:e0144717. ?iiTb,ff/QBXQ`;fRyXRjdRf DQm`M�HXTQM2XyR99dRd. White, Philip. 2019. “Using Data Mining for Citation Analysis.” College & Research Libraries 80, no. 1. ?iiTb,ffb+?QH�`X+QHQ`�/QX2/mf+QM+2`MfT�`2Mif+`8eMRedjf7BH2 nb2ibfNyRNbjRe9. Witbrock, Michael J. and Alexander G. Hauptmann. 1998. “Speech Recognition for a Digital Video Library.” Journal of the American Society for Information Science 49, no. 7: 619–32. ?iiTb,ff/QBXQ`;fRyXRyykfUaA*A)RyNd@98dRURNN3y8R8)9N,dIeRN,,�A.@�a A9>jXyX*P;k@�. Zuccala, Alesia, Maarten Someren, and Maurits Bellen. 2014. “A Machine-Learning Approach to Coding Book Reviews as Quality Indicators: Toward a Theory of Megacitation.” Journal of the Association for Information Science & Technology 65, no. 11: 2248–60. ?iiTb,ff/Q BXQ`;fRyXRyykf�bBXkjRy9. https://doi.org/10.1080/01639374.2019.1653413 https://doi.org/10.1080/01639374.2019.1653413 https://doi.org/10.1371/journal.pone.0144717 https://doi.org/10.1371/journal.pone.0144717 https://scholar.colorado.edu/concern/parent/cr56n1673/file_sets/9019s3164 https://scholar.colorado.edu/concern/parent/cr56n1673/file_sets/9019s3164 https://doi.org/10.1002/asi.23104 https://doi.org/10.1002/asi.23104 06-jiang-cross ---- Chapter 6 Cross-Disciplinary ML Research is like Happy Marriages: Five Strengths and Two Examples Meng Jiang University of Notre Dame Top Strengths in ML+X Collaboration Cross-disciplinary research refers to research and creative practices that involve two or more aca- demic disciplines (Jeffrey 2003; Karniouchina, Victorino, and Verma 2006). These activities may range from those that simply place disciplinary insights side by side to much more integrative or transformative approaches (Aagaard-Hansen 2007; Muratovski 2011). Cross-disciplinary re- search matters, because (1) it provides an understanding of complex problems that require a mul- tifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and com- municate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al. 2011; O’Rourke, Crowley, and Gonnerman 2016; Miller and Leffert 2018). One of the most popular cross-disciplinary research topics/programs is Machine Learning + X (or Data Science + X). Machine learning (ML) is a method of data analysis that automates an- alytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. ML has been used in a variety of applications (Murthy 1998), such as email filtering and computer vision; however, most applications still fall in the domain of computer science and engineering. Recently, the power of ML+X, where X can be any other discipline (such as physics, chemistry, 63 64 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 biology, sociology, and psychology), is well recognized. ML tools can reveal profound insights hiding in ballooning datasets (Kohavi et al. 1994; Pedregosa et al. 2011; Kotsiantis 2012; Mul- lainathan and Spiess 2017). However, cross-disciplinary research, which ML+X is part of, is challenging. Collaborating with investigators outside one’s own field requires more than just adding a co-author to a paper or proposal. True collaborations will not always be without conflict—lack of information leads to misunderstandings. For example, ML experts would have little domain knowledge in the field of X; and researchers in X might not understand ML either. The knowledge gap limits the progress of collaborative research. So how can we start and manage successful cross-disciplinary research? What can we do to facilitate collaborative behaviors? In this essay, I will compare cross-disciplinary ML research to “happy marriages,” discussing some characteristics they share. Specifically, I will present the top strengths of conducting cross-disciplinary ML research and give two examples based on my experience of collaborating with historians and psychologists. Marriage is one of the most common “collaborative” behaviors. Couples expect to have happy marriages, just like collaborators expect to have successful project outcomes (Robinson and Blan- ton 1993; Pettigrew 2000; Xu et al. 2007). Extensive studies have revealed the top strengths of happy marriages (DeFrain and Asay 2007; Gordon and Baucom 2009; Prepare/Enrich, n.d.), which can be reflected in cross-disciplinary ML research. Here I focus on five of them: 1. Collaborators (“partners” in the language of marriage) are satisfied with communication. 2. Collaborators feel very close to each other. 3. Collaborators discuss their problems well. 4. Collaborators handle their differences creatively. 5. There is a goodbalanceoftimealone (i.e., individual research work) andtogether (meetings, discussions, etc). First of all, communication is the exchange of information to achieve a better understanding; and collaboration is defined as the process of working together with another person to achieve an end goal. Effective collaboration is about sharing information, knowledge, and resources to work together through satisfactory communication. Ineffectiveness or lack of communication is one of the biggest challenges in ML+X collaboration. Second, researchers in different disciplines meet different challenges through the process of collaboration. Making the challenges clear to understand and finding solutions together is the core of effective collaboration. Third, researchers in different disciplines can collaborate only when they recognize mutual interest and feel that the research topics they have studied in depth are very close to each other. Collaborators must be interested in solving the same, big problem. Fourth, collaborators must embrace their differences on concepts and methods and take ad- vantage of them. For example, one researcher can introduce a complementary method to the mix of other methods that the collaborator has been using for a long time; or one can have a new, impactful dataset and evaluation method to test the techniques proposed by the other. Fifth, in strong collaboration, there is a balance between separateness and togetherness. Meet- ings are an excellent use of time for having integrated perspectives and productive discourse around Jiang 65 difficult decisions. However, excessive collaboration happens when researchers are depleted by too many meetings and emails. It can lead to inefficient, unproductive meetings. So it is impor- tant to find a balance. Next, I, as a computer scientist and ML expert, will discuss twoML+X collaborative projects. ML experts bring mathematical modeling and computational methods for mining knowledge from data. The solutions usually have good generalizability; however, they still need to be tai- lored for specialized domains or disciplines. Example 1: ML + History The history professor Liang Cai and I have collaborated on an international research project ti- tled “Digital Empires: Structured Biographical and Social Network Analysis of Early Chinese Empires.” Dr. Cai is well known for her contributions to the fields of early Chinese Empires, Classical Chinese thought (in particular, Confucianism and Daoism), digital humanities, and the material culture and archaeological texts of early China (Cai 2014). Our collaboration ex- plores how digital humanities expand the horizon of historical research and help visualize the research landscape of Chinese history. Historical research is often constrained by sources and the human cognitive capacity for processing them. ML techniques may enhance historians’ abilities to organize and access sources as they like. ML techniques can even create new kinds of sources at scale for historians to interpret. “The historians pose the research questions and visualize the project,” said Cai. “The computer scientists can help provide new tools to process primary sources and expand the research horizon.” We conducted a structured biographical analysis to leverage the development of machine learning techniques, such as neural sequence labeling and textual pattern mining, which allowed classical sources of Chinese empires to be represented in an encoded way. The project aims to build a digital biographical database that sorts out different attributes of all recorded historical actors in available sources. Breaking with traditional formats, ML+History creates new oppor- tunities and augments our way of understanding history. First, it helps scholars, especially historians, change their research paradigm, allowing them to generalize their arguments with sufficient examples. ML techniques can find all examples in the data where manual investigation may miss some. Also, abnormal cases can indicate a new discovery. As far as early Chinese empires are concerned, ML promises to automate mining and encoding all available biographical data, which allows scholars to change the perspective from one person to a group of persons with shared characteristics, and to shift from analyzing examples to relating a comprehensive history. Therefore, scholars can identify general trends efficiently and present an information-rich picture of historical reality using ML techniques. Second, the structured data produced by ML techniques revolutionize the questions researchers ask, thereby changing the research landscape. Because of the lack of efficient tools, there are nu- merous interesting questions scholars would like to ask but cannot. For example, the geographical mobility of historical actors is an intriguing question for early China, the answer to which would show how diversified regions were integrated into a unified empire. Nevertheless, an individual historian cannot efficiently process the massive amount of information preserved in the sources. With ML techniques, we can generate fact tuples to sort out original geographical places of all available historical actors and provide comprehensive data for historians to analyze. 66 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 Figure 6.1: The graph presents a visual of the social network of officials who served in the gov- ernment about 2,000 years ago in China. The network describes their relationships and personal attributes. Jiang 67 Patterns Mined by ML Tech Extracted Relations $PER_X …ጛ$PER_Yழ$KLG (዗⑬,᜺〫,᝹) $PER_X was taught by $PER_Y on $KLG (knowledge) (᜺〫,ↁၵ,᝹) (⋁዆,ၔੲ,ឃ⑷) $PER_X PER_Y$ࢍ… (ோ㠟⊡༱,ၮឮሞ) $PER_X was taught/mentored by $PER_Y (ჶ㬾,዗ᴃ) $PER_X …ᖱ$PER_Y (ၯ೓,௙⭈㶷↲ኧ) $PER_X taught $PER_Y (ዀ,㭮⥸) $PER … $LOCࢁࢨ (዗᛹,ᯊᡕቕ㙈) $PER place_of_birth $LOC (ዺヽ,ᝲ㋺) $PER㋣$TIT (ᠮ㋺,୔᱓໼ႉ) $PER job_title $TIT (ⅰኴ໢,㋨ᡕ໼ႉ) $PER⥤$TIT (᫖㙈ⅴ,ጞை໺໽) $PER job_title $TIT (ၯஒ,ࡢᄝࡢმ) $PERẚ$TIT (ⅴ,⒆୻໛ࣝ) $PER job_title $TIT (ோ㠟⊡༱,᫦㡧ሮश) Table 6.1: Examples of Chinese Text Extraction Patterns Third, the project revolutionizes our reading habits. Large datasets mined from primary sources will allow scholars to combine long-distant reading with original texts. The macro pic- ture generated from data will aid in-depth analysis of the event against its immediate context. Furthermore, graphics of social networks and common attributes of historical figures will change our reading habits, transforming linear storytelling to accommodate multiple narratives (see the above figure). Researchers from the two sides develop collaboration through the project step by step, just like developing a relationship for marriage. Ours started at a faculty gathering from some random chat about our research. As the historian is open-minded to ML technologies and the ML expert is willing to create broader impact, we brainstormed ideas that would not have developed without taking care of the five important points: 1. Communication: With our research groups, we started to meet frequently at the begin- ning. We set up clear goals at the early stage, including expected outcomes, publication venues, and joint proposals for funding agencies, such as the National Endowment for the Humanities (NEH) and Notre Dame seed grant funding. Our research groups met almost twice a week for as long as three weeks. 2. Feel very close to each other: Besides holding meetings, we exchanged our instant messenger accounts so we could communicate faster than email. We created Google Drive space to share readings, documents, and presentation slides. We found many tools to create “tight relationships” between the groups at the beginning. 3. Discuss their problems well: Whenever we had misunderstandings, we discussed our prob- 68 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 lems. Historians learned about what a machine does, what a machine can do, and generally how a machine works toward the task. ML people learned what is interesting to historians and what kind of information is valuable. We hold the principle that as the problems exist, they make sense; any problem any other encounters is worth a discussion. We needed to solve problems together from the moment they became our problems. 4. Handletheirdifferencescreatively: Historians are among the few who can read and write in classical Chinese. Classical Chinese was used as the written language from over 3,000 years ago to the early 20th century. Since then, mainland China has used either Mandarin (sim- plified Chinese) or Cantonese, while Taiwan has used traditional Chinese. None is similar to classical Chinese at all. In other words, historians work on a language that no ML ex- perts here, even those who speak modern Chinese, can understand. So we handle our lan- guage differences “creatively” by using the translated version as the intermediate medium. Historians have translated history books in classical Chinese into simplified Chinese so we can read the simplified version. Here, the idea is to let the machine learning algorithms read both versions. We find that information extraction (i.e., finding relations from text) and machine translation (i.e., from classical Chinese to modern Chinese) can mutually en- hance each other, which turns out to be one of our novel technical contributions to the field of natural language processing. 5. Good balance of time alone and together: After the first month, since the project goal, datasets, background knowledge, and many other aspects were clear in both sides’ minds, we had regular meetings in a less intensive manner. We met twice or three times a month so that computer science students could focus on developing machine learning algorithms, and only when significant progress was made or expert evaluation was needed would we schedule a quick appointment with Prof. Liang Cai. So far, we have published peer-reviewed papers on the topic of information extraction and entity retrieval in classical Chinese history books using ML (Ma et al. 2019; Zeng et al. 2019). We have also submitted joint proposals with the above work as preliminary results to NEH. Example 2: ML + Psychology I am working with Drs. Ross Jacobucci and Brooke Ammerman in psychology to apply ML to understand mental health problems and suicidal intentions. Suicide is a serious public health problem; however, suicides are preventable with timely, evidence-based interventions. Social me- dia platforms have been serving users who are experiencing real-time suicidal crises with hopes of receiving peer support. To better understand the helpfulness of peer support occurring online, we characterize the content of both a user’s post and corresponding peer comments occurring on a social media platform and present an empirical example for comparison. We have designed a new topic-model-based approach to finding topics of users and peer posts from the social me- dia forum data. The key advantages include: (i) modeling both the generative process of each type of corpora (i.e., user posts and peer comments) and the associations between them, and (ii) using phrases, which are more informative and less ambiguous than words alone, to represent so- cial media posts and topics. We evaluated the method using data from Reddit’s r/SuicideWatch community. Jiang 69 Figure 6.2: Screenshot of r/SuicideWatch on Reddit. We examined how the topics of user and peer posts were associated and how this information influenced the perceived helpfulness of peer support. Then, we applied structural topic modeling to data collected from individuals with a history of suicidal crisis as a means to validate findings. Our observations suggest that effective modeling of the association between the two lines of top- ics can uncover helpful peer responses to online suicidal crises, notably providing the suggestion of pursuing professional help. Our technology can be applied to “paired” corpora in many appli- cations such as tech support forums and question-answering sites. This project started from a talk I gave at the psychology graduate seminar. The fun thing is that Dr. Jacobucci was not able to attend the talk. Another psychology professor who attended my talk asked constructive questions and mentioned my research to Dr. Jacobucci when they met later. So Dr. Jacobucci dropped me an email, and we had coffee together. Cross-disciplinary research often starts from something that sounds like developing a relationship. Because, again, the psychologists are open-minded to ML technologies and the ML expert is willing to create broader impact, we successfully brainstormed ideas when we had coffee, but this would not have developed into long-term collaboration without the following efforts: (1) Communicate inten- sively between research groups at the early stage. We had multiple meetings a week to make the goals clear. (2) Get students involved in the process. When my graduate student received more and more advice from the psychology professors and students, the connections between the two groups became stronger. (3) Discuss the challenges in our fields very well. We analyzed together whether machine learning would be capable of addressing the challenges in mental health. We also analyzed whether domain experts could be involved in the loop of machine learning algo- rithms. (4) Handle our differences. We separately presented our research and then found times to work together to put sets of slides together based on one common vision and goal. (5) After the first month, only hold meetings when discussion is needed or there is an approaching deadline 70 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 for either paper or proposal. We have enjoyed our collaboration and the power of cross-disciplinary research. Our joint work is under review at Nature Palgrave Communications. We have also submitted joint propos- als to NIH with this work as preliminary results (Jiang et al. 2020). Conclusions In this essay, I used a metaphor comparing cross-disciplinary ML research to “happy marriages.” I discussed five characteristics they share. Specifically, I presented the top strengths of produc- ing successful cross-disciplinary ML research: (1) Partners are satisfied with communication. (2) Partners feel very close to each other. (3) Partners discuss their problems well. (4) Partners han- dle their differences creatively. (5) There is a good balance of time alone (i.e., individual research work) and together (meetings, discussions, etc). While every project is different and will produce its own challenges, my experience of collaborating with historians and psychologists according to the happy marriage metaphor suggests that it is a simple and strong paradigm that could help other interdisciplinary projects develop into successful, long-term collaborations. References Aagaard lj Hansen, Jens. 2007. “The Challenges of Cross lj Disciplinary Research.” Social Epistemology 21, no. 4 (October-December): 425–38. ?iiTb,ff/QBXQ`;fRyXRy3yfyk eNRdkydyRd9e89y. Cai, Liang. 2014. Witchcraft and the Rise of the First Confucian Empire. Albany: SUNY Press. DeFrain, John, and Sylvia M. Asay. 2007. “Strong Families Around the World: An Introduction to the Family Strengths Perspective.” Marriage & Family Review 41, no. 1–2 (August): 1–10. ?iiTb,ff/QBXQ`;fRyXRjyyfCyykp9RMyRnyR. Gordon, Cameron L., and Donald H. Baucom. 2009. “Examining the Individual Within Mar- riage: Personal Strengths and Relationship Satisfaction.” Personal Relationships 16, no. 3 (September): 421–435. ?iiTb,ff/QBXQ`;fRyXRRRRfDXR9d8@e3RRXkyyNXyRkjR Xt. Jeffrey, Paul. 2003. “Smoothing the Waters: Observations on the Process of Cross-Disciplinary Research Collaboration.” Social Studies of Science 33, no. 4 (August): 539–62. Jiang, Meng, Brooke A. Ammerman, Qingkai Zeng, Ross Jacobucci, and Alex Brodersen. 2020. “Phrase-Level Pairwise Topic Modeling to Uncover Helpful Peer Responses to Online Sui- cidal Crises.” Humanities and Social Sciences Communications 7: 1–13. Karniouchina, Ekaterina V., Liana Victorino, and Rohit Verma. 2006. “Product and Service In- novation: Ideas for Future Cross-Disciplinary Research.” TheJournalofProductInnovation Management 23, no. 3 (May): 274–80. Kohavi, Ron, George John, Richard Long, David Manley, and Karl Pfleger. 1994. “MLC++: A Machine Learning Library in C++.” In Proceedings of the Sixth International Conference on Tools with Artificial Intelligence, 740–3. N.p.: IEEE. ?iiTb,ff/QBXQ`;fRyXRRyNfh� AXRNN9Xj9e9Rk. Kotsiantis, S.B. 2012. “Use of Machine Learning Techniques for Educational Proposes [sic]: a Decision Support System for Forecasting Students’ Grades.” Artificial Intelligence Review 37, no. 4 (May): 331–44. ?iiTb,ff/QBXQ`;fRyXRyydfbRy9ek@yRR@Nkj9@t. https://doi.org/10.1080/02691720701746540 https://doi.org/10.1080/02691720701746540 https://doi.org/10.1300/J002v41n01_01 https://doi.org/10.1111/j.1475-6811.2009.01231.x https://doi.org/10.1111/j.1475-6811.2009.01231.x https://doi.org/10.1109/TAI.1994.346412 https://doi.org/10.1109/TAI.1994.346412 https://doi.org/10.1007/s10462-011-9234-x Jiang 71 Ma, Yihong, Qingkai Zeng, Tianwen Jiang, Liang Cai, and Meng Jiang. 2019. “A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography.” In Pro- ceedings of the 2nd International Workshop on EntitY REtrieval, edited by Gong Cheng, Kalpa Gunaratna, and Jun Wang, 8–15. N.p.: International Workshop on EntitY REtrieval. ?iiT,ff+2m`@rbXQ`;foQH@k99ef. Miller, Eliza C. and Lisa Leffert. 2018. “Building Cross-Disciplinary Research Collaborations.” Stroke 49, no. 3 (March): e43-e45. ?iiTb,ff/QBXQ`;fRyXRReRfbi`QF2�?�XRRdXyk y9jd. Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine learning: an applied econometric ap- proach.” Journal of Economic Perspectives 31, no. 2 (spring): 87–106. ?iiTb,ff/QBXQ` ;fRyXRk8dfD2TXjRXkX3d. Muratovski, Gjoko. 2011. “Challenges and Opportunities of Cross-Disciplinary Design Edu- cation and Research.” In Proceedings from the Australian Council of University Art and Design Schools (ACUADS) Conference: Creativity: Brain—Mind—Body, edited by Gordon Bull. Canberra, Australia: ACAUDS Conference. ?iiTb,ff�+m�/bX+QKX�mf+QM72` 2M+2f�`iB+H2f+?�HH2M;2b@�M/@QTTQ`imMBiB2b@Q7@+`Qbb@/Bb+BTHBM�`v@ /2bB;M@2/m+�iBQM@�M/@`2b2�`+?f. Murthy, Sreerama K. 1998. “Automatic Construction of Decision Trees from Data: A Multi- Disciplinary Survey.” DataMiningandKnowledgeDiscovery 2, no. 4 (December): 345–89. ?iiTb,ff/QBXQ`;fRyXRykjf�,RyyNd99ejykk9. O’Rourke, Michael, Stephen Crowley, and Chad Gonnerman. 2016. “On the Nature of Cross- Disciplinary Integration: A Philosophical Framework.” Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 56 (April): 62–70. ?iiTb,ff/QBXQ`;fRyXRyRefDXb?Tb+XkyR8XRyXyyj. Pedregosa, Fabian et al. 2011. “Scikit-learn: Machine Learning in Python.” The Journal of Ma- chine Learning Research 12: 2825–30. ?iiT,ffrrrXDKH`XQ`;fT�T2`bfpRkfT2/`2; Qb�RR�X?iKH. Pettigrew, Simone F. 2000. “Ethnography and Grounded Theory: a Happy Marriage?” In Associ- ation for Consumer Research Conference Proceedings, edited by Stephen J. Hoch and Robert J. Meyer, 256–60. Provo, UT: Association for Consumer Research. ?iiTb,ffrrrX�+`r 2#bBi2XQ`;fpQHmK2bf39yyfpQHmK2bfpkdf. Prepare/Enrich. N.d. “National Survey of Marital Strengths.” Prepare/Enrich (website). Ac- cessed January 17, 2020. ?iiTb,ffrrrXT`2T�`2@2M`B+?X+QKfT2nK�BMnbBi2n+QM i2MifT/7f`2b2�`+?fM�iBQM�Hnbm`p2vXT/7. Robinson, Linda C. and Priscilla W. Blanton. 1993. “Marital Strengths in Enduring Marriages.” Family Relations: An Interdisciplinary Journal of Applied Family Studies 42, no. 1 (Jan- uary): 38–45. ?iiTb,ff/QBXQ`;fRyXkjydf839NRN. Urquhart, R., E. Grunfeld, L. Jackson, J. Sargeant, and G. A. Porter. 2013. “Cross-Disciplinary Research in Cancer: an Opportunity to Narrow the Knowledge–Practice Gap.” Current Oncology 20, no. 6 (December): e512–e521. ?iiTb,ff/QBXQ`;fRyXjd9df+QXkyXR9 3d. Xu, Anqi, Xiaolin Xie, Wenli Liu, Yan Xia, and Dalin Liu. 2007. “Chinese Family Strengths and Resiliency.” Marriage & Family Review 41, no. 1–2 (August): 143–64. ?iiTb, ff/QBXQ`;fRyXRjyyfCyykp9RMyRny3. Zeng, Qingkai, Mengxia Yu, Wenhao Yu, Jinjun Xiong, Yiyu Shi, and Meng Jiang. 2019. “Faceted Hierarchy: A New Graph Type to Organize Scientific Concepts and a Construction Method.” http://ceur-ws.org/Vol-2446/ https://doi.org/10.1161/strokeaha.117.020437 https://doi.org/10.1161/strokeaha.117.020437 https://doi.org/10.1257/jep.31.2.87 https://doi.org/10.1257/jep.31.2.87 https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/ https://doi.org/10.1023/A:1009744630224 https://doi.org/10.1016/j.shpsc.2015.10.003 http://www.jmlr.org/papers/v12/pedregosa11a.html http://www.jmlr.org/papers/v12/pedregosa11a.html https://www.acrwebsite.org/volumes/8400/volumes/v27/ https://www.acrwebsite.org/volumes/8400/volumes/v27/ https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf https://doi.org/10.2307/584919 https://doi.org/10.3747/co.20.1487 https://doi.org/10.3747/co.20.1487 https://doi.org/10.1300/J002v41n01_08 https://doi.org/10.1300/J002v41n01_08 72 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 6 In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), edited by Dmitry Ustalov, Swapna Somasundaran, Peter Jansen, Goran Glavaš, Martin Riedl, Mihai Surdeanu, and Michalis Vazirgiannis, 140–50. Hong Kong: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRyXR3e8jfpRf .RN@8jRd. https://doi.org/10.18653/v1/D19-5317 https://doi.org/10.18653/v1/D19-5317 07-kim-ai ---- Chapter 7 AI and Its Moral Concerns Bohyun Kim University of Rhode Island Automating Decisions and Actions The goal of artificial intelligence (AI) as a discipline is to create an artificial system—whether it be a piece of software or a machine with a physical body—that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity, such as playing chess or driving.1 The actual capability of most AI systems remained far below this ambitious goal for a long time. But with recent successes with machine learning and deep learning, the performance of some AI programs has started surpassing that of humans. In 2016, an AI program developed with the deep learning method, AlphaGo, astonished even its creators by winning four out of five Go matches with the eighteen-time world champion, Sedol Lee.2 In 2020, Google’s DeepMind unveiled Atari57, a deep reinforcement learning algorithm that reached superhuman levels of play in 57 classic Atari games.3 Early symbolic AI systems determined their outputs based upon given rules and logical in- ference. AI algorithms in these rule-based systems, also known as good old-fashioned AI (GO- FAI), are pre-determined, predictable, and transparent. On the other hand, machine learning, 1Note that by ‘as intelligent as a human,’ I only mean AI at human-level performance in achieving a particular goal not general(/strong) AI. General AI—also known as ‘artificial general intelligence (AGI)’ and ‘strong AI’—refers to AI with the ability to adapt to achieve any goals. By contrast, an AI system developed to perform only one or some activities in a specific domain is called a ‘narrow (/weak) AI’ system. 2AlphaGo can be said to be “as intelligent as humans,” but only in playing Go, where it exceeds human capability. So, it does not qualify as general/strong AI in spite of its human-level intelligence in Go-playing. It is to be noted that general(/strong) AI and narrow(/weak) AI signify the difference in the scope of AI capability. General(/strong) AI is also a broader concept than human-like intelligence, either with its carbon-based substrate or with human-like understanding that relies on what we regard as uniquely human cognitive states such as consciousness, qualia, emotions, and so on. For more helpful descriptions of common terms in AI, see (Tegmark 2017, 39). For more on the match between AlphaGo and Sedol Lee, see (Koch 2016). 3Deep reinforcement learning is a type of deep learning that is goal-oriented and reward-based. See (Heaven 2020). 73 74 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 another approach in AI, enables an AI algorithm to evolve to identify a pattern through the so- called ‘training’ process, which relies on a large amount of data and statistics. Deep learning, one of the widely-used techniques in machine learning, further refines this training process using a ‘neural network.’4 Machine learning and deep learning have brought significant improvements to the performance of AI systems in areas such as translation, speech recognition, and detecting objects and predicting their movements. Some people assume that machine learning completely replaced GOFAI, but this is a misunderstanding. Symbolic reasoning and machine learning are two distinct but not mutually exclusive approaches in AI, and they can be used together (Knight 2019a). With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.5 AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinc- tively moral in character. As humans, we are trained to recognize situations that demand moral decision-making. But how would an AI system be able to do so? Or, should they be? With self- driving cars and autonomous weapons systems under active development and testing, these are no longer idle questions. The Trolley Problem Recent advances of AI, such as autonomous cars, have brought new interest to the trolley prob- lem, a thought experiment introduced by the British philosopher Philippa Foot in 1967. In the standard version of this problem, a runaway trolley barrels down a track where five unsuspecting people are standing. You happen to be standing next to a lever that switches the trolley onto a different track, where there is only one person. Those who are on either track will be killed if the trolley heads their way. Should you pull the lever, so that the runaway trolley would kill one per- son instead of five? Unlike a person, a machine does not panic or freeze and simply follows and executes the given instruction. This means that an AI-powered trolley may act morally as long as it is programmed properly.6 The question itself remains, however. Should the AI-powered trolley be programmed to swerve or stay on course? Different moral theories, such as virtue ethics, contractarianism, and moral relativism, take different positions. Here, I will consider utilitarianism and deontology. Since their tenets are relatively straightforward, most AI developers are likely to look towards those two moral theories for guidance and insight. Utilitarianism argues that the utility of an action is what makes an action moral. In this view, what generates the greatest amount of good is the most moral thing to do. If one regards five human lives as a greater good than one, then one acts morally by pulling the lever and diverting the trolley to the other track. By contrast, deontology claims that what determines whether an action is morally right or wrong is not its utility but moral rules. If an action is in accordance with those rules, then the action is morally right. Otherwise, it is morally 4Machine learning and deep learning have gained momentum because the cost of high-performance computing has significantly decreased and large data sets have become more widely available. For example, the data in the ImageNet contains more than 14 million hand-annotated images. The ImageNet data have been used for the well-known annual AI competition for object detection and image classification at large scale from 2010 to 2017. See ?iiT,ffrrrXBK�; 2@M2iXQ`;f+?�HH2M;2bfGao_*f. 5For an excellent history of AI research, see chapter 1, “What is Artificial Intelligence,” of Boden 2016, 1-20. 6Programming here does not exclusively refer to a deep learning or machine learning approach. http://www.image-net.org/challenges/LSVRC/ http://www.image-net.org/challenges/LSVRC/ Kim 75 wrong. If not to kill another human being is one of those moral rules, then killing someone is morally wrong even if it is to save more lives. Note that these are highly simplified accounts of utilitarianism and deontology. The good in utilitarianism can be interpreted in many different ways, and the issue of conflicting moral rules is a perennial problem that deontological ethics grapples with.7 For our purpose, however, these simplified accounts are sufficient to highlight the aspects in which the utilitarian and the deontological position appeal to and go against our moral intuition at the same time. If a trolley cannot be stopped, saving five lives over one seems to be a right thing to do. Util- itarianism appears to get things right in this respect. However, it is hard to dispute that killing people is wrong. If killing is morally wrong no matter what, deontology seems to make more sense. With moral theories, things seem to get more confusing. Furthermore, consider the case in which one freezes and fails to pull the lever. According to utilitarianism, this would be morally wrong because it fails to maximize the greatest good, i.e. human lives. But how far should one go to maximize the good? Suppose there is a very large person on a footbridge over the trolley track, and one pushes that person off the footbridge onto the track, thus stopping the trolley and saving the five people. Would this count as a right thing to do? Utilitarianism may argue that. But in real life, many would consider throwing a person morally wrong but pulling the lever morally permissible.8 The problem with utilitarianism is that it treats the good as something inherently quantifi- able, comparable, calculable, and additive. But not all considerations that we have to factor into moral decision-making are measurable in numbers. What if the five people on the track are help- less babies or murderers who just escaped from the prison? Would or should that affect our de- cision? Some of us would surely hesitate to save the lives of five murderers by sacrificing one innocent baby. But what if things were different and we were comparing five school children ver- sus one baby or five babies versus one school child? No one can say for sure what is the morally right action in those cases.9 While the utilitarian position appears less persuasive in light of these considerations, deon- tology doesn’t fare too well, either. Deontology emphasizes one’s duty to observe moral rules. But what if those moral rules conflict with one another? Between the two moral rules, “do not kill a person” and “save lives,” which one should trump the other? The conflict among values is common in life, and deontology faces difficulty in guiding how an intelligent agent is to act in a tricky situation such as the trolley problem.10 Understanding What Ethics Has to Offer Now, let us consider AI-powered military robots and autonomous weapons systems since they present the moral dilemma in the trolley problem more convincingly due to the high stakes in- volved. Suppose that some engineers, following utilitarianism and interpreting victory as the ul- timate good/utility, wish to program an unmanned aerial vehicle (UAV) to autonomously drop 7For an overview, see (Sinnott-Armstrong, 2019) and (Alexander and Moore, 2016). 8For an empirical study on this, see (Cushman, Young, and Hauser 2006). For the results of a similar survey that involves an autonomous car instead of a trolley, see (Bonnefon, Shariff, and Rahwan 2016). 9For an attempt to identify moral principles behind our moral intuition in different versions of the trolley problem and other similar cases, see (Thomson 1976). 10Some moral philosophers doubt the value of our moral intuition in constructing a moral theory. See (Singer 2005), for example. But a moral theory that clashes with common moral intuition is unlikely to be sought out as a guide to making an ethical decision. 76 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 bombs in order to maximize the chances of victory. That may result in sacrificing a greater num- ber of civilians than necessary, and many will consider this to be morally wrong. Now imagine different engineers who, adopting deontology and following the moral principle of not killing people, program a UAV to autonomously act in a manner that minimizes casualties. This may lead to defeat on the battlefield, because minimizing casualties may not be always advantageous to winning a war. From these examples, we can see that philosophical insights from utilitarian- ism and deontology may provide little practical guidance on how to program autonomous AI systems to act morally. Ethicists seek abstract principles that can be generalized. For this reason, they are interested in borderline cases that reveal subtle differences in our moral intuition and varying moral theories. Their goal is to define what is moral and investigate how moral reasoning works or should work. By contrast, engineers and programmers pursue practical solutions to real-life problems and look for guidelines that will help with implementing those solutions. Their focus is on creating a set of constraints and if-then statements, which will allow a machine to identify and process morally relevant considerations, so that it can determine and execute an action that is not only rational but also ethical in the given situation.11 On the other hand, the goal of military commanders and soldiers is to end a conflict, bring peace, and facilitate restoring and establishing universally recognized human values such as free- dom, equality, justice, and self-determination. In order to achieve this goal, they must make the best strategic decisions and take the most appropriate actions. In deciding on those actions, they are also responsible for abiding by the principles of jus in bello and for not abdicating their moral responsibility, protecting civilians and minimizing harm, violence, and destruction as much as possible.12 The goal of military commanders and soldiers, therefore, differs from those of moral philosophers or of the engineers who build autonomous weapons. They are obligated to make quick decisions in a life-or-death situation while working with AI-powered military systems. These different goals and interests explain why moral philosophers’ discussion on the trolley problem may be disappointing to AI programmers or military commanders and soldiers. Ethics does not provide an easy answer to the question of how one should program moral decision- making into intelligent machines. Nor does it prescribe the right moral decision in a battlefield. But taking this as a shortcoming of ethics is missing the point. The role of moral philosophy is not to make decision-making easier but to highlight and articulate the difficulty and complexity involved in it. Ethical Challenges from Autonomous AI Systems The complexity of ethical questions means that dealing with the morality of an action by an autonomous AI system will require more than a clever engineering or programming solution. The fact that ethics does not eliminate the inherent ambiguity in many moral decisions should not lead to the dismissal of ethical challenges from autonomous AI systems. By injecting the capacity for autonomous decision-making into machines, AI can fundamentally transform any given field. For example, AI-powered military robots are not just another kind of weapon. When widely deployed, they can change the nature of war itself. Described below are some of the signif- icant ethical challenges that autonomous AI systems such as military robots present. Note that 11Note that this moral decision-making process can be modeled with a rule-based symbolic AI approach, a machine learning approach, or a combination of both. See Vincent Conitzer et al. 2017. 12For the principles of jus in bello, see International Committee of the Red Cross 2015. Kim 77 in spite of these ethical concerns, autonomous AI systems are likely to continue to be developed and adopted in many areas as a way to increase efficiency and lower cost. (a) Moral desensitization AI-powered military robots are more capable than merely remotely-operated weapons. They can identify a target and initiate an attack on their own. Due to their autonomy, military robots can significantly increase the distance between the party that kills and the party that gets killed (Sharkey 2012). This increase, however, may lead people to surrender their own moral responsi- bility to a machine, thereby resulting in the loss of humanity, which is a serious moral risk (Davis 2007). The more autonomous military robots become, the less responsibility humans will feel regarding their life-or-death decisions. (b) Unintended outcome The side that deploys AI-powered military robots is likely to suffer fewer casualties itself while inflicting more casualties on the enemy side. This may make the military more inclined to start a war. Ironically, when everyone thinks and acts this way, the number of wars and the overall amount of violence and destruction in the world will only increase.13 (c) Surrender of moral agency AI-powered military robots may fail to distinguish innocents from combatants and kill the for- mer. In such a case, can we be justified in letting robots take the lives of other human beings? Some may argue that only humans should decide to kill other humans, not machines (Davis 2007). Is it permissible for people to delegate such a decision to AI? (d) Opacity in decision-making Machine learning is used to build many AI systems today. Instead of prescribing a pre-determined algorithm, a machine learning system goes through a so-called ‘training’ process to produce the final algorithm from a large amount of data. For example, a machine learning system may generate an algorithm that successfully recognizes cats in a photo after going through millions of photos that show cats in many different postures from various angles.14 But the resulting algorithm is a complex mathematical formula and not something that humans can easily decipher. This means that the inner workings of a machine learning AI system and its decision-making process is opaque to human understanding, even to those who built the system itself (Knight 2017). In cases where the actions of an AI system can have grave consequences such as a military robot, such opacity becomes a serious problem.15 13(Kahn 2012) also argues that the resulting increase in the number of wars by the use of military robots will be morally bad. 14Google’s research team created an AI algorithm that learned how to recognize a cat in 2012. The neural network behind this algorithm had an array of 16,000 processors and more than one billion connections. Unlabeled random thumbnail images from 10 million YouTube videos allowed this algorithm to learn to identify cats by itself. See Markoff 2012 and Clark 2012. 15This black-box nature of AI systems powered by machine learning has raised great concern among many AI re- searchers in recent years. This is problematic in all areas where these AI systems are used for decision-making, not just in military operations. The gravity of decisions made in a military operation makes this problem even more troublesome. Fortunately, some AI researchers including those in the US Department of Defense are actively working to make AI sys- tems explainable. But until such research bears fruit and AI systems become fully explainable, their military use means accepting many unknown variables and unforeseeable consequences. See Turek n.d. 78 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 AI Applications for Libraries Do these ethical concerns outlined above apply to libraries? To answer that, let us first take a look at how AI, particularly machine learning, may apply to library services and operations. AI- powered digital assistants are likely to mediate a library user’s information search, discovery, and retrieval activities in the near future. In recent years, machine learning and deep learning have brought significant improvement to natural language processing (NLP), which deals with analyzing large amounts of natural lan- guage data to make the interaction between people and machines in natural languages possible. For instance, Google Assistant’s new feature ‘duplex’ was shown to successfully make a phone reservation with restaurant staff in 2018 (Welch 2018). Google’s real-time translation capability for 44 different languages was introduced to Google Assist-enabled Android and iOS phones in 2019 (Rincon 2019). As digital assistants become capable of handling more sophisticated language tasks, their use as a flexible voice user interface will only increase. Such digital assistants will be able to directly interact with library systems and applications, automatically interpret a query, and return results that they deem to be most relevant. Those digital assistants can also be equipped to handle the library’s traditional reference or readers’ advisory service. Integrated into a humanoid robot body, they may even greet library patrons at the entrance and answer directional questions about the library building. Cataloging, abstracting, and indexing are other areas where AI will be actively utilized. Cur- rently, those tasks are performed by skilled professionals. But as AI applications become more sophisticated, we may see many of those tasks partially or fully automated and handed over to AI systems. Machine learning and deep learning can be used to extract key information from a large number of documents or from information-rich visual materials, such as maps and video recordings, and generate metadata or a summary. Since machine learning is new to libraries, there are a relatively small number of machine learning applications developed for libraries’ use. They are likely to grow in number. Yewno, Quartolio, and Iris.ai are examples of the commercial products developed with machine learning and deep learning techniques.16 Yewno Discover displays the connections between different con- cepts or works in library materials. Quartolio targets researchers looking to discover untapped research opportunities based upon a large amount of data that includes articles, clinical trials, patents, and notes. Similarly, Iris.ai helps researchers identify and review a large amount of re- search papers and patents and extracts key information from them. Kira identifies, extracts, and analyzes text in contracts and other legal documents.17 None of these applications performs fully automated decision-making nor incorporates the digital assistant feature. But this is an area on which information systems vendors are increasingly focusing their efforts. Libraries themselves are also experimenting with AI to test its potential for library services and operations. Some are focusing on using AI, particularly the voice user interface aspect of the digital assistant, in order to improve existing services. The University of Oklahoma Libraries have been building an Alexa application to provide basic reference service to their students.18 16See ?iiTb,ffrrrXv2rMQX+QKf2/m+�iBQM, ?iiTb,ff[m�`iQHBQX+QKf, and ?iiTb,ffB`BbX�Bf. 17See ?iiTb,ffFB`�bvbi2KbX+QKf. Law firms are adopting similar products to automate and expedite their legal work, and law librarians are discussing how the use of AI may change their work. See Marr 2018 and Talley 2016. 18University of Oklahoma Libraries are building an Alexa application that will provide some basic reference service to their students. Also, their PAIR registry attempts to compile all AI-related projects at libraries. See ?iiTb,ffT�B`XH B#`�`B2bXQmX2/m. https://www.yewno.com/education https://quartolio.com/ https://iris.ai/ https://kirasystems.com/ https://pair.libraries.ou.edu https://pair.libraries.ou.edu Kim 79 At the University of Pretoria Library in South Africa, a robot named ‘Libby’ already interacts with patrons by providing guidance, answering questions, conducting surveys, and displaying marketing videos (Mahlangu 2019). Other libraries are applying AI to extract information from digital materials and automate metadata generation to enhance their discovery and use. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will max- imize the use of its digital collections in 2019.19 Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically gen- erate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.20 Some libraries are also testing out AI as a tool for evaluating services and operations. The Uni- versity of Rochester Libraries applied deep learning to the library’s space assessment to determine the optimal staffing level and building hours. The University of Illinois Urbana-Champaign Li- braries used machine learning to conduct sentiment analysis on their reference chat log (Blewer, Kim, and Phetteplace 2018). Ethical Challenges from the Personalized and Automated Information Environment Do these current and future AI applications for libraries pose ethical challenges similar to those that we discussed earlier? Since information query, discovery, and retrieval rarely involve life- or-death situations, stakes seem to be certainly lower. But an AI-driven automated information environment does raise its own distinct ethical challenges. (i) Intellectual isolation and bigotry hampering civic discourse Many AI applications that assist with information seeking activities promise a higher level of per- sonalization. But a highly personalized information environment often traps people in their own so-called ‘filter bubble,’ as we have been increasingly seeing in today’s social media channels, news websites, and commercial search engines, where such personalization is provided by machine learning and deep learning.21 Sophisticated AI algorithms are already curating and pushing in- formation feeds based upon the person’s past search and click behavior. The result is that infor- mation seekers are provided with information that conforms and reinforces their existing beliefs and interests. Views that are novel or contrast with their existing beliefs are suppressed and be- come invisible without them even realizing. Such lack of exposure to opposing views leads information users to intellectual isolation and even bigotry. Highly personalized information environments powered by AI can actively restrict ways in which people develop balanced and informed opinions, thereby intensifying and perpet- uating social discord and disrupting civic discourse. Under such conditions, prejudices, discrim- 19See Blewer, Kim, and Phetteplace 2018 and Price 2019. 20The AMP wiki is ?iiTb,ffrBFBX/HB#XBM/B�M�X2/mfT�;2bfpB2rT�;2X�+iBQM?T�;2A/48jReNNN9R. The Audiovisual Metadata Platform Pilot Development (AMPPD) project was presented at Code4Lib 2020 (Averkamp and Hardesty 2020). 21See Pariser 2012. https://wiki.dlib.indiana.edu/pages/viewpage.action?pageId=531699941 80 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 ination, and other unjust social practices are likely to increase, and this in turn will have more negative impact on those with fewer privileges. Intellectual isolation and bigotry has a distinctly moral impact on society. (ii) Weakening of cognitive agency and autonomy We have seen earlier that AI-powered digital assistants are likely to mediate people’s information search, discovery, and retrieval activities in the near future. As those digital assistants become more capable, they will go beyond listing available information. They will further choose what they deem to be most relevant to users and proceed to recommend or autonomously execute the best course of action.22 Other AI-driven features, such as extracting key information or generat- ing a summary of a large amount of information, are also likely to be included in future informa- tion systems, and they may deliver key information or summaries even before the request is made based upon constant monitoring of the user’s activities. In such a scenario, an information seeker’s cognitive agency is likely be undermined. Cru- cial to cognitive agency is the mental capacity to critically review a variety of information, judge what is and is not relevant, and interpret how they relate to other existing beliefs and opinions. If AI assumes those tasks, the opportunities for information seekers to exercise their own cogni- tive agency will surely decrease. Cognitive deskilling and the subsequent weakening of people’s agency in the AI -powered automated information environment presents an ethical challenge because such agency is necessary for a person to be a fully functioning moral agent in society.23 (iii) Social impact of scholarship and research from flawed AI algorithms Previously, we have seen that deep learning applications are opaque to human understanding. This lack of transparency and explainability raises a question of whether it is moral to rely on AI-powered military robots for life-or-death decisions. Does the AI-powered information envi- ronment have a similar problem? Machine learning applications base their recommendations and predictions upon the pat- terns in past data. Their predictions and recommendations are in this sense inherently conser- vative. They also become outdated when they fail to reflect new social views and material con- ditions that no longer fit the past patterns. Furthermore, each data set is a social construct that reflects particular values and choices such as who decided to collect the data and for what pur- pose; who labeled data; what criteria or beliefs guided such labeling; what taxonomies were used and why (Davis 2020). No data set can capture all variables and elements of the phenomenon that it describes. Furthermore, data sets used for training machine learning and deep learning algorithms may not be representational samples for all relevant subgroups. In such a case, an al- gorithm trained by such a data set will produce skewed results. Creating a large data set is also costly. Consequently, developers often simply take the data sets available to them. Those data sets are likely to come with inherent limitations such as omissions, inaccuracies, errors, and hidden biases. 22Needless to say, this is a highly simplified scenario. Those features can also be built in the information system itself rather than being delivered by a digital assistant. 23Outside of the automated information environment, AI has a strong potential to engender moral deskilling. Vallor (2015) points out that automated weapons will lead to soldiers’ moral deskilling in the use of military force; new me- dia practices of multitasking may result in deskilling in moral attention; and social robots can cause moral deskilling in practices of human caregiving. Kim 81 AI algorithms trained with these flawed data sets can fail unexpectedly, revealing those limi- tations. For example, it has been reported that the success rate of a facial recognition algorithm plunges from 99% to 35% when the group of subjects changes from white men to dark-skinned women because it was trained mostly with the photographs of white men (Lohr 2018). Adopt- ing such a faulty algorithm for any real-life use at a large scale would be entirely unethical. For the context of libraries, imagine using such a face-recognition algorithm to generate metadata for digitized historical photographs or a similarly flawed audio transcription algorithm to transcribe archival audio recordings. Just like those faulty algorithms, an AI-powered automated information environment can produce information, recommendations, and predictions affected by similar limitations existing in many data sets. The more seamless such an information environment is, the more invisible those limitations become. Automated information systems from libraries may not be involved in decisions that have a direct and immediate impact on people’s lives, such as setting a bail amount or determining the Medicaid payment to be paid.24 But automated information systems that are widely adopted and used for research and scholarship will impact real-life policies and regulations in areas such as healthcare and the economy. Undiscovered flaws will undermine the validity of the scholarly output that utilized those automated information systems and can further inflict serious harm on certain groups of people through those policies and regulations. Moral Intelligence and Rethinking the Role of AI In this chapter, I discussed four significant ethical challenges that automating decisions and ac- tions with AI presents: (a) moral desensitization; (b) unintended outcomes; (c) surrender of moral agency; (d) opacity in decision-making.25 I also examined somewhat different but equally significant ethical challenges in relation to the AI-powered automated information environment, which is likely to surround us in the future: (i) intellectual isolation and bigotry hampering civic discourse; (ii) weakening of cognitive agency and autonomy; (iii) social impact of scholarship and research based upon flawed AI algorithms. In the near future, libraries will be acquiring, building, customizing, and implementing many personalized and automated information systems. Given this, the challenges related to the AI- powered automated information environment are highly relevant to them. At present, libraries are at an early stage in developing AI applications and applying machine learning and deep learn- ing techniques to improve library services, systems, and operations. But the general issues of hidden biases and the lack of explainability in machine learning and deep learning are already gaining awareness in the library community. As we have seen in the trolley problem, whether a certain action is moral is not a line that can be drawn with absolute clarity. It is entirely possible for fully-functioning moral agents to make different judgements. In addition, there is the matter of morality that our tools and systems display. This is called “machine morality” in relation to AI systems. Wallach and Allen (2009) argue that there are three distinct levels of machine morality: oper- ational morality, functional morality, and full moral agency (26). Operational morality is found in systems that are low in both autonomy and ethical sensitivity. At this level of machine moral- ity, a machine or a tool is given a mechanism that prevents its immoral use, but the mechanism 24See Tashea 2017 and Stanley 2017. 25This is by no means an exhaustive list. User privacy and potential surveillance are examples of other important ethical challenges, which I do not discuss here. 82 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 is within the full control of the user. Such operational morality exists in a gun with a childproof safety mechanism, for example. A gun with a safety mechanism is neither autonomous nor sen- sitive to ethical concerns related to its use. By contrast, machines with functional morality do possess a certain level of autonomy and ethical sensitivity. This category includes AI systems with significant autonomy and little ethical sensitivity or those with little autonomy and high ethical sensitivity. An autonomous drone would fall under the former type, while MedEthEx, an ethical decision-support AI recommendation system for clinicians, would be of the latter. Lastly, Wallach and Allen regard systems with high autonomy and high ethical sensitivity as having full moral agency, as much as humans do. This means that those systems would have a mental rep- resentation of values and the capacity for moral reasoning. Such machines can be held morally responsible for their actions. We do not know whether AI will be able to produce such a machine with full moral agency. If the current direction to automate more and more human tasks for cost savings and efficiency at scale continues, however, most of the more sophisticated AI applications to come will be of the kind with functional morality, particularly the kind that combines a relatively high level of autonomy and a lower level of ethical sensitivity. In the beginning of this chapter, I mentioned that the goal of AI is to create an artificial system—whether it be a piece of software or a machine with a physical body—that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity. But what does “as intelligent as a human” exactly mean? If morality is an integral component of human-level intelligence, AI research needs to pay more attention to intelligence not only in accomplishing a goal but also in doing so ethically.26 In that light, it is meaningful to ask what level of autonomy and ethical sensitivity a given AI system is equipped with, and what level of machine morality is appropriate for its purpose. In designing an AI system, it would be helpful to consider what level of autonomy and ethical sensitivity would be best suited for its purpose and whether it is feasible to provide that level of machine morality for the system in question. In general, the narrower the function or the do- main of an AI system is, the easier it will be to equip it with an appropriate level of autonomy and ethical sensitivity. In evaluating and designing an AI system, it will be important to test the actual outcome against the anticipated outcome in different types of cases in order to identify potential problems. System-wide audits to detect well-known biases, such as gender discrimina- tion or racism, can serve as an effective strategy.27 Other undetected problems may surface only after the AI system is deployed. Having a mechanism to continually test an AI algorithm to iden- tify those unnoticed problems and feeding the test result back into the algorithm for retraining will be another way to deal with algorithmic biases. Those who build AI systems will also benefit from consulting existing principles and guidelines such as FAT/ML’s “Principles for Accountable Algorithms and a Social Impact Statement for Algorithms.”28 We may also want to rethink how and where we apply AI. We and our society do not have 26Here, I regard intelligence as the ability to accomplish complex goals following Tegmark 2017. For more discussion on intelligence and goals, see Chapter 2 and Chapter 7. 27These audits are far from foolproof, but the detection of hidden biases will be crucial in making AI algorithms more accountable and their decisions more ethical. A debiasing algorithm can also be used during the training stage of an AI algorithm to reduce hidden biases in training data. See Amini et al. 2019, Knight 2019b, and Courtland 2018. 28See ?iiTb,ffrrrX7�iKHXQ`;f`2bQm`+2bfT`BM+BTH2b@7Q`@�++QmMi�#H2@�H;Q`Bi?Kb. Other principles and guidelines include “Ethics Guidelines for Trustworthy AI” (?iiTb,ff2+X2m`QT�X2mf/B;Bi�H@b BM;H2@K�`F2if2MfM2rbf2i?B+b@;mB/2HBM2b@i`mbirQ`i?v@�B) and “Algorithmic Impact Assessments: A Practical Framework For Public Agency Accountability” (?iiTb,ff�BMQrBMbiBimi2XQ`;f�B�`2TQ`ikyR3XT /7). https://www.fatml.org/resources/principles-for-accountable-algorithms https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai https://ainowinstitute.org/aiareport2018.pdf https://ainowinstitute.org/aiareport2018.pdf Kim 83 to use AI to equip all our systems and machines with human- or superhuman-level performance. This is particularly so if the pursuit of such human- or superhuman-level performance is likely to increase unethical decisions that negatively impact a significant number of people. We do not have to task AI with always automating away human work and decisions as much as possible. What if we reframe AI’s role as helping people become more intelligent and more capable where they struggle or experience disadvantages, such as critical thinking, civic participation, healthy liv- ing, financial literacy, dyslexia, or hearing loss? What kind of AI-driven information systems and environments would be created if libraries approach AI with such intention from the beginning? References Alexander, Larry, and Michael Moore. 2016. “Deontological Ethics.” In The Stanford Encyclo- pedia of Philosophy, edited by Edward N. Zalta, Winter 2016. Metaphysics Research Lab, Stanford University. ?iiTb,ffTH�iQXbi�M7Q`/X2/mf�`+?Bp2bfrBMkyRef2Mi`B2 bf2i?B+b@/2QMiQHQ;B+�Hf. Amini, Alexander, Ava P. Soleimany, Wilko Schwarting, Sangeeta N. Bhatia, and Daniela Rus. 2019. “Uncovering and Mitigating Algorithmic Bias through Learned Latent Structure.” In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 289–295. AIES ’19. New York, NY, USA: Association for Computing Machinery. ?iiTb,ff/QBXQ`;f RyXRR98fjjyeeR3XjjR9k9j. Averkamp, Shawn, and Julie Hardesty. 2020. “AI Is Such a Tool: Keeping Your Machine Learn- ing Outputs in Check.” Presented at the Code4lib Conference, Pittsburgh, PA, March 11. ?iiTb,ffkykyX+Q/29HB#XQ`;fi�HFbf�A@Bb@bm+?@�@iQQH@E22TBM;@vQm`@K �+?BM2@H2�`MBM;@QmiTmib@BM@+?2+F. Blewer, Ashley, Bohyun Kim, and Eric Phetteplace. 2018. “Reflections on Code4Lib 2018.” ACRL TechConnect (blog). March 12, 2018. ?iiTb,ff�+`HX�H�XQ`;fi2+?+QMM2+i fTQbif`27H2+iBQMb@QM@+Q/29HB#@kyR3f. Boden, Margaret A. 2016. AI: Its Nature and Future. Oxford: Oxford University Press. Bonnefon, Jean-François, Azim Shariff, and Iyad Rahwan. 2016. “The Social Dilemma of Au- tonomous Vehicles.” Science 352 (6293): 1573–76. ?iiTb,ff/QBXQ`;fRyXRRkefb+B2 M+2X��7ke89. Clark, Liat. 2012. “Google’s Artificial Brain Learns to Find Cat Videos.” Wired, June 26, 2012. ?iiTb,ffrrrXrB`2/X+QKfkyRkfyef;QQ;H2@t@M2m`�H@M2irQ`Ff. Conitzer, Vincent, Walter Sinnott-Armstrong, Jana Schaich Borg, Yuan Deng, and Max Kramer. 2017. “Moral Decision Making Frameworks for Artificial Intelligence.” In Proceedingsofthe Thirty-First AAAI Conference on Artificial Intelligence, 4831–4835. AAAI’17. San Fran- cisco, California, USA: AAAI Press. Courtland, Rachel. 2018. “Bias Detectives: The Researchers Striving to Make Algorithms Fair.” Nature 558 (7710): 357–60. ?iiTb,ff/QBXQ`;fRyXRyj3f/9R83e@yR3@y89eN@j. Cushman, Fiery, Liane Young, and Marc Hauser. 2006. “The Role of Conscious Reasoning and Intuition in Moral Judgment: Testing Three Principles of Harm.” Psychological Science 17 (12): 1082–89. Davis, Daniel L. 2007. “Who Decides: Man or Machine?” Armed Forces Journal, November. ?iiT,ff�`K2/7Q`+2bDQm`M�HX+QKfr?Q@/2+B/2b@K�M@Q`@K�+?BM2f. Davis, Hannah. 2020. “A Dataset Is a Worldview.” Towards Data Science. March 5, 2020. ?iiT b,ffiQr�`/b/�i�b+B2M+2X+QKf�@/�i�b2i@Bb@�@rQ`H/pB2r@8jk3kRe//99/. https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/ https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/ https://doi.org/10.1145/3306618.3314243 https://doi.org/10.1145/3306618.3314243 https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/ https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/ https://doi.org/10.1126/science.aaf2654 https://doi.org/10.1126/science.aaf2654 https://www.wired.com/2012/06/google-x-neural-network/ https://doi.org/10.1038/d41586-018-05469-3 http://armedforcesjournal.com/who-decides-man-or-machine/ https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d 84 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 7 Foot, Philippa. 1967. “The Problem of Abortion and the Doctrine of Double Effect.” Oxford Review 5: 5–15. Heaven, Will Douglas. 2020. “DeepMind’s AI Can Now Play All 57 Atari Games—but It’s Still Not Versatile Enough.” MIT Technology Review, April 1, 2020. ?iiTb,ffrrrXi2+?MQ HQ;v`2pB2rX+QKfkykyfy9fyRfNd9NNd. International Committee of the Red Cross. 2015. “What Are Jus Ad Bellum and Jus in Bello?” January 22, 2015. ?iiTb,ffrrrXB+`+XQ`;f2Mf/Q+mK2Mifr?�i@�`2@Dmb@�/@#2H HmK@�M/@Dmb@#2HHQ@y. Kahn, Leonard. 2012. “Military Robots and The Likelihood of Armed Combat.” In Robot Ethics: The Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney, and George A. Bekey, 274–92. Intelligent Robotics and Autonomous Agents. Cambridge, Mass.: MIT Press. Knight, Will. 2017. “The Dark Secret at the Heart of AI.” MIT Technology Review, April 11, 2017. ?iiTb,ffrrrXi2+?MQHQ;v`2pB2rX+QKfkyRdfy9fRRf8RRj. . 2019a. “Two Rival AI Approaches Combine to Let Machines Learn about the World like a Child.” MIT Technology Review, April 8, 2019. ?iiTb,ffrrrXi2+?MQHQ;v `2pB2rX+QKfkyRNfy9fy3fRyjkkj. . 2019b. “AI Is Biased. Here’s How Scientists Are Trying to Fix It.” Wired, De- cember 19, 2019. ?iiTb,ffrrrXrB`2/X+QKfbiQ`vf�B@#B�b2/@?Qr@b+B2MiBbib @i`vBM;@7Btf. Koch, Christof. 2016. “How the Computer Beat the Go Master.” Scientific American. March 19, 2016. ?iiTb,ffrrrXb+B2MiB7B+�K2`B+�MX+QKf�`iB+H2f?Qr@i?2@+QKTmi2 `@#2�i@i?2@;Q@K�bi2`f. Lohr, Steve. 2018. “Facial Recognition Is Accurate, If You’re a White Guy.” New York Times, February 9, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fykfyNfi2+?MQHQ;vf7�+B�H @`2+Q;MBiBQM@`�+2@�`iB7B+B�H@BMi2HHB;2M+2X?iKH. Mahlangu, Isaac. 2019. “Meet Libby - the New Robot Library Assistant at the University of Pretoria’s Hatfield Campus.” SowetanLIVE. June 4, 2019. ?iiTb,ffrrrXbQr2i�MHBp 2X+QXx�fM2rbfbQmi?@�7`B+�fkyRN@ye@y9@K22i@HB##v@i?2@M2r@`Q#Qi@HB #`�`v@�bbBbi�Mi@�i@i?2@mMBp2`bBiv@Q7@T`2iQ`B�b@?�i7B2H/@+�KTmbf. Markoff, John. 2012. “How Many Computers to Identify a Cat? 16,000.” New York Times, June 25, 2012. Marr, Bernard. 2018. “How AI And Machine Learning Are Transforming Law Firms And The Legal Sector.” Forbes, May 23, 2018. ?iiTb,ffrrrX7Q`#2bX+QKfbBi2bf#2`M�`/K� ``fkyR3fy8fkjf?Qr@�B@�M/@K�+?BM2@H2�`MBM;@�`2@i`�Mb7Q`KBM;@H�r@7 B`Kb@�M/@i?2@H2;�H@b2+iQ`f. Pariser, Eli. 2011. TheFilterBubble: HowtheNewPersonalizedWebIsChangingWhatWeRead and How We Think. New York: Penguin Press. Price, Gary. 2019. “The Library of Congress Posts Solicitation For a Machine Learning/Deep Learning Pilot Program to ‘Maximize the Use of Its Digital Collection.’ ” LJ InfoDOCKET. June 13, 2019. ?iiTb,ffrrrXBM7Q/Q+F2iX+QKfkyRNfyefRjfHB#`�`v@Q7@+QM;` 2bb@TQbib@bQHB+Bi�iBQM@7Q`@�@K�+?BM2@H2�`MBM;@/22T@H2�`MBM;@TBHQ i@T`Q;`�K@iQ@K�tBKBx2@i?2@mb2@Q7@Bib@/B;Bi�H@+QHH2+iBQM@HB#`�`v@ Bb@HQQFBM;@7Q`@`f. Rincon, Lilian. 2019. “Interpreter Mode Brings Real-Time Translation to Your Phone.” Google Blog (blog). December 12, 2019. ?iiTb,ffrrrX#HQ;X;QQ;H2fT`Q/m+ibf�bbBbi� https://www.technologyreview.com/2020/04/01/974997 https://www.technologyreview.com/2020/04/01/974997 https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0 https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0 https://www.technologyreview.com/2017/04/11/5113 https://www.technologyreview.com/2019/04/08/103223 https://www.technologyreview.com/2019/04/08/103223 https://www.wired.com/story/ai-biased-how-scientists-trying-fix/ https://www.wired.com/story/ai-biased-how-scientists-trying-fix/ https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/ https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/ https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ Kim 85 MifBMi2`T`2i2`@KQ/2@#`BM;b@`2�H@iBK2@i`�MbH�iBQM@vQm`@T?QM2f. Sharkey, Noel. 2012. “Killing Made Easy: From Joysticks to Politics.” In Robot Ethics: The Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney, and George A. Bekey, 111–28. Intelligent Robotics and Autonomous Agents. Cambridge, Mass.: MIT Press. Singer, Peter. 2005. “Ethics and Intuitions.” The Journal of Ethics 9 (3/4): 331–52. Sinnott-Armstrong, Walter. 2019. “Consequentialism.” In The Stanford Encyclopedia of Phi- losophy, edited by Edward N. Zalta, Summer 2019. Metaphysics Research Lab, Stanford University. ?iiTb,ffTH�iQXbi�M7Q`/X2/mf�`+?Bp2bfbmKkyRNf2Mi`B2bf+QMb 2[m2MiB�HBbKf. Stanley, Jay. 2017. “Pitfalls of Artificial Intelligence Decisionmaking Highlighted In Idaho ACLU Case.” American Civil Liberties Union (blog). June 2, 2017. ?iiTb,ffrrrX�+HmXQ`;f# HQ;fT`Bp�+v@i2+?MQHQ;vfTBi7�HHb@�`iB7B+B�H@BMi2HHB;2M+2@/2+BbBQM K�FBM;@?B;?HB;?i2/@B/�?Q@�+Hm@+�b2. Talley, Nancy B. 2016. “Imagining the Use of Intelligent Agents and Artificial Intelligence in Academic Law Libraries.” Law Library Journal 108 (3): 383–402. Tashea, Jason. 2017. “Courts Are Using AI to Sentence Criminals. That Must Stop Now.” Wired, April 17, 2017. ?iiTb,ffrrrXrB`2/X+QKfkyRdfy9f+Qm`ib@mbBM;@�B@b2 Mi2M+2@+`BKBM�Hb@Kmbi@biQT@MQrf. Tegmark, Max. 2017. Life 3.0: Being Human in the Age of Artificial Intelligence. New York: Alfred Knopf. Thomson, Judith Jarvis. 1976. “Killing, Letting Die, and the Trolley Problem.” The Monist 59 (2): 204–17. Turek, Matt. n.d. “Explainable Artificial Intelligence.” Defense Advanced Research Projects Agency. ?iiTb,ffrrrX/�`T�XKBHfT`Q;`�Kf2tTH�BM�#H2@�`iB7B+B�H@BMi2H HB;2M+2. Vallor, Shannon. 2015. “Moral Deskilling and Upskilling in a New Machine Age: Reflections on the Ambiguous Future of Character.” Philosophy & Technology 28 (1): 107–24. ?iiTb, ff/QBXQ`;fRyXRyydfbRjj9d@yR9@yR8e@N. Wallach, Wendell. 2009. Moral Machines: Teaching Robots Right from Wrong. Oxford: Oxford University Press. Welch, Chris. 2018. “Google Just Gave a Stunning Demo of Assistant Making an Actual Phone Call.” The Verge. May 8, 2018. ?iiTb,ffrrrXi?2p2`;2X+QKfkyR3f8f3fRdjjkydy f;QQ;H2@�bbBbi�Mi@K�F2b@T?QM2@+�HH@/2KQ@/mTH2t@BQ@kyR3. https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/ https://plato.stanford.edu/archives/sum2019/entries/consequentialism/ https://plato.stanford.edu/archives/sum2019/entries/consequentialism/ https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ https://www.darpa.mil/program/explainable-artificial-intelligence https://www.darpa.mil/program/explainable-artificial-intelligence https://doi.org/10.1007/s13347-014-0156-9 https://doi.org/10.1007/s13347-014-0156-9 https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018 https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018 08-altman-building ---- Chapter 8 Building a Machine Learning Pipeline Audrey Altman Digital Public Library of America As a new machine learning (ML) practitioner, it is important to develop a mindful approach to the craft. By mindful, I mean possessing the ability to think clearly about each individual piece of the process, and understanding how each piece fits into the larger whole. In my experience, there are many good tutorials available that will help you work with an individual tool, deploy a specific algorithm, or complete a single task. It is more difficult to find guidelines for building a holistic system that supports the entire ML workflow. My aim is to help you build just such a system, so that you are free to focus on inquiry and discovery rather than struggling with in- frastructure and process. I write this as a software developer who has, at one time or another, been on the wrong end of all the recommendations presented here, and hopes to save you from similar headaches. Many of the examples and design choices are drawn from my experiences at the Digital Public Library of America, where I have worked alongside a very talented team of developers. This is by no means an exhaustive text, but rather a bit of pragmatic advice and a jumping-off point for further research, designed to give you a clearer idea of which questions to ask throughout your practice. This article reviews the basic machine learning workflow, discussing design considerations along the way. It offers recommendations for data storage, guidelines on selecting and working with ML algorithms, and questions to guide tool selection. Finally, it describes some challenges with scaling up. My hope is that the insight presented here, combined with your good judgement, will empower you to get started with the actual practice of designing and executing a machine learning project. 89 90 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Algorithm selection As you begin ingesting and preparing data, you’ll want to explore possible machine learning al- gorithms to perform on your dataset. Choose an algorithm that fits your research question and data. If you’re not sure which algorithm to choose and not constrained by time, experiment with several different options and see which one yields the best results. Start by determining what gen- eral type of learning algorithm you need, and proceed from there to research and select one that specifically addresses your research question. In supervised learning, you train a model to predict an output condition based on given in- put conditions; for example, predicting whether or not a patient has some disease based on their symptoms, or the topic of a news article based on keywords in the text. In order for supervised learning to work, you need labeled training data, meaning data in which the outcome is already known. Examples include records of symptoms in patients who were known to have the disease (or not), or news articles that have already been assigned topics. Classification and regression are both types of supervised learning. In a classification problem, you are predicting a discrete number of possible outcomes. For example, “based on what I know about this book, will it make the New York Times Best Seller list?” is a classification problem because there are two discrete outcomes: yes or no. Classification algorithms include naive Bayes, decision trees, and k-nearest neighbor. Regression problems try to predict an outcome from a continuum of possibilities, i.e., “based on what I know about this book, what will its retail price be?” Regression algorithms include linear regression and regression trees. In unsupervised learning, the ML algorithm discovers a new pattern. The training data is unlabeled, meaning there is no indication of how the data should be organized at the outset. A common example is clustering, in which the algorithm groups items together based on features it finds mathematically significant. Perhaps you have a collection of news articles (with no existing topic labels), and you want to discover common themes or topics that appear throughout the collection. The algorithm will not tell you what the themes or topics are, but will show which articles group together. It is then up to the researcher to work out the common thread. In addition to serving your research question, your algorithm should also be a good fit for your data. Specific considerations will vary for each dataset and algorithm, so make sure you know the strengths and weaknesses of your algorithm and how they relate to the unique qualities of your dataset. For example, algorithms differ in their abilities to handle datasets with a very large number of features, handle datasets with high variance, efficiently process very large datasets, and glean meaningful intelligence from very small datasets. Is it important that your algorithm be easy to explain? Some algorithms, such as neural nets, function as black boxes, and it is difficult to decipher how they arrive at their decisions. Other algorithms, such as decision trees, are easy to understand. Can you prepare your data for the algorithm with a reasonable amount of pre- processing? Can you find examples of success (or failure) from people using similar datasets with the same algorithm? Asking these sorts of questions will help you to choose an algorithm that works well for your data, and will also inform how you prepare your data for optimal use. Finally, consider whether or not you are constrained by time, hardware, or available toolsets. Different algorithms require different amounts of time and memory to train and/or execute. Dif- ferent ML tools offer implementations of different algorithms. Altman 91 The machine learning pipeline The metaphor of a pipeline is often used for a machine learning workflow. This metaphor cap- tures the idea of data channeled through a series of sequential transformations. However, it is important to note that each stage in the process will need to be repeated and honed through- out the course of your project. Therefore, don’t think of yourself as building a single intelligent model, such as a decision tree or clustering algorithm. Instead, build a pipeline with pieces that can be swapped in and out as needed. Data flows through the pipeline and outputs a version of a decision tree, clustering algorithm, or other intelligent model. Throughout your process, you will tweak your pipeline, making many intelligent models. Eventually you will select the best model for your use case. To use another metaphor, don’t build a car, build an assembly line for making cars. While the final output of a machine learning workflow is some sort of intelligent model, there are many factors that make repetition and iteration necessary. ML processes often involve subjective decisions, such as which data points to ignore, or which configurations to select for your algorithm. You will want to test different possibilities to see what works best. As you learn more about your dataset throughout the course of the project, you will go back and tweak parts of your process. You may discover biases in your data or algorithms that need to be addressed. If you are working collaboratively, you will be incorporating asynchronous feedback from members of your team. At some point, you may need to introduce new or revised data, or try a new tool or algorithm. It is also prudent to expect and plan for errors. Human errors are inevitable, and hardware errors, such as network timeouts or memory overloads, are common. For all of these reasons, you will be well-served by a pipeline composed of modular, repeatable steps, each with discrete and stable output. A modular pipeline supports a batch processing workflow, in which whole datasets undergo a series of transformations. During each step of the process, a large amount of data (possibly the entire dataset) is transformed all at once and then incrementally stored. This can be contrasted with a real-time workflow, in which individual records are transformed instantaneously (e.g. a li- brarian updates a single record in library catalog); or a streaming workflow, in which a continuous flow of data is pushed through an entire pipeline, often without incremental storage along the way (e.g. performing analysis on a continuous stream of new tweets). Batch processing is com- mon in the research and development phase of an ML project, and may also be a good choice for a production system. When designing any step in the batch processing pipeline, assume that at some point you will need to repeat it either exactly as is, or with modifications. Documenting your process lets you compare the outputs of different variations and communicate the ways in which your choices impact the final results. If you’re writing code, version control software can help. If you’re doing more manual data manipulations, such as editing data in spreadsheets, you will need an inten- tional system of documenting exactly which transformations you are applying to your data. It is generally preferable to automate processes wherever possible so that you can repeat them with ease and consistency. A concrete example from my own experience demonstrates the importance of a pipeline that supports repetition. In my first ever ML project, I worked with a set of XML library data con- verted to CSV. I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. This whole pro- 92 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 cess took me countless hours, and when an updated dataset became available, there was no way to reproduce my painstaking cleanup process. I was stuck with outdated data, and my final output was doomed to grow more and more irrelevant as time wore on. Since then, I have always written repeatable scripts for all my data cleanup tasks. Each decision you make will have an impact on the final results, so it is important to keep clear documentation and to verify your assumptions and hypotheses wherever possible. Sometimes there will be explicit tests to perform; at other times, you may just need to look at data—make a quick visualization, perform a simple calculation, or glance through a sample of records. Be cognizant of the potential to introduce error or bias. For example, you could remove a field that you don’t think is important, but that would, in fact, have a meaningful impact on the final result. All of these precautions will strengthen confidence in your final outcomes and make them intelligible to your collaborators and other audiences. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. Data acquisition The first step is to acquire the data that you will be using for your machine learning project. You may need to combine data from several different sources. There are many ways to acquire data, including downloading files, querying a database or API, or scraping web pages. Depending on the size of the source data and how it is made available, this can be a quick and simple step or the most challenging bottleneck in your pipeline. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least dur- ing the initial phase of testing different algorithms or configurations. Having a raw, immutable copy of your initial dataset (or datasets) ensures that you can always go back to the beginning of your ML process and start over with exactly the same input. It will also save you from the possi- bility that the source data will change from beneath you, thereby compromising your ability to compare the outputs of different operations (for more on this, see the section on data storage). If possible, it’s often worthwhile to learn about how the original data was created, especially if you are getting data from multiple sources that differ in subtle ways. Data preparation Data preparation involves cleaning data and transforming it into an appropriate format for sub- sequent machine learning tasks. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you’ve started training and testing models. The first step of data preparation is to parse your acquired data and transform it into a com- mon, usable schema. Acquired data often comes in file formats that are good for data sharing, such as XML, JSON, or CSV. You can parse these files into whatever schema makes sense to man- age the various transformations you want to perform, but it can help to have a sense of where you are headed. Your eventual choice of data format will likely be dictated by your ML algo- rithms; likely candidates include multidimensional arrays, tensors, matrices, and DataFrames. Look ahead to specific functions in the specific libraries you plan to use, and see what type of input data is required. You don’t have to use these same formats during your data preparations, though it can simplify the process. Altman 93 Data cleanup and transformation is an art. Data is messy, and the messier the data, the harder it is to analyze and uncover underlying patterns. Yet, we are only human, and perfect data is far beyond our reach. To strike a workable balance, focus on those cleanup tasks that you know (or strongly suspect) will have a significant impact on the final product. Cleanup and transfor- mation operations include removing punctuation or stopwords from textual data, standardizing date and number formats, replacing missing or dummy values with a meaningful default, and excluding data that is known to be erroneous or atypical. You will select relevant data points, and you may need to represent them in a new way: a birth date becomes age range; a place name be- comes geo-coordinates; a text document becomes a word density vector. There are many possible normalizations to perform, depending on your dataset and which algorithm(s) you plan to use. It’s not a bad idea to ensure that there’s a genuinely unique identifier for each record (even if you don’t see an immediate need for one). This is also a good time to reflect on any biases that might be inherent in your data, and whether or not you can adjust for them; even if you cannot, under- standing how they might impact the ML process will help you conduct a more nuanced analysis and frame your final results. At the very least, you can record biases in the documentation so that future researchers will be aware of them and react accordingly. As you become more familiar with the data, you will likely hone your cleanup process and iterate through the steps multiple times. The more you can learn about the data, the better your preparations will be. During the data preparation phase, practitioners often make use of visualizations and query frameworks to pic- ture their data holistically, identify patterns, and find errors or outliers. Some ML tools support these features out-of-the-box, or are intentionally interoperable with external query and visual- ization tools. For a lightweight tool, consider spreadsheet or notebook software. Depending on your use case, it may be worthwhile to put your data into a temporary database or search index so that you can make use of a more sophisticated query interface. Model testing and training During the testing and training phase, you will build multiple models and determine which one gives you the best results. One of the main ways you will tune your model is by trying multiple combinations of hyperparameters. A hyperparameter is a value that you set before you run the learning process, which impacts how the learning process works. Hyperparameters control things like the number of learning cycles an algorithm will iterate through, the number of layers in a neural net, the characteristics of a cluster, or the number of decision trees in a forest. Often, you will also want to circle back to your data preparation steps to try different configurations, apply new enhancements, or address new problems and particularities that you’ve uncovered. The process is deceptively simple: try out different configurations until you get a good result. The challenge comes when you try to define what constitutes a good (or good-enough) result. Measuring the quality of a machine learning model takes finesse. Start by asking: What would you expect to see if the model learned perfectly? Equally important, what would you expect to see if the model didn’t learn anything at all? You can often utilize randomness as a stand-in for no learning, e.g. “if a result was selected at random, the probability of the desired outcome would be X”. These two questions will help you to set benchmarks at both extremes of the realm of possible outcomes. Perfection is illusive, and the return on investment dwindles after a while, so be prepared to stop training once you’ve arrived at an acceptably good model. In a supervised learning problem the dataset is split into training and testing datasets. The algorithm uses the training data to “learn” a set of rules that it can subsequently apply to new, 94 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 unseen data to predict the outcome. The testing dataset (also called a validation dataset) is used to test how well the model performs. Often, a third dataset is held out as well, reserved for fi- nal testing after the model has been trained. This third dataset provides an additional bulwark against bias and overfitting. Results are typically evaluated based on some statistical measure- ment that is directly relevant to your research question. In a classification problem, you might optimize for recall or precision. In a regression problem, you can use formulas such as the root- mean square deviation to measure how well the regression line matches the actual data points. How you choose to optimize your model will depend on your specific context and priorities. Testing an unsupervised model is not as straightforward, since there is no preconceived no- tion of correct and incorrect categorization. You can sometimes rely on a known pattern in the underlying dataset that you would reasonably expect to be reflected in a successful model. There may also be characteristics of the final model that indicate success. For example, if you are work- ing with a clustering algorithm, models with dense, well-defined clusters are probably better than sparse clusters with vague boundaries. In unsupervised learning, you may want to hold back some portion of your data to perform an independent validation of your results, or you may use the entire dataset to build the model—it depends on what type of testing you want to perform. Application of results As the final step of your workflow, you will use your intelligent model to perform some task. Perhaps you will use it for scholarly analysis of a dataset, or perhaps you will integrate it into a software product. If it is the former, consider how to export any final data and preserve the artifacts of your project. If it is the latter, consider how the model, its outputs, and its contin- ued maintenance will fit into existing systems and workflows. Planning for interoperability may influence decisions from tool selection to data formats and storage. Immutable data storage Immutable data storage can benefit the batch-processing ML pipeline, especially during the ini- tial research and development phase. This type of data storage supports iteration and allows you to compare the results of many different experiments. Treating data as immutable means that af- ter each significant change or set of changes to your data, you save a new snapshot of the dataset that is never edited or changed. It also allows you to be flexible and adaptive with your data model. Immutable data storage has become a popular choice for data-intensive or “big data” applications as a way to easily assemble large quantities of data, often from multiple sources, without having to spend time upfront crafting a strict data model. You may have heard the term “data lake” to refer to such large, unstructured collections of data. This can be contrasted with a “data warehouse”, which usually indicates a highly structured, centralized repository such as a relational database. To demonstrate how immutable supports iteration and experimentation, consider the fol- lowing scenario: You start with an input file Kvn/�i�X+bp, and then perform some cleanup operation over the data, such as converting all measurements in miles to kilometers, rounded to the nearest whole number. If you were treating your data as mutable, you might overwrite the original contents of Kvn/�i�X+bp with the transformed values. The problem with this ap- proach comes if you want to test some alteration of your cleanup operation. Say, for example, you wanted to round all your conversions to the nearest tenth instead. Since you no longer have your original data, you would have to start the entire ML process from the top. If you instead Altman 95 treated your data as immutable, you would keep Kvn/�i�X+bp in its original state, and save the output of your cleanup operation in a new file, say Kvn+H2�Mn/�i�X+bp. That way, you could return to Kvn/�i�X+bp as many times as you wished, try different operations on this data, and easily compare the results of these operations knowing the source data was exactly the same for each one. Think of each immutable dataset as a place in your process that you can safely reset to anytime you want to try something new or correct for some bias or failure. To illustrate the benefits of a flexible data model, consider a mutable data store, such as a relational database. Before you put any data into the database, you would first need to design a system of tables with set fields and datatypes, and the relationships between those tables. This can feel like putting the cart before the horse, especially if you are starting with a dataset with which you are not yet intimately familiar, and you want the ability to experiment with different algorithms, all of which might require slightly different transformations on the original dataset. Revisiting the example in the previous paragraph, you might initially have defined your distance datatype as an integer (when you were rounding to the nearest whole number), and would later have to change it to a floating point number (when you were rounding to the nearest tenth). Making this change would mean altering the database schema and migrating all of the existing data to the new type, which is a nontrivial task—especially if you later decide to revert back to the original type. By contrast, if you were working with immutable CSV files, it would be much easier to write out two files, one with each data type, and keep whichever one ultimately proved most effective. Throughout your ML process, you can create several incremental datasets that are essentially read-only. There’s no one correct data storage format, but ideally you would use something sim- ple and space-efficient with the capacity to interoperate with different tools, such as flat files (plain text files without extraneous markup, such as TXT, CSV, or Parquet). Even if your data is ulti- mately destined for a different kind of datastore, such as a relational database or triplestore, con- sider using simple, immutable storage as an intermediary to facilitate iteration and experimenta- tion. If you’re concerned about overwhelming your local drive, cloud storage is a good option, especially if you can read and write directly from your programs or software services. One final benefit of immutable storage relates to scale. Batch processing workflows and im- mutable data storage work well with distributed data processing frameworks, such as MapReduce and Spark. Therefore, if you need to scale your ML project using distributed processing, the in- tegration will be more seamless (for more, see the section on scaling up). Organizing Immutable Data Organizing immutable data stores can be a challenge, especially with multiple users. A little planning can save you from losing track of your experiments and results. A well-ordered direc- tory structure, informative and consistent file names, liberal use of timestamps, and disciplined note-taking are simple but effective strategies. For example, say you were acquiring MARCXML records from an API feed, parsing out subject terms, and building a clustering algorithm around these terms. Let us explore one possible way that you could organize your data outputs through each step of the machine learning pipeline. To enforce a naming convention, create a helper method that generates the output path for each run of a particular data process. This output path includes the date and timestamp of the run—that way you won’t have to think about naming each individual file, and can avoid the phenomenon of a mess of files called Kvn+H2�Mn/�i�X+bp, Kvn+H2�M2`n/�i�X+bp, 96 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Kvn7BM�Hn+H2�M2bin/�i�X+bp, etc. Your file path for the acquired data might be in the format: KvS`QD2+if�+[mBbBiBQMbfK�`+nuuuuJJ..n>>JJaaXtKH In this case, “YYMMDD” represents the date and “HHMMSS” represents the timestamp. Your file path for prepared and cleaned data might be: KvS`QD2+if+H2�Mn/�i�b2ibfbm#D2+ibnuuuuJJ..n>>JJaaX+bp Finally, each clustering model you build could be saved using the file path pattern: KvS`QD2+ifKQ/2Hbf+Hmbi2`nuuuuJJ..n>>JJaa Following this general pattern, you can organize all of the outputs for your entire project. Using date and timestamps in the file name also enables easy sorting and retrieval of the most recent output. For each data output, you will want to maintain a record of the exact input, any special at- tributes of the process (e.g. “this time I rounded decimals to the nearest hundredth”), and metrics that will help you determine success or failure of the process. If you can generate this information automatically for each process, all the better for ensuring an accurate record. One strategy is to include a second helper method in your program that will generate and write out a companion file to each data output. The companion file contains information that will help evaluate results, detect errors, perform optimizations, and differentiate between any two data outputs. In the example project, you could accompany the acquisition output with a text file detailing the exact API call used to fetch the data, the number of records acquired, and the runtime for the process. Keeping companion files as close as possible to their outputs helps prevent accidental separation, so save it to: KvS`QD2+if�+[mBbBiBQMfK�`+nuuuuJJ..n>>JJaaXiti In this case, the date and timestamp should exactly match that of its companion XML file. When running processes that test and train models, you can include information in your com- panion file about hyperparameters and whatever metrics you are using to evaluate the quality of the model. In our example, the companion file to each cluster model may contain the file path for the cleaned input data, the number of clusters, and a measure of cluster variance. Working with machine learning algorithms New technologies and software advances make machine learning more accessible to “lay” users, by which I mean those of us without advanced degrees in mathematics or data science. Yet, the algorithms are complex, and you need at least an intuitive understanding of how they work if you hope to implement them correctly. I use the following three questions as a guide for under- standing an algorithm. Keep in mind that any one project will likely make use of several complex algorithms along the way. These questions help ensure that I have the information I truly need, and avoid getting bogged down with details best left to mathematicians. • What do the inputs and outputs of the algorithm mean? There are two parts to answering this question. First is the data structure, e.g. “this is a vector with 300 integers.” Second Altman 97 is knowing what this data describes, e.g. “each vector represents a document, and each integer specifies the number of times a particular word appears in that document.” You also need to be aware of specific implementation details—perhaps the input needs to be normalized in some way, perhaps the output has been smoothed (a technique that com- pensates for noisy data or outliers). This may seem straightforward, but it can be a lot to keep track of once you’ve gone through several layers of processing and abstraction. • What effect do different hyperparameters have on the algorithm? Part of the machine learn- ing process is tuning hyperparameters, or trying out multiple configurations until you get satisfying results. Part of the frustration is that you can’t try every possible configuration, so you have to do some intelligent guesswork. Twiddling hyperparameters can feel enig- matic and unitutive, since it can be difficult to predict their impact on the final outcome. The better you understand hyperparameters and their roles in the ML process, the more likely you are to make reasonable guesses and adjustments—though you should always be prepared for a surprise. • Canyouexplainhowthisalgorithmworkstoalaypersonandwhyit’sbeneficialtotheproject? There are two benefits to articulating a response to this question. First, it ensures that you really understand the algorithm yourself. And second, you will likely be called on to give this explanation to co-collaborators and other stakeholders. A good explanation will build excitement around the project, while a befuddling one could sow doubt or disinterest. It can be difficult to strike a balance between general summary and technical equations, since your stakeholders will likely include people with diverse backgrounds, so do your best and look for opportunities for people with different expertises to help refine your team’s understanding of the algorithm. Learning more about the underlying math can help you make better, more nuanced decisions about how to deploy the algorithm, and is fascinating in its own right—but in most cases I have found that the above three questions provide a solid foundation for machine learning research. Tool selection Tool selection is an important part of your process and should be approached thoughtfully. A good approach is to articulate and prioritize the needs of your team, and make selections that meet these needs. I’ve listed some possible questions for consideration below, many of which you will recognize as general concerns for any tool selection process. • What sorts of features and interfaces do they offer? If you require a specific algorithm, the ability to make data visualizations, or query interfaces, you can find tools to meet these specific needs. • How well do tools interoperate with one another, or with other parts of your existing systems? One of the advantages of a well-designed pipeline is that it will enable you to swap out software components if the need arises. For example, if your data is in a format that is interoperable with many systems, it frees you from being tied down to any specific tool. • How do the tools align with the skill sets and comfort levels of your team? For example, con- sider what coding languages your collaborators know, and whether or not they have the 98 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 capacity to learn a new one. If you have someone who is already a wiz with a preferred spreadsheet program, see if you can export data into a compatible file format. • Arethetoolsstable,well-documented,andwell-supported? Machine learning is a fast-changing field, with new algorithms, services, and software features being developed all the time. Something new and exciting that hasn’t yet been road-tested may not be worth the risk if there is a more dependable alternative. Furthermore, there tends to be more scholarship, documented use cases, and tutorials for older, more widely-adopted tools. • Are you concerned about speed and scale? Don’t get bogged down with these considerations if you’re just trying to get a working pilot off the ground, but it can help to at least be aware of how problems are likely to manifest as your volume of data increases, or as you integrate into time-sensitive workflows. You and your team can work through these questions and articulate additional requirements relevant to your specific context. Scaling up Scaling up in machine learning generally means that you need to work with a larger volume of data, or that you need processes to execute faster. Recent advances in hardware and software make the execution of complex computations magnitudes faster and more efficient than they were even a decade ago, and you can often achieve quite a bit by working on a personal computer. Yet, time is valuable, and it can be difficult to iterate and experiment effectively when individual processes take too long to execute. There are many ML software packages that can help you make efficient use of whatever hard- ware you have, including your personal computer. Some examples at the time of writing are Apache Spark, TensorFlow, Scikit-learn, and Microsoft Cognitive Toolkit, each with their own strengths and applications. In addition to providing libraries for building and testing models, these software packages optimize algorithmic performance, memory resources, data through- puts, and/or parallel computations. They can make a remarkable difference in both processing speed and the amount of data you can comfortably handle. There are also services that allow you to submit executable code and data to the cloud for processing, such as Google AI Platform. Managing your own hardware upgrades is not without challenge. You may be lucky enough to have access to a high-powered computer capable of accelerated processing. A common example is a computer with GPUs (graphics processing units), which break complex processes into many small tasks and run them in parallel. However, these powerful machines can be prohibitively ex- pensive. Another scaling technique is distributed or cluster computing, in which complex pro- cesses are distributed across multiple computers, often in the cloud. A cloud cluster can bring significant cost savings, but managing one requires specialized knowledge and the learning curve can be rather steep. It is also important to note that different algorithms require different scal- ing techniques. Some clustering algorithms, for example, scale well with GPUs but not with distributed computing. Even with the right hardware and software, scaling up can be a tricky business. ML processes tend to have dramatic spikes in memory or network use, which can tax your systems. Not all ML algorithms scale well, causing memory use or execution time to grow exponentially as more data is added. Sometimes you have to add additional, complexity-reducing steps to your pipeline to Altman 99 handle data at scale. Some of the more common machine learning languages, such as Python and R, execute relatively slowly, putting the onus on developers to optimize operations for efficiency. In anticipation of these and other challenges, it is often a good idea to start with a scaled-down pilot or proof of concept, and not to underestimate the time and resources necessary to scale up from there. Conclusion New technologies make it possible for more researchers and developers to leverage the power of machine learning. Building an effective machine learning system means supporting the entire workflow, from data acquisition to final analysis. Practitioners must be mindful of how each im- plementation decision and subjective choice—from the way you structure and store your data to the algorithms you use to the ways you validate your results—will impact the efficiency of opera- tions and the quality of learned intelligence. This article has offered some practical guidelines for building ML systems with modular, repeatable processes and intelligible, verifiable results. There are many resources available for further research, both online and in your libraries, and I encour- age you to consult with subject specialists, data scientists, mathematicians, programmers, and data engineers. May your data be clean, your computations efficient, and your results profound. Further Reading I include here a few suggestions for further reading on key topics. I have also found that in the fast-changing world of machine learning technologies, blogs, internet communities, and online classes can be a great source of information that is current, introductory, and/or geared toward practitioners. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining. Boston: Pearson Addison Wesley. See chapter 2 for data preparation strategies. Later chap- ters introduce common classification and clustering algorithms. Marz, Nathan and James Warren. 2015. Big Data: Principles and best practices of scalable real- time data systems. Shelter Island: Manning. “Part 1: Batch Layer” discusses immutable storage in depth. Kleppmann, Martin. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reli- able, Scalable, and Maintainable Systems. Boston: O’Reilly. “Chapter 10: Batch Process- ing” is especially relevant if you are interested in scaling up. 11-prudhomme-taking ---- Chapter 11 Taking a Leap Forward: Machine Learning for New Limits Patrice-Andre Prud’homme Oklahoma State University Introduction Today, machines can analyze vast amounts of data and increasingly produce accurate results through the repetition of mathematical or computational procedures. With the increasing computing ca- pabilities available to us today, artificial intelligence (AI) and machine applications have made a leap forward. These rapid technological changes are inevitably influencing our interpretation of what AI can do and how it can affect people’s lives. Machine learning models that are developed on the basis of statistical patterns from observed data provide new opportunities to augment our knowledge of text, photographs, and other types of data in support of research and educa- tion. However, “the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on,” as Thomas Padilla, Interim Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). With that in mind, these technologies and methodologies could help augment the capacity of archives and libraries to leverage their creation-value and minimize their institutional memory loss while enhancing the interdisciplinary approach to research and scholarship. In this essay, I begin by placing artificial intelligence and machine learning in context, then proceed by discussing why AI matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at Oklahoma State University Archives. Lastly, I end by challenging other areas in the library and adjacent fields to join in the dialogue, to develop a machine learning solution more broadly, and to explore op- portunities that we can reap by reaching out to others who share a similar interest in connecting people to build knowledge. 127 128 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11 Artificial Intelligence and Machine Learning. Why do they Matter? Artificial intelligence has seen a resurging interest in the recent past—in the news, in the literature, in academic libraries and archives, and in other fields, such as medical imaging, inspection of steel corrosion, and more. John McCarthy, American computer scientist, defined artificial intelligence as “the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable” (2007, 2). This definition has since been extended to reflect a deeper understanding of AI today and what systems run by computers are now able to do. Dr. Carmel Kent notes that “AI feels like a moving target” as we still need to learn how it affects our lives (2019). Within the last decades, the amazing jump in computing capabilities has been quite transformative in that machines are increasingly able to ingest and analyze large amounts of data and more complex data to automatically produce models that can deliver faster and more accurate results. 1 Their “power lies in the fact that machines can recognize patterns efficiently and routinely, at a scale and speed that humans cannot approach,” writes Catherine Nicole Coleman, digital research architect for Stanford University (2017). A Paradigm Shift for Archives and Libraries Within the context of university archives, this paradigm shift has been transforming the way we interpret archival data. Artificial intelligence, and specifically machine learning as a subfield of AI, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. As the software analytics company SAS argues, it is “the iterative aspect of machine learning [that] is important because as models are exposed to new data, they are able to independently adapt. They learn from previous computations to produce reliable, repeatable decisions and results” (n.d.). Case in point, how can we use machine learning to train machines and apply facial and text recognition techniques to interpret the sheer number of photographs and texts in either ana- log or born-digital formats held in archives and libraries? Combining automatic processes to as- sist in supporting inventory management with a focus on descriptive metadata, a machine learn- ing solution could help alleviate time-consuming and relatively expensive metadata tagging tasks, and thus scale the process more effectively using relatively small amounts of data. However, the traditional approach of machine learning would still require a significant time commitment by archivists and curators to identify essential features to make patterns usable for data training. By contrast, deep learning algorithms are able “to learn high-level features from data in an incremen- tal manner. This eliminates the need of domain expertise and hard core feature extraction” (Ma- hapatra 2018). Deep learning has regained popularity since the mid-2000s due to “fast development of high- performance parallel computing systems, such as GPU clusters” (Zhao 2019, 3213). Deep learn- ing neural networks are more effective in feature detection as they are able to solve complex prob- lems such as image classification with greater accuracy when trained with large datasets. The challenge is whether archives and libraries can afford to take advantage of greater computing capabilities to develop sophisticated techniques and make complex patterns from thousands of 1See SAS n.d. and Brennan 2019. Prud’homme 129 digital works. The sheer size of library and archive datasets, such as university photograph collec- tions, presents challenges to properly using these new, sophisticated techniques. As Jason Griffey writes, “AI is only as good as its training data and the weighting that is given to the system as it learns to make decisions. If that data is biased, contains bad examples of decision-making, or is simply collected in such a way that it isn’t representative of the entirety of the problem set[…], that system is going to produce broken, biased, and bad outputs” (2019, 8). How can cultural heritage institutions ensure that their machine learning algorithms avoid such bad outputs? Implications to Machine Learning Machine learning has the potential to enrich the value of digital collections by building upon ex- perts’ knowledge. It can also help identify resources that archivists and curators may never have the time for, and at the same time correct assumptions about heritage materials. It can generate the necessary added value to support the mission of archives and libraries in providing a public good. Annie Schweikert states that “artificial intelligence and machine learning tools are consid- ered by many to be the next step in streamlining workflows and easing workloads” (2019, 6). For images, how can archives build a data-labeling pipeline into their digital curation work- flow that enables machine learning of collections? With the objective being to augment knowl- edge and create value, how can archives and libraries “bring the skills and knowledge of library staff, scholars, and students together to design an intelligent information system” (Coleman 2017)? Despite the opportunities to augment knowledge from facial recognition, models generated by machine learning algorithms should be scrutinized so long it is unclear how choices are made in feature selection. Machine learning “has the potential to reveal things …that we did not know and did not want to know” as Charlie Harper asserts (2018). It can also have direct ethical impli- cations, leading to biased interpretations for nefarious motives. Machine Learning and Deep Learning on the Grounds of Generating Value In the fall 2018, Oklahoma State University Archives began to look more closely at a machine learning solution to facilitate metadata creation in support of curation, preservation, and dis- covery. Conceptually, we envisioned boosting the curation of digital assets, setting up policies to prioritize digital preservation and access for education and research, and enhancing the long-term value of those data. In this section, I describe the parameters of automation and machine learning used to support inventory work and experiment with face recognition models to add contextual- ization to digital objects. From a digital curation perspective, the objective is to explore ways to add value to digital objects for which little information is known, if any, in order to increase the visibility of archival collections. What started this Pilot Project? Before proceeding, we needed to gain a deeper understanding of the large quantity of files held in the archives—both types of data and metadata. The challenge was that with so many files, so many formats, files become duplicated and renamed, doctored, and scattered throughout direc- tories to accommodate different types of projects over time, making it hard to sift due to sparse 130 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11 metadata tags that may have differed from one system to another. In short, how could we justify the value of these digital assets for curatorial purposes? How much could we rely on the estab- lished institutional memory within the archives? Lastly, could machine learning or deep learning applications help us build a greater capacity to augment knowledge? In order to optimize re- sources and systematically make sense of data, we needed to determine that machine learning could generate value, which in turn could help us more tightly integrate our digital initiatives with machine learning applications. Such applications would only be as effective as the data are good for training and the value we could derive from them. Methodology and Plan of Action First, we recruited two student interns to create a series of processes that would automatically populate a comprehensive inventory of all digital collections, including finding duplicate files by hashing. We generated the inventory by developing a process that could be universally adapted to all library digital collections, setting up a universal list of works and their associated metadata, with a focus on descriptive metadata, which in turn could support digital curation and discov- ery of archival materials—digitized analog materials and born-digital materials. We developed a universal policy for digital archival collections, which would allow us to incorporate all forms of metadata into a single format to remedy inconsistencies in existing metadata. This first phase was critical in the sense that it would condition the cleansing and organizing of data. We could then proceed with the design of a face recognition database, with the intent to trace individuals fea- tured in the inventory works of the archives to the extent that our data were accurate. We utilized the Oklahoma State University Yearbook collections and other digital collections as authoritative references for other works, for the purpose of contextualization to augment our data capacity. Second, we implemented our plan; worked closely with the Library Systems’ team within a Windows-based environment; decided on Graphics Processing Unit (GPU) performance and cost, taking into consideration that training neural networks necessitates computing power; de- termined storage needs; and fulfilled other logistical requirements to begin the step-by-step pro- cess of establishing a pattern recognition database. We designed the database on known objects before introducing and comparing new data to contextualize each entry. With this framework, we would be able to add general metadata tags to a uniform storage system using deep learning technology. Third, we applied Tesseract OCR on a series of archival image-text combinations from the archives to extract printed text from those images and photographs. “Tesseract 4 adds a new neural net (LSTM) [Long Short-Term Memory] based OCR engine which is focused on line recognition,” while also recognizing character patterns (“Tesseract” n.d.). We were able to obtain successful output for the most part, with the exception of a few characters that were hard to detect due to pixelation and font types. Fourth, we looked into object identifiers, keeping in mind that “When there are scarce or insufficient labeled data, pre-training is usually conducted” (Zhao 2019, 3215). Working through the inventory process, we knew that we would also need to label more data to grow our capacity. We chose to use ResNet 50, a smaller version backbone of Keras-Retinanet, frequently used as a starting point for transfer learning. ResNet 152 was another implementation layer used as shown in Figure 11.1 demonstrating the output of a training session or epoch for testing purposes. Keras is a deep learning network API (Application Programming Interface) that supports multiple back-end neural network computation engines (Heller 2019) and RetinaNet is a sin- Prud’homme 131 Figure 11.1: ResNet 152 application using PASCAL VOC 2012 Figure 11.2: Face recognition API test gle, unified network consisting of a backbone network and two task-specific subnetworks used for object detection (Karaka 2019). We proceeded by first dumping a lot of pre-tagged infor- mation from pre-existing datasets into this neural network. We experimented with three open source datasets: PASCAL VOC 2012, a set including 20 object categories; Open Images Database (OID), a very large dataset annotated with image-level labels and object bounding boxes; and Mi- crosoft COCO, a large-scale object detection, segmentation, and captioning dataset. With a few faces from the OID dataset, we could compare and see if a face was previously recognized. Ex- panding our process to data known from the archives collection, we determined facial areas, and more specifically, assigned bounding box regressions to feed into the facial recognition API, based on Keras code written in Python. The face recognition API is available via GitHub. 2 It uses a method called Histogram of Oriented Gradient (HOG) encoding that makes the actual face recognition process much easier to implement for individuals because the encodings are fairly unique for every person, as opposed to encoding images and trying to blindly figure out which parts are faces based on our label boxes. Figure 11.2 illustrates our test, confirming from two very different photographs the presence of Jessie Thatcher Bost, the first female graduate from Oklahoma A&M College in 1897. Ren et al. stated that it is important to construct a deep and convolutional per-region object 2See ?iiTb,ff;Bi?m#X+QKf�;2Bi;2vf7�+2n`2+Q;MBiBQM. https://github.com/ageitgey/face_recognition 132 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11 classifier to obtain good accuracy using ResNets (2015). Going forward, we could use the tool “as is” despite the low tolerance for accuracy, or instead try to establish large datasets of faces by training on our own collections in hopes of improving accuracy. We proceeded with utilizing the Oklahoma State University Yearbook collections, comparing image sets with other photographs that may include these faces. We look forward to automating more of these processes. A Conclusive First Experiment We can say that our first experiment developing a machine learning solution on a known set of archival data resulted in positive output, while recognizing that it is still a work in progress. For example, the model we ran for the pilot is not natively supported on Windows, which hindered team collaboration. In light of these challenges, we think that our experiment was a step in the right direction of adding value to collections by bringing in a new layer of discovery for hidden or unidentified content. Above all, this type of work relies greatly on transparency. As Schweikert notes, “Trans- parency is not a perk, but a key to the responsible adoption of machine learning solutions” (2019, 72). More broadly, issues in transparency and ethics in machine learning are important concerns in the collecting and handling of data. In order to boost adoption and get more buy-in with this new type of discovery layer, our team shared information intentionally about the process to help add credibility to the work and foster a more collaborative environment within the library. Also, the team developed a Graphic User Interface (GUI) to search the inventory within the archives and ultimately grow the solution beyond the department. Challenges and Opportunities of Machine Learning Challenges In a National Library of Medicine blog post, Patti Brennan points out “that AI applications are only as good as the data upon which they are trained and built”(2019), and having these data ready for analysis is a must in order to yield accurate results. Scaling of input and output variables also plays an important role in the performance improvement when using neural network mod- els. Jerome Pesenti, Head of AI at Facebook, states that “When you scale deep learning, it tends to behave better and to be able to solve a broader task in a better way” (2019). Clifford Lynch affirms, “machine learning applications could substantially help archives make their collections more discoverable to the public, to the extent that memory organizations can develop the skills and workflows to apply them” (2019). This raises the question whether archives can also afford to create the large amount of data from print heritage materials or refine their born-digital col- lections in order to build the capacity to sustain the use of deep-learning applications. Granted, the increasing volume of born-digital materials could help leverage this data capacity somehow; it does not exclude the fact that all data will need to be ready prior to using deep learning. Since machine learning is only good so long as value is added, archives and libraries will need to think in terms of optimization as well, deciding when value-generated output is justified compared to the cost of computing infrastructure and skilled labor needs. Besides value, operations, such as storing and ensuring access to these data, are just as important considerations to making machine learning a feasible endeavor. Prud’homme 133 Opportunities Investment in resources is also needed for interpreting results, in that “results of an AI-powered analysis should only factor into the final decision; they should not be the final arbiter of that de- cision” (Brennan 2019). While this could be a challenge in itself, it can also be an opportunity when machine learning helps minimize institutional memory loss in archives and libraries (e.g., when long-time archivists and librarians leave the institution). Machine learning could supple- ment practices that are already in place—it may not necessarily replace people—and at the same time generate metadata for the access and discovery of collections that people may never have the time to get to otherwise. But we will still need to determine accuracy in results. As deep learn- ing applications will only be as effective as the data, archives and libraries should expand their capacity by working with academic departments and partnering with university supercomput- ing centers or other highly performant computing environments across consortium aggregating networks. Such networks provide a computing environment with greater data capacity and more GPUs. Along similar lines, there are opportunities to build upon Carpentries workshops and the communities of practice that surround this type of interest. These growing opportunities could help boost the use of machine learning and deep learn- ing applications to minimize our knowledge gaps about local history and the surrounding com- munity, bringing together different types of data scattered across organizations. This increased capacity for knowledge could grow through collaborative partnerships, connecting people, schol- ars, computer scientists, archivists and librarians, to share their expertise through different types of projects. Such projects could emphasize the multi- and interdisciplinary academic approach to research, including digital humanities and other forms or models of digital scholarship. Conclusion Along with greater computing capabilities, artificial intelligence could be an opportunity for li- braries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits. Machine learning applications could help increase our knowledge of texts, photographs, and more, and determine their relevance within the context of research and education. It could minimize institutional memory loss, espe- cially as long-time professionals are leaving the profession. However, these applications will only be as effective as the data are good for training and for the added value they generate. At Oklahoma State University, we took a leap forward developing a machine learning so- lution to facilitate metadata creation in support of curation, preservation, and discovery. Our experiment with text extraction and face recognition models generated conclusive results within one academic year with two student interns. The team was satisfied with the final output and so was the library as we reported on our work. Again, it is still a work in progress and we look forward to taking another leap forward. In sum, it will be organizations’ responsibility to build their data capacity to sustain deep learning applications and justify their commitment of resources. Nonetheless, as Oklahoma State University’s face recognition initiative suggests, these applications can augment archives’ and li- braries’ support for multi- and interdisciplinary research and scholarship. 134 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11 References Brennan, Patti. 2019. “AI is Coming. Are Data Ready?” NLM Musings from the Mezzanine (blog). March 26, 2019. ?iiTb,ffMHK/B`2+iQ`XMHKXMB?X;QpfkyRNfyjfkef�B@ Bb@+QKBM;@�`2@i?2@/�i�@`2�/vf. Carmel, Kent. 2019. “Evidence Summary: Artificial Intelligence in Education.” European EdTech Network. ?iiTb,ff22iMX2mfFMQrH2/;2f/2i�BHf1pB/2M+2@amKK�` v@$f@�`iB7B+B�H@AMi2HHB;2M+2@BM@2/m+�iBQM. Coleman, Catherine Nicole. 2017. “Artificial Intelligence and the Library of the Future, Revis- ited.” Stanford Libraries (blog). November 3, 2017. ?iiTb,ffHB#`�`vXbi�M7Q`/X2 /mf#HQ;bf/B;Bi�H@HB#`�`v@#HQ;fkyRdfRRf�`iB7B+B�H@BMi2HHB;2M+2@�M /@HB#`�`v@7mim`2@`2pBbBi2/. “Face Recognition.” n.d. Accessed November 30, 2019. ?iiTb,ff;Bi?m#X+QKf�;2Bi;2vf 7�+2n`2+Q;MBiBQM. Griffey, Jason, ed.. 2019. “Artificial Intelligence and Machine Learning in Libraries.” Special issue, Library Technology Reports 55, no. 1 (January). ?iiTb,ffDQm`M�HbX�H�XQ`;fB M/2tXT?TfHi`fBbbm2fpB2rAbbm2fdyNf9dR. Harper, Charlie. 2018. “Machine Learning and the Library or: How I Learned to Stop Worrying and Love My Robot Overlords.” Code4Lib, no. 41 (August). ?iiTb,ffDQm`M�HX+Q/2 9HB#XQ`;f�`iB+H2bfRjedR. Heller, Martin. 2019. “What is Keras? The Deep Neural Network API Explained.” InfoWorld (website). January 28, 2019. ?iiTb,ffrrrXBM7QrQ`H/X+QKf�`iB+H2fjjjeRNkfr? �i@Bb@F2`�b@i?2@/22T@M2m`�H@M2irQ`F@�TB@2tTH�BM2/X?iKH. Karaka, Anil. 2019. “Object Detection with RetinaNet.” Weights & Biases (website). July 18, 2019. ?iiTb,ffrrrXr�M/#X+QKf�`iB+H2bfQ#D2+i@/2i2+iBQM@rBi?@`2iBM�M 2i. Lynch, Clifford. 2019. “Machine Learning, Archives and Special Collections: A High Level View.” International Council on Archives Blog. October 1, 2019. ?iiTb,ff#HQ;@B+� XQ`;fkyRNfRyfykfK�+?BM2@H2�`MBM;@�`+?Bp2b@�M/@bT2+B�H@+QHH2+iBQM b@�@?B;?@H2p2H@pB2rf. Mahapatra, Sambit. “Why Deep Learning over Traditional Machine Learning?” Towards Data Science (website). March 21, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfr?v@/22 T@H2�`MBM;@Bb@M22/2/@Qp2`@i`�/BiBQM�H@K�+?BM2@H2�`MBM;@R#e�NNRdd yej. McCarthy, John. “What is Artificial Intelligence?” Professor John McCarthy (website). Revised November 12, 2007. ?iiT,ffDK+Xbi�M7Q`/X2/mf�`iB+H2bfr?�iBb�Bfr?�iBb� BXT/7. Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI in Libraries. Dublin, OH: OCLC Research. ?iiTb,ff/QBXQ`;fRyXk8jjjftFdx@N;Nd. Pesenti, Jerome. 2019. “Facebook’s Head of AI Says the Field Will Soon ‘Hit the Wall.’ ” Inter- view by Will Knight. Wired (website). December 4, 2019. ?iiTb,ffrrrXrB`2/X+QKf biQ`vf7�+2#QQFb@�B@b�vb@7B2H/@?Bi@r�HHf. Ren, Shaoqing, Kaiming He, Ross Girshick, Xiangyu Zhang, and Jian Sun. 2015. “Object De- tection Networks on Convolutional Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39, no. 7 (April). SAS. n.d. “Machine Learning: What It Is and Why It Matters.” Accessed December 17, 2019. https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/ https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/ https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited https://github.com/ageitgey/face_recognition https://github.com/ageitgey/face_recognition https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471 https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471 https://journal.code4lib.org/articles/13671 https://journal.code4lib.org/articles/13671 https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html https://www.wandb.com/articles/object-detection-with-retinanet https://www.wandb.com/articles/object-detection-with-retinanet https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/ https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/ https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/ https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063 https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063 https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063 http://jmc.stanford.edu/articles/whatisai/whatisai.pdf http://jmc.stanford.edu/articles/whatisai/whatisai.pdf https://doi.org/10.25333/xk7z-9g97 https://www.wired.com/story/facebooks-ai-says-field-hit-wall/ https://www.wired.com/story/facebooks-ai-says-field-hit-wall/ Prud’homme 135 ?iiTb,ffrrrXb�bX+QKf2MnmbfBMbB;?ibf�M�HviB+bfK�+?BM2@H2�`MBM;X?i KH. Schweikert, Annie. 2019. “Audiovisual Algorithms, New Techniques for Digital Processing.” Master’s Thesis, New York University. ?iiTb,ffrrrXMvmX2/mfiBb+?fT`2b2`p�iB QMfT`Q;`�Kfbim/2MinrQ`FfkyRNbT`BM;fRNbni?2bBbna+?r2BF2`iXT/7. “Tesseract OCR.” n.d. Accessed December 11, 2019. ?iiTb,ff;Bi?m#X+QKfi2bb2`�+i@Q +`fi2bb2`�+i. Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2017 “Object Detection with Deep Learning: A Review.” IEEE Transactions on Neural Networks and Learning Sys- tems 30, no. 11 (2019): 3212-3232. https://www.sas.com/en_us/insights/analytics/machine-learning.html https://www.sas.com/en_us/insights/analytics/machine-learning.html https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf https://github.com/tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract 09-lesk-fragility ---- Chapter 9 Fragility and Intelligibility of Deep Learning for Libraries Michael Lesk Rutgers University Introduction On February 7, 2018, Mounir Mahjoubi, then the “digital minister” of France (le secrétariat d’État chargé du Numérique), told the civil service to use only computer methods that could be understood (Mahjoubi 2018). To be precise, what he actually said to l’Assemblée Nationale was: Aucun algorithme non explicable ne pourra être utilisé. I gave this to Google Translate and asked for it in English. What I got (on October 13, 2019) was: No algorithm that can not be explained can not be used. That’s a long way from fluent English. As I count the “not” words, it’s actually reversed in mean- ing. But, what if I leave off the final period when I enter it in Google Translate? Then I get: No non-explainable algorithm can be used Quite different, and although only barely fluent, now the meaning is right. The difference was only the final punctuation on the sentence.1 This is an example of the fragility of an AI algorithm. The point is not that both translations are of doubtful quality. The point is that a seemingly insignificant change in the input produced such a difference in the output. In this case, the fragility was detected by accident. 1In the months between my original queries in October 2019 and the final preparations for publication in November 2020, the algorithm has changed to produce the same translation with or without a period: “No non-explicable algorithm can be used.” 101 102 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Machine learning systems have a set of data for training. For example, if you are interested in translation, and you have a large collection of text in both French and English, you might notice that the word truck in English appears where the word camion appears in French. And the system might “learn” this translation. It would then apply this in other examples; this is called general- ization. Of course if you wish to translate French into British English, a preferred translation of camion is lorry. And if the context of your English truck is a US discussion of the wheels and axles underneath railway vehicles, the better French word is le bogie. Deep learning enthusiasts believe that with enough examples, machine learning systems will be able to generalize correctly. There can be various kinds of failures: we can discuss both (a) problems in the scope of the training data and (b) problems in the kind of modeling done. If the system has sufficiently general input data so that it learns well enough to produce reliably correct results on examples it has not seen, we call it robust; robustness is the opposite of fragility. Fragility errors here can arise from many sources—for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Or, the data may not have the scope of the real problem: if you train for “boat” based on ocean liners, don’t be surprised if the program fails on canoes. In addition, there are also modeling issues. Suppose you use a very simple model, such as a linear model, for data that is actually perhaps quadratic or exponential. This is called “underfit- ting” and may often arise when there is not enough training data. The reverse is also possible: there may be a lot of training data, including many noisy points, and the program may decide on a very complex model to cover all the noise in the training data. This is called “overfitting” and gives you an answer too dependent on noise and outliers in your data. For example, 1998 was an unusually warm year, but the decline in world temperature for the next few years suggests it was noise in the data, not a change in the development of climate. Fragility is also a problem in image recognition (“AI Recognition” 2017). Currently the most common technique for image recognition research projects is the use of convolutional neural nets. Recently, several papers have looked at how trivial modifications to images may impact im- age classification. Here (figure 9.1) is an image taken from (Su, Vargas, and Sakurai 2019). The original image class is in black and the classifier choice (and confidence) after adding a single un- usual pixel are shown in blue, with the extraneous pixel in white. The images were deliberately processed at low resolution—hence the pixellation—to match the input requirement of a popu- lar image classification program. The authors experimented with algorithms to find the quickest single-pixel change that would deceive an image classifier. They were routinely able to fool the recognition software. In this ex- ample, the deception was deliberate; the researchers searched for the best place to change the image. Bias and mistakes We have seen a major change in the way we do machine learning, and there are real dangers in- volved. The current enthusiasm for neural nets risks the use of processes which cannot be under- stood, as Mahjoubi warned, and which can thus conceal methods we would not approve of, such as discrimination in lending or hiring. Cathy O’Neil has described this in her book Weapons of Math Destruction (2016). There is much research today that seeks methods to explain what neural nets are doing. See Lesk 103 Figure 9.1: Examples of misclassification. Guidiotti et al. (2017) for a survey. There is also a 2018 DARPA program on “Explainable AI.” Techniques used can include looking at the results over a range of input data and seeing if the neural net can be modeled by a decision tree, or modifying the input data to see which input elements have the greatest effect on the results, and then showing that to the user. For example, Mariusz Bojarski et al. describe a self-driving system that highlights what it thinks is important in what it is seeing (2017). However, this is generally research in progress, and it raises the question of whether we can trust the explanation generator. Many popular magazines have discussed this problem; Forbes, for example, had an explana- tion of how the choice of datasets can produce a biased result without any deliberate attempt to do so (Taulli 2019). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). The MIT Media Lab hosts the Algorithmic Justice League, trying to stop organizations from building socially slanted systems. Similar thoughts come from groups like the Data and Society Research Institute or the AI Now Institute. Again, the problems may be accidental or deliberate. The phrase “data poisoning” has been used to suggest malicious creation of training data or examples of data designed to deceive ma- chine learning systems. There is now a DARPA research program, “Guaranteeing AI Robustness against Deception (GARD),” supporting research to learn how to stop trickery such as a demon- stration of converting a traffic stop sign to a 45 mph speed limit with a few stickers (Eykholt et al. 2018). More generally, bias in systems deciding whether to grant loans may be discriminatory but nevertheless profitable. Even if you want to detect AI mistakes, recognizing such problems is difficult. Often things will be wrong and we won’t know why. And even hypothetical (but perhaps erroneous) explana- tions can be very convincing; people easily believe plausible stories. I routinely give my students a paper that concludes that prior ownership of a cat prevents fatal myocardial infarctions; its re- sult implies that cats are more protective than statin drugs (Qureshi et al. 2009). The students are very quick to come up with possibilities like “petting a cat is relaxing, relaxation reduces your blood pressure, and lower blood pressure decreases the risk of heart attacks.” Then I have to ex- plain that the paper evaluates 32 possibilities (prior/current ownership ⇥ cats/dogs ⇥ 4 medical conditions ⇥ fatal/nonfatal) and you shouldn’t be surprised if you evaluate 32 chances and one is significant at the 0.05 level, which is only 1 in 20. In this example, there is also the question of reverse causality: perhaps someone who is in ill health will decide he is too sick to take care of a 104 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.2: Panoramic landscape. pet, so that the poor health is not caused by the lack of a cat, but rather the poor health causes the absence of a cat. Sometimes explanations can help, as in a machine learning program that was deliberately trained to distinguish images of wolves and dogs but was trained using pictures of wolves that always contained snow and pictures of dogs that never did (Ribeiro, Singh, and Guestrin 2016). Without explaining that, 10 of 27 subjects thought the classifier was trustworthy; after point- ing out the snow only 3 of 27 subjects believed the system. Usually you don’t get such a clear presentation of a mis-trained system. Recognition of problems Can we tell when something is wrong? Here’s the result of a Google Photo merge of three other photos; two landscapes and a picture of somebody’s friend. The software was told to make a panorama and stitched the images together (Peng 2018). It looks like a joke, and even made it into a list of top jokes on reddit. The author’s point was that the panorama system didn’t understand basic composition: people are not the same scale as mountains. Often, machine learning results are overstated. Google Flu Trends was acclaimed for several years and then turned out to be undependable (Lazer et al. 2014). A study that attempted to compare the performance of machine learning systems for medical diagnosis with actual doctors found that of over 20,000 papers analyzed, only a few dozen had data suitable for an evaluation (Liu et al. 2019). The results claimed comparable accuracy, but virtually none of the papers Lesk 105 presented adequate data to support that conclusion. Unusually promising results are sometimes the result of overfitting (Brownlee 2018); this is what was wrong with Google Flu Trends. A machine learning program can learn a large number of special cases and then find that the results do not generalize. In other cases problems can result when using “clean” data for training, and then encountering messier data in applications. Ideally, training and testing data should be from the same dataset and divided at random, but it can be tempting to start off with examples that are the result of initial and higher quality data collection. Sometimes in the past we had a choice between modeling and data for predictions. Consider, for example, the problem of guessing what the weather will be tomorrow. We now do this based on a model of the atmosphere that uses the Navier-Stokes equations; we use supercomputers and derive tomorrow’s atmosphere from today’s (Christensen 2015). What did we do before we had supercomputers? Solving those equations by hand is impractical. One of the methods was “pre- diction by analogy”: find some day in the past whose weather was most similar to today. Suppose that day is Oct. 20, 1970. Then use October 21, 1970 as tomorrow’s prediction. Prediction by analogy doesn’t require you to have a model or use advanced mathematics. In this case, however, it doesn’t work as well—partly because we don’t have enough past days to choose from, and we only get new days at the rate of one per day. In fact, Huug van den Dool estimated the number of days of data needed to make accurate predictions as 1030 years, which is far more than the age of the universe (Wilks 2008). The under- lying problem is that the weather is very random. If your state lottery is properly run, it should be completely pointless to look at past winning numbers and try to guess the next one. The weather is not that random but it has too much variation to be solved easily by analogy. If your problem is very simple (tic-tac-toe) you could indeed write down each position and what the best next move is; there are only about 255,000 games. To deal with more realistic problems, much of machine learning research is now focused on obtaining larger training sets. Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, “more data beats better algorithms.” In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, “The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets. The basic learning and decoding algorithms have not changed substantially in 40 years” (2014). Nevertheless, speech recognition has gone from frustration to useful products such as dictation software or home appliances. Lacking a model, however, means that we won’t know the limits of the calculations being done. For example, if you have some data that looks quadratic, but you fit a linear model, any attempt at extrapolation is fraught with error. If you are using a “black box” system, you don’t know when this is happening. And, regrettably, many of the AI software systems are sold as black boxes where the purchasers and users do not have access to the process, even if they are imagined to be able to understand it. What’s changing Many AI researchers are sensitive to the risks, especially given the publicity over self-driving cars. As the hype over “deep learning” built up, writers discussed examples such as a Pittsburgh med- ical system that proposed to send patients with both pneumonia and asthma home, because the computer had not understood that patients with both problems were actually being sent to the ICU (Bornstein 2016; Caruana et al. 2015). 106 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.3: Explainability. Many people work on ways of explaining or presenting neural net software (Harley 2015). Most important, perhaps, are new EU regulations that prohibit automated decision making that affects EU citizens, and provides a “right of explanation” (Metz 2016). We recognize that systems which don’t rely on a mathematical model may be cheaper to build than one where the coders understand what is going on. More serious is that they may be more accurate. This image is from the same article on understandability (Bornstein 2016). If there really is a tradeoff between what will solve the problem and what can be explained, we know that many system builders will choose to solve the problem. And yet even having explana- tions may not be an answer; a key paper on interpretability discusses the complexities of meaning related to explanation, causality, and modeling (Lipton 2018). Arend Hintze has noted that we do not always impose a demand for explanation on people. I can write that the New York Public Library main building is well proportioned and attractive without anyone expecting that I will recite its dimensions or the source of the marble used to construct it. And for some problems that’s fine: I don’t care how my camera decides on the focus distance to the subject. Where it matters, however, we often want explanations; the hard ethical problem, as noted before, is if better performance can be achieved in an inexplicable way. Recommendations 2017 saw the publication of the “Asilomar AI principles” (2017). Two of these principles are: • Safety: AI systems should be safe and secure throughout their operational lifetime, and verifiably so where applicable and feasible. • Failure Transparency: If an AI system causes harm, it should be possible to ascertain why. The problem is that the technology used to build many systems does not enable verifiability and explanation. Similarly the World Economic Forum calls for protection against discrimina- tion but notes many ways in which technology can have unanticipated and undesirable effects as a result of machine learning (“How to Prevent” 2018). Lesk 107 Historically there has been and continues to be too much hype. An important image recog- nition task is distinguishing malignant and benign spots on mammograms. There have been promises for decades that computers would do this better than radiologists. Here are examples from 1995 (“computer-aided diagnosis can improve radiologists’ observational performance”) (Schmidt and Nishikawa) and 2009 (“The Bayesian network significantly exceeded the perfor- mance of interpreting radiologists”) (Burnside et al.). A typical recent AI paper to do this with convolutional neural nets reports 90% accuracy (Singh et al. 2020). To put this in perspective, the problem is complex, but some examples are more straightforward, and even pigeons can reach 85% (Levenson et al. 2015). A serious recent review is “Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection” (Lehman et al. 2015). Very re- cently there was another claim that computers have surpassed radiologists (Walsh 2020); we will have to await evaluation. As with many claims of medical progress, replicability and evaluation are needed before doctors will be willing to believe them. What should we do? Software testing generally is a decades-old discipline, and many basic principles of regression testing apply here also: • Test data should cover the full range of expected input. • Test data should also cover unexpected and even illegal input. • Test data should include known past failures believed cleared up. • Test data should exercise all parts of the program, and all important paths (coverage). • Test data should include a set of data which is representative of the distribution of actual data, to be used for timing purposes. It is difficult to apply these ideas in parts of the AI world. If the allowed input is speech, there is no exhaustive list of utterances which can be sampled. If a black-box commercial machine learning package is being used, there is no way to ask about coverage of any number of test cases. If a program is constantly learning from new data, there is no list of previously fixed failures to be collected that reflects the constantly changing program. And obviously the circumstances of use matter. We may well, as a society, decide that forcing banks evaluating loan applications to use decision trees instead of deep learning is appropriate, so that we know whether illegal discrimination is going on, even if this raises the costs to the banks. We might also believe that the safest possible railway operation is important, even if the automated train doesn’t routinely explain how it balanced its choices of acceleration to achieve high punctuality and low risk. What would I suggest? Organizationally: • Have teams including both the computer scientists and the users. • Collaborate with a statistician: they’ve seen a lot of these problems before. • Work on easier problems. As examples, I watched a group of zoologists with a group of computer scientists discussing how to improve accuracy at identifying animals in photographs. The discussion indicated that 108 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 you needed hundreds of training examples at a minimum, if not thousands, since the animals do not typically walk up to the camera and pose for a full-frame shot. It was important to have both the people who understood the learning systems and the people who knew what the pictures were realistically like. The most amusing contribution by a statistician happened when a computer scientist offered a program that tried to recognize individual giraffes, and a zoologist complained that it only worked if you had a view of the right-hand side of the giraffe. Somebody who knew statistics perked up and said “it’s a 50% chance of recognizing the animal? I can do the math for that.” And it is simpler to do “is there any animal in the picture?” before asking “which animal is it?” and create two easier problems. Technically: • Try to interpolate rather than extrapolate: use the algorithm on points “inside” the training set (thinking in multiple dimensions). • Lean towards feature detection and modeling rather than completely unsupervised learn- ing. • Emphasize continuous rather than discrete variables. I suggest using methods that involve feature detection, since that tells you what the algorithm is relying on. For example, consider the Google Flu Trends failure; the public was not told what terms were used. As David Lazer noted, some of them were just “winter” terms (like ‘basketball’). If you know that, you might be skeptical. More significant are decisions like jail sentences or college admissions; knowing that racial or religious discrimination are not relevant can be verified by knowing that the program did not use them. Knowing what features were used can sometimes help the user: if you know that your loan application was downrated because of your credit score, it may be possible for you to pay off some bill to raise the score. Sometimes you have to use categorical variables (what county do you live in?) but if you have a choice of how you phrase a variable, asking something like “how many minutes a day do you spend reading?” is likely to produce a better fit than asking people to choose “how much do you read: never, sometimes, a lot?” A machine learning algorithm may tell you how much of the variance each input variable explains; you can use that information to focus on the variables that are most important to your problem, and decide whether you think you are measuring them well enough. Why not extrapolate? Sadly, as I write this in early April 2020, we are seeing all sorts of ex- trapolations of the COVID-19 epidemic, with expected US deaths ranging from 30,000 to 2 million, as people try to fit various functions (Gaussians, logistic regression, or whatever) with inadequately precise data and uncertain models. A simpler example is Mark Twain’s: “In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hun- dred and forty-two miles. That is an average of a trifle over one mile and a third per year. There- fore, any calm person, who is not blind or idiotic, can see that in the ‘Old Oolitic Silurian Period,’ just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod. And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen” (1883). Lesk 109 Finally, note the advice of Edgar Allan Poe: “Believe nothing you hear, and only one half that you see.” References “AI Recognition Fooled by Single Pixel Change.” BBC News, November 3, 2017. ?iiTb,ffrr rX##+X+QKfM2rbfi2+?MQHQ;v@9R3983d3. “Asilomar AI Principles.” 2017. ?iiTb,ff7mim`2Q7HB72XQ`;f�B@T`BM+BTH2bf. Bojarski, Mariusz, Larry Jackel, Ben Firner, and Urs Muller. 2017. “Explaining How End-to- End Deep Learning Steers a Self-Driving Car.” NVIDIA Developer Blog. ?iiTb,ff/2p# HQ;bXMpB/B�X+QKf2tTH�BMBM;@/22T@H2�`MBM;@b2H7@/`BpBM;@+�`f. Bornstein, Aaron. 2016. “Is Artificial Intelligence Permanently Inscrutable?” Nautilus 40 (1). ?iiT,ffM�miBHXmbfBbbm2f9yfH2�`MBM;fBb@�`iB7B+B�H@BMi2HHB;2M+2@T2` K�M2MiHv@BMb+`mi�#H2. Brownlee, Jason. 2018. “The Model Performance Mismatch Problem (and What to Do about It).” Machine Learning Mastery. ?iiTb,ffK�+?BM2H2�`MBM;K�bi2`vX+QKfi?2@K Q/2H@T2`7Q`K�M+2@KBbK�i+?@T`Q#H2Kf. Burnside, Elizabeth S., Jessie Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom, Berta M. Geller, Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, and C. David Page. 2009. “Probabilistic Computer Model Developed from Clinical Data in National Mammography Database Format to Classify Mammographic Findings.” Radiology 251 (3): 663–72. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Read- mission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD ’15), 1721–30. New York: ACM Press. ?iiTb, ff/QBXQ`;fRyXRR98fkd3jk83Xkd33eRj. Christensen, Hannah. 2015. “Banking on better forecasts: the new maths of weather predic- tion.” The Guardian, 8 Jan 2015. ?iiTb,ffrrrXi?2;m�`/B�MX+QKfb+B2M+2f�H2t b@�/p2Mim`2b@BM@MmK#2`H�M/fkyR8fD�Mfy3f#�MFBM;@7Q`2+�bib@K�i?b@r 2�i?2`@T`2/B+iBQM@biQ+?�biB+@T`Q+2bb2b. Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramèr, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. “Physical Adversarial Examples for Ob- ject Detectors.” 12th USENIX Workshop on Offensive Technologies (WOOT 18). Guidiotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Giannotti Fosca, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Computing Surveys 51 (5): 1–42. Halevy, Alon, Peter Norvig, and Fernando Pereira. 2009. “The Unreasonable Effectiveness of Data.” IEEE Intelligent Systems 24 (2). Harley, Adam W. 2015. “An Interactive Node-Link Visualization of Convolutional Neural Net- works.” In Advances in Visual Computing, edited by George Bebis et al., 867–77. Lecture Notes in Computer Science. Cham: Springer International Publishing. “How to Prevent Discriminatory Outcomes in Machine Learning.” 2018. White Paper from the Global Future Council on Human Rights 2016–2018, World Economic Forum. ?iiTb, ffrrrXr27Q`mKXQ`;fr?Bi2T�T2`bf?Qr@iQ@T`2p2Mi@/Bb+`BKBM�iQ`v@Qmi+ QK2b@BM@K�+?BM2@H2�`MBM;. https://www.bbc.com/news/technology-41845878 https://www.bbc.com/news/technology-41845878 https://futureoflife.org/ai-principles/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://doi.org/10.1145/2783258.2788613 https://doi.org/10.1145/2783258.2788613 https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning 110 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Huang, Xuedong, James Baker, and Raj Reddy. 2014. “A Historical Perspective of Speech Recog- nition.” Communications of the ACM 57 (1): 94–103. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–1205. Lehman, Constance, Robert Wellman, Diana Buist, Karl Kerlikowske, Anna Tosteson, and Di- ana Miglioretti. 2015. “Diagnostic Accuracy of Digital Screening Mammography with and without Computer-Aided Detection.” JAMA Intern Med 175 (11): 1828–1837. Levenson, Richard M., Elizabeth A. Krupinski, Victor M. Navarro, and Edward A. Wasserman. 2015. “Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images.” PLoS One, November 18, 2015. ?iiTb,ff/QBXQ`;fRyXRjdRfDQm` M�HXTQM2XyR9Rj8d. Lipton, Zachary. 2018. “The Mythos of Model Interpretability.” ACM Queue 61 (10): 36–43. Liu, Xiaoxuan et al. 2019. “A Comparison of Deep Learning Performance against Health-Care Professionals in Detecting Diseases from Medical Imaging: a Systematic Review and Meta- Analysis.” Lancet Digital Health 1 (6): e271–97. ?iiTb,ffrrrXb+B2M+2/B`2+iX+Q Kfb+B2M+2f�`iB+H2fTBBfak83Nd8yyRNjyRkjk. Mahjoubi, Mounir. 2018. “Assemblée nationale, XVe législature. Session ordinaire de 2017–2018.” Compte rendu intégral, Deuxième séance du mercredi 07 février 2018. ?iiT,ffrrrX�b b2K#H22@M�iBQM�H2X7`fR8f+`BfkyRd@kyR3fkyR3yRjdX�bT. Metz, Cade. 2016. “Artificial Intelligence Is Setting Up the Internet for a Huge Clash with Eu- rope.” Wired, July 11, 2016. ?iiTb,ffrrrXrB`2/X+QKfkyRefydf�`iB7B+B�H@BMi 2HHB;2M+2@b2iiBM;@BMi2`M2i@?m;2@+H�b?@2m`QT2f. O’Neil, Cathy. 2016. Weapons of Math Destruction. New York: Crown. Peng, Tony. 2018. “2018 in review: 10 AI failures.” Medium, December 10, 2018. ?iiTb,ffK2 /BmKX+QKfbvM+2/`2pB2rfkyR3@BM@`2pB2r@Ry@�B@7�BHm`2b@+R37��/78N3j. Qureshi, A. I., M. Z. Memon, G. Vazquez, and M. F. Suri. 2009. “Cat ownership and the Risk of Fatal Cardiovascular Diseases. Results from the Second National Health and Nutrition Ex- amination Study Mortality Follow-up Study.” Journal of Vascular and Interventional Neu- rology 2 (1): 132–5. ?iiTb,ffrrrXM+#BXMHKXMB?X;QpfTK+f�`iB+H2bfSJ*jjRdj kN. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “ ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedingsofthe22ndACMSIGKDDIn- ternational Conference on Knowledge Discovery and Data Mining (KDD ’16), 1135–1144. New York: ACM Press. Schmidt, R. A. and R. M. Nishikawa. 1995. “Clinical Use of Digital Mammography: the Present and the Prospects.” Journal of Digital Imaging 8 (1 Suppl 1): 74–9. Singh, Vivek Kumar et al. 2020. “Breast Tumor Segmentation and Shape Classification in Mam- mograms Using Generative Adversarial and Convolutional Neural Network.” Expert Sys- tems with Applications 139. Su, Jiawei, Danilo Vasconcellos Vargas, and Kouichi Sakurai. 2019. “One Pixel Attack for Fool- ing Deep Neural Networks.” IEEETransactionsonEvolutionaryComputation23 (5): 828–841. Taulli, Tom. 2019. “How Bias Distorts AI (Artificial Intelligence).” Forbes, August 4, 2019. ?iiTb,ffrrrX7Q`#2bX+QKfbBi2bfiQKi�mHHBfkyRNfy3fy9f#B�b@i?2@bBH2M i@FBHH2`@Q7@�B@�`iB7B+B�H@BMi2HHB;2M+2fOR++e7j8/d/3d. Twain, Mark. 1883. Life on the Mississippi. Boston: J. R. Osgood & Co. https://doi.org/10.1371/journal.pone.0141357 https://doi.org/10.1371/journal.pone.0141357 https://www.sciencedirect.com/science/article/pii/S2589750019301232 https://www.sciencedirect.com/science/article/pii/S2589750019301232 http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 Lesk 111 Tugend, Alina. 2019. “The Bias Embedded in Tech.” The New York Times, June 17, 2019, section F, 10. Walsh, Fergus. 2020. “AI ‘outperforms’ doctors diagnosing breast cancer.” BBC News, January 2, 2020. ?iiTb,ffrrrX##+X+QKfM2rbf?2�Hi?@8y38dd8N. Wilks, Daniel S. 2008. Review of EmpiricalMethodsinShort-TermClimatePrediction, by Huug van den Dool. Bulletin of the American Meteorological Society 89 (6): 887–88. https://www.bbc.com/news/health-50857759 10-morgan-bringing ---- Chapter 10 Bringing Algorithms and Machine Learning Into Library Collections and Services Eric Lease Morgan University of Notre Dame Seemingly revolutionary changes At the time of their implementation, some changes in the practice of librarianship were deemed revolutionary, but now-a-days some of these same changes are deemed matter of fact. Take, for example, the catalog. During much of the Middle Ages, a catalog was more akin to a simple acquisitions list. By 1548 the first author, title, subject catalog was created (LOC 2017, 18). These catalogs morphed into books, books which could be mass produced and distributed. But the books were difficult to keep up to date, and they were expensive to print. As a consequence, in the early 1860s, the card catalog was invented by Ezra Abbot, and the catalog eventually became a massive set of drawers (82). Unfortunately, because the way catalog cards are produced, it is not feasible to assign more than three or four subject headings to any given book. If one does, then the number of catalog cards quickly gets out of hand. In the 1870s, the idea of sharing catalog cards between libraries became common, and the Library of Congress facilitated much of the distribution (LOC 2017, 87). In 1965 and with the advent of computers, the idea of sharing cataloging data as MARC (machine readable cataloging) became prevalent (Crawford 1989, 204). The data structure of a MARC record is indicative of the time. Intended to be distributed on reel-to-reel tape, the MARC record is a sequential data structure designed to be read from beginning to end, complete with checks and balances ensuring the record’s integrity. Despite the apparent flexibility of a digital data structure, the tradition of three or four subject headings per book still holds true. Now-a-days, the data from MARC records is used to fill databases, the databases’ content is indexed, and items from the 113 114 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 library collection are located by searching the index. The evolution of the venerable library catalog has spanned centuries, each evolutionary change solving some problems but creating new ones. With the advent of the Internet, a host of other changes are (still) happening in libraries. Some of them are seen as revolutionary, and only time will tell whether or not these changes will persevere. Examples include but are not limited to: • the advocacy of alt-metrics and open access publications • the continuing dichotomy of the virtual library and library as place • the creation and maintenance of institutional repositories • the existence of digital scholarship centers • the increasing tendency to license instead of own content Many of the traditional roles of libraries are not as important as they used to be. That does not mean the roles are unimportant, just not as important. Like many other professions, librarianship is exploring new ways to remain relevant when many of their core functions are needed by fewer people. Working smarter, not harder Beyond automation, librarianship has not exploited computer technology. Despite the fact that libraries have the world of knowledge at their fingertips, libraries do not operate very intelligently, where “intelligently” is an allusion to artificial intelligence. Let’s enumerate the core functionalities of computers. First of all, computers…compute. They are given some sort of input, assign the input to a variable, apply any number of functions to the variable, and output the result. This process — computing — is akin to solving simple algebraic equations such as the area of a circle or a distance traveled. There are two factors of particular interest here. First, the input can be as simple as a number or a string (read: “a word”) or the input can be arbitrarily large combinations of both. Examples include: • 42 • 1776 • xyzzy • George Washington • a MARC record • the circulation history and academic characteristics of an individual • the full text and bibliographic descriptions of all early American authors Morgan 115 What is really important is the possible scale of a computer’s input. Libraries have not taken advantage of that scale. Imagine how librarianship would change if the profession actively used the full text of its collections to enhance bibliographic description and resulting public service. Imagine how collection policies and patron needs could be better articulated if: 1) students, re- searchers, or scholars first opted-in to have their records analyzed, and 2) the totality of circulation histories and journal usage histories were thoroughly investigated in combination with patron characteristics and data from other libraries. A second core functionality of computers is their ability to save, organize, and retrieve vast amounts of data. More specifically, computers save “data” — mere numbers and strings. But when the data is given context, such as a number denoted as date or a string denoted as a name, then the data is transformed into information. An example might include the birth year 1972 and the name of my pet, Blake. Given additional information, which may be compared and contrasted with other information, knowledge can be created — information put to use and un- derstood. For example, Mary, my sister, was born in 1951 and is therefore 21 years older than Blake. Computers excel at saving, organizing, and retrieving data which leads to information and knowledge. The possibilities of computers dispensing wisdom — knowledge of a timeless nature — is left for another essay. Like the scale of computer input, the library profession has not really exploited computers’ ability to save, organize, and retrieve data; on the whole, the library profession does not under- stand the concept of a “data structure.” For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures. Each has its own set of inherent strengths and weaknesses; there is no such thing as “One size fits all.” Through the use of data structures, computers store and retrieve information. Librarianship is about these same kinds of things, yet few librarians would be able to outline the differences between different data structures. Again, data becomes information when it is given context. In the world of MARC, when a string (one or more “words”) is inserted into the 245 field of a MARC bibliographic record, then the string is denoted as a title. In this case, MARC is a “data structure” because different fields denote different contexts. There are fields for authors, subjects, notes, added entries, etc. This is all very well and good, especially considering that MARC was designed more than fifty years ago. But since then, many more scalable, flexible, and efficient data structures have been designed. Relational databases are a good example. Relational databases build on a classic data structure known as the “table” — a matrix of rows and columns where each row is a record and each column is a field. Think “spreadsheet.” For example, each row may represent a book, with columns for authors, titles, dates, publishers, etc. The problem comes when a column needs to be repeatable. For example, a book may have multiple authors or more commonly, multiple subjects. In this case the idea of a table breaks down because it doesn’t make sense to have a column named subject-01, subject-02, and subject-03. As soon as you do that, you will want subject-04. Relational databases solve this problem. The solution is to first add a “key” — a unique value — to each row. Next, for fields with multiple values, create a new table where one of the columns is the key from the first table and the other column is a value, in this case, a subject heading. There are now two tables and they can be “joined” through the use of the key. Given such a data structure it is possible to add as many subjects as desired to any bibliographic item. But you say, “MARC can handle multiple subjects.” True, MARC can handle multiple sub- jects, but underneath, MARC is a data structure designed for when information was dissemi- 116 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 nated on tape. As such, it is a sequential data structure intended to be read from beginning to end. It is not a random access structure. What’s more, the MARC data structure is really di- vided into three substructures: 1) the leader, which is always twenty-four characters long, 2) the directory, which denotes where each bibliographic field exists, and 3) the bibliographic section where the bibliographic information is actually stored. It gets more complicated. The first five characters of the leader are expected to be a left-hand, zero-padded integer denoting the length of the record measured in bytes. A typical value may be 01999. Thus, the record is 1999 bytes long. Now, ask yourself, “What is the maximum size of a MARC record?” Despite the fact that librarianship embraces the idea of MARC, very few librarians really understand the structure of MARC data. MARC is a format for transmitting data from one place to another, not for organization. Moreover, libraries offer more than bibliographic information. There is information about people and organizations. Information about resource usage. Information about licensing. In- formation about resources that are not bibliographic, such as images or data sets. Etc. When these types of information present themselves, libraries fall back to the use of simple tables, which are usually not amenable to turning data into information. There are many different data structures. XML became popular about twenty years ago. Since then JSON has become prevalent. More than twenty years ago the idea of Linked Data was presented. All of these data structures have various strengths and weaknesses. None of them is perfect, and each addresses different needs, but they are all better than MARC when it comes to organizing data. Libraries understand the concept of manifesting data as information, but as a whole, libraries do not manifest the concept using computer technology. Finally, another core functionality of computers is networking and communication. The advent of the Internet is a relatively recent phenomenon, and the ubiquitous nature of comput- ers combined with other “smart” devices has facilitated literally billions of connections between computers (and people). Consequently the data computed upon and stored in one place can be transmitted almost instantly to another place, and the transmission is an exact copy. Again, like the process of computing and the process of storage, efficient computer communication builds upon itself with unforeseen consequences. For example, who predicted the demise of many cen- tralized information authorities? With the advent of the Internet there is less of a need/desire for travel agents, movie reviewers, or dare I say it, libraries. Yet again, libraries use the Internet, but do they actually exploit it? How many librarians are able to create a file, put it on the Web, and share the resulting URL? Granted, centralized computing departments and networking administrators put up road blocks to doing such things, but the sharing of data and information is at the core of librarianship. Putting a file on the ’Net, even temporarily, is something every librarian ought to be able to know how (and be authorized) to do. Despite the functionality of computers and their place in libraries over the past fifty to sixty years, computers have mostly been used to automate library tasks. MARC automated the process of printing catalog cards and eventually the creation of “discovery systems.” Libraries have used computers to automate the process of lending materials between themselves as well as to local learners, teachers, and scholars. Libraries use computers to store, organize, preserve, and dissem- inate the gray literature of our time, and we call these systems “institutional repositories.” In all of these cases, the automation has been a good thing because efficiencies were gained, but the use of computers has not gone far enough nor really evolved. Lending and usage statistics are not routinely harvested nor organized for the purposes of monitoring and predicting library patron Morgan 117 needs/desires. The content of institutional repositories is usually born digital, but libraries have not exploited their full text nature nor created services going beyond rudimentary catalogs. Computers can do so much more for libraries than mere automation. While I will never say computers are “smart,” their fundamental characteristics do appear intelligent, especially when used at scale. The scale of computing has significantly changed in the past ten years, and with this change the concept of “machine learning” has become more feasible. The following sections outline how libraries can go beyond automation, embrace machine learning, and truly evolve their ideas of collections and services. Machine learning: what it is, possibilities, and use cases Machine learning is a computing process used to make decisions and predictions. In the past, computer-aided decision-making and predictions were accomplished by articulating large sets of if-then statements and navigating down decision trees. The applications were extremely domain specific, and they weren’t very scalable. Machine learning turns this process on its head. Instead of navigating down a tree, machine learning takes sets of previously made observations (think “decisions”), identifies patterns and anomalies in the observations, and saves the result as a math- ematical model, which is really an n-dimensional array of vectors. Outside observations are then compared to the model and depending on the resulting similarities or differences, decisions or predictions are drawn. Using such a process, there are really only four different types of machine learning: classifi- cation, clustering, regression, and dimension reduction. Classification is a supervised machine learning process used to subdivide a set of observations into smaller sets which have been previ- ously articulated. For example, suppose you had a few categories of restaurants such as American, French, Italian, or Chinese. Given a set of previously classified menus, one could create a model defining each category and then classify new, unseen menus. The classic classification example is the filtering of email. “Is this message ‘spam’ or ‘ham’?” This chapter’s appendix walks a person through the creation of a simplified classification system. It classifies texts based on authorship. Clustering is almost always an unsupervised machine learning process which also creates smaller sets from a larger one, but clustering is not given a set of previously articulated categories. That is what makes it “unsupervised.” Instead, the categories are created as an end result. Topic modeling is a popular example of clustering. Regression predicts a numeric value based on sets of dependent variables. For example, given dependent variables like annual income, education level, size of family, age, gender, religion, and employment status, one might predict how much money a person may spend on an independent variable such as charity. Sometimes the number of characteristics of each observation is very large. Many times some of these characteristics do not play a significant role in decision-making or prediction. Dimension reduction is another machine learning process, and it is used to eliminate these less-than-useful characteristics from the observations. This process simplifies classification, clustering, or regres- sion. Some possible use cases There are many possible ways to enhance library collections and services through the use of ma- chine learning. I’m not necessarily advocating the implementation of any of the following ideas, 118 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 but they are possibilities. Each is grouped into the broadest of library functional departments: • reference and public services – given a set of grant proposals, suggest library resources be used in support of the grants – given a set of licensed library resources and their usage, suggest other resources for use – given a set of previously checked out materials, suggest other materials to be checked out – given a set of reference interviews, create a chatbot to supplement reference services – given the full text of a set of desirable journal articles, create a search strategy to be applied against any number of bibliographic indexes; answer the proverbial question, “Can you help me find more like this one?” – given the full text of articles as well as their bibliographic descriptions, predict and describe the sorts of things a specific journal title accepts or whether a given draft is good enough for publication – given the full text of reading materials assigned in a class, suggest library resources to support them • technical services – given a set of multimedia, enumerate characteristics of the media (number of faces, direction of angles, number and types of colors, etc.), and use the results to supple- ment bibliographic description – given a set of previously cataloged items, determine whether or not the cataloging can be improved – given full-text content harvested from just about anywhere, analyze the content in terms of natural language processing, and supplement bibliographic description • collections – given circulation histories, articulate more refined circulation patterns, and use the results to refine collection development policies – given the full text of sets of theses and dissertations, predict where scholarship at your institution is growing, and use the results to more intelligently build your just-in-case collection; do the same thing with faculty publications Implementing any of these possible use cases would necessarily be a collaborative effort. Im- plementation requires an array of expertise. Enumerated in no priority order, this expertise in- cludes: subject/domain expertise (such as cataloging trends, circulation services, collection strate- gies, etc.), computer programming and data management skills (such as Python, R, relational databases, JSON, etc.), and statistical modeling (an understanding of the strengths and weak- nesses of different machine learning algorithms). The team would then need to: 1. articulate and share a common goal for the work Morgan 119 2. amass the data to model 3. employ a feature extraction process (lower case words, extract a value from a database, etc.) 4. vectorize the features 5. create and evaluate the resulting model 6. go to Step #2 until satisfied 7. put the model into practice 8. go to Step #1; this work is never done For example, to bibliographically connect grant proposals to library resources, try this: 1. use classification to sub-divide each of your bibliographic index descriptions 2. apply the resulting model to the full text of the grants 3. return a percentage score denoting the strength of each resulting classification 4. recommend the use of zero or more bibliographic indexes To predict scholarship, try this: 1. amass the full text and bibliographic descriptions of all theses and dissertations 2. topic model the full text 3. evaluate the resulting topics 4. go to Step #2 until satisfied 5. augment the model’s matrix of vectors with bibliographic description 6. pivot the matrix on any of the given bibliographics 7. plot the results to see possible trends over time, trends within disciplines, etc. 8. use the results to make decisions The content of the GitHub repository reproduced in this chapter’s appendix describes how to do something very similar in method to the previous example.1 1See ?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. https://github.com/ericleasemorgan/bringing-algorithms 120 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 Some real-world use cases Here at the University of Notre Dame’s Navari Center for Digital Scholarship, we use machine learning in a number of ways. We cut our teeth on a system called Convocate.2 In this case we ob- tained a set of literature on the theme of human rights. Half of the set was written by researchers in non-governmental organizations. The other half was written by theologians. While both sets were on the same theme, the language of each was different. An excellent example is the use of the word “child.” In the former set, children were included in documents about fathers and mothers. In the later set, children often referred to the “Children of God.” Consequently, queries referring to children were often misleading. To rectify this problem, a set of broad themes were articulated, such as Actors, Harms and Violations, Rights and Freedoms, and Principles and Values. We then used topic modeling to subdivide all of the paragraphs of all of the documents into smaller and smaller sets of paragraphs. We compared the resulting topics to the broad themes, and when we found correlations between the two, we classified the paragraphs accordingly. Because the process required a great deal of human intervention, and thus impeded subsequent updates, this process was not ideal, but we were learning and the resulting index is useful. On a regular basis we find ourselves using a program called Topic Modeling Tool, which is a GUI/desktop application heavily based on the venerable MALLET suite of software.3 Given a set of plain text files and an integer, Topic Modeling Tool will create a weighted list of latent themes found in a corpus. Each theme is really a list of words which tend to cluster around each other, and these clusters are generated through the use of an algorithm called LDA (Latent Dirichlet Allocation). When it comes to topic modeling, there is no such thing as the correct number of topics. Just as in the traditional process of denoting what a corpus is about, there can be many distinct topics or there can be a few. Moreover, some of the topics may be large and others may be small. When using a topic modeler, it is important to iteratively configure and re-configure the input until the results seem to make sense. Just like every other machine learning application, Topic Modeling Tool bases its “reason- ing” on a matrix of vectors. Each row represents a document, and each column is a topic. At the intersection of a document row and a topic column is a score denoting how much the given doc- ument is “about” the calculated topic. It is then possible to sum each topic column and output a pie chart illustrating not only what the topics are, but how much of the corpus is about each topic. Such can be very insightful. By adding metadata to the matrix of vectors, even more insights can be garnered. Suppose you have a set of plain text files. Suppose also you know the names of the authors of each file. You can then do topic modeling against your corpus, and when the modeling is complete you can add a new column to the matrix and call it authors. Next, you update the values in the authors column with author names. Finally, you “pivot” the matrix on the authors column to calculate the degree each authors’ works are “about” the calculated topics. This too can be quite insightful. Suppose you have works by authors A, B, C, and D. Suppose you have calculated topics I, II, III, and IV. By updating the matrix and pivoting the results, you might discover that author A discusses topic I almost exclusively, whereas author B discusses topics I, II, III, and IV in equal parts. This process works for just about any type of metadata: gender, genre, extent, dates, language, etc. What’s more, Topic Modeling Tool makes this process almost trivial. To learn how, see the GitHub 2See ?iiTb,ff+QMpQ+�i2XM/X2/m. 3See ?iiTb,ff;Bi?m#X+QKfb2M/2`H2fiQTB+@KQ/2HBM;@iQQH for the Topic Modeling Tool. See ?iiT, ffK�HH2iX+bXmK�bbX2/m for MALLET. https://convocate.nd.edu https://github.com/senderle/topic-modeling-tool http://mallet.cs.umass.edu http://mallet.cs.umass.edu Morgan 121 repository accompanying this chapter.4 We have used classification techniques in at least a couple of ways. One project required the classification of press releases. Some press releases are deemed mandatory — declared necessary to publish. Other press releases are considered discretionary — published at the will of a com- pany. The domain expert needed a set of 100,000 press releases classified into either mandatory or discretionary piles. We used a process very similar to the process outlined in this chapter’s Ap- pendix. In the end, the domain expert believes the classification process was 86% correct, and this was good enough for them. In another project, we tried to identify articles about a particu- lar yeast (Cryptococcus neoformans), despite the fact that the articles never mentioned the given yeast. This project failed because we were unable to generate an accuracy score greater than 70%. This was deemed not good enough. We are developing a high performance computing system called the Distant Reader, which uses machine learning to do natural language processing against an arbitrarily large volume of text. Given one or more documents of just about any number or type, the Distant Reader will: 1. amass the documents 2. convert the documents into plain text 3. do rudimentary counts and tabulations against the plain text 4. calculate statistically significant keywords against the plain text 5. extract narrative summaries against the plain text 6. use Spacy (a natural language processing library) to classify each and every feature of each and every sentence into parts-of-speech and/or named entities5 7. save the results of Steps #1 through #6 as plain text and tab-delimited files 8. distill the tab-delimited files into an SQLite database 9. create both narrative as well as tabular reports against the database 10. create an archive (.zip file) of everything 11. return the archive to the student, researcher, or scholar The student, researcher, or scholar can then analyze the contents of the .zip file to get a bet- ter understanding of its contents. This analysis (“reading”) ranges from perusing the narrative reports, to using desktop tools to visualize the data, to exploiting command-line tools to inves- tigate the data, to writing software which uses the data as input. The Distant Reader scales to everything between a single scholarly report, hundreds of book-length documents, and thou- sands of journal articles. Its purpose is to supplement the traditional reading process, and it uses machine learning techniques at its core. 4?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. 5See ?iiTb,ffbT�+vXBQ. https://github.com/ericleasemorgan/bringing-algorithms https://spacy.io 122 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 Summary and Conclusion Computers and libraries are a natural fit. They both excel at the collection, organization, and dissemination of data, information, and knowledge. Compared to most professions, the practice of librarianship has used computers for a very long time. But, for the most part, the functionality of computers in libraries has not been fully exploited. Advances in machine learning coupled with the data/information found in libraries present an opportunity for both librarianship and the people whom libraries serve. Machine learning can be used to enhance library collections and services, and with a modest investment of time as well as resources, the profession can make it a reality. Appendix: Train and Classify This appendix lists two Python programs. The first (train.py) creates a model for the classification of plain text files. The second (classify.py) uses the output of the first to classify other plain text files. For your convenience, the scripts and some sample data ought to be available in a GitHub repository.6 The purpose of including these two scripts is to help demystify the process of machine learn- ing. Train The following Python script is a simple classification training application. Given a file name and a list of directories containing .txt files, this script first reads all of the files’ contents and the names of their directories into sets of data and labels (think “categories”). It then divides the data and labels into training and testing sets. Such is a best practice for these types of programs so the models can be evaluated for accuracy. Next, the script counts and tabulates (“vectorizes”) the training data and creates a model using a variation of the Naive Bayes algorithm. The script then vectorizes the test data, uses the model to classify the test data, and compares the resulting classifications to the originally supplied labels. The result is an accuracy score, and generally speaking, a score greater than 75% is on the road to success. A score of 50% is no better than flipping a coin. Finally, the model is saved to a file for later use. 1 O i`�BM X Tv @ ;Bp2M � 7BH2 M�K2 �M/ � HBbi Q7 /B`2+iQ`B2b O +QMi�BMBM; X iti 7BH2b - +`2�i2 � KQ/2H 7Q` +H�bbB7vBM; O bBKBH�` Bi2Kb O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F 6 7`QK bFH2�`MX72�im`2n2ti`�+iBQMXi2ti BKTQ`i *QmMio2+iQ`Bx2` 7`QK bFH2�`MXKQ/2Hnb2H2+iBQM BKTQ`i i`�BMni2binbTHBi 7`QK bFH2�`MXM�Bp2n#�v2b BKTQ`i JmHiBMQKB�HL" BKTQ`i ;HQ#- Qb- TB+FH2- bvb 11 O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi B7 H2MU bvbX�`;p V I 9 , bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y 6?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb. https://github.com/ericleasemorgan/bringing-algorithms Morgan 123 ] IKQ/2H= I/B`2+iQ`v = I�MQi?2` /B`2+iQ`v =$M] V [mBiUV 16 O ;2i i?2 M�K2 Q7 i?2 7BH2 r?2`2 i?2 KQ/2H rBHH #2 b�p2/ KQ/2H 4 bvbX�`;p( R ) O ;2i i?2 `2bi Q7 i?2 BMTmi - i?2 M�K2b Q7 /B`2+iQ`B2b iQ T`Q+2bb 21 /B`2+iQ`B2b 4 () 7Q` B BM `�M;2U k- H2MU bvbX�`;p V V , /B`2+iQ`B2bX�TT2M/U bvbX�`;p( B ) V O BMBiB�HBx2 i?2 /�i� iQ �M�Hvx2 �M/ Bib �bbQ+B�i2/ H�#2Hb 26 /�i� 4 () H�#2Hb 4 () O HQQT i?`Qm;? 2�+? ;Bp2M /B`2+iQ`v 7Q` /B`2+iQ`v BM /B`2+iQ`B2b , 31 O 7BM/ �HH i?2 i2ti 7BH2b �M/ ;2i i?2 /B`2+iQ`v ^b M�K2 7BH2b 4 ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V H�#2H 4 QbXT�i?X#�b2M�K2U /B`2+iQ`v V 36 O T`Q+2bb 2�+? 7BH2 7Q` 7BH2 BM 7BH2b , O QT2M i?2 7BH2 rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 , 41 O �// i?2 +QMi2Mib Q7 i?2 7BH2 iQ i?2 /�i� /�i�X�TT2M/U ?�M/H2X`2�/UV V O mT/�i2 i?2 HBbi Q7 H�#2Hb 46 H�#2HbX�TT2M/U H�#2H V O /BpB/2 i?2 /�i� f H�#2Hb BMiQ i`�BMBM; b2ib �M/ i2biBM; b2ib c O � #2bi T`�+iB+2 /�i�ni`�BM - /�i�ni2bi - H�#2Hbni`�BM - H�#2Hbni2bi 4 51 i`�BMni2binbTHBiU /�i�- H�#2Hb V O BMBiB�HBx2 � p2+iQ`Bx2` - �M/ i?2M +QmMi f i�#mH�i2 i?2 O i`�BMBM; /�i� p2+iQ`Bx2` 4 *QmMio2+iQ`Bx2`U biQTnrQ`/b4^2M;HBb?^ V 56 /�i�ni`�BM 4 p2+iQ`Bx2`X7Bini`�Mb7Q`KU /�i�ni`�BM V O BMBiB�HBx2 � +H�bbB7B+�iBQM KQ/2H - �M/ i?2M mb2 L�Bp2 "�v2b O iQ +`2�i2 � KQ/2H +H�bbB7B2` 4 JmHiBMQKB�HL"UV 61 +H�bbB7B2`X7BiU /�i�ni`�BM - H�#2Hbni`�BM V O +QmMi f i�#mH�i2 i?2 i2bi /�i� - �M/ mb2 i?2 KQ/2H iQ +H�bbB7v Bi 124 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10 /�i�ni2bi 4 p2+iQ`Bx2`Xi`�Mb7Q`KU /�i�ni2bi V +H�bbB7B+�iBQMb 4 +H�bbB7B2`XT`2/B+iU /�i�ni2bi V 66 O #2;BM iQ i2bi 7Q` �++m`�+v +QmMi 4 y O HQQT i?`Qm;? 2�+? i2bi +H�bbB7B+�iBQM 71 7Q` B BM `�M;2U H2MU +H�bbB7B+�iBQMb V V , O BM+`2K2Mi - +QM/BiBQM�HHv B7 +H�bbB7B+�iBQMb( B ) 44 H�#2Hbni2bi( B ) , +QmMi Y4 R 76 O +�H+mH�i2 �M/ QmiTmi i?2 �++m`�+v b+Q`2 c O �#Qp2 d8$W #2;BMb iQ �+?B2p2 bm++2bb T`BMi U ]�++m`�+v, WbWW $M] W U BMiU U +QmMi RXy V f H2MU +H�bbB7B+�iBQMb V Ryy V V V 81 O b�p2 i?2 p2+iQ`Bx2` �M/ i?2 +H�bbB7B2` U i?2 KQ/2H V O 7Q` 7mim`2 mb2 - �M/ /QM2 rBi? QT2MU KQ/2H- ^r#^ V �b ?�M/H2 , TB+FH2X/mKTU U p2+iQ`Bx2` - +H�bbB7B2` V- ?�M/H2 V 86 2tBiUV Classify The following Python script is a simple classification program. Given the model created by the previous script (train.py) and a directory containing a set of .txt files, this script will output a suggested label (“classification”) and a file name for each file in the given directory. This script automatically classifies a set of plain text files. O +H�bbB7v X Tv @ ;Bp2M � T`2pBQmbHv b�p2/ +H�bbB7B+�iBQM KQ/2H �M/ O � /B`2+iQ`v Q7 X iti 7BH2b - +H�bbB7v � b2i Q7 /Q+mK2Mib 4 O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F BKTQ`i ;HQ#- Qb- TB+FH2- bvb O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi B7 H2MU bvbX�`;p V 54 j , 9 bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y ] IKQ/2H= I/B`2+iQ`v =$M] V [mBiUV O ;2i BMTmi c ;2i i?2 KQ/2H iQ `2�/ �M/ i?2 /B`2+iQ`v +QMi�BMBM; 14 O i?2 X iti 7BH2b KQ/2H 4 bvbX�`;p( R ) /B`2+iQ`v 4 bvbX�`;p( k ) O `2�/ i?2 KQ/2H 19 rBi? QT2MU KQ/2H- ^`#^ V �b ?�M/H2 , Morgan 125 U p2+iQ`Bx2` - +H�bbB7B2` V 4 TB+FH2XHQ�/U ?�M/H2 V O T`Q+2bb 2�+? X iti 7BH2 7Q` 7BH2 BM ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V , 24 O QT2M - `2�/ - �M/ +H�bbB7v i?2 7BH2 rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 , +H�bbB7B+�iBQM 4 +H�bbB7B2`XT`2/B+iU p2+iQ`Bx2`Xi`�Mb7Q`KU ( ?�M/H2X`2�/UV ) V V 29 O QmiTmi i?2 +H�bbB7B+�iBQM �M/ i?2 7BH2 ^b M�K2 T`BMiU ]$i2ti#�+FbH�b? i]XDQBMU U +H�bbB7B+�iBQM( y )- QbXT�i?X#�b2M�K2U 7BH2 V V V V 34 O /QM2 2tBiUV References Crawford, Walt. 1989. MARC for Library Use: Understanding Integrated USMARC. 2nd ed. Boston: G.K. Hall. LOC (Library of Congress). 2017. The Card Catalog: Books, Cards, and Literary Treasures. San Francisco: Chronicle Books. 12-cohen-machine ---- Chapter 12 Machine Learning + Data Creation in a Community Partnership for Archival Research Jason Cohen Berea College Mario Nakazawa Berea College Introduction: Cultural Heritage and Archival Preservation in Eastern Kentucky In this chapter, two researchers, Jason Cohen and Mario Nakazawa, describe the contexts for an archivally focused project that emerged from a partnership between the Pine Mountain Settle- ment School (PMSS)1 in Harlan County, Kentucky, and scholars and students at Berea College. In this process, we have entered into a critical dialogue with our sources and knowledge pro- duction that Roopika Risam calls for in “self-reflexive” investigations in the digital humanities (2015, para. 16). Risam’s intervention, nevertheless, does not explicitly distinguish questions of class and the concomitant geographic constraints that often accompany the economic and social disadvantages of poverty (Ahmed et al. 2018). Our work demonstrates how class and geography are tied, even in digital archives, to the need for reflexive and diverse approaches to humanist ma- terials. For instance, a recent invited contribution to Proceedings of the IEEE articulates a need 1See ?iiT,ffTBM2KQmMi�BMb2iiH2K2Mib+?QQHX+QK. 137 http://pinemountainsettlementschool.com 138 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for diversity in computing and technology without mentioning class or region as factors shaping these related issues of diversity (Stephan et al. 2012, 1752–5). Given these constraints, perhaps it is also pertinent to acknowledge that the machine learning application we describe in this chapter is itself not particularly novel in scope or method—we describe our data acquisition and prepa- ration, and two parallel implementations of commercially available tools for facial recognition. What stands out as unique are the ethical and practical concerns tied to bringing unique archival materials out of their local contexts into a larger conversation about computer vision as a tool that helps liberate, and at the same time possibly endanger, a subaltern cultural heritage. In that light, we enter our archival investigation into what Bruno Latour has productively named “actor-network theory” (2007, 11–13) because, as we suggest below, our actions were highly conditioned not only by the physical and social spaces our research occupies and where its events occurs, but also because the nature of the historical artifacts themselves act powerfully to shape our work in these contexts. Moreover, the partnership model of curation and archiving that we pursued in this project complicates the very concept of agency because the actions form- ing the project emerged from a continuing dialogue rather than any one decision or hierarchy. As we suggest later, a distributed model for decisions (Sabharwal 2015, 52–5) also revealed the limitations of using a participatory and identity-based model for archival development and man- agement. Indeed, those historical artifacts will exert influence on this network of relations long after any one of us involved in the current project has ceased to pursue them. When we came to this project, we asked a version of a classic question that has arisen in a variety of forms begin- ning with very early efforts by Bell Laboratories, among others, to translate data structures to suit the often flexible needs of humanist data: “what aspects of life are formalizable?” (Weizenbaum 1976, 12). We discovered that while an ontology may represent a formalized relationship of an archive to a database or finding aid, it also asks questions about the ethical implications of what information and embedded relationships can be adequately formalized by an abstract schema. The Promises and Realities of Technology After Coal in Eastern Kentucky Despite the longstanding threats of having to adapt to a post-coal economy, Harlan County, Ken- tucky continues to rely on coal and the mountains from which that coal is extracted as two of the cornerstones that shape the identity of the territory as well as the people who call it home. The mountains of Eastern Kentucky, like much of Appalachia, are by turns beautiful and devastated, and both authors of this essay have found conversations with Eastern Kentucky’s citizens about the role the mountains play and the traditions that emerge from them both insightful and, at times, heartbreaking. This dramatic landscape, with its drastic challenges, may not sound like a place likely to find uses for machine learning. You would not be alone in your assumption. Standing far from urban centers of technology and mobility, Eastern Kentucky combines deeply structural problems of generational poverty with a hard won understanding that, since the moment of the region’s colonization, outsiders have taken resources and made uninformed decisions about what the region needs, or where it should turn in order to gain a better pur- chase on the narrative of American progress, self-improvement, and the unavoidable allures of development-driven capitalism. Suspicion of outsiders is endemic here. And unfortunately, eco- nomic and social conditions, such as the high workplace injury rates associated with mining and extraction-related industries, the effects of the pharmaceutical industry’s abuse of prescription Cohen and Nakazawa 139 opioids to treat a wide array of medical pain symptoms without treating the underlying causal conditions, and the systematic dismantling of federal- and state-level social support programs, have become increasingly acute concerns today. But this trajectory is not new: when President Lyndon B. Johnson announced the beginning of the War on Poverty in 1964, he landed an hour away in Martin County, and subsequently, drove through Harlan on a regional tour to inaugurate the initiative. Successive generations have sought to leave a mark, and all the while, the residents have been collecting their own local histories of their place. Our project, centered on recovering a latent social network of historical families represented by the images held in one local archive, mobilizes this tension between insiders’ persistence and outsiders’ interventions to think about how, as Bruno Latour puts it, we can “reassemble the social” while still respecting the local (2007, 191–2). PMSS occupies a unique position in this social and physical landscape: both local in its emplacement and attention, and a site of philanthropic work that attracted outside money as well as human and cultural capital, PMSS is at once of Harlan County and beyond it. As we sug- gest in the later sections of this essay, PMSS’s position, both within local and straddling regional boundaries, complicates the network we identified. More than that, however, its split position complicates the relationships of power and filiation embedded in its historical social network. While an economy centered on coal continues to define the Eastern Kentucky regional iden- tity, a second history can be told about this place and its people, one centered on resilience, in- dependence, simplicity, and beauty, both of the land and its people. This second history has made outsiders’ recent appeals for the region to court technology as a potential solution for what comes “after coal” particularly attractive to a region that prides itself on its capacity to sustain, out- last, and overcome obstacles. While that techno-utopian vision offers another version of the self- aggrandizing Silicon Valley bootstraps success story J.D. Vance narrates in Hillbilly Elegy (2016), like Vance’s story itself, those narratives most often get told by outsiders to outsiders using re- gional stereotypes as the grounds for a sales pitch. In reality, however, those efforts have largely proven difficult to sustain, and at times, become the sources of potentially explosive accusations of fraud and malfeasance. Recently, for instance, organizations including Mined Minds2 have been accused by residents aiming to prepare for a post-coal economy of misleading students, at least, and of fraud at worst. As with the timber, coal, and gas extraction industries that preceded these software development firms’ aspirations, the promises of technology have not been kind to Eastern Kentucky, and in particular, as with those extraction industries that preceded them, the technological-industrial complex making its pitch in Kentucky’s mountains has not returned resources to the region’s residents whom the work was intended at least nominally to support (Hochschild 2018; Campbell 2019; Bailey 2017). In this context of technology, culture, and the often controversial position machine learning occupies in generating obscure metrics for its classifiers that may embed bias, our project aims to activate its archival holdings and bring critical awareness to the question of how to actively engage with a paper archive of a local place as we venture further into our pervasively digital mo- ment. The School operates today as a regional cultural heritage institution; it opened in 1913 as a residential school and operated as an educational institution until 1974, at which point it trans- formed itself into an environmental and cultural outreach institution focused on developing its local community and maintaining the richness of the region’s cultural resources and heritage. Every year since 1974, PMSS has brought hundreds of students and citizens onto its campus to learn about nature and the landscape, traditional crafts and artistic practices, and musical and dance forms, among many other programs. Similarly, it has created a space for locals to come 2See ?iiT,ffrrrXKBM2/KBM/bXQ`;f. http://www.minedminds.org/ 140 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 together for social events, community celebrations, and festival days, and at the same time, has become a destination for national-level events that create community from shared interests in- cluding foodways, wildflowers, traditional dance forms, and other wide-ranging attractions. Project Background: Preserving Cultural Heritage in Harlan Country The archives of the Pine Mountain Settlement School emerge from its shifting history. The ma- jority of its papers relate to its time as a traditional institution of education, including student records (which continue to be restricted for several reasons, including FERPA constraints, and personal and community interests in privacy), minutes of its board meetings (again, partially re- stricted), and financial and narrative accounts of its many activities across a year. The school’s records are unique because they provide a snapshot, year by year and month by month, of the region’s interests and challenges during key years of the 20th Century, spanning the First World War to Vietnam. In addition, they detail the relations the School maintained with a philanthropic base of donors who helped to support it and shape it, and beyond its local relations, place it into contact with a larger set of cultural interactions than a boarding school that relied on tuition or other profit-driven means to sustain its operations would. While the archival holdings contin- ued to be informally developed by its directors and staff, who kept the official papers organized roughly by year, the archive itself sat largely neglected after 1974. Beginning around the turn of the millennium, a volunteer archivist named Helen Wykle began digitizing items one by one, and soon, hosted a curated selection of those digital surrogates along with interpretive and descrip- tive narration on a WordPress installation, The Pine Mountain Settlement School Collections.3 The PMSS Collections WordPress site has been continuously running and frequently updated by Wykle and the volunteer community members she has organized since 1999.4 Together with her collaborators and volunteers, Wykle has grown the WordPress site to over 2200 pages, including over 30,000 embedded images that include photographs and newspapers; scanned memos, meet- ing minutes and other textual material (in JPG and PDF formats); HTML transcriptions and bibliographies hard-coded into the pages; scanned images of 3-D collections objects like textile looms or wood carving tools; partially scanned runs of serial publications; and other compos- ite visual material. None of those objects was hosted within a regular and complete metadata hierarchy or ontology: no regular scheme of fields or file-naming convention was followed, no controlled vocabulary was maintained, no object-types were defined, no specific fields were re- quired prior to posting, and perhaps unsurprisingly as a result, the search and retrieval functions of the site had deteriorated noticeably. In 2016, Jason Cohen approached PMSS with the idea of using its archives as the basis for curricular development at Berea College.5 Working in collaboration beginning in 2017, Mario Nakazawa and Cohen developed two courses in digital and computational humanities, led a team-directed study in augmented reality in coordination with Pine Mountain, contributed ma- 3See ?iiTb,ffTBM2KQmMi�BMb2iiH2K2MiXM2if. 4Jason Cohen and Mario Nakazawa wish to extend a note of appreciation to Helen Hays Wykle, Geoff Marietta, the former director of PMSS, and Preston Jones, its current director, for welcoming us and enabling us to access the physical archives at PMSS from 2016–20. 5Jason Cohen would like to recognize the support this project received from the National Endowment for the Hu- manities’ “Humanities Connections” grant. See grant number AK-255299-17, description online at ?iiTb,ffb2+m `2;`�MibXM2?X;QpfTm#HB+[m2`vfK�BMX�bTt?74R�;M4�E@k88kNN@Rd. https://pinemountainsettlement.net/ https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 Cohen and Nakazawa 141 terials and methods for a new course in Appalachian Studies, and promoted the use of PMSS archival materials in several other extant courses in history and art history, among others. These new college courses each make use of PMSS historical documents as a shared core of visual and textual material in a digital and computational humanities concentration that clusters around critical archival and textual studies.6 The success of that initial collaboration and course development seeded the potential in 2019– 2021 for a Whiting Public Engagement7 fellowship focused on developing middle and high school curricula for use in Kentucky public schools with PMSS archival materials. That Whiting funded project has generated over 80 lessons keyed to Kentucky state standards; these lessons are cur- rently in use at nine schools across eight school districts, and each school is using PMSS materials to highlight its own regional and local interests. The work we have done with these archives has thus far reached the classrooms of at least eleven different middle and high school teachers, and as a result, touched over 450 students in eastern and central Kentucky public schools. We mention these numbers in order to demonstrate that our collaboration has not been shal- low nor fleeting. We have come to know these archives quite well, and because they are not ade- quately cataloged, the only way to get to know them is to spend time reading through the mate- rials one page at a time. An ancillary consequence of this durable collaboration and partnership across the public-academic divide is the shared recognition early in 2019 that the PMSS archival database and its underlying data structure (a flat SQL database generated by the WordPress inter- face) would provide inadequate stability for records management and quality control in future development. In addition, we discovered that the interpretive materials and metadata associated with the WordPress installation were also insufficient for linked metadata across the objects in this expanding digital archive, for reasons discussed below. As partners, we decided together to migrate to a ContentDM instance hosted by the Ken- tucky Virtual Library,8 a consortium to which Berea College belongs, and which is open to future membership from PMSS. That decision led a team of Berea College undergraduate and faculty re- searchers to scrape the data from the PMSS archive site and supplement the images and transcrip- tions it contains with available textual metadata drawn from the site.9 Alongside the WordPress instance as our reference, we were also granted access to a Dropbox account that hosted higher resolution versions of the images featured on the blog. The scraper pulled over 19,228 unique images (and located over 11,000 duplicate images in the process), 732 document transcriptions for scanned texts on the site, and 380 subject and person bibliographies, including Library of Congress Subject Headings that had been hard-coded into the site’s HTML. We also extracted the unique object identifiers and labels associated with each image, which in WordPress are not associated with the image objects themselves. We used that data to populate the ContentDM in- stance and returned a sparse but stable skeleton for future archival development. In the process, we also learned significantly about how a future implementation of a controlled vocabulary, an image acquisition and processing pipeline, and object documentation standards should work in the next stages of our collaborative PMSS archival development. 6In the original version of the collaboration, we had planned also to teach basic computer programming to high school students during a summer program that also would have used that same set of materials, but with the paired departures of the original co-PI as well as the former director, that plan has thus far remained unfulfilled. 7See ?iiTb,ffrrrXr?BiBM;XQ`;f+QMi2MifD�bQM@+Q?2M. 8See ?iiTb,ffF/HXFvpHXQ`;f. 9Jason Cohen wishes to thank Mario Nakazawa, Bethanie Williams, and Tradd Schmidt for undertaking this project with him. The github repo for the PMSS scraper is hosted here: ?iiTb,ff;Bi?m#X+QKfh`�//@a+?KB/ifSJaana +`�T2`. https://www.whiting.org/content/jason-cohen https://kdl.kyvl.org/ https://github.com/Tradd-Schmidt/PMSS_Scraper https://github.com/Tradd-Schmidt/PMSS_Scraper 142 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 As we developed and refined this new point of entry to the digital archives using the Con- tentDM hosting and framework, some of the ethical issues surrounding this local archive came more clearly into focus. A parallel set of questions arose in response in the first instance to J.D. Vance’s work, and in the second, to outsiders’ claims for technological solutions to the deteri- oration of local and cultural heritage. Because we were creating virtual archival surrogates for materials housed at Pine Mountain, for instance, questions arose from the PMSS board mem- bers related to privacy and use of historical materials. Further, the board was concerned that even historical materials could bear on families present in the community today. We found that while profession-wide responses to archival constraints are shaped predominantly by discussions of copyright and fair use, issues of personal privacy are often left tacit. This gap between legal use and public interests in privacy reveals how tasks executed using techniques in machine learning may impinge upon more ethical constraints of public trust and civic obligation.10 Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextri- cably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the “historical validation” of primary source materials (2017, 424–5). When an AI system recognizes an object, Calo remarks, that object is validated. But how should one handle the lack of a specific vocabulary within a given training set? One answer, of course, would be to train a new set—but that response is becoming increasingly prohibitive for smaller cultural heritage projects like ours: the time and computational power re- quired to execute the training is non-negligible. In addition, training resources (such as data sets, algorithms, and platforms) are increasingly becoming monetized, and we do not have the mar- gins to buy access to new data for training. As a consequence, questions stemming from how one labels material in a controlled vocabulary were also at issue. We encountered a failure in historical validation when, for instance, our AI system labeled a “spinning wheel” as a wheel, but did not de- tect its historical relationship to weaving and textiles. That validation was further obscured when the system also failed to categorize a second form of “spinning wheel,” which refers locally to a home-made merry-go-round.11 In other words, not only did the system flatten a spinning wheel into a generic wheel, it also missed the regional homology between textile production and play, a cultural crux that reveals how this place envisions an intersection between work and recreation. By breaking the associations between two forms of “spinning wheel,” our system erased a small but significant site of cultural inheritance. How, we asked, should one handle such instances of effacement? At one level, one would expect an archival system to be able to identify the prim- itive machine for spinning wool, flax, or other raw materials into usable thread for textiles, but what about the merry-go-round? And what should one do when a system neglects both of these meanings and reduces the object to the same status as a wheel on a tractor, car, or carriage? Similarly, when competing naming conventions arise for landmarks, we were conscious to consider which name should be granted priority as the default designation, and we asked how one should designate a local or historical name, whether for a road, waterway, knob, or other fea- ture, in relationship to a more widely accepted nomenclature such as state route designations or 10The professional conversation in archive and collections management has not been as rich as the one emerging in AI contexts more broadly. For a recent discussion of the conflict in the roles of public trust and civic service that emerge from the context of the powers artificial intelligence holds for image recognition in policing applications, see Elizabeth Joh, “Artificial Intelligence and Policing: First Questions,” Seattle University Law Review 41: 1139–44. 11See “Spinning Wheel” in Cassidy 1985–2012. Cohen and Nakazawa 143 standardized toponym? As we attempted to address the challenge of multiple naming conven- tions, we encountered some of the same challenges that archivists find in dealing with indigenous peoples and their textual, material, and physical artifacts.12 Following an example derived from the Passamaquoddy people, we implemented a small set of “traditional knowledge labels”13 to describe several forms of information, including (a) restrictions on images that should not be shown to strangers (to protect family privacy), (b) places that should remain undisclosed (for in- stance, wild ginseng, ramp, orchid, or morel mushroom patches), and (c) educational materials focused on “how it was done” as related to local skills and crafts that have more modern imple- mentations, but for which the traditional practices have remained meaningful. This included cases such as Maypole dancing and festivals, which remain endowed with ritual significance. In the final analysis, neither the framework supplied by copyright and fair use nor the one supplied by data validation proved singularly adequate to our purposes, but they did provide guidelines from which our facial recognition project could proceed, as we discuss below. Machine Learning in a Local Archive These preliminary discussions of ethics and convention may seem unrelated to the focus this col- lection adopts toward machine learning and artificial intelligence in the archive. However, as we have begun to suggest, the data migration to ContentDM opened the door to machine learning for this project, and those initial steps framed the pitfalls that we continue to navigate as we con- tinue forward. As we suggested at the outset, the technical machine-learning task that we set for ourselves is not cutting edge research as much as an application of existing technologies to a new aspect of archival investigation. We proposed (and succeeded with) an application of commercial facial recognition software to identify the persons in historic photographs in the PMSS archives. We subsequently proposed and are currently working to identify the photographs sharing com- mon but unnamed faces, and in coordination with photographs of known people, to re-create the social network of this historic institution across slices of its history. We describe the next steps briefly below, but let us tarry for a moment with the question of how the ethical concerns we navigated up to this point also influenced our approach to facial recognition. The first of those concerns has to do with commercial and public access to archival materials that, as we suggested above, include materials that are designated as restricted use in some way. We demonstrated to the local members at Pine Mountain how our use case and its con- straints for digital archives fit with the current standards for the fair use of copyrighted materials based on the “substantive transformation” of reproduced objects (Levendowski 2018, 622–9). Since we are not making available large bodies of materials still protected by copyright, and since our use of select materials shifts the context within which they are presented, we were able to negotiate with PMSS to allow us to design a system for facial recognition using the ContentDM instance as our image source. What that negotiation did not consider, however, is when fair use does not provide a sufficiently high standard of control for the institution involved in the appli- cation of algorithms to institutional memory or its technological dependencies. First, to test the facial recognition processes, we reached back to the most primitive and local version of facial recognition software that we could find, Google’s retired platform, the Picasa 12One well-documented digital approach to handling indigenous archival materials includes the Mukurtu platform for indigenous cultural heritage: ?iiTb,ffKmFm`imXQ`;f. 13For the original traditional knowledge labels, see: ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@ i`�/BiBQM�H@FMQrH2/;2@H�#2Hb. https://mukurtu.org/ https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels 144 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 Web Albums API, which was retired in May 2016 and fully deprecated as of March 2018 (Sab- harwal 2016). We chose Picasa because it is a self-contained software application that operates using a locally hosted script and locally hosted images. Given its deprecated status and its loca- tion on a local machine, we were confident that no cloud services would be ingesting the images we fed into the system for our trial. This meant that we could test small data examples without fear of having to upload an entire corpus of material that could subsequently be incorporated into commercial facial recognition engines or pop up unexpectedly in search results. We thus began by upholding a high threshold for privacy and insisting on finding ways for PMSS to maintain control over these images within the grasp of its local directories. The Picasa system created surprisingly good results within the scope we allowed it. It was highly successful at matching the small group of known faces we supplied as test materials. While it would be difficult to supply a numerical match rate first because of this limited test set, and second because we have not expanded the test to a broad sample using another platform, we were anecdotally surprised at how robust Picasa’s matching was in practice. For instance, Picasa matched the images of a single person’s face, Celia Cathcart, from pictures of her as a teenager to images of her as a grandmother. It recognized Cathcart in a group of basketball players, and it also identified her face from side-view and off-center angles, as in a photograph of her looking down at her newborn child. The most immediate limitation of Picasa lies in its tagging, which required manual entry of every name and did not allow any automation. Following the success of that hand-tagging and cross-image identification process, we dis- cussed with our partners whether the next step, using Amazon Web Services’ computer vision and facial recognition platform, ReKognition, would be acceptable. They agreed, and we ran the images through the AWS application, testing our results against samples pulled from our Pi- casa run to verify the results. Perhaps unsurprisingly, AWS ReKognition fared even better with those test cases. Using one photograph image, the AWS application identified all of the Picasa matches as well as three new images that had not previously been tagged with Cathcart’s name. The same pattern held for other images in our sample group: Katherine Pettit was positively iden- tified across more likenesses than had been previously tagged, and Alice Cobb was also positively tracked across images. This positive attribution also reveals a limitation of the metadata: while these three women we have named are important historical figures at PMSS, and while they are widely acknowledged in the archive and well-represented in the photographic record, not all of the photographs have been well-tagged or fully documented in the archive. The newly tagged images that we found would enrich the metadata available to the archive not because these im- ages include surprising faces, but rather, because the tagging has been inconsistent, and over time, previously known faces have become less easy to discern. Like other recent discussions of private materials disclosed within systems trained for match- ing and similarity, we found that the ethics of private materials for this non-private purpose pro- voked strong reactions. While some of the reaction was positive with community members happy to have more images of the School’s founding director, Katherine Pettit, identified, those same community members were not comfortable with our role as researchers identifying people in the photographs in their community’s archive, unsupervised. They wanted instead to verify each positive identification, a point that we agreed with, but which also hindered the process of mov- ing through 19,000 images. They wanted to maintain authority, and while we saw our efforts as contributions to their goals of better describing their archival holdings, it turns out that the larger scope of automation we brought to the project was intimidating. While its legal status and direct ethics seemed settled before the beginning of the project, ultimately, this project contributed to Cohen and Nakazawa 145 a sense among some individuals at PMSS that they were losing control of their own archive.14 That fear of a loss of control led to another reckoning with the project, as we discuss in the next section. What Machine Learning Cannot Learn: An Ethics of the Archive It became clear at the same moment we validated our test case, that our research goals and those of our partners had quickly diverged. We had discussed the scope and use of PMSS materials with our partners at PMSS and laid out in a formally drafted “Memorandum of Understanding” (MOU) adapted from the US Department of Justice (2008; 2017) our shared goals in the project. As we described in the MOU, both partners considered it mutually beneficial for the archive and its metadata to be able to identify faces of named as well as unnamed people. We aimed to capture single-person images as well as groups in order to enrich the archive with cross-links to other pho- tographs or archival materials with a shared subject heading, and we hoped to increase the number of names included in object attributes. Despite those conversations and multiple revisions of the MOU draft, what we discovered was ultimately different than the path our planning had indi- cated. Instead of creating an historical social network using the five decades of photographs we had prepared, we found that the history of the social network and the family and kinship relation- ships detailed through those images was deeply personal for the community living in the region today. We found out the hard way that those kinships reflected economic changes in status and power, realignments among families and their communities, and new patterns in the social fabric formed by the warp of personal relationships and the weft of local institutions (schools, hospi- tals, and local governance). Revealing those changes was not always something that our partners wanted us to do, and these were not patterns we had sought to discover: they are simply there, embedded in the images and the relations among images. These social changes in local alignments—tied in complex ways to marriages and separations, legal conflicts and resolutions, changes in ownership of residential and commercial interests, and other material reflections of that social fabric—remain highly charged and, for those continuing to live in the area, they revealed potentially unexpected parts of the lived realities and values of the place. As a result, even though we had an MOU that worked for the technical details of the project, we could not find common ground for how to handle the competing social and ethical values of the project. As we problem-solved, we tried to describe new forms of restriction and to generate appro- priately sensitive guidelines to handle future use and access, but it turned out that all of these approaches were threatening to the values of a tightly knit community. They, rightly, want to tell their story, and so many people have told it so poorly for so long that they wish to have sole access to the materials from which the narratives are assembled. As researchers interested in open access and stable platform management, we have disagreements with the scholarly and archival implications of this decision, but we ultimately respect the resolve and underlying values that accompany the difficult choices PMSS makes about its public audiences and the corresponding goals it maintains for its collections. Interestingly, Wykle has come to view our work with PMSS collections as another form of the material and cultural extraction that has dominated the region 14See, for another example of the ethical quandaries that may be associated with legal applications of machine learning techniques, Ema et al. 2019. 146 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for generations. While we see our work in light of preservation and access as well as our lasting commitment to PMSS and the region, we have also come to recognize the powerful explanatory force that the idea of “extraction” has become for the communities in a region that has suffered many forms of extraction industries’ negative effects. In acknowledging the limitations of our own efforts, we would posit that our case study offers a counter-example to works that suggest how AI systems can be designed automatically to meet the needs of their constituents (Winfield et al. 2019). We tried to use a design approach to address our research goals and our partner’s needs, and it turned out that the dynamically constructed and evolving nature of those needs outstripped the capacity we could build into our available system of machine learning. The divergence of our goals has led the collaboration to an impasse. Given that we had al- ready outlined further steps in our initial documents that could not be satisfied after the partners identified their divergent intentions, the collaborative scope the partners initially described was not completely fulfilled. The divergence of goals became stark: as researchers interested in the relevance and sustainability of these archives, we were moving the collections toward a more ac- cessible and comprehensive platform with open documentation and protocols for future devel- opment. By contrast, the PMSS staff were moving toward more stringent and local controls over access to the archives in order to limit dissemination. At this juncture, we had some negotiating to do. First, we made the ContentDM instance a password protected and not publicly accessible (private) sandbox rather than a public instance of a virtual digital collection. As PMSS owns the material, they decided shortly thereafter to issue a take-down order of the ContentDM instance, and we complied. As the ContentDM materials were ultimately accessible in the public domain on their live site, this decision revealed how personal the challenges had become. Nothing in- cluded in the take-down order was unique or new material—rather, the ContentDM site simply provided a more accessible format for existing primary material on the WordPress site, stripped of its interpretive and secondary contexts. If there is a silver lining, it lies in this context for use: the “academic divorce” we underwent by discontinuing our collaboration has made it possible for us to continue conducting research on the publicly available archival materials without being obligated to host a live and dynamic reposi- tory for further materials. As a result, we can test best-approaches without having to worry about pushing them to a live production site. Within this constraint, we aim to continue re-creating the historical social network without compromising our partners’ needs for privacy and control of their production site. The mutual decision to terminate further partnership activities based in archival development arose because of these differing paths forward. That decision meant that any further enrichment of the archival materials would not become publicly available, which we saw as a penalty against using the archive at a moment when archives need as much advocacy and visible support as possible. Under these constraints of private accessibility, we have continued to work on the AWS ReKog- nition pipeline and have successfully identified all of the faces of named people featured in the archive, with face and name labels now associated with over 1900 unique images. Our next step, delayed to Spring 2021 as a result of the COVID-19 pandemic, includes the creation of an associative network that first identifies unnamed faces in each image using unique identifiers. The second element of that process will be to generate an historical social network using the co- occurrence among those faces as well as the faces of named people in the available images. Given that our metadata enrichment has already included date associations for most of the images, we are confident that we will be able to reconstruct historically specific networks for a given year or range of years, and moreover, that the association between dates and named people will help us Cohen and Nakazawa 147 to identify further members of the community who are not currently named in the photographs because of the small groups involved in activities and clubs, as well as the generally limited student and teacher populations during any given year. We are now far more sensitive to how the local concerns of this community shape our research methods and outcomes. The longer-term hope, one it is not clear at all that we will be allowed to pursue, would be to use natural language processing tools on the archive’s textual materials, par- ticularly named entity recognition and word vectors, to search and match images where known names occur proximate to the names of unmatched faces. The present goal, however, remains to create a more replete and densely connected network of faces and the places they occupied when they were living in the gentle shadows of Pine Mountain. In order to abide by PMSS community wishes for privacy, we will be using anonymized aggregate results without identifying individuals in the photographs. While this method has the drawback of not being able to reveal the complex- ity of the historical relations at the granular level of individuals, it will allow us to report on the persistence or variation in network metrics, such as network density, centrality, path length, and betweenness measures, among others. In this way, we aim to be able to measure and report on the network and its changes over time without reporting on individuals. We arrived at an anonymiz- ing method as a solution to the dissolved partnership by asking about the constraints of FERPA as well as by looking back at federal and commercial facial recognition practices. In each case, the dark side of these technological tools remains one associated with surveillance, and in the lan- guage of Eastern Kentucky, extraction. We mention this not only to be transparent about our recognition of these limitations, but also in the hopes of opening a new dialogue with our part- ners that might stem from generating interesting discoveries without compromising their sense of the local ownership of their archival materials. Nonetheless, in order to report on the most interesting aspects, the actual people and their local histories of place, the work to be done would remain more at a human level than at a technical one. Conclusion In conclusion, our project describes a success that remains imbricated with a shortcoming in machine learning. The machine learning tasks and algorithms our project implemented serve a mimetic function in the distilled picture of the community they reflect. By matching histori- cal faces to names, the project embraces a form of digital surrogacy: we have aimed to produce a meta-historical account of the present institution’s social and cultural function as a site of social networking and local knowledge transmission. As Robyn Caplan and danah boyd have recently suggested, the “bureaucratic functions” these algorithms promote can be understood by the ways in which they structure users’ behaviors (2018, 3). We would like to supplement Caplan and boyd’s insight regarding the potential coercions involved in how data structures implicitly shape their contents as well as their users’ behaviors. Not only do algorithms promote a kind of bureau- cracy, to ends that may be positive and negative, and sometimes both at once, but further, those same structures may reflect or shape public behaviors and interactions beyond a single platform. As we move between digital and public spheres, our work similarly shifts its scope. The re- search that we intended to have positive community effects was instead read by that very same set of people as an attempt to displace a community from the center of its own history. In other words, the bureaucratic functions embedded in PMSS as an institution saw our new approach to their storytelling as an unwanted and external intervention. As their response suggests, the inter- nal and extant structures for governing their community, its stories, and the people who tell them, 148 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 saw our contribution as an effort to co-opt their control. Where we thought we were offering new tools for capturing, discovering, and telling stories, they saw what Safiya Noble has recently characterized in a specifically racialized context as “algorithms of oppression” (2018). Here the oppression would be geographic, socio-economic, and cultural, rather than racial; nevertheless, the perception that one is being oppressed by systems set into place by agents working beyond one’s own community remains a shared foundation in Noble’s argument and in the unexpected reception of our project. As we move forward with our own project into unknown territories, in which our work-products may never see the light of day because of the value conflicts bound up in making archival objects public and accessible, we have found a real and lasting respect for the institutional dependencies and emplacements within which we all do our work. We hope to channel some of those functions of emplacement to create new forms of accountability and restraint that will allow us to move forward, but at least for now, we have found with our project one limitation of machine learning, and it is not the machine. References Ahmed, Manan, Maira E. Álvarez, Sylvia A. Fernández, Alex Gil, Rachel Hendery, Moacir P. de Sá Pereira, and Roopika Risam. 2018. “Torn Apart / Separados.” Group for Experimental Methods in Humanistic Research. ?iiTb,fftTK2i?Q/XTH�BMi2tiXBMfiQ`M@�T�`i fpQHmK2fkf. Bailey, Ronald. 2017. “The Noble, Misguided Plan to Turn Coal Miners Into Coders.” Reason, November 25, 2017. ?iiTb,ff`2�bQMX+QKfkyRdfRRfk8fi?2@MQ#H2@KBb;mB/2/@ TH�M@iQ@imf. Calo, Ryan. 2017. “Artificial Intelligence Policy: A Primer and Roadmap.” University of Cali- fornia, Davis Law Review 51:399-435. Caplan, Robyn and danah boyd. 2018. “Isomorphism through algorithm: Institutional de- pendencies in the case of Facebook.” Big Data & Society (January-June): 1-12. ?iiTb, ff/QBXQ`;fRyXRRddfky8jN8RdR3d8dk8j. Cassidy, Frederic G. et al., eds. 1985-2012. Dictionary of American Regional English. Cam- bridge, MA: Belknap Press. ?iiTb,ffrrrX/�`2/B+iBQM�`vX+QK. Ema, Arisa et. al. 2019. “Clarifying Privacy, Property, and Power: Case Study on Value Conflict Between Communities.” Proceedings of the IEEE 107, no. 3 (March): 575-80. ?iiTb, ff/QBXQ`;fRyXRRyNfCS_P*XkyR3Xk3jdy98. Harkins, Anthony and Meredith McCarroll, eds. 2019. Appalachian Reckoning: A Region Re- sponds to Hillbilly Elegy. Morgantown, WV: West Virginia University Press. Hochschild, Arlie. 2018. “The Coders of Kentucky.” The New York Times, September 21, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fyNfkRfQTBMBQMfbmM/�vfbBHB+QM@p�HH2v @i2+?X?iKH. Joh, Elizabeth. 2018. “Artificial Intelligence and Policing: First Questions.” Seattle University Law Review 41 (4): 1139-44. Latour, Bruno. 2007. Reassembling the Social: An Introduction of Actor-Network Theory. New York: Oxford University Press. Levendowski, Amanda. 2018. “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem.” Washington Law Review 93 (2): 579-630. Mukurtu CMS. ?iiTb,ffKmFm`imXQ`;f. Accessed December 12, 2019. https://xpmethod.plaintext.in/torn-apart/volume/2/ https://xpmethod.plaintext.in/torn-apart/volume/2/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://doi.org/10.1177/2053951718757253 https://doi.org/10.1177/2053951718757253 https://www.daredictionary.com https://doi.org/10.1109/JPROC.2018.2837045 https://doi.org/10.1109/JPROC.2018.2837045 https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://mukurtu.org/ Cohen and Nakazawa 149 Noble, Safiya. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press. Passamaquoddy People. “Passamaquoddy Traditional Knowledge Labels.” ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@i`�/BiBQM�H@FMQrH2 /;2@H�#2Hb Accessed December 12, 2019. Risam, Roopika. 2015. “Beyond the Margins: Intersectionality and the Digital Humanities.” DHQ: Digital Humanities Quarterly 9 (2). ?iiT,ff/B;Bi�H?mK�MBiB2bXQ`;f/?[f pQHfNfkfyyyky3fyyyky3X?iKH. Robertson, Campbell. 2019. “They Were Promised Coding Jobs in Appalachia. Now They Say It Was a Fraud.” The New York Times, May 12, 2019. ?iiTb,ffrrrXMviBK2bX+QKfky RNfy8fRkfmbfKBM2/@KBM/b@r2bi@pB`;BMB�@+Q/BM;X?iKH. Sabharwal, Anil. 2016. “Moving on from Picasa.” Google Photos Blog. Last modified March 26, 2018. ?iiTb,ff;QQ;H2T?QiQbX#HQ;bTQiX+QKfkyRefykfKQpBM;@QM@7`QK@T B+�b�X?iKH. Sabharwal, Arjun. 2015. Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Boston: Chandos. Stephan, Karl D., Katina Michael, M.G. Michael, Laura Jacob, and Emily P. Anesta. 2012. “So- cial Implications of Technology: The Past, the Present, and the Future.” Proceedings of the IEEE 100, Special Centennial Issue (May): 1752-1781. ?iiTb,ff/QBXQ`;fRyXRRyNf CS_P*XkyRkXkR3NNRN. United States Department of Justice. 2008. “Guidelines for a Memorandum of Understanding.” ?iiTb,ffrrrXDmbiB+2X;QpfbBi2bf/27�mHif7BH2bfQprfH2;�+vfkyy3fRyfk Rfb�KTH2@KQmXT/7. . 2017. “Sample Memorandum of Understanding.” ?iiT,ffrrrX/QDXbi�i2X Q`XmbfrT@+QMi2MifmTHQ�/bfkyRdfy3fKQmnb�KTH2n;mB/2HBM2bXT/7. Vance, J.D. 2016. Hillbilly Elegy: A Memoir of a Family and Culture in Crisis. New York: Harper. Weizenbaum, Joseph. 1976. Computer Power and Human Reason: From Judgment to Calcula- tion. New York: W.H. Freeman and Co. Winfield, Alan F., Katina Michael, Jeremy Pitt, and Vanessa Evers. 2019. “Machine Ethics: the design and governance of ethical AI and autonomous systems.” ProceedingsoftheIEEE 107, no. 3 (March): 509-17. ?iiTb,ff/QBXQ`;fRyXRRyNfCS_P*XkyRNXkNyyekk. https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://doi.org/10.1109/JPROC.2012.2189919 https://doi.org/10.1109/JPROC.2012.2189919 https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf https://doi.org/10.1109/JPROC.2019.2900622 13-lucic-towards ---- Chapter 13 Towards a Chicago place name dataset: From back-of-the-book index to a labeled dataset Ana Lucic University of Illinois John Shanahan DePaul University Introduction Reading Chicago Reading1 is a grant-supported digital humanities project that takes as its ob- ject the “One Book One Chicago” (OBOC) program2 of the Chicago Public Library. Since fall 2001, One Book One Chicago has fostered community through reading and discussion. On its “Big Read” website, the Library of Congress includes information about One Book programs around the United States,3 and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.4 While community reading programs are not a 1Reading Chicago Reading project (?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf) gratefully acknowl- edges the support of the National Endowment for the Humanities Office of Digital Humanities, HathiTrust, and Lyrasis. 2See ?iiTb,ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf. 3See ?iiT,ff`2�/X;Qpf`2bQm`+2bf. 4See ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM2#QQF. 151 https://dh.depaul.press/reading-chicago/ https://www.chipublib.org/one-book-one-chicago/ http://read.gov/resources/ http://www.ala.org/tools/programming/onebook 152 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in continual existence for nearly 20 years). Although relatively common, book clubs and community-based reading programs are not regularly assessed as other library programming components are, or are subjects of long-term quantitative study. The following research questions have been guiding the Reading Chicago Reading project so far: can we predict the future circulation of a book using a predictive model based on prior cir- culation, community demographics, and text characteristics? How did different neighborhoods in a diverse but also segregated city respond to particular book choices? Have certain books been more popular than others around the city as measured by branch-level circulation, and can these changes in checkout totals be correlated with CPL outreach work? A related question is the fo- cus of this paper: by associating place names with sentiment scores in Chicago-themed OBOC books, what trends emerge from spatial analysis? Results are still in progress and will be forth- coming in future papers. In the meantime, exploration of these questions, and our attempt to find solutions for some of them, enables us to reflect on some innovative services that libraries can offer. We will discuss this possibility in the last section of this paper. Chicago as a place name Thus far, the Reading Chicago Reading project has focused the bulk of its analysis on seven recent OBOC book selections and their respective “seasons” of public outreach programming: • Fall of 2011: Saul Bellow’s The Adventures of Augie March • Spring of 2012: Yiyun Li’s Gold Boy, Emerald Girl • Fall of 2012: Markus Zusak’s The Book Thief • 2013–2014: Isabel Wilkerson’s The Warmth of Other Suns • 2014 – 2015: Michael Chabon’s The Amazing Adventures of Kavalier and Clay • 2015 – 2016: Thomas Dyja’s The Third Coast • 2016 – 2017: Barbara Kingsolver’s Animal Vegetable Miracle: A Year of Food Life All of the listed works above, spanning categories of fiction and non-fiction, are still in copy- right. Of the seven works, three were categorized as Chicago-themed because they take place in the Chicago area in whole or in substantial part: Saul Bellow’s The Adventures of Augie March, Isabel Wilkerson’s The Warmth of Other Suns, and Thomas Dyja’s The Third Coast. As part of ongoing work of the Reading Chicago Reading project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright nov- els in our set. The HathiTrust research portal permits the extraction of non-consumptive fea- tures of the works included in the digital library, even those that are still under copyright. Non- consumptive features do not violate copyright restrictions as they do not allow the regular reading (“consumption”) or digital reconstruction of the full work in question. An example of a non- consumptive feature is the part of speech information extracted in aggregate with or without connection to its source words. Location words (i.e. place names) in the text are another example Lucic and Shanahan 153 of a non-consumptive feature as long as we do not aim to extract locations with the surround- ing context: that is, while the extraction of a location word alone from a work under copyright will not violate copyright law, the extraction of the location word with its surrounding context (a fixed size “window” of words that surrounds the location word) might do so. Similarly, the sentiment of a sentence also falls under the category of a “non-consumptive” feature as long as we do not extract both the entire sentence and its sentiment score. Using these methods, it was possible to utilize the HathiTrust research portal to access and also extract the location words as well as sentiment of individual sentences from copyrighted works. As later paragraphs will reveal however, we also needed to verify the accuracy of these extractions, which was done manually by checking the extracted references against the actual text of the work. This paper arises from the finding that the three OBOC books that are set largely in or are about Chicago circulated differently than the OBOC books that are not, (i.e., Marcus Zusak’s TheBookThief, Yiyun Li’sGoldBoy, Barbara Kingsolver’sAnimal,Vegetable,Miracle, and Michael Chabon’s TheAmazingAdventuresofKavalierandClay. Since one of the findings was that some CPL branches had higher circulation for “Chicago” OBOC books than others in the program, we wanted to determine (1) which place names were featured in the three books and (2) quan- tify and examine the sentiment associated with these places. Although recognizing a well-defined place name in a text by automated means is no longer a difficult task thanks to the development of named entity recognizers such as the Stanford Named Entity Recognizer,5 OpenNLP,6 spaCy,7 and NLTK,8 recognizing whether a place name is a reference to a Chicago location is a harder task. If Chicago is the setting or one of the main topics of the book then we can assume that a number of locations mentioned will also be Chicago place names. However, if information about the topicality or locality of the book is not known in advance or if the plot in the book moves from location to location, then the task of verifying through automated methods whether a place name is a Chicago location is much harder. With the help of LinkedGeoData9 we were able to obtain all of the Chicago place names identified by volunteers through the OpenStreetMap project10 and then download a listing that included Chicago buildings, theaters, restaurants, streets, and other prominent places. While this is very useful, we also realized that we were missing historical Chicago place names with this ap- proach. At the same time, the way that place names are represented in a text will likely not always correspond to the way a place name is formally represented in a dictionary, database, or knowledge graph. For example, a sentence might simply use an anaphoric reference such as “that building” or “her home” instead of directly naming the entity known from other sentences. Moreover, there were many examples of generic place names: how many cities in the United States have a State Street, a Madison Street, or a 1st Avenue, and the like? A further hindrance was determining the type of place names we wanted to identify and collect from the text’s total set of location word tokens: it soon became obvious that for the purposes of visualizing a place name on the map, gen- eral references to Chicago went beyond the scope of the maps we wanted to create. We became more interested in tracking references to specific Chicago place names that included buildings (historical and present), named areas of the city, monuments, streets, theatres, restaurants, and the like. Given that our total dataset for this task comprised just three books, we were able to man- 5See ?iiTb,ffMHTXbi�M7Q`/X2/mfbQ7ir�`2f*_6@L1_X?iKH. 6See ?iiTb,ffQT2MMHTX�T�+?2XQ`;f. 7See ?iiTb,ffbT�+vXBQf. 8See ?iiTb,ffrrrXMHiFXQ`;f#QQFf+?ydX?iKH. 9See ?iiT,ffHBMF2/;2Q/�i�XQ`;f�#Qmi. 10See ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f. https://nlp.stanford.edu/software/CRF-NER.html https://opennlp.apache.org/ https://spacy.io/ https://www.nltk.org/book/ch07.html http://linkedgeodata.org/About https://www.openstreetmap.org/ 154 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 Figure 13.1: Mapping place names associated with positive (top row) and very negative (bottom row) sentiment extracted from three OBOC books. ually sift through the automatically identified place names and verify whether they were indeed a Chicago place name or not. We also established the sentiment of each location-bearing sentence in the three books using the Stanford Sentiment Analyzer.11 Our guiding principle was that spe- cific place(s) mentioned in the sentence “inherit” the sentiment score of the entire sentence. This principle may not always be true, but our manual inspection of the sentiment assigned to sen- tences, and therefore to locations mentioned in the sentences, established that this was a fairly accurate estimate: the sentiment score of the entire sentence is at the very least connected to or “resonates” with the individual components of the sentence including place names. While we did examine some samples, we did not conduct a qualitative analysis of the accuracy of the sentiment scores assigned to the corpus. Figure 13.1 documents an example of the results of our effort to integrate place names with the sentiment of the sentence. Particularly notable in Figure 13.1 is The Third Coast (right column) which shows a concen- tration of positively-associated Chicago place names in the northern parts of the city along the shore of Lake Michigan. Negative sentiment, by contrast appears to be more concentrated in the central part of Chicago and also in the southern parts of the city. The place names extracted from our three Chicago-setting OBOC books allowed us to focus 11See ?iiTb,ffMHTXbi�M7Q`/X2/mfb2MiBK2Mif. https://nlp.stanford.edu/sentiment/ Lucic and Shanahan 155 Figure 13.2: Mapping of sentences that feature “Hyde Park,” and their sentiment, from three OBOC program books on particular areas of the city such as Hyde Park on the South Side, which is mentioned in each of them. Larger circles correspond to a greater number of sentences that mention Hyde Park and are associated with a negative sentiment in both The Adventures of Augie March and The Warmth of Other Suns. As the maps in figure 13.2 indicate, on the other hand, The Third Coast features sentences in which Hyde Park is mentioned in both positive and negative contexts. These results prompt us to continue with this line of research and to procure a larger “con- trol” set of texts with Chicago place names and sentiment scores. This would allow us to focus on specific places such as “Wrigley Field” or the once-famous but no longer existing “Mecca” apart- ment building (which stood at the intersection of 34th and State Street on the South Side and was immortalized in a 1968 poetry collection by Gwendolyn Brooks). With a robust place name data set, we could analyze the context in which these place names were mentioned in other liter- ature, in contemporary or historical newspapers (Chicago Tribune, Chicago Sun-Times, Chicago Defender), or in library and archival materials. Promising contextual elements would include the sentiment associated with the place name. Our interest in creating a dataset of Chicago place names extracted from literature led us to The Chicago of Fiction, a vast annotated bibliography by James A. Kaser. Published in 2011, this 156 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13 work contains entries on more than 1,200 works published between 1852 and 1980 that feature Chicago. Kaser’s book contains several indexes that can serve as sources of labeled data or in- stances in which Chicago locations are mentioned. Although we are still determining how many of the titles included in the annotated bibliography already exist in digital format or are accessible through the HathiTrust digital library, it is likely that a subset of the total can be accessed elec- tronically. Even if the books do not exist in electronic format presently, it is still possible to use the index as a source of already-labeled data for Chicago place names. We anticipate that such a dataset would be of interest to researchers in Urban Studies, Literature, History, and Geogra- phy. A sufficiently large number of sentences featuring Chicago place names would enable us to proceed in the direction of a Chicago place name recognizer that can “learn” Chicago context or examine how much context is sufficient to establish whether, for instance, a “Madison Street” place name in a text is located in Chicago or elsewhere. How do libraries innovate? From print index to labeled data Over the last decade, libraries have pioneered services related to the development and preservation of digital scholarship projects. Librarians frequently assist faculty and students with the devel- opment of digital humanities and digital scholarship projects. They point patrons to resources and portals where they can find data and help with licensing. Librarians also procure datasets, and some perform data cleaning and pre-processing tasks. And yet it is still not that common for librarians to participate in the creation of a dataset. A relatively recent initiative, however, Collections as Data,12 directly tackles the issue of treating research, library, and cultural heritage collections as data and providing access to them. This ongoing initiative aims to create 12 projects that can serve as a model to other libraries for making collections accessible as data. The data that undergird the mechanisms of library workings—circulation records for phys- ical and digital objects, metadata records, and the like—are not commonly available as datasets open to machine learning tasks. If they were, not only could libraries refer others to the already created and annotated physical and digital objects, but they could also participate in creating ob- jects that are local to their settings. Creation and curation of such datasets could in turn help establish new relationships between area libraries and local communities. One can imagine a “data challenge,” for instance, in which libraries assemble a community by building a dataset rel- evant to that community. Such an effort would need to be preceded by assessment of the data needs and interests of that particular community. In the case of a Chicago place name dataset challenge, efforts could revolve around local communities adding sentences to the dataset from literary sources. A second step might involve organizing a crowdsourced data challenge to build a place name recognizer model (e.g. Chicago place name recognizer model) based on the sentences gathered. One can also imagine turning metadata records into curated datasets that are shared with local communities and with teachers and university lecturers for use in the classroom. Once a dataset is built, scenarios can be invented for using it. This kind of work invites conversations with faculty members about their needs and about potential datasets that would be of particular interest. Creation of datasets based on unique materials at their disposal will enrich the palette of services already offered by libraries. 12See ?iiTb,ff+QHH2+iBQMb�b/�i�X;Bi?m#XBQfT�`ikr?QH2f. https://collectionsasdata.github.io/part2whole/ Lucic and Shanahan 157 One of the main goals of the Reading Chicago Reading project was the creation of a model that can predict the circulation of a One Book One Chicago program book selection given param- eters such as prior circulation for the book, its text characteristics, and the geographical locality of the work. We are not aware of other predictive models that integrate circulation records with text features extracted from the books in this way. Given that circulation records are not com- monly integrated with other data sources when they are analyzed, linking different data sources with circulation records is another challenging opportunity that this paper envisions. Ultimately, libraries can play a dynamic role in both managing and creating data and datasets that can be shared with the members of local communities. Using back-of-the-book indexes as a source of labeled place name data is a tool that we have begun to prototype but still requires further exploration and troubleshooting. While organizing a data challenge takes a lot of effort, a data challenge can be an effective way of reaching out to one’s local community and identifying their data needs. To this end, we aim to make freely available our curated list of sentences and associated sentiment scores for Chicago place names in the three OBOC selections centered on Chicago. We will invite scholars and the general public to add more Chicago location sentences extracted from other literature. Our end goal is a labeled training dataset for the creation of a Chicago place name recognizer, which, we hope, will enable new avenues of research. References American Library Association. n.d. “One Book One Community.” Programming & Exhibitions (website). Accessed May 31, 2020. ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM 2#QQF. Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media Inc. Chicago Public Library. n.d. “One Book One Chicago.” Accessed May 31, 2020. ?iiTb, ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf. “Collections as Data: Part to Whole.” n.d. Accessed May 31, 2020. ?iiTb,ff+QHH2+iBQMb� b/�i�X;Bi?m#XBQfT�`ikr?QH2f. Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005. “Incorporating Non- local Information into Information Extraction Systems by Gibbs Sampling.” In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363-370. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSy8@Ry98f. HathiTrust Digital Library. n.d. Accessed May 31, 2020. ?iiTb,ffrrrX?�i?Bi`mbiXQ`;f. Kaser, A. James. 2011. The Chicago of Fiction: A Resource Guide. Lanham: Scarecrow Press. Library of Congress. “Local/Community Resources.’ n.d. Read.gov. Accessed May 31, 2020. ?iiT,ff`2�/X;Qpf`2bQm`+2bf. LinkedGeoData. “About / News.” n.d. Accessed May 31, 2020. ?iiT,ffHBMF2/;2Q/�i�X Q`;f�#Qmi. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSR9@8yRyf. OpenStreetMap. n.d. Accessed May 31, 2020. ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f. Reading Chicago Reading. “About Reading Chicago Reading.” n.d. Accessed May 31, 2020. ?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf�#Qmif. http://www.ala.org/tools/programming/onebook http://www.ala.org/tools/programming/onebook https://www.chipublib.org/one-book-one-chicago/ https://www.chipublib.org/one-book-one-chicago/ https://collectionsasdata.github.io/part2whole/ https://collectionsasdata.github.io/part2whole/ https://www.aclweb.org/anthology/P05-1045/ https://www.hathitrust.org/ http://read.gov/resources/ http://linkedgeodata.org/About http://linkedgeodata.org/About https://www.aclweb.org/anthology/P14-5010/ https://www.openstreetmap.org/ https://dh.depaul.press/reading-chicago/about/ 14-hansen-can ---- Chapter 14 Can a Hammer Categorize Highly Technical Articles? Samuel Hansen University of Michigan When everything looks like a nail... I was sure I had the most brilliant research project idea for my course in Digital Scholarship tech- niques. I would use the Mathematical Subject Classification (MSC) values assigned to the publi- cations in MathSciNet1 to create a temporal citation network which would allow me to visualize how new mathematical subfields were created and perhaps even predict them while they were still in their infancy. I thought it would be an easy enough project. I already knew how to analyze network data and the data I needed already existed, I just had to get my hands on it. I even sold a couple of my fellow coursemates on the idea and they agreed to work with me. Of course nothing is as easy as that, and numerous requests for data went without response. Even after I reached out to personal contacts at MathSciNet, we came to understand we would not be getting the MSC data the entire project relied upon. Not that we were going to let a little setback like not having the necessary data stop us. After all, this was early 2018 and there had already been years of stories about how artificial intelligence, machine learning in particular, was going to revolutionize every aspect of our world (Kelly 2014; Clark 2015; Parloff 2016; Sangwani 2017; Tank 2017). All the coverage made it seem like AI was not only a tool with as many applications as a hammer, but that it also magically turned all problems into nails. While none of us were AI experts, we knew that machine learning was supposed to be good at classification and categorization. The promise seemed to be that if you had stacks of data, a machine learning algorithm could dive in, find the needles, and arrange them into neatly divided piles of similar sharpness and length. Not only that, but there were pre- built tools that made it so almost anyone could do it. For a group of people whose project was on 1See ?iiTb,ffK�i?b+BM2iX�KbXQ`;f. 159 https://mathscinet.ams.org/ 160 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 life support because we could not get the categorization data we needed, machine learning began to look like our only potential savior. So, machine learning is what we used. I will not go too deep into the actual process, but I will give a brief outline of the techniques we employed. Machine-learning-based categorization needs data to classify, which in our case were mathematics publications. While this can be done with titles and abstracts we wanted to provide the machine with as much data as we could, so we decided to work with full-text articles. Since we were at the University of Wisconsin at the time, we were able to connect with the team behind GeoDeepDive2 who have agreements with many publishers to provide the full text of ar- ticles for text and data mining research (“GeoDeepDive: Project Overview” n.d.). GeoDeepDive provided us with the full text of 22,397 mathematics articles which we used as our corpus. In or- der to classify these articles, which were already pre-processed by GeoDeepDive with CoreNLP,3 we first used the Python package Gensim4 to process the articles into a Python-friendly format and to remove stopwords. Then we randomly sampled 1⁄3 of the corpus to create a topic model using the MALLET5 topic modeling tool. Finally, we applied the model to the remaining articles in our corpus. We then coded the words within the generated topics to subfields within mathe- matics and used those codes to assign articles a subfield category. In order to make sure our results were not just a one-off, we repeated this process multiple times and checked for variance in the results. There was none, the results were uniformly poor. That might not be entirely fair. There were interesting aspects to the results of the topic mod- eling, but when it came to categorization they were useless. Of the subfield codes assigned to arti- cles, only two were ever the dominant result for any given article: Graph Theory and Undefined, which does not really tell the whole story as Undefined was the run-away winner in the article classification race with more than 70% of articles classified as Undefined in each run, including one for which it hit 95%. The topics generated by MALLET were often plagued by gibberish caused by equations in the mathematics articles and there was at least one topic in each run that was filled with the names of months and locations. Add how the technical language of math- ematics is filled with words that have non-technical definitions (for example, map or space), or words which have their own subfield-specific meanings (such as homomorphism or degree), both of which frustrate attempts to code a subfield. These issues help make it clear why so many arti- cles ended up as “Undefined.” Even for the one subfield which had a unique enough vocabulary for our topic model to partially be able to identify, Graph Theory, the results were marginally positive at best. We were able to obtain Mathematical Subject Classification (MSC) values for around 10% of our corpus. When we compared the articles we categorized as Graph Theory to the articles which had been assigned the MSC value for Graph Theory (05Cxx), we found we had a textbook recall-versus-precision problem. We could either correctly categorize nearly all of the Graph Theory articles with a very high rate of false positives (high recall and low precision) or we could almost never incorrectly categorize an article as Graph Theory, but miss over 30% that we should have categorized as Graph Theory (high precision and low recall). Needless to say, we were not able to create the temporal subfield network I had imagined. While we could reasonably claim that we learned very interesting things about the language of mathematics and its subfields, we could not claim we even came close to automatically catego- rizing mathematics articles. When we had to report back on our work at the end of the course, 2See ?iiTb,ff;2Q/22T/Bp2XQ`;f. 3See ?iiTb,ffbi�M7Q`/MHTX;Bi?m#XBQf*Q`2LGSf. 4See ?iiTb,ff`�/BK`2?m`2FX+QKf;2MbBKf. 5See ?iiT,ffK�HH2iX+bXmK�bbX2/mfiQTB+bXT?T. https://geodeepdive.org/ https://stanfordnlp.github.io/CoreNLP/ https://radimrehurek.com/gensim/ http://mallet.cs.umass.edu/topics.php Hansen 161 our main result was that basic, off-the-shelf topic modelling does not work well when it comes to highly technical articles from subjects like mathematics. It was also a welcome lesson in not believing the hype of machine learning, even when a problem looks exactly like the kind machine learning was supposed to excel at solving. While we had a hammer and our problem looked like a nail, it seemed that the former was a ball peen and the latter a railroad tie. In the end, even in the land of hammers and nails, the tool has to match the task. Though we failed to accomplish automated categorization of mathematics, we were dilettantes in the world of machine learning. I believe our project is a good example of how machine learning is still a long way from being the magic tool as some, though not all (Rahimi and Recht 2017), have portrayed it. Let us look at what happens when smarter and more capable minds tackle the problem of classifying mathe- matics and other highly technical subjects using advanced machine learning techniques. Finding the Right Hammer To illustrate the quest to find the right hammer I am going to focus on three different projects that tackled the automated categorization of highly technical content, two of which also attempted to categorize mathematical content and one that looked to categorize scholarly works in general. These three projects provide examples of many of the approaches and practices employed by ex- perts in automated classification and demonstrate the two main paths that these types of projects follow to accomplish their goals. Since we have been discussing mathematics, let us start with those two projects. Both projects began because the participants were struggling to categorize mathematics pub- lications so they would be properly indexed and searchable in digital mathematics databases: the Czech Digital Mathematics Library (DML-CZ)6 and NUMDAM7 in the case of Radim Ře- hůřek and Petr Sojka (Řehůřek and Sojka 2008), and Zentralblatt MATH (zbMath)8 in the case of Simon Barthel, Sascha Tönnies, and Wolf-Tilo Balke (Barthel, Tönnies, and Balke 2013). All of these databases rely on the aforementioned MSC9 to aid in indexing and retrieval, and so their goal was to automate the assignment of MSC values to lower the time and labor cost of requir- ing humans to do this task. The main differences between their tasks related to the number of documents they were working with (thousands for Řehůřek and Sojka and millions for Barthel, Tönnies, and Balke), the amount of the works available (full text for Řehůřek and Sojka, and titles, authors, and abstracts for Barthel, Tönnies, and Balke), and the quality of the data (mostly OCR scans for Řehůřek and Sojka and mostly TeX for Barthel, Tönnies, and Balke). Even with these differences, both projects took a similar approach, and it is the first of the two main pathways to- ward classification I spoke of earlier: using a predetermined taxonomy and a set of pre-categorized data to build a machine learning categorizer. In the end, while both projects determined that the use of Support Vector Machines (Gandhi 2018)10 provided the best categorization results, their implementations were different. The Ře- 6See ?iiTb,ff/KHX+xf. 7See ?iiT,ffrrrXMmK/�KXQ`;f. 8See ?iiTb,ffx#K�i?XQ`;f. 9Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting catego- rization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article㸪s content before they are published. This multi-step process of review yields a built-in accuracy check for the categorization. 10Support Vector Machines (SVMs) are machine learning models which are trained using a pre-classified corpus to split a vector space into a set of differentiated areas (or categories) and then attempt to classify new items by where in the https://dml.cz/ http://www.numdam.org/ https://zbmath.org/ 162 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 hůřek and Sojka SVMs were trained with terms weighted using augmented term frequency11 and dynamic decision threshold12 selection using s-cut13 (Řehůřek and Sojka 2008, 549) and Barthel, Tönnies, and Balke’s with term weighting using term frequency–inverse document frequency14 and Euclidean normalization15 (Barthel, Tönnies, and Balke 2013, 88), but the main difference was how they handled formulae. In particular the Barthel, Tönnies, and Balke group split their corpus into words and formulae and mapped them to separate vectors which were then merged together for a combined vector used for categorization. Řehůřek and Sojka did not differenti- ate between words and formulae in their corpus, and they did note that their OCR scans’ poor handling of formulae could have hindered their results (Řehůřek and Sojka 2008, 555). In the end, not having the ability to handle formulae separately did not seem to matter as Řehůřek and Sojka claimed microaveraged F1 scores of 89.03% (Řehůřek and Sojka 2008, 549) when classify- ing the top level MSC category with their best performing SVM. When this is compared to the microaveraged F1 of 67.3% obtained by Barthel, Tönnies, and Balke (Barthel, Tönnies, and Balke 2013, 88), it would seem that either Řehůřek’s and Sojka’s implementation of SVMs or their ac- cess to full-text led to a clear advantage. This advantage becomes less clear when one takes into account that Řehůřek and Sojka were only working with top level MSCs where they had at least 30 (60 in the case of their best result) articles, and their limited corpus meant that many top-level MSC categories would not have been included. Looking at the work done by Barthel, Tönnies, and Balke makes it clear that these less common MSC categories such as K-Theory or Potential Theory, for which Barthel, Tönnies, and Balke achieved microaveraged F1 measures of 18.2% and 24% respectively, have a large impact on the overall effectiveness of the automated categorization. Remember, this is only for the top level of MSC codes, and the work of Barthel, Tönnies, and Balke suggests it would get worse when trying to apply the second and third level for full MSC categorization to these less-common categories. This leads me to believe that in the case of cat- egorizing highly technical mathematical works to an existing taxonomy, people have come close to identifying the overall size of the machine learning hammer, but are still a long way away from finding the right match for the categorization nail. Now let us shift from mathematics-specific categorization to subject categorization in gen- eral and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search prod- uct.16 While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify. Microsoft took a unique approach in the development of their taxonomy. Instead of rely- vector space the trained model places them. For a more in-depth, technical explanation, see: ?iiTb,ffiQr�`/b/�i�b +B2M+2X+QKfbmTTQ`i@p2+iQ`@K�+?BM2@BMi`Q/m+iBQM@iQ@K�+?BM2@H2�`MBM;@�H;Q`Bi?Kb@Nj9�99 97+�9d. 11Augmented term frequency refers to the number of times a term occurs in the document divided by the number of times the most frequent occurring term appears in the document. 12The decision threshold is the cut-off for how close to a category the SVM must determine an item to be in order for it to be assigned that category. Řehůřek and Sojka㸪s work varied this threshold dynamically. 13Score-based local optimization, or s-cut, allows a machine-learning model to set different thresholds for each category with an emphasis on local, or category, instead of global performance. 14Term frequency–inverse document frequency provides a weight for terms depending on how frequently it occurs across the corpus. A term which occurs rarely across the corpus but with a high frequency within a single document will have a higher weight when classifying the document in question. 15A Euclidean norm provides the distance from the origin to a point in an n-dimensional space. It is calculated by taking the square root of the sum of the squares of all coordinate values. 16See ?iiTb,ff�+�/2KB+XKB+`QbQ7iX+QKf. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://academic.microsoft.com/ Hansen 163 ing on the corpus of articles in the MAG to develop it, they relied primarily on Wikipedia for its creation. They generated an initial seed by referencing the Science Metrix classification scheme17 and a couple thousand FoS Wikipedia articles they identified internally. They then used an iter- ative process to identify more FoS in Wikipedia based on whether they were linked to Wikipedia articles that were already identified as FoS and whether the new articles represented valid entity types—e.g. an entity type of protein would be added and an entity type of person would be ex- cluded (Shen, Ma, and Wang 2018, 3). This work allowed Microsoft to develop a list of more than 200,000 Fields of Study for use as categories in the MAG. Microsoft then used machine learning techniques to apply these FoS to their corpus of over 140 million academic articles. The specific techniques are not as clear as they were with the previ- ous examples, likely due to Microsoft protecting their specific methods from competitors, but the article published to the arXiv by their researchers (Shen, Ma, and Wang 2018) and the write up on the MAG website does make it clear they used vector based convolutional neural networks which relied on Skip-gram (Mikolov et al. 2013) embeddings and bag-of-words/entities features to cre- ate their vectors (“Microsoft Academic Increases Power of Semantic Search by Adding More Fields of Study—Microsoft Research” 2018). One really interesting part of the machine learn- ing method used by Microsoft was that it did not rely only on information from the article being categorized. It also utilized the citations to and references from information about the article in the MAG, and used the FoS the citations and references were assigned in order to influence the FoS of the original article. The identification of potential FoS and their assignment to articles was only a part of Mi- crosoft’s purpose. In order to fully index the MAG and make it searchable they also wished to determine the relationships between the FoS; in other words they wanted to build a hierarchical taxonomy. To achieve this they used the article categorizations and defined a Field of Study A as the parent of B if the articles categorized as B were close to a subset of the articles categorized as A (a more formal definition can be found in (Shen, Ma, and Wang 2018, 4). This work, which cre- ated a six-level hierarchy, was mostly automated, but Microsoft did inspect and manually adjust the relationships between FoS on the highest two levels. To evaluate the quality of their FoS taxonomy and categorization work, Microsoft randomly sampled data at each of the three steps of the project and used human judges to assess their ac- curacy. The accuracy assessments of the three steps were not as complete as they would be with the mathematics categorization, as that approach would evaluate terms across the whole of their data sets, but the projects are of very different scales so different methods are appropriate. In the end Microsoft estimates the accuracy of the FoS at 94.75%, the article categorization at 81.2%, and the hierarchy at 78% (Shen, Ma, and Wang 2018, 5). Since MSC was created by humans there is no meaningful way to compare the FoS accuracy measurements, but the categorization accuracy falls somewhere between that of the two mathematics projects. This is a very impres- sive result, especially when the aforementioned scale is taken into account. Instead of trying to replace the work of humans categorizing mathematics articles indexed in a database, which for 2018 was 120,324 items in MathSciNet18 and 97,819 in zbMath,19 the FoS project is trying to replace the human categorization of all items indexed in MAG, which was 10,616,601 in 2018.20 17See ?iiT,ffb+B2M+2@K2i`BtX+QKf?[42Mf+H�bbB7B+�iBQM. 18See ?iiTb,ffK�i?b+BM2iX�KbXQ`;fK�i?b+BM2ifb2�`+?fTm#HB+�iBQMbX?iKH?/`4Tm#v2�`�v`QT4 2[��`;j4kyR3. 19See ?iiTb,ffx#K�i?XQ`;f?[4TvWj�kyR3. 20See ?iiTb,ff�+�/2KB+XKB+`QbQ7iX+QKfTm#HB+�iBQMbfjjNkj89d. http://science-metrix.com/?q=en/classification https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018 https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018 https://zbmath.org/?q=py%3A2018 https://academic.microsoft.com/publications/33923547 164 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 Both zbMath and MathSciNet were capable of providing the human labor to do the work of assigning MSC values to the mathematics articles they indexed in 2018.21 Therefore using an automated categorization, which at best could only get the top level right with 90% accuracy, was not the right approach. On the other hand, it seems clear that no one could feasibly provide the human labor to categorize all articles indexed by MAG in 2018 so an 80% accurate categorization is a significant accomplishment. To go back to the nail and hammer analogy, Microsoft may have used a sledgehammer but they were hammering a rather giant nail. Are You Sure it’s a Nail? I started this chapter talking about how we have all been told that AI and machine learning were going to revolutionize everything in the world. That they were the hammers and all the world’s problems were nails. I found that this was not the case when we tried to employ it, in an ad- mittedly rather naive fashion, to automatically categorize mathematical articles. From the other examples I included, it is also clear computational experts find the automatic categorization of highly technical content a hard problem to tackle, one where success is very much dependent on what it is being measured against. In the case of classifying mathematics, machine learning can do a decent job but not enough to compete with humans. In the case of classifying everything, scale gives machines an edge, as long as you have the computational power and knowledge wielded by a company like Microsoft. This collection is about the intersection of AI, machine learning, deep learning, and libraries. While there are definitely problems in libraries where these techniques will be the answer, I think it is important to pause and consider if artificial intelligence techniques are the best approach before trying to use them. Libraries, even those like the one I work in, which are lucky enough to boast of incredibly talented IT departments, do not tend to have access to a large amount of unused computational power or numerous experts in bleeding-edge AI. They are also rather no- toriously limited budget-wise and would likely have to decide between existing budget items and developing an in-house machine learning program. Those realities combined with the legitimate questions which can be raised about the efficacy of machine learning and AI with respect to the types of problems a library may encounter, such as categorizing the contents of highly technical articles, make me worry. While there will be many cases where using AI makes sense, I want to be sure libraries are asking themselves a lot of questions before starting to use it. Questions like: is this problem large enough in scale to substitute machines for human labor given that machines will likely be less accurate? Or: will using machines to solve this problem cost us more in equip- ment and highly technical staff than our current solution, and has that factored in the people and services a library may need to cut to afford them? Or: does the data we have to train a machine contain bias and therefore will produce a biased model which will only serve to perpetuate exist- ing inequities and systemic oppression? Not to mention: is this really a problem or are we just looking for a way to employ machine learning to say that we did? In the cases where the answers to these questions are yes, it will make sense for libraries to employ machine learning. I just want libraries to look really carefully at how they approach problems and solutions, to make sure that 21When an article is indexed by MathSciNet it receives initial MSC values from a subject area editor who then passes the article along to an external expert reviewer who suggests new MSC values, completes partial values, and provides potential corrections to the MSC values assigned by the editors (㸫Mathematical Reviews Guide For Reviewers㸬2020) and then the subject area editors will make the final determination in order to make sure internal styles are followed. zbMath follows a similar procedure. Hansen 165 their problem is, in fact, a nail, and then to look even closer and make sure it is the type of nail a machine-learning hammer can hit. References Barthel, Simon, Sascha Tönnies, and Wolf-Tilo Balke. 2013. “Large-Scale Experiments for Math- ematical Document Classification.” In Digital Libraries: Social Media and Community Networks, edited by Shalini R. Urs, Jin-Cheon Na, and George Buchanan, 83–92. Cham: Springer International Publishing. Clark, Jack. 2015. “Why 2015 Was a Breakthrough Year in Artificial Intelligence.” Bloomberg, December 8, 2015. ?iiTb,ffrrrX#HQQK#2`;X+QKfM2rbf�`iB+H2bfkyR8@Rk@y3 fr?v@kyR8@r�b@�@#`2�Fi?`Qm;?@v2�`@BM@�`iB7B+B�H@BMi2HHB;2M+2. Gandhi, Rohith. 2018. “Support Vector Machine—Introduction to Machine Learning Algo- rithms.” Medium. July 5, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfbmTTQ`i@p2+ iQ`@K�+?BM2@BMi`Q/m+iBQM@iQ@K�+?BM2@H2�`MBM;@�H;Q`Bi?Kb@Nj9�9997 +�9d. “GeoDeepDive: Project Overview.’ n.d. Accessed May 7, 2018. ?iiTb,ff;2Q/22T/Bp2XQ` ;f�#QmiX?iKH. Kelly, Kevin. 2014. “The Three Breakthroughs That Have Finally Unleashed AI on the World.” Wired, October 27, 2014. ?iiTb,ffrrrXrB`2/X+QKfkyR9fRyf7mim`2@Q7@�`iB7B +B�H@BMi2HHB;2M+2f. “Mathematical Reviews Guide For Reviewers.” 2015. AmericanMathematicalSociety. February 2015. ?iiTb,ffK�i?b+BM2iX�KbXQ`;fK`2bm#bf;mB/2@`2pB2r2`bX?iKH. “Microsoft Academic Increases Power of Semantic Search by Adding More Fields of Study.” 2018. Microsoft Academic (blog). February 15, 2018. ?iiTb,ffrrrXKB+`QbQ7iX+Q Kf2M@mbf`2b2�`+?fT`QD2+if�+�/2KB+f�`iB+H2bfKB+`QbQ7i@�+�/2KB+@BM +`2�b2b@TQr2`@b2K�MiB+@b2�`+?@�//BM;@7B2H/b@bim/vf. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neu- ral Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–3119. Curran Associates, Inc. ?iiT,ffT�T2 `bXMBTbX++fT�T2`f8ykR@/Bbi`B#mi2/@`2T`2b2Mi�iBQMb@Q7@rQ`/b@�M/@T ?`�b2b@�M/@i?2B`@+QKTQbBiBQM�HBivXT/7. Parloff, Roger. 2016. “From 2016: Why Deep Learning Is Suddenly Changing Your Life.” For- tune. September 28, 2016. ?iiTb,ff7Q`imM2X+QKfHQM;7Q`Kf�B@�`iB7B+B�H@B Mi2HHB;2M+2@/22T@K�+?BM2@H2�`MBM;f. Rahimi, Ali, and Benjamin Recht. 2017. “Back When We Were Kids.” Presentation at the NIPS 2017 Conference. ?iiTb,ffrrrXvQmim#2X+QKfr�i+??p4ZBRu`vjjhZ1. Řehůřek, Radim, and Petr Sojka. 2008. “Automated Classification and Categorization of Math- ematical Knowledge.” In Intelligent Computer Mathematics, edited by Serge Autexier, John Campbell, Julio Rubio, Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, 543–57. Berlin: Springer Verlag. Sangwani, Gaurav. 2017. “2017 Is the Year of Machine Learning. Here’s Why.” Business Insider, January 13, 2017. ?iiTb,ffrrrX#mbBM2bbBMbB/2`XBMfkyRd@Bb@i?2@v2�`@Q7@K �+?BM2@H2�`MBM;@?2`2b@r?vf�`iB+H2b?Qrf8e8R98j8X+Kb. https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 https://geodeepdive.org/about.html https://geodeepdive.org/about.html https://www.wired.com/2014/10/future-of-artificial-intelligence/ https://www.wired.com/2014/10/future-of-artificial-intelligence/ https://mathscinet.ams.org/mresubs/guide-reviewers.html https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/ https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/ https://www.youtube.com/watch?v=Qi1Yry33TQE https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms 166 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 14 Shen, Zhihong, Hao Ma, and Kuansan Wang. 2018. “A Web-Scale System for Scientific Knowl- edge Exploration.” Paper presented at the 56th Annual Meeting of the Association for Com- putational Linguistics, Melbourne, July 2018. ?iiT,ff�`tBpXQ`;f�#bfR3y8XRkkRe. Tank, Aytekin. 2017. “This Is the Year of the Machine Learning Revolution.” Entrepreneur, January 12, 2017. ?iiTb,ffrrrX2Mi`2T`2M2m`X+QKf�`iB+H2fk3djk9. http://arxiv.org/abs/1805.12216 https://www.entrepreneur.com/article/287324