Chapter 12 Machine Learning + Data Creation in a Community Partnership for Archival Research Jason Cohen Berea College Mario Nakazawa Berea College Introduction: Cultural Heritage and Archival Preservation in Eastern Kentucky In this chapter, two researchers, Jason Cohen and Mario Nakazawa, describe the contexts for an archivally focused project that emerged from a partnership between the Pine Mountain Settle- ment School (PMSS)1 in Harlan County, Kentucky, and scholars and students at Berea College. In this process, we have entered into a critical dialogue with our sources and knowledge pro- duction that Roopika Risam calls for in “self-reflexive” investigations in the digital humanities (2015, para. 16). Risam’s intervention, nevertheless, does not explicitly distinguish questions of class and the concomitant geographic constraints that often accompany the economic and social disadvantages of poverty (Ahmed et al. 2018). Our work demonstrates how class and geography are tied, even in digital archives, to the need for reflexive and diverse approaches to humanist ma- terials. For instance, a recent invited contribution to Proceedings of the IEEE articulates a need 1See ?iiT,ffTBM2KQmMi�BMb2iiH2K2Mib+?QQHX+QK. 137 http://pinemountainsettlementschool.com 138 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for diversity in computing and technology without mentioning class or region as factors shaping these related issues of diversity (Stephan et al. 2012, 1752–5). Given these constraints, perhaps it is also pertinent to acknowledge that the machine learning application we describe in this chapter is itself not particularly novel in scope or method—we describe our data acquisition and prepa- ration, and two parallel implementations of commercially available tools for facial recognition. What stands out as unique are the ethical and practical concerns tied to bringing unique archival materials out of their local contexts into a larger conversation about computer vision as a tool that helps liberate, and at the same time possibly endanger, a subaltern cultural heritage. In that light, we enter our archival investigation into what Bruno Latour has productively named “actor-network theory” (2007, 11–13) because, as we suggest below, our actions were highly conditioned not only by the physical and social spaces our research occupies and where its events occurs, but also because the nature of the historical artifacts themselves act powerfully to shape our work in these contexts. Moreover, the partnership model of curation and archiving that we pursued in this project complicates the very concept of agency because the actions form- ing the project emerged from a continuing dialogue rather than any one decision or hierarchy. As we suggest later, a distributed model for decisions (Sabharwal 2015, 52–5) also revealed the limitations of using a participatory and identity-based model for archival development and man- agement. Indeed, those historical artifacts will exert influence on this network of relations long after any one of us involved in the current project has ceased to pursue them. When we came to this project, we asked a version of a classic question that has arisen in a variety of forms begin- ning with very early efforts by Bell Laboratories, among others, to translate data structures to suit the often flexible needs of humanist data: “what aspects of life are formalizable?” (Weizenbaum 1976, 12). We discovered that while an ontology may represent a formalized relationship of an archive to a database or finding aid, it also asks questions about the ethical implications of what information and embedded relationships can be adequately formalized by an abstract schema. The Promises and Realities of Technology After Coal in Eastern Kentucky Despite the longstanding threats of having to adapt to a post-coal economy, Harlan County, Ken- tucky continues to rely on coal and the mountains from which that coal is extracted as two of the cornerstones that shape the identity of the territory as well as the people who call it home. The mountains of Eastern Kentucky, like much of Appalachia, are by turns beautiful and devastated, and both authors of this essay have found conversations with Eastern Kentucky’s citizens about the role the mountains play and the traditions that emerge from them both insightful and, at times, heartbreaking. This dramatic landscape, with its drastic challenges, may not sound like a place likely to find uses for machine learning. You would not be alone in your assumption. Standing far from urban centers of technology and mobility, Eastern Kentucky combines deeply structural problems of generational poverty with a hard won understanding that, since the moment of the region’s colonization, outsiders have taken resources and made uninformed decisions about what the region needs, or where it should turn in order to gain a better pur- chase on the narrative of American progress, self-improvement, and the unavoidable allures of development-driven capitalism. Suspicion of outsiders is endemic here. And unfortunately, eco- nomic and social conditions, such as the high workplace injury rates associated with mining and extraction-related industries, the effects of the pharmaceutical industry’s abuse of prescription Cohen and Nakazawa 139 opioids to treat a wide array of medical pain symptoms without treating the underlying causal conditions, and the systematic dismantling of federal- and state-level social support programs, have become increasingly acute concerns today. But this trajectory is not new: when President Lyndon B. Johnson announced the beginning of the War on Poverty in 1964, he landed an hour away in Martin County, and subsequently, drove through Harlan on a regional tour to inaugurate the initiative. Successive generations have sought to leave a mark, and all the while, the residents have been collecting their own local histories of their place. Our project, centered on recovering a latent social network of historical families represented by the images held in one local archive, mobilizes this tension between insiders’ persistence and outsiders’ interventions to think about how, as Bruno Latour puts it, we can “reassemble the social” while still respecting the local (2007, 191–2). PMSS occupies a unique position in this social and physical landscape: both local in its emplacement and attention, and a site of philanthropic work that attracted outside money as well as human and cultural capital, PMSS is at once of Harlan County and beyond it. As we sug- gest in the later sections of this essay, PMSS’s position, both within local and straddling regional boundaries, complicates the network we identified. More than that, however, its split position complicates the relationships of power and filiation embedded in its historical social network. While an economy centered on coal continues to define the Eastern Kentucky regional iden- tity, a second history can be told about this place and its people, one centered on resilience, in- dependence, simplicity, and beauty, both of the land and its people. This second history has made outsiders’ recent appeals for the region to court technology as a potential solution for what comes “after coal” particularly attractive to a region that prides itself on its capacity to sustain, out- last, and overcome obstacles. While that techno-utopian vision offers another version of the self- aggrandizing Silicon Valley bootstraps success story J.D. Vance narrates in Hillbilly Elegy (2016), like Vance’s story itself, those narratives most often get told by outsiders to outsiders using re- gional stereotypes as the grounds for a sales pitch. In reality, however, those efforts have largely proven difficult to sustain, and at times, become the sources of potentially explosive accusations of fraud and malfeasance. Recently, for instance, organizations including Mined Minds2 have been accused by residents aiming to prepare for a post-coal economy of misleading students, at least, and of fraud at worst. As with the timber, coal, and gas extraction industries that preceded these software development firms’ aspirations, the promises of technology have not been kind to Eastern Kentucky, and in particular, as with those extraction industries that preceded them, the technological-industrial complex making its pitch in Kentucky’s mountains has not returned resources to the region’s residents whom the work was intended at least nominally to support (Hochschild 2018; Campbell 2019; Bailey 2017). In this context of technology, culture, and the often controversial position machine learning occupies in generating obscure metrics for its classifiers that may embed bias, our project aims to activate its archival holdings and bring critical awareness to the question of how to actively engage with a paper archive of a local place as we venture further into our pervasively digital mo- ment. The School operates today as a regional cultural heritage institution; it opened in 1913 as a residential school and operated as an educational institution until 1974, at which point it trans- formed itself into an environmental and cultural outreach institution focused on developing its local community and maintaining the richness of the region’s cultural resources and heritage. Every year since 1974, PMSS has brought hundreds of students and citizens onto its campus to learn about nature and the landscape, traditional crafts and artistic practices, and musical and dance forms, among many other programs. Similarly, it has created a space for locals to come 2See ?iiT,ffrrrXKBM2/KBM/bXQ`;f. http://www.minedminds.org/ 140 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 together for social events, community celebrations, and festival days, and at the same time, has become a destination for national-level events that create community from shared interests in- cluding foodways, wildflowers, traditional dance forms, and other wide-ranging attractions. Project Background: Preserving Cultural Heritage in Harlan Country The archives of the Pine Mountain Settlement School emerge from its shifting history. The ma- jority of its papers relate to its time as a traditional institution of education, including student records (which continue to be restricted for several reasons, including FERPA constraints, and personal and community interests in privacy), minutes of its board meetings (again, partially re- stricted), and financial and narrative accounts of its many activities across a year. The school’s records are unique because they provide a snapshot, year by year and month by month, of the region’s interests and challenges during key years of the 20th Century, spanning the First World War to Vietnam. In addition, they detail the relations the School maintained with a philanthropic base of donors who helped to support it and shape it, and beyond its local relations, place it into contact with a larger set of cultural interactions than a boarding school that relied on tuition or other profit-driven means to sustain its operations would. While the archival holdings contin- ued to be informally developed by its directors and staff, who kept the official papers organized roughly by year, the archive itself sat largely neglected after 1974. Beginning around the turn of the millennium, a volunteer archivist named Helen Wykle began digitizing items one by one, and soon, hosted a curated selection of those digital surrogates along with interpretive and descrip- tive narration on a WordPress installation, The Pine Mountain Settlement School Collections.3 The PMSS Collections WordPress site has been continuously running and frequently updated by Wykle and the volunteer community members she has organized since 1999.4 Together with her collaborators and volunteers, Wykle has grown the WordPress site to over 2200 pages, including over 30,000 embedded images that include photographs and newspapers; scanned memos, meet- ing minutes and other textual material (in JPG and PDF formats); HTML transcriptions and bibliographies hard-coded into the pages; scanned images of 3-D collections objects like textile looms or wood carving tools; partially scanned runs of serial publications; and other compos- ite visual material. None of those objects was hosted within a regular and complete metadata hierarchy or ontology: no regular scheme of fields or file-naming convention was followed, no controlled vocabulary was maintained, no object-types were defined, no specific fields were re- quired prior to posting, and perhaps unsurprisingly as a result, the search and retrieval functions of the site had deteriorated noticeably. In 2016, Jason Cohen approached PMSS with the idea of using its archives as the basis for curricular development at Berea College.5 Working in collaboration beginning in 2017, Mario Nakazawa and Cohen developed two courses in digital and computational humanities, led a team-directed study in augmented reality in coordination with Pine Mountain, contributed ma- 3See ?iiTb,ffTBM2KQmMi�BMb2iiH2K2MiXM2if. 4Jason Cohen and Mario Nakazawa wish to extend a note of appreciation to Helen Hays Wykle, Geoff Marietta, the former director of PMSS, and Preston Jones, its current director, for welcoming us and enabling us to access the physical archives at PMSS from 2016–20. 5Jason Cohen would like to recognize the support this project received from the National Endowment for the Hu- manities’ “Humanities Connections” grant. See grant number AK-255299-17, description online at ?iiTb,ffb2+m `2;`�MibXM2?X;QpfTm#HB+[m2`vfK�BMX�bTt?74R�;M4�E@k88kNN@Rd. https://pinemountainsettlement.net/ https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17 Cohen and Nakazawa 141 terials and methods for a new course in Appalachian Studies, and promoted the use of PMSS archival materials in several other extant courses in history and art history, among others. These new college courses each make use of PMSS historical documents as a shared core of visual and textual material in a digital and computational humanities concentration that clusters around critical archival and textual studies.6 The success of that initial collaboration and course development seeded the potential in 2019– 2021 for a Whiting Public Engagement7 fellowship focused on developing middle and high school curricula for use in Kentucky public schools with PMSS archival materials. That Whiting funded project has generated over 80 lessons keyed to Kentucky state standards; these lessons are cur- rently in use at nine schools across eight school districts, and each school is using PMSS materials to highlight its own regional and local interests. The work we have done with these archives has thus far reached the classrooms of at least eleven different middle and high school teachers, and as a result, touched over 450 students in eastern and central Kentucky public schools. We mention these numbers in order to demonstrate that our collaboration has not been shal- low nor fleeting. We have come to know these archives quite well, and because they are not ade- quately cataloged, the only way to get to know them is to spend time reading through the mate- rials one page at a time. An ancillary consequence of this durable collaboration and partnership across the public-academic divide is the shared recognition early in 2019 that the PMSS archival database and its underlying data structure (a flat SQL database generated by the WordPress inter- face) would provide inadequate stability for records management and quality control in future development. In addition, we discovered that the interpretive materials and metadata associated with the WordPress installation were also insufficient for linked metadata across the objects in this expanding digital archive, for reasons discussed below. As partners, we decided together to migrate to a ContentDM instance hosted by the Ken- tucky Virtual Library,8 a consortium to which Berea College belongs, and which is open to future membership from PMSS. That decision led a team of Berea College undergraduate and faculty re- searchers to scrape the data from the PMSS archive site and supplement the images and transcrip- tions it contains with available textual metadata drawn from the site.9 Alongside the WordPress instance as our reference, we were also granted access to a Dropbox account that hosted higher resolution versions of the images featured on the blog. The scraper pulled over 19,228 unique images (and located over 11,000 duplicate images in the process), 732 document transcriptions for scanned texts on the site, and 380 subject and person bibliographies, including Library of Congress Subject Headings that had been hard-coded into the site’s HTML. We also extracted the unique object identifiers and labels associated with each image, which in WordPress are not associated with the image objects themselves. We used that data to populate the ContentDM in- stance and returned a sparse but stable skeleton for future archival development. In the process, we also learned significantly about how a future implementation of a controlled vocabulary, an image acquisition and processing pipeline, and object documentation standards should work in the next stages of our collaborative PMSS archival development. 6In the original version of the collaboration, we had planned also to teach basic computer programming to high school students during a summer program that also would have used that same set of materials, but with the paired departures of the original co-PI as well as the former director, that plan has thus far remained unfulfilled. 7See ?iiTb,ffrrrXr?BiBM;XQ`;f+QMi2MifD�bQM@+Q?2M. 8See ?iiTb,ffF/HXFvpHXQ`;f. 9Jason Cohen wishes to thank Mario Nakazawa, Bethanie Williams, and Tradd Schmidt for undertaking this project with him. The github repo for the PMSS scraper is hosted here: ?iiTb,ff;Bi?m#X+QKfh`�//@a+?KB/ifSJaana +`�T2`. https://www.whiting.org/content/jason-cohen https://kdl.kyvl.org/ https://github.com/Tradd-Schmidt/PMSS_Scraper https://github.com/Tradd-Schmidt/PMSS_Scraper 142 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 As we developed and refined this new point of entry to the digital archives using the Con- tentDM hosting and framework, some of the ethical issues surrounding this local archive came more clearly into focus. A parallel set of questions arose in response in the first instance to J.D. Vance’s work, and in the second, to outsiders’ claims for technological solutions to the deteri- oration of local and cultural heritage. Because we were creating virtual archival surrogates for materials housed at Pine Mountain, for instance, questions arose from the PMSS board mem- bers related to privacy and use of historical materials. Further, the board was concerned that even historical materials could bear on families present in the community today. We found that while profession-wide responses to archival constraints are shaped predominantly by discussions of copyright and fair use, issues of personal privacy are often left tacit. This gap between legal use and public interests in privacy reveals how tasks executed using techniques in machine learning may impinge upon more ethical constraints of public trust and civic obligation.10 Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextri- cably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the “historical validation” of primary source materials (2017, 424–5). When an AI system recognizes an object, Calo remarks, that object is validated. But how should one handle the lack of a specific vocabulary within a given training set? One answer, of course, would be to train a new set—but that response is becoming increasingly prohibitive for smaller cultural heritage projects like ours: the time and computational power re- quired to execute the training is non-negligible. In addition, training resources (such as data sets, algorithms, and platforms) are increasingly becoming monetized, and we do not have the mar- gins to buy access to new data for training. As a consequence, questions stemming from how one labels material in a controlled vocabulary were also at issue. We encountered a failure in historical validation when, for instance, our AI system labeled a “spinning wheel” as a wheel, but did not de- tect its historical relationship to weaving and textiles. That validation was further obscured when the system also failed to categorize a second form of “spinning wheel,” which refers locally to a home-made merry-go-round.11 In other words, not only did the system flatten a spinning wheel into a generic wheel, it also missed the regional homology between textile production and play, a cultural crux that reveals how this place envisions an intersection between work and recreation. By breaking the associations between two forms of “spinning wheel,” our system erased a small but significant site of cultural inheritance. How, we asked, should one handle such instances of effacement? At one level, one would expect an archival system to be able to identify the prim- itive machine for spinning wool, flax, or other raw materials into usable thread for textiles, but what about the merry-go-round? And what should one do when a system neglects both of these meanings and reduces the object to the same status as a wheel on a tractor, car, or carriage? Similarly, when competing naming conventions arise for landmarks, we were conscious to consider which name should be granted priority as the default designation, and we asked how one should designate a local or historical name, whether for a road, waterway, knob, or other fea- ture, in relationship to a more widely accepted nomenclature such as state route designations or 10The professional conversation in archive and collections management has not been as rich as the one emerging in AI contexts more broadly. For a recent discussion of the conflict in the roles of public trust and civic service that emerge from the context of the powers artificial intelligence holds for image recognition in policing applications, see Elizabeth Joh, “Artificial Intelligence and Policing: First Questions,” Seattle University Law Review 41: 1139–44. 11See “Spinning Wheel” in Cassidy 1985–2012. Cohen and Nakazawa 143 standardized toponym? As we attempted to address the challenge of multiple naming conven- tions, we encountered some of the same challenges that archivists find in dealing with indigenous peoples and their textual, material, and physical artifacts.12 Following an example derived from the Passamaquoddy people, we implemented a small set of “traditional knowledge labels”13 to describe several forms of information, including (a) restrictions on images that should not be shown to strangers (to protect family privacy), (b) places that should remain undisclosed (for in- stance, wild ginseng, ramp, orchid, or morel mushroom patches), and (c) educational materials focused on “how it was done” as related to local skills and crafts that have more modern imple- mentations, but for which the traditional practices have remained meaningful. This included cases such as Maypole dancing and festivals, which remain endowed with ritual significance. In the final analysis, neither the framework supplied by copyright and fair use nor the one supplied by data validation proved singularly adequate to our purposes, but they did provide guidelines from which our facial recognition project could proceed, as we discuss below. Machine Learning in a Local Archive These preliminary discussions of ethics and convention may seem unrelated to the focus this col- lection adopts toward machine learning and artificial intelligence in the archive. However, as we have begun to suggest, the data migration to ContentDM opened the door to machine learning for this project, and those initial steps framed the pitfalls that we continue to navigate as we con- tinue forward. As we suggested at the outset, the technical machine-learning task that we set for ourselves is not cutting edge research as much as an application of existing technologies to a new aspect of archival investigation. We proposed (and succeeded with) an application of commercial facial recognition software to identify the persons in historic photographs in the PMSS archives. We subsequently proposed and are currently working to identify the photographs sharing com- mon but unnamed faces, and in coordination with photographs of known people, to re-create the social network of this historic institution across slices of its history. We describe the next steps briefly below, but let us tarry for a moment with the question of how the ethical concerns we navigated up to this point also influenced our approach to facial recognition. The first of those concerns has to do with commercial and public access to archival materials that, as we suggested above, include materials that are designated as restricted use in some way. We demonstrated to the local members at Pine Mountain how our use case and its con- straints for digital archives fit with the current standards for the fair use of copyrighted materials based on the “substantive transformation” of reproduced objects (Levendowski 2018, 622–9). Since we are not making available large bodies of materials still protected by copyright, and since our use of select materials shifts the context within which they are presented, we were able to negotiate with PMSS to allow us to design a system for facial recognition using the ContentDM instance as our image source. What that negotiation did not consider, however, is when fair use does not provide a sufficiently high standard of control for the institution involved in the appli- cation of algorithms to institutional memory or its technological dependencies. First, to test the facial recognition processes, we reached back to the most primitive and local version of facial recognition software that we could find, Google’s retired platform, the Picasa 12One well-documented digital approach to handling indigenous archival materials includes the Mukurtu platform for indigenous cultural heritage: ?iiTb,ffKmFm`imXQ`;f. 13For the original traditional knowledge labels, see: ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@ i`�/BiBQM�H@FMQrH2/;2@H�#2Hb. https://mukurtu.org/ https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels 144 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 Web Albums API, which was retired in May 2016 and fully deprecated as of March 2018 (Sab- harwal 2016). We chose Picasa because it is a self-contained software application that operates using a locally hosted script and locally hosted images. Given its deprecated status and its loca- tion on a local machine, we were confident that no cloud services would be ingesting the images we fed into the system for our trial. This meant that we could test small data examples without fear of having to upload an entire corpus of material that could subsequently be incorporated into commercial facial recognition engines or pop up unexpectedly in search results. We thus began by upholding a high threshold for privacy and insisting on finding ways for PMSS to maintain control over these images within the grasp of its local directories. The Picasa system created surprisingly good results within the scope we allowed it. It was highly successful at matching the small group of known faces we supplied as test materials. While it would be difficult to supply a numerical match rate first because of this limited test set, and second because we have not expanded the test to a broad sample using another platform, we were anecdotally surprised at how robust Picasa’s matching was in practice. For instance, Picasa matched the images of a single person’s face, Celia Cathcart, from pictures of her as a teenager to images of her as a grandmother. It recognized Cathcart in a group of basketball players, and it also identified her face from side-view and off-center angles, as in a photograph of her looking down at her newborn child. The most immediate limitation of Picasa lies in its tagging, which required manual entry of every name and did not allow any automation. Following the success of that hand-tagging and cross-image identification process, we dis- cussed with our partners whether the next step, using Amazon Web Services’ computer vision and facial recognition platform, ReKognition, would be acceptable. They agreed, and we ran the images through the AWS application, testing our results against samples pulled from our Pi- casa run to verify the results. Perhaps unsurprisingly, AWS ReKognition fared even better with those test cases. Using one photograph image, the AWS application identified all of the Picasa matches as well as three new images that had not previously been tagged with Cathcart’s name. The same pattern held for other images in our sample group: Katherine Pettit was positively iden- tified across more likenesses than had been previously tagged, and Alice Cobb was also positively tracked across images. This positive attribution also reveals a limitation of the metadata: while these three women we have named are important historical figures at PMSS, and while they are widely acknowledged in the archive and well-represented in the photographic record, not all of the photographs have been well-tagged or fully documented in the archive. The newly tagged images that we found would enrich the metadata available to the archive not because these im- ages include surprising faces, but rather, because the tagging has been inconsistent, and over time, previously known faces have become less easy to discern. Like other recent discussions of private materials disclosed within systems trained for match- ing and similarity, we found that the ethics of private materials for this non-private purpose pro- voked strong reactions. While some of the reaction was positive with community members happy to have more images of the School’s founding director, Katherine Pettit, identified, those same community members were not comfortable with our role as researchers identifying people in the photographs in their community’s archive, unsupervised. They wanted instead to verify each positive identification, a point that we agreed with, but which also hindered the process of mov- ing through 19,000 images. They wanted to maintain authority, and while we saw our efforts as contributions to their goals of better describing their archival holdings, it turns out that the larger scope of automation we brought to the project was intimidating. While its legal status and direct ethics seemed settled before the beginning of the project, ultimately, this project contributed to Cohen and Nakazawa 145 a sense among some individuals at PMSS that they were losing control of their own archive.14 That fear of a loss of control led to another reckoning with the project, as we discuss in the next section. What Machine Learning Cannot Learn: An Ethics of the Archive It became clear at the same moment we validated our test case, that our research goals and those of our partners had quickly diverged. We had discussed the scope and use of PMSS materials with our partners at PMSS and laid out in a formally drafted “Memorandum of Understanding” (MOU) adapted from the US Department of Justice (2008; 2017) our shared goals in the project. As we described in the MOU, both partners considered it mutually beneficial for the archive and its metadata to be able to identify faces of named as well as unnamed people. We aimed to capture single-person images as well as groups in order to enrich the archive with cross-links to other pho- tographs or archival materials with a shared subject heading, and we hoped to increase the number of names included in object attributes. Despite those conversations and multiple revisions of the MOU draft, what we discovered was ultimately different than the path our planning had indi- cated. Instead of creating an historical social network using the five decades of photographs we had prepared, we found that the history of the social network and the family and kinship relation- ships detailed through those images was deeply personal for the community living in the region today. We found out the hard way that those kinships reflected economic changes in status and power, realignments among families and their communities, and new patterns in the social fabric formed by the warp of personal relationships and the weft of local institutions (schools, hospi- tals, and local governance). Revealing those changes was not always something that our partners wanted us to do, and these were not patterns we had sought to discover: they are simply there, embedded in the images and the relations among images. These social changes in local alignments—tied in complex ways to marriages and separations, legal conflicts and resolutions, changes in ownership of residential and commercial interests, and other material reflections of that social fabric—remain highly charged and, for those continuing to live in the area, they revealed potentially unexpected parts of the lived realities and values of the place. As a result, even though we had an MOU that worked for the technical details of the project, we could not find common ground for how to handle the competing social and ethical values of the project. As we problem-solved, we tried to describe new forms of restriction and to generate appro- priately sensitive guidelines to handle future use and access, but it turned out that all of these approaches were threatening to the values of a tightly knit community. They, rightly, want to tell their story, and so many people have told it so poorly for so long that they wish to have sole access to the materials from which the narratives are assembled. As researchers interested in open access and stable platform management, we have disagreements with the scholarly and archival implications of this decision, but we ultimately respect the resolve and underlying values that accompany the difficult choices PMSS makes about its public audiences and the corresponding goals it maintains for its collections. Interestingly, Wykle has come to view our work with PMSS collections as another form of the material and cultural extraction that has dominated the region 14See, for another example of the ethical quandaries that may be associated with legal applications of machine learning techniques, Ema et al. 2019. 146 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 for generations. While we see our work in light of preservation and access as well as our lasting commitment to PMSS and the region, we have also come to recognize the powerful explanatory force that the idea of “extraction” has become for the communities in a region that has suffered many forms of extraction industries’ negative effects. In acknowledging the limitations of our own efforts, we would posit that our case study offers a counter-example to works that suggest how AI systems can be designed automatically to meet the needs of their constituents (Winfield et al. 2019). We tried to use a design approach to address our research goals and our partner’s needs, and it turned out that the dynamically constructed and evolving nature of those needs outstripped the capacity we could build into our available system of machine learning. The divergence of our goals has led the collaboration to an impasse. Given that we had al- ready outlined further steps in our initial documents that could not be satisfied after the partners identified their divergent intentions, the collaborative scope the partners initially described was not completely fulfilled. The divergence of goals became stark: as researchers interested in the relevance and sustainability of these archives, we were moving the collections toward a more ac- cessible and comprehensive platform with open documentation and protocols for future devel- opment. By contrast, the PMSS staff were moving toward more stringent and local controls over access to the archives in order to limit dissemination. At this juncture, we had some negotiating to do. First, we made the ContentDM instance a password protected and not publicly accessible (private) sandbox rather than a public instance of a virtual digital collection. As PMSS owns the material, they decided shortly thereafter to issue a take-down order of the ContentDM instance, and we complied. As the ContentDM materials were ultimately accessible in the public domain on their live site, this decision revealed how personal the challenges had become. Nothing in- cluded in the take-down order was unique or new material—rather, the ContentDM site simply provided a more accessible format for existing primary material on the WordPress site, stripped of its interpretive and secondary contexts. If there is a silver lining, it lies in this context for use: the “academic divorce” we underwent by discontinuing our collaboration has made it possible for us to continue conducting research on the publicly available archival materials without being obligated to host a live and dynamic reposi- tory for further materials. As a result, we can test best-approaches without having to worry about pushing them to a live production site. Within this constraint, we aim to continue re-creating the historical social network without compromising our partners’ needs for privacy and control of their production site. The mutual decision to terminate further partnership activities based in archival development arose because of these differing paths forward. That decision meant that any further enrichment of the archival materials would not become publicly available, which we saw as a penalty against using the archive at a moment when archives need as much advocacy and visible support as possible. Under these constraints of private accessibility, we have continued to work on the AWS ReKog- nition pipeline and have successfully identified all of the faces of named people featured in the archive, with face and name labels now associated with over 1900 unique images. Our next step, delayed to Spring 2021 as a result of the COVID-19 pandemic, includes the creation of an associative network that first identifies unnamed faces in each image using unique identifiers. The second element of that process will be to generate an historical social network using the co- occurrence among those faces as well as the faces of named people in the available images. Given that our metadata enrichment has already included date associations for most of the images, we are confident that we will be able to reconstruct historically specific networks for a given year or range of years, and moreover, that the association between dates and named people will help us Cohen and Nakazawa 147 to identify further members of the community who are not currently named in the photographs because of the small groups involved in activities and clubs, as well as the generally limited student and teacher populations during any given year. We are now far more sensitive to how the local concerns of this community shape our research methods and outcomes. The longer-term hope, one it is not clear at all that we will be allowed to pursue, would be to use natural language processing tools on the archive’s textual materials, par- ticularly named entity recognition and word vectors, to search and match images where known names occur proximate to the names of unmatched faces. The present goal, however, remains to create a more replete and densely connected network of faces and the places they occupied when they were living in the gentle shadows of Pine Mountain. In order to abide by PMSS community wishes for privacy, we will be using anonymized aggregate results without identifying individuals in the photographs. While this method has the drawback of not being able to reveal the complex- ity of the historical relations at the granular level of individuals, it will allow us to report on the persistence or variation in network metrics, such as network density, centrality, path length, and betweenness measures, among others. In this way, we aim to be able to measure and report on the network and its changes over time without reporting on individuals. We arrived at an anonymiz- ing method as a solution to the dissolved partnership by asking about the constraints of FERPA as well as by looking back at federal and commercial facial recognition practices. In each case, the dark side of these technological tools remains one associated with surveillance, and in the lan- guage of Eastern Kentucky, extraction. We mention this not only to be transparent about our recognition of these limitations, but also in the hopes of opening a new dialogue with our part- ners that might stem from generating interesting discoveries without compromising their sense of the local ownership of their archival materials. Nonetheless, in order to report on the most interesting aspects, the actual people and their local histories of place, the work to be done would remain more at a human level than at a technical one. Conclusion In conclusion, our project describes a success that remains imbricated with a shortcoming in machine learning. The machine learning tasks and algorithms our project implemented serve a mimetic function in the distilled picture of the community they reflect. By matching histori- cal faces to names, the project embraces a form of digital surrogacy: we have aimed to produce a meta-historical account of the present institution’s social and cultural function as a site of social networking and local knowledge transmission. As Robyn Caplan and danah boyd have recently suggested, the “bureaucratic functions” these algorithms promote can be understood by the ways in which they structure users’ behaviors (2018, 3). We would like to supplement Caplan and boyd’s insight regarding the potential coercions involved in how data structures implicitly shape their contents as well as their users’ behaviors. Not only do algorithms promote a kind of bureau- cracy, to ends that may be positive and negative, and sometimes both at once, but further, those same structures may reflect or shape public behaviors and interactions beyond a single platform. As we move between digital and public spheres, our work similarly shifts its scope. The re- search that we intended to have positive community effects was instead read by that very same set of people as an attempt to displace a community from the center of its own history. In other words, the bureaucratic functions embedded in PMSS as an institution saw our new approach to their storytelling as an unwanted and external intervention. As their response suggests, the inter- nal and extant structures for governing their community, its stories, and the people who tell them, 148 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 12 saw our contribution as an effort to co-opt their control. Where we thought we were offering new tools for capturing, discovering, and telling stories, they saw what Safiya Noble has recently characterized in a specifically racialized context as “algorithms of oppression” (2018). Here the oppression would be geographic, socio-economic, and cultural, rather than racial; nevertheless, the perception that one is being oppressed by systems set into place by agents working beyond one’s own community remains a shared foundation in Noble’s argument and in the unexpected reception of our project. As we move forward with our own project into unknown territories, in which our work-products may never see the light of day because of the value conflicts bound up in making archival objects public and accessible, we have found a real and lasting respect for the institutional dependencies and emplacements within which we all do our work. We hope to channel some of those functions of emplacement to create new forms of accountability and restraint that will allow us to move forward, but at least for now, we have found with our project one limitation of machine learning, and it is not the machine. References Ahmed, Manan, Maira E. Álvarez, Sylvia A. Fernández, Alex Gil, Rachel Hendery, Moacir P. de Sá Pereira, and Roopika Risam. 2018. “Torn Apart / Separados.” Group for Experimental Methods in Humanistic Research. ?iiTb,fftTK2i?Q/XTH�BMi2tiXBMfiQ`M@�T�`i fpQHmK2fkf. Bailey, Ronald. 2017. “The Noble, Misguided Plan to Turn Coal Miners Into Coders.” Reason, November 25, 2017. ?iiTb,ff`2�bQMX+QKfkyRdfRRfk8fi?2@MQ#H2@KBb;mB/2/@ TH�M@iQ@imf. Calo, Ryan. 2017. “Artificial Intelligence Policy: A Primer and Roadmap.” University of Cali- fornia, Davis Law Review 51:399-435. Caplan, Robyn and danah boyd. 2018. “Isomorphism through algorithm: Institutional de- pendencies in the case of Facebook.” Big Data & Society (January-June): 1-12. ?iiTb, ff/QBXQ`;fRyXRRddfky8jN8RdR3d8dk8j. Cassidy, Frederic G. et al., eds. 1985-2012. Dictionary of American Regional English. Cam- bridge, MA: Belknap Press. ?iiTb,ffrrrX/�`2/B+iBQM�`vX+QK. Ema, Arisa et. al. 2019. “Clarifying Privacy, Property, and Power: Case Study on Value Conflict Between Communities.” Proceedings of the IEEE 107, no. 3 (March): 575-80. ?iiTb, ff/QBXQ`;fRyXRRyNfCS_P*XkyR3Xk3jdy98. Harkins, Anthony and Meredith McCarroll, eds. 2019. Appalachian Reckoning: A Region Re- sponds to Hillbilly Elegy. Morgantown, WV: West Virginia University Press. Hochschild, Arlie. 2018. “The Coders of Kentucky.” The New York Times, September 21, 2018. ?iiTb,ffrrrXMviBK2bX+QKfkyR3fyNfkRfQTBMBQMfbmM/�vfbBHB+QM@p�HH2v @i2+?X?iKH. Joh, Elizabeth. 2018. “Artificial Intelligence and Policing: First Questions.” Seattle University Law Review 41 (4): 1139-44. Latour, Bruno. 2007. Reassembling the Social: An Introduction of Actor-Network Theory. New York: Oxford University Press. Levendowski, Amanda. 2018. “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem.” Washington Law Review 93 (2): 579-630. Mukurtu CMS. ?iiTb,ffKmFm`imXQ`;f. Accessed December 12, 2019. https://xpmethod.plaintext.in/torn-apart/volume/2/ https://xpmethod.plaintext.in/torn-apart/volume/2/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/ https://doi.org/10.1177/2053951718757253 https://doi.org/10.1177/2053951718757253 https://www.daredictionary.com https://doi.org/10.1109/JPROC.2018.2837045 https://doi.org/10.1109/JPROC.2018.2837045 https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html https://mukurtu.org/ Cohen and Nakazawa 149 Noble, Safiya. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press. Passamaquoddy People. “Passamaquoddy Traditional Knowledge Labels.” ?iiTb,ffT�bb�K�[mQ//vT2QTH2X+QKfT�bb�K�[mQ//v@i`�/BiBQM�H@FMQrH2 /;2@H�#2Hb Accessed December 12, 2019. Risam, Roopika. 2015. “Beyond the Margins: Intersectionality and the Digital Humanities.” DHQ: Digital Humanities Quarterly 9 (2). ?iiT,ff/B;Bi�H?mK�MBiB2bXQ`;f/?[f pQHfNfkfyyyky3fyyyky3X?iKH. Robertson, Campbell. 2019. “They Were Promised Coding Jobs in Appalachia. Now They Say It Was a Fraud.” The New York Times, May 12, 2019. ?iiTb,ffrrrXMviBK2bX+QKfky RNfy8fRkfmbfKBM2/@KBM/b@r2bi@pB`;BMB�@+Q/BM;X?iKH. Sabharwal, Anil. 2016. “Moving on from Picasa.” Google Photos Blog. Last modified March 26, 2018. ?iiTb,ff;QQ;H2T?QiQbX#HQ;bTQiX+QKfkyRefykfKQpBM;@QM@7`QK@T B+�b�X?iKH. Sabharwal, Arjun. 2015. Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Boston: Chandos. Stephan, Karl D., Katina Michael, M.G. Michael, Laura Jacob, and Emily P. Anesta. 2012. “So- cial Implications of Technology: The Past, the Present, and the Future.” Proceedings of the IEEE 100, Special Centennial Issue (May): 1752-1781. ?iiTb,ff/QBXQ`;fRyXRRyNf CS_P*XkyRkXkR3NNRN. United States Department of Justice. 2008. “Guidelines for a Memorandum of Understanding.” ?iiTb,ffrrrXDmbiB+2X;QpfbBi2bf/27�mHif7BH2bfQprfH2;�+vfkyy3fRyfk Rfb�KTH2@KQmXT/7. . 2017. “Sample Memorandum of Understanding.” ?iiT,ffrrrX/QDXbi�i2X Q`XmbfrT@+QMi2MifmTHQ�/bfkyRdfy3fKQmnb�KTH2n;mB/2HBM2bXT/7. Vance, J.D. 2016. Hillbilly Elegy: A Memoir of a Family and Culture in Crisis. New York: Harper. Weizenbaum, Joseph. 1976. Computer Power and Human Reason: From Judgment to Calcula- tion. New York: W.H. Freeman and Co. Winfield, Alan F., Katina Michael, Jeremy Pitt, and Vanessa Evers. 2019. “Machine Ethics: the design and governance of ethical AI and autonomous systems.” ProceedingsoftheIEEE 107, no. 3 (March): 509-17. ?iiTb,ff/QBXQ`;fRyXRRyNfCS_P*XkyRNXkNyyekk. https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html https://doi.org/10.1109/JPROC.2012.2189919 https://doi.org/10.1109/JPROC.2012.2189919 https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf https://doi.org/10.1109/JPROC.2019.2900622