5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 1/180 Always Already Computational: Collections as Data  Final Report        Thomas Padilla (PI)  Laurie Allen (Co‑PI)  Hannah Frost (Co‑I)  Sarah Potvin (Co‑I)  Elizabeth Russey Roke (Co‑I)  Stewart Varner (Co‑I)      ‑‑‑‑‑‑‑‑‑        This project was made possible by the Institute of Museum and Library Services (LG‑73‑16‑0096‑16).   The views, findings, conclusions, or recommendations expressed in this publication do not necessarily                          represent those of the Institute of Museum and Library Services or author host institutions.                     1  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 2/180 Acknowledgements    The project team would like to acknowledge the Institute of Museum and Library Services, whose                              support made this project possible. Program Officers Trevor Owens and Emily Reynolds provided                          essential guidance throughout. Patricia Hswe, formerly at Penn State University Libraries, now at the                            Andrew W. Mellon Foundation, helped sparked the idea that became reality. We thank Stanford                            University, Texas A&M University, Emory University, and the University of Pennsylvania for their                          contributions to the project. Project home institutions ‑ the University of California, Santa Barbara, and                              later the University of Nevada, Las Vegas ‑ provided crucial support to the project. Special thanks to Amy                                    Gros Louis, Kee Choi, Lonnie Marshall, Maggie Farrell, and Michelle Light.      Individuals listed below authored and edited project resources, participated in national forums, and                          presented or served on program committees for project‑initiated events. Many others beyond the list                            below contributed ‑  we are grateful to them all.    Ajao, John  Almas, Bridget  Anderson, Clifford  Arroyo Ramirez, Elvia   Averkamp, Shawn  Bailey, Helen  Bailey, Jefferson  Baumgardt, Frederik  Becker, Devin   Butterhof, Robin   Capell, Laura   Chassanoff, Alex  Claeyssens, Steven  Clement, Tanya  Coates, Heather  Coble, Zach   Collard, Scott   Craig, Kalani  Cram, Greg  Del Rio Riande, Gimena  Di Cresce, Rachel  Dickson, Eleanor   Dombrowski, Quinn   Elings, Mary   Enderle, Scott  Escobar Varela, Miguel  Ferriter, Meghan  Foreman, Gabrielle P.  Fowler, Daniel   Galarza, Alex  Gniady, Tassie  Gradeck, Bob  Green, Harriett  Guiliano, Jennifer  Hardesty, Julie  Harlow, Christina  Higgins, Devin  Horowitz, Sarah M.  Hswe, Patricia   Ikeshoji‑Orlati, Veronica   Jansen, Greg  Johnston, Lisa  Jordan, Mark  Jules, Bergis  Kashyap, Nabil  Kaufman, Micki  Kerchner, Dan   Kizhner, Inna  Kouper, Inna  Leem, Deborah  Lill, Jonathan   Lillehaugen, Brook   Lincoln, Matthew  Littman, Justin   Liu, Alan  Locke, Brandon  Lynch, Katherine  Mannheimer, Sara  Marciano, Richard  Marcus, Cecily  Martinez, Alberto  Mason, Ingrid  Matienzo, Mark  McLaughlin, Steve  Meredith‑Lobay, Megan  Miller, Matthew  Milligan, Ian  Mookerjee, Labanya  Morgan, Paige   Neatrour, Anna  Newbury, David   Nunes, Charlotte  Orlowitz, Jake  Patterson, Sarah  Phillips, Cheryl  Pollock, Caitlin  Porter, Dot   Posner, Miriam  Powell, Chaitra  Rabun, Sheila  Ridge, Mia  2  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 3/180 Rodgers, Richard   Romary, Laurent  Ross, Denice  Roued‑Cunliffe, Henriette  Sakr, Laila  Scates Kattler, Hannah  Schmidt, Ben  Schwartz, Daniel L.  Scott Weber, Chela  Senseney, Megan  Seubert, David  Severson, Sarah  Sherratt, Tim  Simpson, John  Souther, Mary  Sutton Koeser, Rebecca  St. Onge, Timothy  Terras, Melissa  Thomas, Deborah   Thompson, Santi  Tomasek, Kathryn  Tracy, Daniel G.  Van Tine, Lindsay  Vejvoda, Berenica  Vogeler, Georg  Weigel, Tobias  Weingart, Scott  Whitmire, Amanda   Williams, Elliot   Wolf, Nick   Wrubel, Laura   Yarasavage, Nathan   Zarafonetis, Michael   Zastrow, Thomas  Ziegler, Scott   Zwaard, Kate            3  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 4/180   Scope Note 8  Introduction 9  Activities 11  About our approach 11  Collections as Data Framework (v1) 12  The Santa Barbara Statement on Collections as Data 12  Collections as Data Facets 12  Collections as Data Personas 12  50 Things 12  Methods Profiles 13  Collections as Data Position Statements 13  Additional Resources 13  Impacts 14  Findings 16  Collections as data development requires critical engagement with the ethical implications of  cultural heritage organization work 16  Collections as data development is possible at a wide range of organizations 16  Collections as data development benefits collection users and stewards 17  Challenges to collections as data development are more organizational than technical 17  Collections as data development benefits from engaging specific community needs 18  Collections as data development benefits from collaboration across multiple communities of  practice 18  Areas for Further Investigation 19  Moving from ethical consideration to action 19  Conducting more community‑specific user studies to inform workflow development 19  Developing functional requirements in service to user and collection steward needs 19  Publicly charting and sharing the terms of relationships with commercial entities 19  Enabling  widespread collections as data discovery 20  Addressing collections as data preservation needs and obstacles 20  Exploring post‑custodial approaches to collections as data 20  Appendices 21  Appendix 1: The Santa Barbara Statement on Collections as Data 21  Appendix 2: Collections as Data Facets 24  Appendix 3: Collections as Data Personas 80  Appendix 4: 50 Things 93  Appendix 5: Collections as Data Methods Profiles 98  Appendix 6: National Forum Position Statements 103  4  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 5/180 Appendix 7: Forum Summaries 161  Appendix 8: Conference engagements, 2017‑2018 168  Appendix 9: Digital Humanities 2017 preconference: Shaping Humanities Data 171                                        5  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 6/180 Scope Note  From 2016‑2018  Always Already Computational: Collections as Data  documented, iterated on, and                        shared current and potential approaches to developing cultural heritage collections that support                        computationally‑driven research and teaching. With funding from the Institute of Museum and Library                          Services,  Always Already Computational held two national forums, organized multiple workshops, shared                        project outcomes in disciplinary and professional conferences, and generated nearly a dozen deliverables                          meant to guide institutions as they consider development of collections as data.      This report documents the activities and impacts of the Always Already Computational project,                          delineates findings, and identifies areas for further inquiry.        6  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 7/180 Introduction  Always Already Computational: Collections as Data arose from practical need and a desire to build upon                                decades of digital collection practice. While cultural heritage practitioners have broad experience                        replicating the analog experience of watching, viewing, and reading in a digital environment, they less                              commonly share the experience of supporting users who want to work with collections as data ‑ a                                  conceptual orientation to collections that renders them as ordered information, stored digitally, that are                            inherently amenable to computation. These users come from many disciplines and professions, they act                            within and outside of the university, and they share in common a desire to leverage computational                                methods like machine learning, computer vision, text mining, visualization, and network analysis.                        Meeting their needs is contingent on the availability of collections, infrastructure, and services that are                              tuned for computational work.     At the time  Always Already Computational  formed, existing experience in this space was difficult to                              discern beyond relatively well‑resourced efforts like the HathiTrust Research Center and the British                          Library. Without diversification of examples and corresponding paths to doing the work, the viability of                              collections as data efforts ran the risk of being perceived as an elite activity ‑ smaller actors need not                                      apply. It became clear that a broader field of participation was needed. Ideally, this field would exhibit                                  variation in institutional resources, collection types, and community responsibilities. All of the above                          would critically contend with the ethical implications of producing and making use of collections as data.                                From 2016‑2018,  Always Already Computational  sought to cultivate this field by openly documenting,                          iterating on, and sharing current and potential approaches to developing cultural heritage collections                          that support computationally‑driven research and teaching.     At inception, anticipated project outcomes were as follows: gather key stakeholders to craft a strategic                              direction that leads to  (1) creation of a collections as data framework that supports pragmatic collection                                transformation and documentation,  (2)  development of computationally amenable collection use cases                      and user stories (3) identification of methods for making computationally amenable library collections                          more discoverable through aggregation and other means, and ( 4)  guidance, in the form of functional                              requirements that support development decisions relative to technical feature integrations with                      repository infrastructures.     As synchronous and asynchronous engagements began in earnest, project scope and the shape of                            deliverables morphed accordingly. The tension between creation of particular solutions and universal                        solutions was persistent . Given its nature as a broadly conceived community project,  Always Already                            Computational  was not positioned to make overly specific technical recommendations. Preference was                        ultimately given to the creation of malleable deliverables that could be shaped to guide engagement                              with particular community needs. We determined that collections as data discoverability and the                          development of specific functional requirements were projects that required independent investigation.                      Ideally these investigations will be tied to specific contexts, a framing distinct from a project like  Always                                  Already Computational , which sought cultural heritage community‑wide engagement .     7  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 8/180 Always Already Computational  deliverables constitute version 1 of the collections as data framework.                          This framework includes a range of resources, expressed in different forms, providing multiple points of                              engagement throughout the process of considering collections as data efforts. For example,  The Santa                            Barbara Statement on Collections as Data is a set of principles developed with community feedback                              designed to help guide practitioners through the practical, theoretical, and ethical dimensions of                          collections as data work. This deliverable does not advance solutions, rather it raises core questions to                                be resolved in local contexts. The  Collections as Data Facets describe a range of institutional approaches                                to implementing collections as data. This resource aims to help practitioners see multiple paths into                              doing the work. The  Collections as Data Personas represent high level role types associated with                              collections as data development and use. Together, the personas, derived from  Always Already                          Computational  community engagements and project team experience, aim to surface needs,                      motivations, and goals in context. Compiled at the end of two years of project engagements, the  50                                  Things  provide examples of things a practitioner can do to initiate collections as data at their institution.     Throughout the course of the project,  Always Already Computational  was inspired and humbled by the                              active interest and ingenuity shown by librarians, archivists, museum professionals, researchers,                      educators, and more as they engaged with collections as data challenges and opportunities. By                            emphasizing diverse community engagement and documentation over prescriptive recommendations,                  we hope that we have cultivated, encouraged, and questioned in ways that a wide range of communities                                  find to be useful.     Thomas Padilla  Laurie Allen   Hannah Frost   Sarah Potvin   Elizabeth Russey Roke   Stewart Varner                   8  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 9/180 Activities  About our approach    From the beginning,  Always Already Computational held an expansive view of collections as data work.                              The project sought to document implications of collections as data work across cultural heritage                            organization functions, practices, and roles. A National Forum with participants representing a broad                          spectrum of perspectives kicked off project activity. Two years of synchronous and asynchronous                          community engagements spanning a range of professional and disciplinary contexts followed.     Project activity was designed to serve three near‑term goals:  (1)  identify cross‑cutting issues and bring                              common themes into focus,  (2) scaffold project activity with those issues and themes  (3)  identify  special                                concerns or less clear areas that required deeper investigation. Discussions at the first National Forum                              informed overall project goals and direction. Project deliverables were iterated on over the course of the                                project activity. Iteration was by design, given the need to engage, respond to, and incorporate diverse                                community input. Deliverables were shared across a range of venues including but not limited to the                                Digital Library Federation, American Historical Association, Society of American Archivists, the Coalition                        for Networked Information, Association of College and Research Libraries, NICAR, and Open                        Repositories.     Always Already Computational  community engagements drew inspiration from human‑centered design                    methods. The LUMA Institute  Handbook of Human‑Centered Design Methods and the  Liberating                        Structures  toolkit provided a series of generative activities :  1   ● Round Robin  ‑ generate fresh ideas by providing a format for group authorship.  2 ● Concept Poster  ‑ promote an idea and rally support for its development.   3 ● Affinity Clustering ‑ teams sort items based on perceived similarity, defining commonalities                        that are inherent but not necessarily obvious.  4 ● Importance/Difficulty Matrix ‑ establish priorities by plotting relative importance and                    difficulty.  5 ● 1‑2‑4‑All  ‑ generate ideas that open with self‑reflection in response to a prompt and expand                              into larger group discussion.   6   1 LUMA Institute,  Innovating for People: Handbook of Human‑Centered Design Methods  (Pittsburgh, PA: LUMA Institute, 2012);  http://www.liberatingstructures.com/   2  Innovating for People , 64.  3 Ibid., 76.  4 Ibid., 40.  5 Ibid., 44.  6 http://www.liberatingstructures.com/1‑1‑2‑4‑all/  9  http://www.liberatingstructures.com/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 10/180 Individual and group perspectives gathered through these activities directly informed the framework                        described below.  Collections as Data Framework (v1)  The Santa Barbara Statement on Collections as Data    The Santa Barbara Statement on Collections as Data is a set of principles developed with community                                feedback designed to help guide practitioners through the practical, theoretical, and ethical dimensions                          of collections as data work. This deliverable does not advance solutions, rather it raises core questions to                                  be resolved in local contexts. The first version of the Santa Barbara Statement was inspired by the first                                    collections as data national forum (UC Santa Barbara, March 1‑3 2017). After its release, the team                                asynchronously gathered comments on the web via open annotation and sought synchronous feedback                          across a  series of conversations and workshops . The second version of the statement was revised and                                released based on community feedback.    Permanent link: https://doi.org/10.5281/zenodo.3066209  Collections as Data Facets    Collections as Data Facets, authored by community contributors, document collections as data                        implementations. An implementation consists of the people, services, practices, technologies, and                      infrastructure that aim to encourage computational use of cultural heritage collections. The fifteen                          facets represent collections as data efforts at museums, academic libraries, societies, and institutions like                            the Library of Congress.     Permanent link: https://doi.org/10.5281/zenodo.3066240  Collections as Data Personas  Collections as Data Personas represent high level role types associated with the development  and use of                                collections as data. The personas aim to surface needs, motivations, and goals in context.     Permanent link:  https://doi.org/10.5281/zenodo.3066515  50 Things    50 Things is designed for practitioners who are seeking to get started with collections as data. 50 Things                                    provides an impetus for exploring, learning from colleagues, deepening knowledge and understanding,                        and taking that first step. Participants at our  second National Forum (University of Nevada Las Vegas,                                May 7‑8, 2018) provided the bulk of recommendations.   10  https://collectionsasdata.github.io/nominations/ https://collectionsasdata.github.io/events/ https://doi.org/10.5281/zenodo.3066515 https://collectionsasdata.github.io/partners/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 11/180   Permanent link: https://doi.org/10.5281/zenodo.3066237  Methods Profiles    Methods Profiles characterize common research methods in relation to the process of collections as data                              development. They are designed to help collection stewards bridge the gap between research methods                            and design of workflows that support creation of machine actionable collections.     Permanent link:  https://doi.org/10.5281/zenodo.3146756  Collections as Data Position Statements (Forum 1)    Prepared by invited participants in advance of the first  collections as data national forum (UC Santa                                Barbara, March 1‑3 2017), the twenty‑six position statements describe challenges, opportunities,                      connections, and gaps in the work of collections as data. Perspectives subsequently informed project                            activity.     Permanent link: https://doi.org/10.5281/zenodo.3066161  Additional Resources    ● National Forum 2 Livestream Recording  ● Collections as Data Google Group ‑ As of May 2019, the Google Group includes 57 topics and 413                                    members  7 ● Collections as Data Group Library  ‑ As of May 2019, this Zotero group includes 266 items and 73  members   8 ● Serendipitous Collections as Data                  7 https://groups.google.com/forum/#!forum/collectionsasdata  8 https://www.zotero.org/groups/2171423/collections_as_data_‑_projects_initiatives_readings_tools_datasets  11  https://doi.org/10.5281/zenodo.3146756 https://collectionsasdata.github.io/nominations/ https://www.youtube.com/watch?v=ENaPV2XmO9I https://groups.google.com/forum/#!forum/collectionsasdata https://www.zotero.org/groups/2171423/collections_as_data_-_projects_initiatives_readings_tools_datasets https://collectionsasdata.github.io/serendipity/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 12/180 Impacts    Always Already Computational’s primary role, as expressed in the framework, was to highlight existing                            work, foster conversations, identify gaps, collect feedback, and spark further conversation and adoption                          in the context of specific community needs. The impact of  Always Already Computational is likely best                                measured by its potential to motivate further development.    Over two years of project activity,  Always Already Computational  saw collections as data:     ● … taken up as a strategic priority within the University of California’s Shared Content                            Leadership Group’s  Plans & Priorities for 2017/2018 Based on the University of California                          Library Collection: Content for the 21st Century and Beyond  ● … incorporated as a feature of the OCLC  Research and Learning Agenda for Archives,                            Special, and Distinctive Collections in Research Libraries  ● … inform the creation of permanent positions like the Digital Collections as Data Manager                            position at Johns Hopkins University Libraries  ● … inform the creation of postdoctoral positions like the British National Archives’ FTNA                          Postdoctoral Research Fellowship, focused on unlocking “archival collections as data”  ● … identified as a core driver for an international, future of archival science curriculum effort  ● … presented as a component of the Digital Library Federation eResearch Network  ● … inform Software Preservation Network outreach  ● … delivered as a week‑long collections as data course at the Humanities Intensive Learning                            and Teaching Institute  ● ... inspire reading groups, international hackathons, workshops, and conference sessions                    that span disciplinary, library, archives, and museum communities.   9 9 “2017/2018 SCLG Plans & Priorities for 2017/2018 Based on the University of California Library Collection: Content for the 21st  Century and Beyond,” University of California, last modified September 29, 2017,  http://libraries.universityofcalifornia.edu/groups/files/sclg/docs/SCLG_2017_2018%20Plan.pdf ; Chela Scott Weber. “Research  and Learning Agenda for Archives, Special, and Distinctive Collections and Research Libraries.” OCLC Research, 2017.  https://doi.org/10.25333/C3C34F ; “Manager of Digital Collections as Data.”  https://jobs.jhu.edu/job/Baltimore‑Manager‑of‑Digital‑Collections‑as‑Data‑MD‑21218/546941200/ .  ; “Developing a Computational Framework for Library and Archival Education.” Developing a Computational Framework for  Library and Archival Education.  https://dcicblog.umd.edu/ComputationalFrameworkForArchivalEducation/ ;   “FTNA Postdoctoral Research Fellowship (Datafication) at The National Archives,” February 9, 2018.  https://web.archive.org/web/20180209203649/http://www.jobs.ac.uk/job/BHO511/ftna‑postdoctoral‑research‑fellowship‑dat afication/ ; “EResearch Network ‑ DLF Wiki.” Accessed January 21, 2019.  https://wiki.diglib.org/EResearch_Network#webinars ;  “Events | The Software Preservation Network.” Accessed May 14, 2018.  http://www.softwarepreservationnetwork.org/events/ ;  Padilla, Thomas, and Mia Ridge. “Collections as Data.”  HILT  (blog). Accessed January 21, 2019.  https://dhtraining.org/hilt/course/collections‑as‑data‑2018/ ; September 12, Natalia Ermolaev. “CDH Reading Group: Collections  as Data.” Center for Digital Humanities @ Princeton University, September 12, 2018.  https://cdh.princeton.edu/updates/2018/09/12/cdh‑reading‑group‑collections‑data/ ; Moore Institute. “Collections as Data ‑  Hackathon / Collaborative Workshop ‑ Moore Institute.” Text.  NUI Galway  (blog). Accessed January 21, 2019.  http://mooreinstitute.ie/event/collections‑data‑hackathon‑collaborative‑workshop/ ; Dalmau, Michelle. “Collections as Data at  Indiana University and Beyond,” November 16, 2018.  https://libraries.indiana.edu/collections‑data‑indiana‑university‑and‑beyond ;  Menendez, Rebecca, Cheryl Miller, Andrzej  Rutkowski, and Stacy R. Williams. “ARLIS/NA 47th Annual Conference: Getting Started with Collections as Data.” Accessed  January 21, 2019  12  http://libraries.universityofcalifornia.edu/groups/files/sclg/docs/SCLG_2017_2018%20Plan.pdf https://doi.org/10.25333/C3C34F https://doi.org/10.25333/C3C34F https://jobs.jhu.edu/job/Baltimore-Manager-of-Digital-Collections-as-Data-MD-21218/546941200/ https://dcicblog.umd.edu/ComputationalFrameworkForArchivalEducation/ https://web.archive.org/web/20180209203649/http://www.jobs.ac.uk/job/BHO511/ftna-postdoctoral-research-fellowship-datafication/ https://web.archive.org/web/20180209203649/http://www.jobs.ac.uk/job/BHO511/ftna-postdoctoral-research-fellowship-datafication/ https://web.archive.org/web/20180209203649/http://www.jobs.ac.uk/job/BHO511/ftna-postdoctoral-research-fellowship-datafication/ https://wiki.diglib.org/EResearch_Network#webinars http://www.softwarepreservationnetwork.org/events/ https://dhtraining.org/hilt/course/collections-as-data-2018/ https://dhtraining.org/hilt/course/collections-as-data-2018/ https://cdh.princeton.edu/updates/2018/09/12/cdh-reading-group-collections-data/ https://cdh.princeton.edu/updates/2018/09/12/cdh-reading-group-collections-data/ http://mooreinstitute.ie/event/collections-data-hackathon-collaborative-workshop/ http://mooreinstitute.ie/event/collections-data-hackathon-collaborative-workshop/ https://libraries.indiana.edu/collections-data-indiana-university-and-beyond https://libraries.indiana.edu/collections-data-indiana-university-and-beyond https://arlisna2019.sched.com/event/ITtA/getting-started-with-collections-as-data?iframe=no&w=100%&sidebar=yes&bg=no 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 13/180 ● … directly inform  Collections as Data: Part to Whole , awarded $750,000 by the Andrew W.                              Mellon Foundation to foster the development of broadly viable models that support                        implementation  and  use of collections as data.    In addition to tracking the various examples of impact above, the  Always Already Computational  team                              simply asked through an open survey, “Have you used this project?”. We include below a sampling of                                  responses:    More than the resources, which I've referenced and read and looked at off an on during the                                  projects run, we (the digital library folk at Idaho) have used the idea(s) promoted through the                                project to stimulate our own thinking, development, and conversations. I've had other librarians I                            don't work with that closely with bring up the project to me, and that's led to some really                                    interesting conversations.     Devin Becker, University of Idaho     (1) I'm leading data curation work package in a national research and data infrastructure project                              for humanities, arts and social sciences. I drew on the Collections As Data facets to augment the                                  advice given to a colleague new to data curation, to help them think about how to make data                                    available e.g. via API or as static snapshots in a sustainable manner, and to think about their                                  collection "as data". (2) The facets have also informed the development of a data curation                              framework for data sharing and interoperability across multiple platforms (discovery, access,                      research and archiving).    Ingrid Mason, Australia’s Academic and Research Network (AARNet)     So far, we have developed one research project exploring the use of oral histories as collections                                as data. Collections as Data has also strongly influenced our thinking of how best to digitize and                                  make available a collection of mining records from the early 1900s, which would be best                              expressed more as a database set up for computational use by researchers in addition to a                                traditional digital collection.    Anna Neatrour, University of Utah     Use of the [collections as data] facets were instrumental in explaining the widespread practice of                              working with collections as data. Before this list of examples, it was a constant struggle to                                explain the idea and justify the work. I frequently cite the Santa Barbara Statement when writing                                about the the use of data in special collection libraries. I've used the Personas somewhat less                                https://arlisna2019.sched.com/event/ITtA/getting‑started‑with‑collections‑as‑data?iframe=no&w=100%&sidebar=yes&bg=no ;  Neely, Liz, Anne Luther, and Chad Weinard. “Cultural Collections as Data: Aiming for Digital Data Literacy and Tool Development  – MW19 | Boston.” Accessed January 21, 2019.  https://mw19.mwconf.org/proposal/cultural‑collections‑as‑data‑aiming‑for‑digital‑data‑literacy‑and‑tool‑development/ ;   Padilla, Thomas, Hannah Scates Kettler, Laurie Allen, and Stewart Varner. “Collections as Data: Part to Whole.” Collections as  Data ‑ Part to Whole. Accessed January 21, 2019.  https://collectionsasdata.github.io/part2whole/ .  13  https://arlisna2019.sched.com/event/ITtA/getting-started-with-collections-as-data?iframe=no&w=100%&sidebar=yes&bg=no https://mw19.mwconf.org/proposal/cultural-collections-as-data-aiming-for-digital-data-literacy-and-tool-development/ https://mw19.mwconf.org/proposal/cultural-collections-as-data-aiming-for-digital-data-literacy-and-tool-development/ https://collectionsasdata.github.io/part2whole/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 14/180 regularly, but I have referenced them to offer examples for what types of researchers might be                                interested in different types of data.     Scott Ziegler, Louisiana State University     I think it helps people draw the connection between digital archives and progressive values. I also                                think it's a helpful, positive avenue into discussing what resources are necessary in terms of                              storage, repository infrastructure, etc. in order to archive collections digitally, and why                        institutions should earmark funds and other resources to support collections as data.    Charlotte Nunes, Lafayette College  Findings   Below, we share some of the clearest findings that arose from  Always Already Computational . As an                                overarching finding, we cannot emphasize enough the value of collaboration between staff working                          within and across galleries, libraries, archives, and museums. Collections as data development provides                          concrete, generative opportunities to learn things from one another. In a university context, much is                              often said about  interdisciplinary  research and its role in addressing challenges. With collections as data,                              we have an opportunity to embrace the value of  interprofessionalism .  The incorporation of concepts,                            language, and standards from multiple areas of practice allows for a more nuanced understanding of                              systems and the ways they can serve us and our users. As each of the findings below are considered,                                      readers are encouraged to think broadly about the kinds of collaborations that would allow forward                              progress.   Collections as data development requires critical engagement with the ethical                    implications of cultural heritage organization work    Collections as data development must critically engage with bias in collection and description,                          archival silences, and assumptions about collection use. The viability of collections as data effort                            demands critical engagement ‑ especially as collection practices leveraging computational means like                        machine learning, computer vision, and more hold as much potential to harm as to help. Archival                                approaches to provenance, with their focus on  documenting the custodial and contextual history of                            objects, provide one path forward.  Ethical fault lines are often easier to see when trying to develop                                  new policies and workflows. Examination of policies and workflows should support changes in                          practice. Prior harms should be acknowledged and remediated to the extent possible.   Collections as data development is possible at a wide range of organizations     Collections as data development does not depend on availability of abundant resources ‑ the work is                                possible at a wide range of organizations. Incremental progress is a primary feature of this work.                                14  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 15/180 Small scale projects, experiments, and discussions can help establish a more inclusive path forward.                            While discussions of API development, data download mechanisms, and what technical                      infrastructure to adopt are important, it is more important that space be created for meaningful                              collaborations to form within and outside of the cultural heritage organization.  Collections as data development benefits collection users  and  stewards    Collections as data development offers clear benefits to collection users  and  stewards. Users gain                            access to machine‑actionable collections that are more readily amenable to research questions,                        expanded forms of pedagogy, and creative work.     The value of being able to more readily apply computational methods to collections is decidedly not                                isolated to disciplinary researchers. Cultural heritage staff increasingly use similar methods to                        address core challenges that include, but are not limited to, collection metadata and object                            remediation, expanding discovery, and critically engaging with collections. For example,  Always                      Already Computational  has shown that cultural heritage staff are among the most prolific users of                              collections as data. As Dot Porter, Curator of Digital Research Services in the Schoenberg Institute for                                Manuscript Studies at the University of Pennsylvania, observed at the second collections as data                            national forum:    I must have know that I would use it, I just didn’t realize how much I would use it, or how                                          having it available to me would change the way I thought about my work, and the way I                                    worked with the collections. ... Having OPenn as a source for data gives me so much in                                  my curatorial role. I have the flexibility to build the interfaces I want using tools I can                                  understand, and flexibility, easy access, familiar formats.  10 Challenges to collections as data development are more organizational than                    technical     Collections as data development provides a context for productive destabilization of organizational                        silos often predicated on the management and use of analog resources. The cultural heritage                            community has repeatedly lauded the capacity for collections as data work to encourage                          collaboration between operationally disconnected parts of a cultural heritage organization.    A successful turn towards collections as data development requires inclusive organizational                      experimentation ‑ spanning archivists, technologists, subject experts, catalogers, and more.                    Collections as data development blurs traditional divisions between cultural heritage organization                      10 Dot Porter. “Data for Curators: OPenn and Bibliotheca Philadelphiensis as Use Cases.”  Remarks from the  Collections as Data  National Forum 2  event held at the University of Nevada, Las Vegas, May 7 2018.  Dot Porter Digital  (blog).  http://www.dotporterdigital.org/data‑for‑curators‑openn‑and‑bibliotheca‑philadelphiensis‑as‑use‑cases/ .    15  https://collectionsasdata.github.io/forum2/ https://collectionsasdata.github.io/forum2/ http://www.dotporterdigital.org/data-for-curators-openn-and-bibliotheca-philadelphiensis-as-use-cases/ http://www.dotporterdigital.org/data-for-curators-openn-and-bibliotheca-philadelphiensis-as-use-cases/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 16/180 work. Implementation requires a combination of community engagement, domain knowledge, and                      the capacity for infrastructure development. Holistic combination of programs, tools, and services                        presents the primary challenge. For example, digital scholarship groups have a role to play in                              catalyzing use of collections as data, but the sustainability of that effort remains challenged by                              integration with core digital repository efforts. As efforts in this space grow, cultural heritage                            organizations will need to review divisions of labor and experiment with policies and workflows that                              foster generative, inclusive collaborations.   Collections as data development benefits from engaging specific community                  needs    Collections as data development reaches its true potential when it engages specific community                          needs. Collections as data designed for everyone serve no one. Engagement with community needs                           11 is never complete ‑ it requires active, ongoing, and sustained effort. What we learn from                              engagement directly informs programs, services, and partnerships. Beyond the question of                      collections as data usability, community partnerships help ensure that collections as data efforts do                            not result in replication or amplification of bias that harms underrepresented communities. While                          collections as data development will be a new experience for some, it can be an exciting opportunity                                  to develop close collaborative relationships that go beyond the traditional roles of  service provider                            and  service consumer .   Collections as data development benefits from collaboration across multiple                  communities of practice     Given that community needs are constantly changing, collections as data are varied in                          implementation. Efforts to meet these needs benefit from collaboration across multiple                      communities of practice.  Always Already Computational surfaced many communities of practice that                        contribute or hold the potential to contribute to collections as data development. Collection                          stewards have deep knowledge of metadata, web archive managers and digital library managers                          have expertise in packaging subsets of data, historians have experience using and/or developing                          their own collections as data, and educators are anchored by the experience of teaching with data in                                  a classroom setting. The efforts of a diverse group like this are brought into generative contact by                                  shared statements of principles and tools for communicating, exchanging experience, and                      collaborating across communities of practice.     11  Santa Barbara Statement on Collections as Data. Always Already Computational: Collections as Data.  https://collectionsasdata.github.io/statement/  16  https://collectionsasdata.github.io/statement/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 17/180 Areas for Further Investigation   As  Always Already Computational  comes to a close, we introduce a series of topics that merit further                                  investigation. In some cases, these topics were determined to be out of scope for the project given the                                    scale of team engagement, composition, and capacity. In other cases, topics were introduced and then                              reinforced by multiple community engagements, without clear resolution. We resist referring to these                          topics as “new” and suggest, instead, that these topics align with perennial challenges facing cultural                              heritage organizations.  1. Moving from ethical consideration to action    The cultural heritage community works to become more knowledgeable about the negative                        potential of producing and using collections as data. Taking that knowledge and converting it                            into actionable strategies, processes, and workflows that can be implemented across various                        stages of collection acquisition, description, and access is a prime area for further                          investigation.   2. Conducting more community‑specific user studies to inform workflow development   The viability of collections as data workflows depend on further investment in                        community‑specific user studies.  Always Already Computational  encountered repeated calls                  for practical resources that support collections as data development decisions relative to                        descriptive practices, alignment with standards, data types, and optimal delivery mechanisms.                      Creation of more tailored resources in this space depends on deeper understanding of user                            need.   3. Developing functional requirements in service to user and collection steward needs   Always Already Computational  focused on documenting and/or creating tools for eliciting,                      describing, and communicating user and collection steward needs. The next stage of                        development would benefit from creation of functional requirements that reflect needs in                        context with specific use cases. In the aggregate, functional requirements should take into                          account variation in institutional resources required to implement them.   4. Publicly charting and sharing the terms of relationships with commercial entities   Local and global efforts to develop open data and infrastructure greatly benefit collections as                            data development and use. With that said, collections as data effort will often call for                              interaction with commercial entities. This is likely the case from a collection standpoint given                            the spread of proprietary data held by contemporary companies like Twitter and licensed data                            held by vendors. Optimal practice in this space is often difficult to access, locked within                              non‑public agreements. Efforts to improve this situation should be documented and released                        publicly. The work should be aligned with core values that support openness and equity.                            17  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 18/180 Securing relationships on these terms must be viable with or without the power of a                              prestigious institution, consortial heft, or inordinate access to capital.   5. Enabling  widespread collections as data discovery    Approaches to making collections as data easy to find are inconsistent at best. Setting aside                              well‑known sites  where large volumes of open or licensed data are systematically collected or                            aggregated for discovery, or institutionally‑based static sites promoting their collections as                      data such as  OPENN or  LC Labs , it often feels like one needs to know the right people to find                                        collections amenable to computational use. How can  ad hoc instances of collections as data be                              described and indexed for better discovery in a consistent, systematic fashion? Are there                          approaches to encoding description ‑ leveraging schema.org vocabularies for example ‑ that                        can be developed and standardized for community adoption? Are there particular platforms                        or systems that enable or hinder discovery, and if so, how?  6. Addressing collections as data preservation needs and obstacles   As we work to develop collections as data, the matter of long‑term stewardship of the                              products of these efforts ‑‑ including source data sets and derived data sets ‑‑ comes to the                                  fore. Do current digital preservation policies and resources in institutions adequately cover the                          requirements for ensuring preservation of collections as data? Are there identifiable gaps or                          misalignments in resources, workflows and practices which hinder preservation, and how can                        they be overcome?   7. Exploring post‑custodial approaches to collections as data  Cultural heritage organizations are not the primary repositories for collections as data. It is                            neither desirable nor feasible for organizations to collect, store, and preserve all data locally,                            even if libraries took a collaborative approach. How can post‑custodial approaches to                        preservation and access in archival repositories inform collections as data work?        18  https://www.data.gov/ https://www.data.gov/ http://openn.library.upenn.edu/ https://labs.loc.gov/lc-for-robots/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 19/180 Appendices  Appendix 1: The Santa Barbara Statement on Collections as Data  May 2018    The Santa Barbara Statement on Collections as Data was written by the Institute of Museum and                                Library Services supported Always Already Computational: Collections as Data project team. The first                          version was based on the collaborative work of participants at the first Collections as Data National                                Forum (UC Santa Barbara, March 1‑3 2017). After its release, the team gathered comments from the                                Hypothesis web annotation tool and sought additional feedback across a series of conversations and                            workshops (April 2017 ‑ April 2018). The current version of the statement was revised based on that                                  community feedback, especially the close, directed feedback provided by workshop participants at                        the Digital Library Federation Forum 2017.        What are “collections as data”? Who are they for? Why are they needed? What values guide their                                  development? The Santa Barbara Statement on Collections as Data poses these questions and                          suggests a set of principles for thinking through them, as part of a community effort to empower                                  cultural heritage institutions to think of collections as data and consequently to explore what might                              be possible if cultural heritage seen in this light was more readily open to computation.    The concept of collections as data emerges at – and is grounded by – a particular moment in the                                      recent history of cultural heritage institutions. For decades, cultural heritage institutions have been                          building digital collections. Simultaneously, researchers have drawn upon computational means to                      ask questions and look for patterns. This work goes under a wide variety of names including but not                                    limited to text mining, data visualization, mapping, image analysis, audio analysis, and network                          analysis. With notable exceptions like the Hathitrust Research Center, the National Library of the                            Netherlands Data Services & APIs, the Library of Congress’ Chronicling America, and the British                            Library, cultural heritage institutions have rarely built digital collections or designed access with the                            aim to support computational use. Thinking about collections as data signals an intention to change                              that, and efforts like the Library of Congress’ Collections as Data: Stewardship and Use Models to                                Enhance Access and the multinational Digging into Data suggest that a broader community shift                            intentionally scoped to institutions large and small comes at an opportune time.    While the specifics of how to develop and provide access to collections as data will vary, any digital                                    material can potentially be made available as data that are amenable to computational use. Use and                                reuse is encouraged by openly licensed data in non‑proprietary formats made accessible via a range                              of access mechanisms that are designed to meet specific community needs.    19  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 20/180 Ethical concerns are integral to collections as data. Collections as data should make a commitment to                                openness. At the same time, care must be taken to comply with legal requirements, cultural norms,                                and the values of vulnerable groups. The scale of some collections may also obfuscate what is hidden                                  or missing in the histories they are perceived to represent. Cultural heritage institutions must be                              mindful of these absences and plan to work against their repetition. Documentation should be                            informed by archival principles and emergent reproducibility practice to ensure that users have the                            information they need to work with collections responsibly.    Principles    1. Collections as data development aims to encourage computational use of digitized and                        born digital collections. By conceiving of, packaging, and making collections available as                        data, cultural heritage institutions work to expand the set of possible opportunities for                          engaging with collections.    2. Collections as data stewards are guided by ongoing ethical commitments. These                      commitments work against historic and contemporary inequities represented in collection                    scope, description, access, and use. Commitments should be formally documented and                      made publicly available. Commitment details will vary across communities served by                      collections but will share common cause in seeking to address the needs of the vulnerable.                              Collection stewards aim to respect the rights and needs of the communities who create                            content that constitute collections, those who are represented in collections, as well as the                            communities that use them.    3. Collections as data stewards aim to lower barriers to use. A range of accessible                            instructional materials and documentation should be developed to support collections as                      data use. These materials should be scoped to varying levels of technical expertise. Materials                            should also be scoped to a range of disciplinary, professional, creative, artistic, and                          educational contexts. Furthermore the community should be motivated and encouraged to                      build and share tools and infrastructure to facilitate use of collections as data.    4. Collections as data designed for everyone serve no one. Specific needs inform collections as                            data development. These needs may be commonly held by particular user communities.                        Rather than assuming these needs or imagining these communities, stewards should be                        intentional about who their collections are designed for, work to lower the barriers to use                              for the people in those communities, and continue to assess these needs over time. Where                              resources permit, multiple approaches to data development and access are encouraged.    5. Shared documentation helps others find a path to doing the work. For example, collections                            as data work can entail decisions about selection, description, conversion cleaning,                      formatting, and delivery mechanisms or platforms that enable discovery and provide access.                        In order for a range of individuals and institutions to engage collections as data work, it must                                  20  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 21/180 be possible to locate documentation that demonstrates how and why the work is done.                            Documentation must also attest to the history of how the collection has been treated over                              time. While no documentation can be fully comprehensive, incomplete or in‑progress                      documentation is better than no documentation. Examples of documentation include                    human and machine readable metadata schemas, data sheets, workflows, application                    profiles, deeds of gift, and codebooks. Documentation should be publicly accessible by                        default.    6. Collections as Data should be made openly accessible by default, except in cases where                            ethical or legal obligations preclude it . Terms of use for collections as data must be made                                explicit and should align with community‑based practices such as RightsStatements.org and                      standard licenses such as Creative Commons, Open Data Commons, and Traditional                      Knowledge licenses.    7. Collections as data development values interoperability. Interoperability entails alignment                  with emerging and/or established community standards and infrastructure and eases                    integration with centralized as well as distributed infrastructure. This approach facilitates                      collections as data discovery, access, use and preservation.    8. Collections as data stewards work transparently in order to develop trustworthy,                      long‑lived collections. Trustworthiness depends upon efforts to ensure and publicly                    document the technical integrity of the data as well as its provenance. It also requires that                                data stewards acknowledge absences and areas of uncertainty within the collection as data.                          Trustworthy collections as data should include open, robust metadata, and should be under                          the care of stewards and institutions committed to their preservation.    9. Data as well as the data that describe those data are considered in scope. For example,                                images and the metadata, finding aids, and/or catalogs that describe them are equally in                            scope. Data resulting from the analysis of those data are also in scope.    10. The development of collections as data is an ongoing process and does not necessarily                            conclude with a final version. Work in progress status can be seen as a virtue when iteration                                  is geared toward developing productive collaborations and integrations between new and                      existing technologies, workflows, and service models. The ongoing development of                    collections as data can impact staffing models, workflows, and technical infrastructure.                21  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 22/180   Appendix 2: Collections as Data Facets  August 2017 ‑ August 2018    Collections as Data Facets document collections as data implementations. An implementation                      consists of the people, services, practices, technologies, and infrastructure that aim to encourage                          computational use of cultural heritage collections.        Facet 1: MIT Libraries Text and Data Mining  Richard Rodgers, Massachusetts Institute of Technology    1. Why do it  MIT Libraries collect, curate, and provide access to numerous digital collections that comprise                          important research outputs and contributions to the scholarly record. Access is typically                        provided via traditional web applications designed for individual users in browsers. In assessing                          the patterns of use of these collections, it became apparent that a significant amount of traffic                                was due to various automated processes that ‘scraped’ the sites, but did not identify themselves                              as indexing services. At the same time, we began to receive more and more direct requests from                                  individual scholars on campus (and beyond) for bulk delivery of textual corpora in our                            collections, in order to perform text‑mining on them. It was clear that these ‘alternative’ uses of                                collections were not well served by existing access methods and systems.    2. Making the Case  We saw that we needed to explore how better to provide access for these kinds of use, and this                                      need dovetailed with a broader agenda that the Libraries were pursuing of reconceiving library                            services as a ‘platform’: a notion articulated in recommendation 6 of the Future of the Libraries                                Task Force Report, which specifically mentions text and data mining as important                        ‘non‑consumptive’ uses of library‑stewarded material. The platform model emphasizes                  empowering users to create their own discovery/access/consumption tools by providing open,                      standards‑based, and performant APIs or other services that such tooling can leverage. So the                            case was made by arguing that an experiment to expose collection data via a new API designed                                  for bulk access would teach us how to build a library platform that would increase the value of                                    all collections.    3. How you did it  22  https://collectionsasdata.github.io/facets/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 23/180 Based on the analytics, we selected MIT’s Electronic Theses and Dissertations as the initial                            collection to work with: it was highly sought after, fairly extensive (close to 50K theses, with                                plans to digitize the entire historical run), and already under management in our institutional                            repository (DSpace@MIT). We wrote a formal proposal for a project to design and build a                              prototype of a new discovery and access service for this collection to enable text and data                                mining (or other non‑consumptive uses).    The project team consisted of:    ● a project manager, who oversaw the scrum‑agile process used to manage the development  ● three software developers, who took responsibility for content accession, repository                    management, and API design and development, respectively  ● an analyst, who surveyed the field of existing text and data mining services, and who worked                                with potential users of the system to understand their needs  ● a UI/UX expert, who helped in designing intuitive and effective user interfaces (which                          complemented and documented the API).  ● The development project ran for 10‑11 months, and a functional prototype was built that                            exposed an API for discovery and bulk access of etheses. The user could request any (or all)                                  of 3 content representations: the metadata (including an abstract), the thesis as a PDF                            (which is the approved submission format), and the full (unstructured) text extracted from                          the PDF.    The service consisted of several cooperating software components: a Fedora 4 repository, which                          held the metadata and textual artifacts, an ElasticSearch index, used to query the full‑text, as                              well as the metadata, an API server which formed the front‑end, exposing the ways users could                                interact with the index and repository, and various queues and caches to connect these                            components. Each component was deployed in a container to a Kubernetes‑orchestrated                      environment in a cloud service (Google Container Engine).    Several challenges the project encountered, to name a few:    ● The quality of the PDFs in the collection varied considerably, with numerous encoding and                            other errors that affected or impaired use. Some etheses were created in digitization                          workflows from analogue originals, whereas others were ‘born digital’, and both content                        streams were created over a long span of time using different software, workflow practices,                            etc. We vacillated between attempts to ‘repair’ the theses, or enhance the metadata with                            quality indications so that machine use could adjust for it: the final prototype included                            aspects of both approaches.  ● The cloud environment required considerable knowledge of deployment and orchestration                    tools and platforms that the team lacked. While we were able largely to surmount these                              deficiencies, we did so at some cost to the overall project deliverables. Our initial resource                              model for the project included a ‘devops’ role (unfilled) that would have greatly assisted.  23  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 24/180 ● It was difficult to identify and attract a broad variety of potential users to help define the                                  product design. We gained valuable insight from those we engaged with, but suspected                          there were many more research objectives, techniques, requirements, etc that would have                        beneficially shaped the design of the API and the whole service. This stemmed in part from                                the fact that we were asking for input without a working system to react to.    4. Share the docs  Project documents forthcoming, but the code that was used to run the prototype is available on                                Github.    5. Understanding use  The team solicited potential users of the ethesis service, and conducted a small number of                              interviews to elicit both their intended use, but also what affordances such a service should                              provide to researchers.    We learned that the metadata we exposed (academic department, completion year, degree                        type, etc) were considered useful ways to plumb and select within that particular corpus                            (etheses), in addition to keyword search over the full‑text.    The service itself was designed to gather data about how it was used, but working against this                                  was the desire to make the data openly available to all, without ‘user tracking’. In the end, the                                    service emerged with a lightly tiered structure: all content was freely available, but certain                            advanced functions required obtaining an API key (which allowed much better analytics).    6. Who supports use  While the cloud‑hosted service compute infrastructure was supported by the libraries                      technology team, the project required considerable support throughout the libraries and                      archives. At MIT, the responsibility for collecting and curating theses and dissertations falls to the                              Institute Archives, who were a key stakeholder in the project. They did extensive research                            (including soliciting advice from the Institute’s legal counsel) on the IP and rights issues                            surrounding such a new service, since this kind of use was not originally contemplated in the                                policies governing theses. They also assumed general responsibility for the rare but complex                          decisions around takedown requests, etc.    Since this service obtains content from existing digitization workflows, the digitization team was                          also closely involved in providing access to scripts, software tools, etc used to create ethesis                              artifacts.    If the service were launched in production, repository managers would need to both administer                            the service, but also field questions and provide support for end‑users (API key management,                            etc). In addition, the IT operations group would need to follow the standard set of practices for                                  system backup, performance monitoring, etc. We learned that data‑intensive services such as                        24  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 25/180 this (where gigabytes of package downloads were routine) had to be managed carefully from a                              resource perspective.    7. Things people should know  One key insight we gained was the need to perform a thorough appraisal of the collection from a                                    data completeness, uniformity, and consistency perspective: when discovery and access are                      confined to siloed legacy applications, these quality dimensions may be difficult to observe.    8. What’s next  ETDs were a great candidate collection for understanding the requirements of a text and data                              mining service, but we have numerous text‑based collections of high value, including our                          extensive open access articles collection, conference proceedings, technical reports, working                    papers, etc. An analysis of these corpora (what are useful metadata discriminators, etc) in light                              of the insights gained in the etheses prototype, could lead to a general, flexible service for                                offering the wealth of content the Libraries has to new forms of scholarly inquiry.                                  25  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 26/180 Facet 2: Carnegie Museum of Art Collection Data  David Newbury, Carnegie Museum of Art; Daniel Fowler, Open Knowledge International    1. Why do it  As stated on the Carnegie Museum of Art (CMOA) website, the Collection Data project is meant                                to be used for “discovery, inspiration, and innovation, allowing people to creatively re‑imagine                          and re‑engineer our collection in the digital space.” CMOA Collection Data is stored in  EMu , a                                collections management system from Axiell. This Collections as Data Facet documents the                        release of this data: It was exported to both CSV and JSON as a “data dump” and  released on                                      GitHub  for public consumption to help enable this creative reuse.    CMOA acknowledges that this project is continuously evolving and that the data will be                            periodically revised to reflect changes in how its curators understand the objects stored in the                              database. This acknowledgment is reflected in the choice of a platform (GitHub) which natively                            supports storing version‑controlled data. CMOA made the choice to publish using CSV, JSON, and                            GitHub because of their relative ease of use for researchers and developers—these platforms                          enable easy access to large amounts of data without the need for tools beyond what the                                researchers already possess, or requiring potential users to learn an API or write SQL against                              proprietary databases.    In addition to publishing the data itself, it was also important to provide a human‑ and                                machine‑readable description of the data, its structure, and guidance on how to actually use it.                              CSV, while easy to work with for many users, is a notoriously underspecified format: developers                              often have differing opinions on what constitutes a “valid” CSV file. The  Data Package                            specification developed by Open Knowledge International is a “containerization” format for data                        which is meant to provide a consistent interface (or “wrapper”) to a diverse range of datasets,                                especially those containing tabular data (e.g. data stored in CSV files). A single file,                            datapackage.json, stored with the dataset documents where each data file is saved (either on                            disk or a remote server) as well as its “schema” (number of columns and expected values per                                  column). Releasing this dataset as a Data Package was a good start for providing a minimum                                machine‑readable description of a dataset for processing. A growing set of software libraries and                            tools can read the Data Package specification so that artists, data analysts, and other users                              interested in CMOA’s collection can benefit from this consistent interface regardless of the                          software they use.    A human‑readable version of some of this same information is provided through a supplied                            “README” file.    Collection Data on GitHub:  https://github.com/cmoa/collection     Data Package specification:  http://specs.frictionlessdata.io/   26  https://emu.axiell.com/ https://github.com/cmoa/collection https://github.com/cmoa/collection https://frictionlessdata.io/specs/ https://frictionlessdata.io/specs/ https://github.com/cmoa/collection http://specs.frictionlessdata.io/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 27/180   2. Making the Case  The case to provide the public increased access to museum data was not a difficult one at the                                    Carnegie Museum of Art—the museum considers engagement and education to be a core part                            of its mission, and firmly believes in Open Access as essential to museum practice. Also, we were                                  helped immensely by the fact that several large institutions, in particular MoMA,  had already                            done so —rather than having to explain exactly what we were doing in detail, we could tell our                                  administration and board that “we were doing it the way MoMA did it”. Being able to model our                                    work on the previous work and decisions of others helped reassure non‑technical stakeholders                          that we weren’t doing anything risky or controversial.    The most significant barrier was determining how to coordinate the various expectations across                          departments—to publish this data required coordination across registrarial, publishing, digital,                    and curatorial teams. Additionally, it was clear that it would be important to provide all                              stakeholders with the ability to maintain control over their data. We provided at least six months                                of notice to allow the various departments time to correct any information that they felt was                                essential, and we also allowed anyone to hold back data that they didn’t feel was ready. All we                                    asked for was a single sentence written description of why the information should not be                              published. This allowed stakeholders to maintain agency, while avoiding the temptation to                        withhold large amounts of the information by default.    Finally, we had many internal discussions about how regular updates would be possible, and we                              worked with all the departments to craft language to communicate this within the GitHub                            documentation as being living data. This helped set the expectation both inside and out that this                                is not a publication that had been vetted by a curator for accuracy and completeness.    3. How you did it  The Carnegie Museum of Art collections data publication was an offshoot of the Art Tracks                              project at CMOA, a data visualization for provenance. Because of the sensitive nature of                            provenance, one of the most important goals of the project was to ensure that the professionals                                with the best understanding of the nuances of the data had control over which works were                                available for publication. To do so, we worked with Travis Snyder, the Collections Database                            Administrator, to craft a series of reports, using filter criteria he devised and fields he approved,                                that created a collection of XML reports, one per‑table, from the collections management                          system. These reports run as needed nightly, and the resulting XML is uploaded to an internal                                FTP site.    A second set of custom scripts, written by David Newbury, the Lead Developer of the Art Tracks                                  project, download and transform the XML, replacing internal field names with friendlier labels                          and joining data across tables. Additionally, these scripts add additional information that is not                            explicitly held in our collections management system such as the URLs for the object website and                                27  https://github.com/MuseumofModernArt/collection https://github.com/MuseumofModernArt/collection 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 28/180 images of the work. These scripts, written in Ruby, are run whenever the institution wants to                                update the publication data.    Our intention was to automate this process, but at this point, the benefit of regular, automatic                                updates is not yet worth the overhead of what is needed to maintain a complex automation                                system, for example, the time and effort required to provision servers and handle error reporting                              robustly. Instead, they’ve been wrapped into a single command line command using Rake, a                            Ruby library designed to automate repetitive tasks for programmers. The single command will                          download the XML, reprocess the files, generate both the JSON and CSV representations, and                            then upload the generated files to GitHub. Currently, if there are problems in the export, a                                human is running them and will notice (and hopefully correct) the problem before erroneous                            data is published. One interesting fact is that this script also has to update the documentation on                                  GitHub. For example, we provide in the documentation the number of items in the collection.    We’ve included several data formats within our the export. First, we include a CSV export. In                                discussions with members of the Pittsburgh digital humanities community, CSVs were seen as                          the most readily‑accessible format for researchers interested in quantitative analysis of our                        collections information. It doesn’t require any programming ability to read it, just a copy of Excel,                                which also means that it’s the version we show internal, non‑technical people. It is, however,                              somewhat limited—for instance, artworks can have one or more creators, and tabular formats                          like CSV are not designed to handle hierarchical relationships. We encode this data using an                              internal microformat (pipe‑separated values), but we’ve learned from watching users that this is                          confusing and non‑optimal. We’re still working to determine if there’s a better way to handle this                                sort of data.    The Data Package descriptor file, datapackage.json, which provides metadata for the CSV files in                            the dataset is written separately as an encapsulation of the expected output of this CSV export                                pipeline. Information about contributors to the dataset, its licensing, expected values per                        column per file is stored here.    We also provide a single large JSON export of the data. This is designed primarily for developers,                                  who can load it into memory and process it directly. It’s a large file (41 Mb), but not so large that                                          it can’t be held in memory using a modern computer. When we’ve held hackathons or worked                                with web technologists, this is the form of the data that they’ve been most comfortable with.    We also provide a directory containing a single JSON file for each object in the collection. This                                  was created to approximate an API—there’s a single URL that will return information about each                              object, as well as an index file containing a list of ids, titles, and a URL to an image. However, our                                          experience has been that this format is too complicated for both developers (who prefer the                              single JSON file) and non‑developers (who prefer the CSV), and is not used.    An additional complication for our data is that we have broken out the 80,000+ photographs of                                the Teenie Harris collection into their own file. This collection is part of the CMOA collection, but                                  28  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 29/180 is significantly larger than the rest of the collection combined. We found in exploring other                              collection data releases, such as the Tate London and their collection of J.M.W. Turner’s                            sketchbooks, that large‑scale special collections tended to drown out the rest of the collection in                              data analysis, and might be best considered separately. We discussed with the museum                          stakeholders our options, but the decision was made that publishing them as a separate files,                              using the same format and structures, and both documented the same way in the GitHub, was                                an acceptable pattern.    4. Share the docs  One of the most important decisions that we made was to treat  the documentation for the                                release as of equal importance to the data. Tracey Berg‑Fulton, the collections database                          associate and Art Tracks team member, spent a long time crafting the documentation to be                              thorough and friendly. Friendly was important, because we knew that many of the people who                              would be looking at this data would be students or members of the public, and we wanted them                                    to feel welcome to use the data. Big legal disclaimers and restrictions, or dense technological                              jargon might have prevented them from feeling like they were welcome.    We also included within our documentation a table that indicates not just what the field is, but                                  what it means, what type of data you can expect, and a real‑world example of the sort of data                                      that field contains. We wanted to make sure that people were able to find out if our data would                                      meet their needs without having to download it and review it.    Once we had completed our documentation, we sent it through several rounds of internal                            review—not just editorial review, to confirm that we’d spelled everything correctly, and legal                          review, to make sure that we’d appropriately used the correct licenses and disclaimers, but also                              content review, to make sure that our examples were factual, and that our descriptions captured                              the nuances of the content experts. This helped, but even more it fostered the sense that this                                  was of the museum, not just of the Art Tracks project or the technology department.    Beyond internal review, we’ve tried to consult with developers and researchers to verify that the                              information that we’ve provided is what is actually needed to understand our release. We also                              explicitly reached out to others in our field with a history of being critical of museum                                documentation and data, such as Matthew Lincoln, to critique our documentation and provide                          feedback on utility, comprehensibility, and completeness. We’ve also monitored other data                      releases across the museum field, and have worked to integrate good ideas around                          documentation from our peers. Finally, we model good collaboration by explicitly linking and                          thanking the institutions that helped us through example and direct advice on this project.    Finally, we’ve been working with Open Knowledge International to explore the use of Data                            Packages to provide an additional level of documentation for the collections data release. This                            provides a machine‑readable description of the contents of the CSV file, which allows software                            tools and agents to both understand and validate the structural content of the data. We use it as                                    29  https://github.com/cmoa/collection 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 30/180 a validation tool to ensure that all of the data published is structurally correct—for instance, that                                every URL is a valid URL, or that our ID numbers are in the correct format, or that every work has                                          an accession date. Our hope is that in the future additional software tools will leverage this                                format, but the most direct benefit to the institution has been as a exhaustive check against our                                  data to verify that the rules that we believe are enforced actually are—and we have been                                regularly surprised by the exceptions that we’ve found.    Collection Data on GitHub:  https://github.com/cmoa/collection     5. Understanding use  Compared to an API, providing access to Carnegie Museum of Art Collection Data through a data                                dump is a lower support cost option in terms of time and money. There is no server we need to                                        run: CMOA are, for the moment, hosting the public data on GitHub’s infrastructure. Providing a                              data dump also benefits users, both academic researchers and software developers, who might                          not be not be interested in writing code to hit an API endpoint 75,000 times to get 75,000                                    objects. A single file containing all the required data seems to be much easier for certain use                                  cases.    6. Who supports use  Mid‑size museums are not well‑equipped to offer support for digital resources. Unlike, for                          instance, a library or archive, the information management and technology staff are                        internally‑focused, not public‑facing. Curators, educators, and docents, who are often the public                        face of the museum, are often unaware that our digital resources exist.    Because of this, we have worked closely with local universities, in particular the University of                              Pittsburgh’s Information Science program, the Carnegie Mellon Digital Humanities program, and                      the Frank Raytche STUDIO for Creative Inquiry. We’ve worked with faculty and staff there,                            providing access to curatorial and digital team members one‑on‑one to help them enable use of                              these collections in their programs for teaching, research, and artistic reuse amongst their                          students.    Finally, our hope is that through the standardization work that we’ve been undertaking with                            Open Knowledge International, we can work to make it so that enabling reuse and support can                                be shared across the industry—we can facilitate working with Museum data, not just Carnegie                            Museum of Art data.    7. Things people should know  One of the most important decisions we made was to release our data under a Creative                                Commons Zero (CC0) license. We were strongly influenced in this decision by Cooper Hewitt and                              the Museum of Modern Art, as well as from conversations with the digital humanities                            community. Attribution is extremely important to us, and we’re extremely proud of our data. But                              the case was made convincingly that requiring attribution would be a burden to the most                              30  https://github.com/cmoa/collection 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 31/180 innovative and essential use we wanted to enable—projects that synthesize our data with others                            to generate new knowledge. By putting any restriction on the reuse of the data, many potential                                users would feel obligated to involve legal counsel to review their use, and that burden would be                                  sufficient to prevent their use of our data. Instead of requiring attribution via a CC‑BY license, we                                  made it easy for people to give us credit—we told them how we’d like to be credited, and asked                                      them kindly to do it. In our experience, almost every project that has used our data has credited                                    us in some way or another.    A surprising takeaway for us has been that one of the primary users of our public data has been                                      the museum itself. Easy access to our own data has enabled internal projects to be built on top                                    of the published data, both because it’s in an easy‑to‑use form, but also because of the                                permissive license. All of the data available is already approved for public use, so the approval                                process for remixing it and reusing it is significantly easier—”It’s already public” is a wonderful                              way to eliminate debate as to the appropriateness of using that information in public                            presentations.    Another important point that we missed on our initial communications is that we didn’t                            adequately explain how we were using GitHub. GitHub is an essential tool in the Open Source                                community, and that community has a set of norms around how to provide feedback and                              suggestions on work that is released via the tool. Typically, if you found a mistake or wanted to                                    improve a project that was available on GitHub, you would do so through a provided mechanism                                called a “pull request”, where you would create a copy of the work, make the change, and ask                                    the owner to approve merging your new version with the official version. Because collections                            data is not a standard use of GitHub, people were unclear whether or not we would accept                                  corrections to our collections information through this mechanism. Matthew Lincoln, who                      originally brought this to our attention, suggested that it wasn’t important what the answer was,                              as long as it was clear, and so we explicitly indicated that we would not take suggestions this                                    way, and offered an email address that would accept such changes. This has been entirely                              satisfactory to all of our users, as well as our internal staff who were happy to accept                                  suggestions, but were very pleased to learn that theyat didn’t have to learn how to use GitHub                                  to do so.    Open Knowledge International is keen to work on pilots with others considering releasing high                            quality tabular datasets in the open:  http://frictionlessdata.io     8. What’s next  Carnegie Museum of Art is hoping to release further iterations of its collections data over time.                                There are also now more tools that consume and generate Data Packages. It would be an                                interesting exercise to more deeply integrate features enabled by the Data Package descriptor.                          For example, CMOA can now add steps in the workflow that validate the dataset using tools like                                  Good Tables to ensure that the data and the expectations declared in the datapackage.json                            match before publishing. Additionally, given the additional information stored in a Data Package,                          31  http://frictionlessdata.io/ http://goodtables.io/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 32/180 semi‑automated export to other backend formats or databases can be offered relatively easily                          depending on interest.    CMOA and Open Knowledge International also hope to do work that supports the automatic                            generation of dataset documentation to ensure that documentation provided on GitHub through                        the README file matches that contained within the datapackage.json.                                                    32  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 33/180 Facet 3: CalCOFI Hydrobiological Survey of Monterey Bay  Amanda Whitmire, Stanford University Libraries    1. Why do it  Researchers are beginning to understand the magnitude and complexity of the effects of climate                            change on our Earth system, and all research in this area is grounded in what we know about the                                      past. Data collection at sea is labor‑intensive and relatively rare, and technology has lowered                            that barrier only within the last couple of decades. Through this lens, we understand why in the                                  marine sciences, the most valuable data collections are observational time‑series studies, and                        the older the better.    When I realized the scope of the analog oceanographic data collections being housed at the                              Miller Library (a marine biology branch library in the Stanford Libraries system), there was no                              question that these materials needed to be digitized and shared openly. There are very few                              oceanographic time‑series studies from the 1950s ‑ 1970s, and these particular data only exist at                              our location. These data are an important contribution to studies in the marine sciences, climate                              change and coastal ecology. Our library is located in a tsunami zone, and since we have the only                                    copy of these data, they are at significant risk of being lost.    2. Making the Case  Stanford Libraries has a Digital Production Group (DPG) whose primary focus is digitization of                            library collections for the purposes of preservation and access. Given the scientific relevance of                            the oceanographic data and its risk of being lost, it was not difficult to convince my boss (the                                    Associate University Librarian for Science & Engineering) to support digitization of the material.    Our process for internally funding digitization projects is kept intentionally simple. Any librarian                          in our Science and Engineering Research Group is welcome to write a “Collection Project                            Proposal” (CPP; limited to a single side of one page) that describes the materials to be digitized,                                  why they are important, what the goals for digitization would be, and an estimate of the costs.                                  Our AUL reviews these on an annual basis and grants as many requests as are justified and he                                    has the budget for. If a project idea comes up mid‑year, we can also submit a CPP as needed. I                                        proposed a pilot project to digitize a subset of the collection, and it was funded at $5,000.    3. How you did it  My goals for this collection include moving a step beyond digitization of materials to create                              actionable datasets, but I am not prepared to address that because I am still investigating how                                best to accomplish such a task (automated text recognition processes, crowdsourcing,                      transcription services, etc.). This section will be a LOT more interesting once I get there, and the                                  project will make more sense as a CAD Facet at that time.    33  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 34/180 For now, I’ll focus on the process of material curation and how the digitization workflow works.                                Some of the process is being captured in an Open Science Framework project page. In concise                                terms, this was the curation plan that I made before I started (adapted from a great poster and                                    using common sense), and it has largely been accurate:    1. INVENTORY ‑ What do we have? How much do we have? What kinds do we have?  2. ORGANIZE ‑ By cruise, station, variable, year? Standardize dates, stations, variables, cruise                        names…  3. APPRAISE ‑ Are there duplicates? Is anything missing? Prioritize: what is most valuable or in                              the worst shape?  4. METADATA ‑ Create descriptive & administrative metadata to guide digitization process:                      titles for collections in the digital repository, file names, etc.  5. DIGITIZATION ‑ Stanford Libraries Digital Production Group has a well‑equipped lab and staff                          for systematic digitization & deposit into the Stanford Digital Repository (SDR)  6. METADATA ‑ Data need readme files and item‑ & data‑level metadata to facilitate                          understanding & reuse; metadata from the DPG needs quality assurance and remediation.  7. MAKE ACTIONABLE ‑ Conversion from PDF to actionable tabular data is critical for enabling                            reuse of the data. How do we make it happen at scale?    Steps 1‑6 have been completed for the first batch of materials (data from every third year over                                  the 23‑year time‑series). Steps 1‑3 are time‑intensive and the effort logically scales with the size                              of the collection. The DPG requires relatively little metadata to get the digitization process going,                              so Step 4 was brief. I am fortunate that we are so well supported by the experts in the DPG. They                                          require submission of a digitization proposal via a standardized form that they provide, which                            ended up to be about 4 pages long. Based on the proposal, they provided an estimate of the                                    digitization timeline and costs, and then moved forward.    4. Share the docs  As mentioned in the previous section, some content can be found at, “Whitmire, Amanda L.                              2016. “Hopkins Marine Station CalCOFI Hydrobiological Survey of Monterey Bay, CA: 1951 ‑                          1974.” Open Science Framework. November 30.  osf.io/c3egt .”    The digitized items are not yet in the library catalog (also the discovery layer for the repository),                                  but you can see a few examples of digitized material via direct links:    ● A quarterly report:  https://purl.stanford.edu/qt035cq4651  ● An annual report:  https://purl.stanford.edu/dz088js0926  ● Field data:  https://purl.stanford.edu/xj314cj5427  ● Phytoplankton data:  https://purl.stanford.edu/qw382yy6150  ● Zooplankton data:  https://purl.stanford.edu/hy617cx4382      34  https://osf.io/c3egt/ https://purl.stanford.edu/qt035cq4651 https://purl.stanford.edu/dz088js0926 https://purl.stanford.edu/xj314cj5427 https://purl.stanford.edu/qw382yy6150 https://purl.stanford.edu/hy617cx4382 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 35/180 5. Understanding use  The primary audience for these data is researchers, but I believe that they will not use the data                                    for research purposes unless it is in a format that that can use. Meaning, text files with tabulated                                    data. That is the main driver behind my desire to move a step beyond digitization (while                                recognizing that digitization is a critical action for these at‑risk materials). I believe this because I                                used to be an oceanographer and I understand both their need for data like this and also the                                    constraints on their time and workflows. PDFs of legacy data are nearly worthless to a marine                                scientist who seeks to answer research questions.    6. Who supports use  After the data have been fully documented and converted to spreadsheets, the goal is that they                                can be used largely unsupported (setting aside the tremendous amount of work that goes into                              maintaining the digital repository). As a subject specialist and the curator of the collection, I am                                available to support data users. Interacting with 4‑dimensional oceanographic data is generally                        handled in Matlab (the software of choice for most oceanographers) or R (an emerging choice in                                this domain). I expect most users of these data to be outside of Stanford.    7. Things people should know  This project feels important. Analog research data is everywhere ‑ EVERYWHERE ‑ and we need                              librarians and archivists to engage with faculty who are retiring to guide them in sorting through                                the maelstrom. I am focused on facilitating reuse in the digital space because my audience for                                these data are my former colleagues and I know that’s where they operate. That said,                              identifying, curating, and archiving analog datasets to facilitate discovery and enable future                        reuse is critical. In my opinion, collections as data must necessarily extend to the analog world in                                  order to keep up with the upcoming influx of materials from retiring faculty who worked in the                                  pre‑digital era. This project is an example of how we bring those data into the digital realm, but I                                      encourage anyone interested in this type of work to reach out to faculty regarding their data. Do                                  it today.    8. What’s next  The most challenging part of this process is next: go from image or PDF to spreadsheets. This is                                    the part of the project that has the potential for real‑world impact. Nothing that I’ve                              accomplished so far is unique (important though it is). We’ve seen crowdsourcing, and we’ve                            seen transcription. What researchers really need is a way to liberate all of the older, analog data                                  from paper into the digital medium that they use. If I can make progress on addressing how we                                    might be able to do that at scale, I’ll consider this effort a success.              35  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 36/180 Facet 4: American Philosophical Society Open Data Projects   Scott Ziegler, American Philosophical Society Library    1. Why do it  The  American Philosophical Society Library (APS) has been digitizing historic primary sources for                          just about a decade. We’ve spent a lot of time smoothing out our workflow, and we feel like the                                      process is pretty well developed. However, we’ve known for some time that the audience for                              these scans are limited. The vast majority of our scanned material is hand‑written                          (correspondence, diaries, ledgers, account books, for example). Reading this handwriting can be                        slow, and at times is a specialization in its own right.    We wanted to make this material available in a more approachable manner. We also wanted to                                give researchers an opportunity to easily interact with the material in different ways, including                            mapping and text analysis. Lastly, we see this as an outreach opportunity. We hope to build                                tutorials for students at the high school and undergraduate level to learn about visualization                            creation and digital history.    2. Making the Case  The administrative case for  creating datasets from our collection was based entirely on our                            mission to increase access to our collections. This was a relatively easy case to make. However,                                there were additional hurdles to overcome.    Primarily, there are administrative concerns that the data we put out will have mistakes. This has                                proven to be the case. We try to include warnings that our datasets are created with attention to                                    detail, but that errors happen. We’re also cautious about how we label these datasets. We tend                                not to say that they are transcriptions (though, due to a dearth of synonyms, we do use the verb                                      ‘transcribe’). As an organization, we benefit greatly from large and professional transcription                        projects, including the Papers of Benjamin Franklin and the Papers of Thomas Jefferson. These                            projects are definitive representations of primary material. Our datasets are not. Our datasets                          are our attempt to make our material more usable, and usable for different types of projects.    In making the case for doing these datasets, we agreed to be clear about what we’re putting out,                                    to help draw a distinction between our datasets and professional transcriptions, and to supply                            feedback options for people who find mistakes.    3. How you did it  We identified the requirements for dataset creation to be:    1. ability to view a scan of the page being transcribed  2. ability to simultaneously view the software that the text is being typed into  3. versioning and/or revision history  4. ability to share among multiple people  36  https://diglib.amphilsoc.org/data https://diglib.amphilsoc.org/data 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 37/180   We experimented with a number of crowdsourcing tools, including  Omeka/Scripto ,                    Omeka/Scribe , and  Scribe Project . However, we quickly realized that the team we were                          assembling was small enough to rely on more modest tools.    We ended up using Google Sheets as the primary tool. We used dual monitors to ensure that the                                    person creating the dataset can easily see the scanned page as well as the spreadsheet.    For the  historic prison data , our first major step toward thinking of our collections as data, we                                  were lucky to have two talented and devoted volunteers: Kristina Frey and Michelle Ziogas.                            Kristina assisted in the early stages of the project, and Michelle did the majority of the dataset                                  work.    4. Share the docs  We don’t currently have any documentation, though we expect to create some during future                            projects.    5. Understanding use  We understand the use of our data primarily anecdotally. We think of our datasets as a means of                                    identifying new institutional partners and collaborators. We monitor the use of our data via                            these partners. For example, we created the historic prison dataset from material in our library                              related to Eastern State Penitentiary. As we did this, we contacted the staff of the Eastern State                                  Historic Site, and this has flourished into a fruitful partnership. Researchers come to our data                              through them, through our digital repository, and through the various third‑party services we                          use to host our data. Several of these researchers have contacted us to offer their own data, to                                    discuss additional projects, to show what they’re building, and to offer corrections. This has                            been our principal measure of success.    We do maintain some metrics. The  Magazine for Early American Datasets records the number of                              times datasets are downloaded. We also have a count of how many people download from our                                digital repository. These are helpful and appreciated. However, the motivation continues to be                          the new connections we make with individuals.    6. Who supports use  [blank]    7. Things people should know  When discussing this with people at libraries similar to my own, I tend to focus on the following:    ● Datasets are easy to create. All you need to get started as a spreadsheet and something to                                  transcribe.  37  https://github.com/omeka/plugin-Scripto https://github.com/ui-libraries/Scribe http://scribeproject.github.io/ https://diglib.amphilsoc.org/data https://repository.upenn.edu/mead/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 38/180 ● Material is easy to identify. We look for material that will work well as spreadsheets.                              Ledgers, printed forms, tallies, account books, are all examples due to their recognizable and                            repeatable format.  ● Datasets are useful. You can save researchers’ time by removing the challenge of reading                            handwritten notes; you can put material in a format that makes it easy to map; the material                                  can sorted, searched and filtered; you can promote the mission of your library.    However:    ● Datasets need to be managed: Mistakes will slide in, and researchers will point them out;                              editorial decisions will need to made, even in the most straight‑forward‑looking material.    8. What’s next  Our flagship project to date – historic prison data – has gotten some positive attention, and                                we’re eager to keep moving. We’ll be hosting a digital humanities fellow to focus specifically on                                using the historic prison data. He’ll be exploring various types of visualizations and analysis. We                              also hope to build a number of tutorials to encourage others to use the data for their own                                    projects.    Additionally, we’re working on two other  open data projects . One involves a post office book                              kept by Benjamin Franklin during his tenure as Postmaster of Philadelphia. The other will involve                              a record of indentured individuals arriving in Philadelphia during the years of 1771‑1773. Both of                              these projects will have academic advisory committees to help us strategize use cases and                            promote the data.                                      38  https://diglib.amphilsoc.org/data 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 39/180 Facet 5: OPenn  Dot Porter, University of Pennsylvania Libraries    1. Why do it  We believe that users of  manuscript data should have access to first‑quality images and                            metadata free of technical or licensing constraints and this is what  OPenn provides. First quality                              means the resolution at which the images were captured, and authoritative metadata in archival                            formats presented for easy reuse by humans and machines. Everything in OPenn is licensed as a                                Free Cultural Work .    2. Making the Case  The administrative case for creating datasets from our collection was based entirely on our                            mission to increase access to our collections. This was a relatively easy case to make. However,                                there were additional hurdles to overcome.    Penn Libraries has a commitment to Open Data, and the study of manuscripts in a digital age is                                    the central mandate of the Schoenberg Institute for Manuscript Studies (SIMS) which is an                            integral part of the library and was founded in 2013. Much of the work of SIMS involves the                                    reuse of our own digital manuscript materials, and we knew in 2013 that we could not do our                                    job without a resource like OPenn. So we had to make one. The director of SIMS made the case                                      for OPenn to the Director of Libraries, who made the decision to invest in the creation of OPenn.    3. How you did it  In 2013 Penn Libraries hired Doug Emery, who had created systems similar to OPenn for other                                projects, and he conceived the framework. The Penn Libraries did not at that time have a                                repository, so it was not in a position to host OPenn in an existing system. The Director of SIMS                                      asked the Director of Libraries if we could set it up through Penn Central Computing. We started                                  to populate OPenn with existing medieval manuscript image data; this was a challenge because                            although most of our manuscripts had already been photographed and cataloged, the master                          TIFF files were located in scattered hard drives and servers stored in various corners of Penn                                Libraries. This work was very intensive, and was carried out primarily by Jessie Dummer. We                              chose the manuscripts because they were central to the mission of SIMS and because the data                                was good. Doug Emery and Dot Porter designed a package and metadata structure for                            converting descriptive MARC and structural metadata into a TEI format designed for use and                            consumption integrating metadata with images.    Once OPenn was populated with Penn Libraries manuscript data we moved on to a second                              project. This project took advantage of the OPenn platform to gather into one location holdings                              from many different institutions, based around a common theme ‑ 19th century travel diaries.                            This project has its own website, but the data served from there is all extracted from OPenn                                  (http://diaries.pacscl.org/). OPenn now is the host for the Bibliotheca Philadelphiensis project, a                        project to digitize most of the Western medieval manuscripts in Philadelphia which received a                            39  http://openn.library.upenn.edu/ http://openn.library.upenn.edu/ https://creativecommons.org/share-your-work/public-domain/freeworks/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 40/180 $500K  grant from CLIR . SIMS’s Curator for Digital Research Services, Dot Porter, is a co‑PI on this                                  project.    OPenn was designed to use the simplest and least expensive technologies available for sharing                            image and metadata. As such, technologically it is nothing more than a webserver with a very                                large hard drive that runs Apache and exposes the directory listings of its content. The content                                itself is static, comprising only images, TEI/XML metadata, text manifests, and HTML files. This                            data is exposed for ease of access and ease of movement via simple, well‑established internet                              protocols: HTTP, anonymous FTP, and Rsync. One challenge that we had during implementation                          was convincing our service providers that what we wanted was something as simple as OPenn,                              without a query interface or an Application Programming Interface. Technologically, OPenn is                        more like an old‑style software sharing website from the 1990s than it is a modern web                                application.    However this approach does have sustainability issues. Penn Libraries is currently designing and                          building a  Samvera repository, and in the future we would like the data in OPenn to be stored in                                      this repository, but served in ways similar to how it is done now. Storing the data in the                                    repository will help with sustainability, and will also provide additional ways to serve the same                              data (e.g., using IIIF protocols). However we do plan to keep serving the data as friction‑free as                                  possible.    4. Share the docs  We have both a ReadMe and a Technical ReadMe file on the OPenn site:    http://openn.library.upenn.edu/ReadMe.html  http://openn.library.upenn.edu/TechnicalReadMe.html     5. Understanding use  Through OPenn, we provide well‑structured standard packages that allow for machine and                        human reuse without putting any preconditions on how it may be used. We provide the data;                                users can do whatever they like. We are undoubtedly OPenn’s primary user. We have built online                                bookreaders (generated with scripts from the TEI/XML files) that stream image files from the                            OPenn server, and we have also built downloadable epub electronic books (also generated with                            scripts from the TEI/XML files) that have copies of the manuscript images as part of the book.    6. Who supports use  ISC (Penn Central Computing) maintains the computer and storage, Jessie Dummer and Diane                          Biunno carry out the day to day work of managing and adding materials to the OPenn website.                                  Dot Porter provides curatorial advice and oversight (and is also a superuser), and Doug Emery                              wrote and maintains the software and manages the project.    7. Things people should know  40  http://bibliophilly.pacscl.org/ https://wiki.duraspace.org/display/samvera/Samvera http://openn.library.upenn.edu/ReadMe.html http://openn.library.upenn.edu/TechnicalReadMe.html 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 41/180 We serve digital assets on OPenn that represent physical materials that Penn Libraries doesn’t                            own. OPenn is seen by us as an outlet for materials    OPenn treats digital assets as originals and seeks to build up a distinctive library of assets                                whether those originals are housed by Penn Libraries or not. The Open licensing in OPenn allows                                for easy collaboration with institutions local and international, many of which could not deliver                            this data in this quantity by themselves. It is a mistake to think that either the licensing or the                                      ease of access to the materials is less important than the other ‑ they are equally important.    8. What’s next  We are going to move OPenn to the Library’s Samvera repository to ensure preservation                            standards and long term sustainability and scalability. We will maintain an OPenn interface to                            this data, but the same data will also be able to be served through other methods including IIIF.                                    We will also be expanding the content of OPenn from mainly medieval manuscripts to printed                              books and archival material.                                                        41  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 42/180 Facet 6: Chronicling America  Robin Butterhof, Library of Congress; Deborah Thomas, Library of Congress; Nathan Yarasavage,                        Library of Congress    1. Why do it  American newspapers are a valuable primary source for research and study across a wide variety                              of disciplines – from political history to economics to epidemiology and more. The primary goal                              of the  National Digital Newspaper Program is to enhance and expand access to American                            newspapers by providing free and open access to the data selected and gathered from                            institutional collections around the country to create one unified national collection of                        historically significant newspapers. By utilizing open data formats and schemas, communication                      protocols, and providing bulk data downloads, we can expose the collection to a very different                              type of use than through an individual user‑based Web interface and extend the research value                              of the collection.    2. Making the Case  The administrative case for creating datasets from our collection was based entirely on our                            mission to increase access to our collections. This was a relatively easy case to make. However,                                there were additional hurdles to overcome.    The case for providing extended access to data had two aspects. Extending uses of the collection                                beyond the individual user was an opportunity to allow for new and enhanced uses of the                                content. In addition, the software developed for managing and displaying the data created under                            the program uses internal APIs and standard Web protocols for accessing data and                          communication within the software. To expose these internal mechanisms to external users was                          a low barrier to extending the use of this important federally‑funded resource.    3. How you did it  An important component of envisioning the collection as a dataset was accomplished through                          emphasizing consistent and verified technical standardization of the file formats and metadata                        created under the program. To ensure this outcome, primarily for the purposes of creating a                              sustainable collection, the program developed highly‑detailed technical requirements for data                    producers and provided a JHOVE1‑based JAVA validation tool for ensuring conformance to key                          requirements. While minor changes have been made over the course of 12 production years so                              far, the dataset is largely internally consistent. (Most changes have been loosening of precise                            requirements rather than outright changes to technical specifications.) With a long‑term vision                        for the program and specifically scoped goals (eventually involve all 50 states and territories,                            produce x amount of data per producer per 2‑year grant, etc.), we strove to ensure that the data                                    we received at the end of the program (some 20 years later) would be compatible with the data                                    received early in the program. To that end maintaining strict data standards using open                            42  https://chroniclingamerica.loc.gov/ https://www.loc.gov/ndnp/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 43/180 well‑document technical formats and a robust inventory management system has allowed us to                          achieve that goal to date.    With a reliable and consistent dataset, an access system could be built that both supported                              broad access to the collection and provided robust and flexible technical environment. The                          current system is based in the Django web framework written in Python which includes                            implementation of various open data access points and supports others. More information on                          these access points  is available and the  code‑base itself  is available.    Collaboration is a notable characteristic of the program not only with regard to the institutions                              providing data, but also with regard to the staff within the Library of Congress. Developers,                              digital library staff, program managers, and collection specialists alike had a stake in the                            development of the web site. Various views were created not only to assist programmatic access                              to the open data for digital humanists and researchers but also for digital library staff, program                                partners, and collections managers at LC.    4. Share the docs  Technical requirements for creation of the dataset are part of the  Technical Guidelines for the                              National Digital Newspaper Program . The National Endowment for the Humanities funds state                        representatives to select and digitize historic newspapers from their collections to conform to                          technical specifications established by the Library of Congress. All data created under the                          program is delivered to the Library for aggregation and public presentation, creating a large                            consistent dataset for historic newspapers (currently 12 million pages/45 million files).    Harvest and use of the data is documented on the  main web interface . A built‑in reporting                                feature of the Django framework provides information and RSS feeds supporting use of the data                              at  http://chroniclingamerica.loc.gov/reports/ . The Django framework and Python code itself is                      available on GitHub . In addition, a  listserv , hosted by the Library of Congress, supports data                              users through community input.    5. Understanding use  Learning about uses of the data is often indirect. As no API key is required to use the data, there                                        is no register of people interested in using the data. On one hand, this is a primary driver for the                                        adoption of the content in, for example, classroom settings. No API key means that it is very                                  quick to get going with the content. On the other hand, it means we must infer use through                                    various alerts and searches, for example, when we see a published article. In addition, as the                                content is public domain, there are no restrictions on the use of the content. This has led to a                                      wide variety of uses, from commercial harvesting of the site to serving as a test dataset in a                                    digital humanities class.    Some methods of finding out about the data use include Google alerts for the project name or                                  social media posts, using common #hashtags like #ChronAm or retweets. (A former web                          43  https://chroniclingamerica.loc.gov/about/api/ https://github.com/LibraryOfCongress/chronam http://www.loc.gov/ndnp/guidelines/ http://www.loc.gov/ndnp/guidelines/ https://chroniclingamerica.loc.gov/about/api/ http://chroniclingamerica.loc.gov/reports/ https://github.com/LibraryofCongress/chronam https://listserv.loc.gov/cgi-bin/wa?A0=CHRONAM-USERS 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 44/180 developer created a Twitter bot  @paperbot that retweets when someone posts a tweet with a                              link to one of the NDNP pages.) Other methods include tracking metrics for the site; a huge                                  traffic spike on a particular day to a particular page turned out to be a popular Reddit post.                                    Similarly, if the content harvester or researcher is running into problems getting content from                            the site, they will reach out to us to figure out a better method. Researchers will also reach out                                      for information about how to credit the site or ask questions about the parameters of the data,                                  both through direct contact or through the chronam‑users listserv.    NEH also ran a  data challenge in 2016 to encourage direct use of the content. This led to some                                      outstanding projects. One tracked how biblical quotations were used within the newspaper                        context; another combined the data with another dataset (Project Hal, a national lynching                          database) to provide more information about specific lynchings. Other researchers tracked the                        etymology of the word “Hoosier,” extracted the agricultural news, and created an interactive                          visualization for following a phrase over time/location. In the K‑12 arena, an AP History Class                              used digital humanities tools to look at different historical topics in the newspapers.    6. Who supports use  There are a number of different layers that support the use of the data. Inside of the Library of                                      Congress, the NDNP program specialists are often the first line of contact. The Library of                              Congress site provides an email contact option (Ask‑a‑Librarian), and reference specialists                      typically refer these questions to the NDNP program specialist. (Most users review all available                            documentation first and tend to use email contact as the last possible option.) The NDNP                              program specialists tend to answer some technical questions (pointing users to csv files), data                            questions (questions about OCR, limitations of the dataset), or query tweaking (instead of                          looking for fish pricing, search for specific fish prices in specific markets, such as market price of                                  salmon in Portland versus local nearby markets).    For complicated questions, there are a number of other options. Sometimes the method the                            researcher/user is using can impact the performance of the website. In that case, the LC                              technology staff figure out how the researcher/user can get at the data without impacting                            performance (like downloading the bulk OCR bags instead of scraping the site). In other cases,                              the question is best answered by other users of the data. In this case, we recommend that users                                    contact the ChronAm‑users listserv (chronam‑users@listserv.loc.gov). For example, another user                  might have already figured out a way to visualize given issues in a specific state by year. As more                                      and more users work with the data, we encourage researchers to look at prior research, and                                point researchers to known current research efforts underway.    Publicizing and encouraging the use of the data is also mixed in with encouraging the use of the                                    collection in general. The NEH supports the use of the data, such as the data challenge described                                  above. Similarly, our education outreach team as well as National History Day serve as boosters                              for the use of the collection in general and the use of the data. As the project is a distributed                                        model, our state project partners (universities, state libraries, and state historical societies)                        44  https://twitter.com/paperbot https://www.neh.gov/news/press-release/2016-07-25 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 45/180 encourage the use of content in the classroom, provide greater awareness of the content and                              what can be done with it via talks at conferences, etc.    7. Things people should know  Beyond the features that support individual Web browsing, Chronicling America also supports                        access to all data through common Web protocols and formats, providing machine‑level views of                            all data for harvesting and large‑scale bulk download. As examples, researchers can harvest                          batched digitized page images as JPEG2000, PDF and/or METS‑ALTO OCR, or bulk OCR‑only                          batches. Each newspaper page includes embedded Linked Data using a number of ontologies                          and supports JSON and RDF views. US Newspaper Directory bibliographic records are also                          available as MARCXML. The open API includes industry‑standard endpoints like OpenSearch and                        supports stable intelligible URLs.    To accommodate data harvesting activities, the Chronicling America Web site infrastructure and                        workflow includes several features specifically designed to support such work:    1. During data ingest, additional text‑only data sets are created and stored separately ready for                            bulk download.  2. To create transparency and ease of access to the bulk downloadable data, feeds for the                              downloadable files, in both ATOM and JSON format were added. Researchers can subscribe                          to the feed to ensure they get any new data that is added.  3. For the interactive API (JSON & RDF) caching was added to provide fast responses for pages                                that need to be created “on the fly” by the server (as opposed to the bulk processed data                                    that exists in flat files).    For the user, we intentionally provide access and support to users with a wide variety of needs                                  and skills. For example, a student can download a csv file of all of the digitized newspapers                                  available on the site; the csv file includes information about the title, first issue digitized, final                                issue digitized, state, etc. A researcher might be interested in large‑scale text analysis; for that                              user, all of the OCR files have been bagged and are available for bulk download.    8. What’s next  Planned infrastructure and interface design upgrades as well as endeavors to integrate and                          streamline digital content presentations at the Library present challenges and opportunities                      related to API access to data collections. Planning is underway to integrate the Chronicling                            America dataset into the general digital collections of the Library. Providing API and bulk data                              download access to Chronicling America data has proven to be a valuable service, and as such,                                maintaining equivalent or improved access after integration is a priority for the Library. Much of                              the available digital collections at the Library of Congress lack API documentation or bulk data                              access. Leveraging the work done with Chronicling America in these areas, more data collections                            at the Library are expected to take advantage of the same approaches used by Chronicling                              America in the near future.  45  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 46/180 Facet 7: La Gaceta de La Habana  Paige Morgan, University of Miami Libraries; Elliot Williams, University of Miami Libraries; Laura                          Capell, University of Miami Libraries    1. Why do it  The University of Miami Libraries Cuban Heritage Collection (CHC) received funding from LAMP                          (Latin American Materials Project) and LARRP (Latin American Research Resources Project) to                        digitize its holdings of La Gaceta de la Habana in 2015.  La Gaceta is a significant historical                                  resource, in that it was the paper of record during the Spanish colonial occupation of Cuba; and                                  the CHC holds one of the largest collections of the newspaper outside of Cuba, with nearly 50                                  years of issues (from 1849‑1899).    As part of our regular digitization workflow, we also create a plain‑text file generated through                              Optical Character Recognition (OCR), in order to make digitized material discoverable through                        our  digital collections user interface . Our standard practice within this workflow has been to use                              uncorrected OCR. However, our digital collections interface (currently CONTENTdm) only allows                      discovery, rather than any sort of analysis. Associate Dean for Digital Strategies Sarah Shreeves                            was aware of the increasing interest in text analysis as a result of digital humanities activity, and                                  she suggested that creating a dataset that was easily accessible for use in text analysis tools                                would be a useful experimental project for a few members of the Library’s Digital Strategies                              team. Everyone involved was aware of the imperfections of the OCR’d files; but we were also                                aware of the relative scarcity of Spanish‑language datasets, and aware that if we made                            high‑accuracy OCR a condition for release, that we might never reach the point where the files                                were ready. At this point in time, we are more interested in learning what is possible with                                  imperfect OCR, and learning how we can make significant small improvements, than we are in                              striving for perfection on first release.    We think that it is worth emphasizing the creation of this dataset as a learning project on                                  multiple levels. One of those levels was institutional: our goal was to understand how much                              work was involved in preparing a large dataset (approximately 50,000 files), and what specific                            steps would be part of the workflow, both for La Gaceta and potentially for other datasets we                                  might want to release in the future. On another level, it was a learning project for the three of us                                        who were chiefly responsible because of our different backgrounds. As a Digital Humanities                          Librarian without an MLS/IS, Paige Morgan brought hands‑on experience with text mining, and                          with creating and preparing corpora, but lacked experience with corpus creation in the context                            of library systems for large‑scale file management. Conversely, Elliot Williams (Metadata                      Librarian) and Laura Capell (Head of Digitization) had experience with library file management,                          but were unfamiliar with the specific needs of researchers who might want to work with the La                                  Gaceta materials. This project was an opportunity for all three of us to begin fitting our expertise                                  together and teaching each other enough to be able to produce materials efficiently. We see this                                as valuable preparation for future similar projects where we bring in people who may have vital                                46  https://github.com/UMiamiLibraries/collections-as-data/tree/master/LaGaceta https://merrick.library.miami.edu/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 47/180 expertise with a particular set of materials, but who may be less familiar to the processes                                involved in creating machine‑readable data.    2. Making the Case  There was considerable enthusiasm for this project, both from library administrators, CHC                        curators, and library faculty who were excited about providing deeper access to materials than                            the Digital Collections interface allowed. La Gaceta is a significant set of texts for Cuban and                                colonial studies, and we are excited about being able to introduce interested CHC researchers                            and UM students to text‑mining techniques with materials that are directly relevant to their                            studies.    Acting on that enthusiasm was not difficult precisely because we deliberately kept this project as                              low‑key and low‑resource‑intensive as possible: three people were primarily involved, with brief                        consultations or assistance from three others. Generating the OCR’d plain‑text files is part of our                              existing digitization workflow, so the new activity within this project was focused on finding the                              best way to share the files and document how to use them. Our estimate is that the total time                                      spent on this new activity was around 4‑6 hours. Keeping the project fairly low‑stakes and                              experimental made it a more comfortable site for learning and collaboration for everyone                          involved. It was also helpful that our goal for this project was not just the end product of the La                                        Gaceta dataset, but also a clearer understanding of the work involved, and the resources we                              might need in the future (I.e., an internal data repository, rather than an external GitHub site).    La Gaceta is an interesting test case for text mining release because it’s an imperfect dataset.                                The paper is thin enough that opposite page images tend to bleed through, and creases and                                sometimes blurred text complicate the OCR process. The dataset is too large for every page to                                have its OCR checked individually – however, that makes it a more interesting test case. And                                even with imperfect OCR, distant reading still yields interesting results. We’re looking for                          repetitive errors that might be fixable using a bulk find‑and‑replace – and hoping that doing so                                will be another aspect of useful learning for our team.    3. How you did it  For the initial digitization process, roughly half of the La Gaceta volumes were digitized in‑house                              by UM Libraries personnel; and the other half were outsourced with funding from LAMP and                              LARRP. The combined output of this digitization process was approximately 4.2 terabytes of TIFF                            files (one file for each page of the newspaper), which were OCR’d in‑house. Both the TIFF and                                  plain‑text files are stored in our dedicated digital collections server for preservation purposes,                          but for this initial release, we decided to focus on providing just the plain‑text files as a bulk                                    download, available through a GitHub repository.    The majority of our work was about deciding how to structure the files, and how they should be                                    named – and for all of us, that meant learning about the differences between file management                                practices within a library context and the context of a DH researcher working with the files in a                                    47  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 48/180 text analysis tool such as Laurence Anthony’s AntConc or Geoffrey Rockwell and Stefan Sinclair’s                            Voyant.    To explain: when our La Gaceta holdings were prepared for digitization, they were separated in                              one‑month chunks. Within each month, there would be separate text files for each page of the                                newspaper, so each month would contain about 100 files, since each issue is 3‑5 pages long. We                                  broke up the newspaper this way because although La Gaceta was a daily paper, breaking it                                down by day would have required substantially more time – enough to be unsustainable within                              our standard digitization workflow. We experimented with regular expressions to see whether it                          would be possible to break the months into days using the first few lines – but the results                                    weren’t quite reliable enough to be worthwhile. One month chunks of the newspaper worked                            fine for displaying La Gaceta within our Digital Collections interface. But what would it be like for                                  researchers to navigate those materials in bulk within a text analysis tool?    The question that emerged from this thinking was about the ID for each individual .txt file, i.e.                                  each page of the newspaper. Our standard digitization workflow also generated a 20‑character                          filename for each .txt and .tiff file (e.g. chc99980000010001001.txt). This filename is the product                            of our house schema for internal file management, which has worked very well in that context:                                library faculty and staff who use it are familiar with how the filename breaks down into                                segments that identify the repository, collection, object, sequence, and format. However, this                        filename structure is not easy to parse for external researchers, especially not in tools like                              AntConc and Voyant. Would we need to change the filename to something more                          human‑readable in order to make the dataset useful? What would the stakes of that change be?                                As a researcher, Paige wanted more legible filenames, while Laura and Elliot were resistant to                              the idea of multiple filenames for the same object, and what it would mean for the Library to                                    potentially have to develop an alternative filename schema designed for functionality within text                          analysis tools.    Making a decision about the filenames was probably the most controversial/high‑stakes aspect                        of this project, since it felt like it had major implications both for users and for the library                                    personnel involved. In the end, for our initial release of La Gaceta files, rather than create                                simplified and human‑readable filenames for each document, we created a roster that will allow                            users to match any filename to its month and year. Keeping the 20‑character filename is                              advantageous since researchers can use the same ID number to access the page image through                              our catalog if they want to check the original image. As we make more releases, the question of                                    a more human‑readable filename will almost certainly come up again, and perhaps we’ll work                            towards that alternate schema that’s designed more for external researchers, rather than for                          internal library file management.    4. Share the docs  This project is still new enough that we’re still in the process of adding more formal                                documentation – as we have it, we’ll make it available through the  UM Libraries Collections As                                48  https://github.com/UMiamiLibraries/collections-as-data/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 49/180 Data website . Our current introduction to the dataset (including an explanation of the filenames)                            is here, in our main repository.    For now, however, we recommend exploring this dataset with Laurence Anthony’s  AntConc . We                          recommend AntConc for three main reasons:    1. It’s lightweight and easy to download and run on Windows, Mac, and Linux machines.  2. The main interface is adjustable in a way that will work well with the La Gaceta filenames.  3. AntConc is widely used enough that there are plenty of excellent tutorials, and even a c orpus                                linguistics MOOC based at Lancaster University that features it – in short, lots of support for                                users who might want to use this dataset as they learn more about text mining.    While this dataset could also work with  Voyant (particularly Voyant Server, which doesn’t require                            an internet connection), the experience might be a bit rougher, just because of the sheer                              number of files involved, since even a single month includes around 100 pages.    5. Understanding use  Because of the early stage of this project, this is an area that we’re still figuring out: we want to                                        learn from what our users do and what they need, and continue refining this dataset or use the                                    info to produce better datasets with future materials. One important aspect of this project is                              that the local campus community is relatively new to DH, and so getting to the point where we                                    can better understand the use will involve at least some work on our part to model what use                                    looks like. Since we released this at the end of the school year, we anticipate more opportunities                                  to figure that out till this fall. We understand that our success in this area will depend on how                                      much work we put into making sure that various communities are aware of this dataset and how                                  to use it, and plan to produce more materials that help them learn what they can do.    We’re very interested in responding to the needs that our users raise, and we welcome feedback                                and requests.    6. Who supports use  The fully digitized version of La Gaceta is supported by University of Miami Libraries faculty in                                the Cuban Heritage Collection and faculty who work with our distinctive collections. Use of the                              current release of the plain‑text dataset is supported chiefly by Paige Morgan (Digital Humanities                            Librarian), in collaboration with Laura Capell and Elliot Williams, as we continue to refine the                              dataset according to user feedback. In addition to making the dataset available for individual                            researchers, we are also developing lightweight plans that instructors could adapt if they wanted                            to use the dataset as a smaller or larger unit within a particular course.    7. Things people should know  Our approach might be described as “ambitiously unambitious” in its scope – and that gave us                                room to think calmly and clearly about the new dataset that we were producing, and how it fit                                    49  https://github.com/UMiamiLibraries/collections-as-data/ http://www.laurenceanthony.net/software/antconc/ https://www.futurelearn.com/courses/corpus-linguistics https://www.futurelearn.com/courses/corpus-linguistics https://voyant-tools.org/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 50/180 (or didn’t fit) with our existing digital collections and schema, and our local institutional                            practices, etc. Creating this dataset has helped to make some inchoate questions more explicit,                            and we think that seeing those questions more clearly is just as valuable as answering them –                                  which we hope to do in future projects. We recommend this approach, especially for any                              institutions that are hoping to use the Collections as Data initiative as a means for helping their                                  faculty/staff develop new skills and expertise.    8. What’s next  In the immediate future, we want to make sure that we put sufficient energy into outreach,                                promotion, and support for the La Gaceta dataset, which should be valuable both as a training                                object for our local community, and for gathering feedback for future data releases.    We will also be looking for other materials in our collections that could be good candidates to be                                    processed and released in formats that will be useful for digital humanities researchers. One                            obvious future project will be various parts of the  Pan American World Airlines Collection , which                              is in the process of being digitized – but we’re certain that the Pan‑Am Collection is just one of                                      many potential projects.                                50  http://scholar.library.miami.edu/digital/exhibits/show/panamerican 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 51/180   Facet 8: Text as Data Initiative  Zach Coble, New York University Libraries; Scott Collard, New York University Libraries; Nicholas Wolf,                            New York University Libraries    1. Why do it  As part of a broader text‑as‑data initiative, New York University (NYU) Libraries is in the process                                of expanding access to the ProQuest Historical Newspapers collection. This project involves                        negotiating with the vendor for access to the corpus as a set of text files, acquiring and storing                                    the data, and creating infrastructure to promote discovery, access, and creative uses of the new                              collection. At a high level, this is the type of work that librarians do every day, but the technical                                      components of the project have presented a fresh set of challenges.    We are seeing an increasing number of requests for machine‑actionable data at NYU Libraries,                            whether in the form of full‑text collections, bibliographic metadata, or both, from data                          researchers seeking corpora to perform topic modeling, network modeling, machine learning,                      and other natural language processing tests. The most predominant disciplines at our university                          that are interested in these methods have thus far come from political science and the  Center for                                  Data Science . We are simultaneously tracking the changes among publishers with regard to of                            API access to collections, provisions for researcher worksets of publisher data, and other                          affordances for machine‑actionable research using previously licensed content. In anticipation of                      an emerging trend, several departments at the library, including  Digital Scholarship Services ,                        Data Services , and  Digital Library Technology Services , are eager to get ahead of this changing                              landscape, to shape how our relationships with content providers can enable this type of                            research, and to reconsider what library‑provided content will look like in this environment.    2. Making the Case  As with all of our new initiatives, it begins as a pilot. We are interested in exploring several                                    significant questions: What is the best way to provide access to the data? How will researchers                                use it? A pilot provides a low‑stakes mechanism to work through a set of faculty requests in                                  order to answer these questions and then evaluate if and how we want to continue. In our                                  experience, when we are upfront with patrons about the pilot status of a project, and make clear                                  that we are not promising new services and that the whole thing might disappear in, say, six                                  months, they respond favorably and appreciate the candidness.    We have also found that pilots are most successful when they have wide scale buy‑in. A project                                  like this has a variety of stakeholders ‑ both internally from liaison, reference, collections                            management, data services, and metadata librarians, as well as externally from faculty and                          central IT. Clear and consistent communication with everyone during pilot process not only helps                            prevent surprises but also establishes buy‑in through a collaborative work process.  51  https://cds.nyu.edu/ https://cds.nyu.edu/ https://library.nyu.edu/departments/digital-scholarship-services/ https://library.nyu.edu/departments/data-services/ https://library.nyu.edu/departments/digital-library-technology-services/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 52/180   3. How you did it  The project began with a faculty member asking a liaison library for access to government                              documents corpora. This prompted us to revisit our licensing terms for similar types of content,                              such as historical newspapers, and to look for cases where our licensing terms allows us to                                provide full‑text content to our research community. Once we realized there was potential to                            meet an emerging need among scholars and to leverage existing resource agreements, we                          convened a working group to investigate the issues.    The project has been a joint endeavor bringing together several departments, including Digital                          Scholarship Services, Data Services, Digital Library Technology Services, Subject Liaisons, and                      Collection Development. Each brings strengths to this team project. Digital Scholarship members                        speak to researcher needs working with content not traditionally seen as “data,” in this case                              full‑text historical content. Digital Scholarship can also draw on past experiences in digital                          humanities projects that have developed key techniques in text mining that we can bring to bear                                on how we shape the form of the data we distribute. Data Services team members bring an                                  awareness of how researchers are wrangling, transforming, and analyzing data‑driven projects,                      assisting patrons and librarians alike in how they conceive of the data embedded in the full‑text                                content. Subject Liaisons will have interacted with faculty members and understand the scope of                            their needs. Collections Development can speak to the terms of licenses, will often know the                              institutional history of data collections acquired by vendors (often previous shipments of                        CD‑ROMS, hard drives, and other storage media), and can help negotiate new terms as vendors                              begin to take notice of data‑drive access requests.    The pilot is also a helpful use case for new mass storage services coming out of  Research Cloud                                    Services , a joint initiative from NYU Libraries and central IT. Specifically, we are considering                            providing access to the collection through NYU’s mountable storage (another pilot!), which                        provides remotely accessible fast‑as‑desktop storage that is protected and backed up. Here, we                          will use this new storage service as a distribution point to researcher to enable restricted access                                that is both convenient and controlled.    4. Share the docs  We do not have any documentation that we have permission to share at this point, although we                                  will share it via our various channels as it becomes available.    5. Understanding use  We have researchers interested in using the historical newspaper corpus for machine learning,                          topic modeling, network modeling, and other natural‑language processing. To better facilitate a                        variety of research uses, we are currently investigating ways to reduce the data cleaning and                              preparation steps that individual researchers are required to perform. One example of this is                            OCR correction, as preliminary samples indicate there is a fair amount of incorrectly transcribed                            text. Additionally, the library would like to create mechanisms to query the corpus and create                              52  https://wp.nyu.edu/library-drsr/2017/05/25/mountable-storage-pilot-first-impressions/ https://wp.nyu.edu/library-drsr/2017/05/25/mountable-storage-pilot-first-impressions/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 53/180 subcollections (e.g. by a specific newspaper, timespan, or keyword) to facilitate use by                          researchers interested in working with the content but are not interested in massaging the data.                              At a broader level, the library sees this pilot as a new and creative approach to library forms of                                      ingest, collection development, and information distribution. We want this use case to help                          inform our vision for next‑generation library services and library collections.    6. Who supports use  Use of the historical newspapers corpus is supported primarily by Data Services and Digital                            Scholarship Services. Liaison librarians also have a significant role in outreach and patron                          support.    7. Things people should know  We are still early in the process and are eager to learn from our experiences. Thus far we have                                      found that positioning the initiative as a pilot was helpful in making the administrative pitch                              because it allows us to try new things and, equally important, gives us room to make mistakes.                                  Additionally, bringing in several departments has been helpful in scoping the project as well as                              getting buy‑in from our diverse group of stakeholders.    8. What’s next  Our next steps include plans to improve access, discovery, and outreach for the collection. After                              our data cleaning and processing work is complete, we want to ensure the collections is                              discoverable in the library catalog and other primary discovery avenues. Finally, we plan to begin                              outreach for the collection, which could included workshops as well as class‑based instructional                          sessions, as we’ve found that sessions working with pre‑packaged data sets are better.                                    53  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 54/180 Facet 9: #HackFSM  Mary Elings, University of California Berkeley, Bancroft Library; Quinn Dombrowski, University of                        California Berkeley, Research IT    1. Why do it  In April 2014 to celebrate the 50th anniversary of the Free Speech Movement at UC Berkeley,                                The Bancroft Library, the Research IT group in the Office of the CIO, and the School of                                  Information at UC Berkeley held  #HackFSM , a hackathon around the  Free Speech Movement                          Digital Archive , as part of the Digital Humanities @ Berkeley initiative. The event brought                            together thirteen teams of UC Berkeley students to design a new interface for a subset of                                Bancroft’s digital holdings on the Free Speech Movement.    The Free Speech Movement was an appealing, immediately recognizable subject of the                        hackathon. The Free Speech Movement is felt to be quintessentially “Berkeley”, and while most                            students are aware of the movement, it is not necessarily well understood by those students.                              The hackathon offered an opportunity to raise awareness of the subject and there was an                              available dataset to work with in the Bancroft Library’s Free Speech Movement (FSM) digital                            archive.    2. Making the Case  The hackathon served as a valuable opportunity for groups in very different areas of the                              university, with different priorities and organizational cultures, to work together towards a                        shared vision. There were areas of administrative overlap, particularly between the Library and                          Research IT groups, and clearly defining roles and responsibilities was essential. #HackFSM was a                            highly collaborative and interdisciplinary effort, made possible by the participation of the Library                          Systems Office, Library Administration, BIDS, the School of Information, Arts & Humanities                        Division, Social Sciences, and the students from various disciplines, in addition to the Bancroft                            Library and Research IT. The relationships formed through participating in this hackathon have                          continued to benefit campus through the development of new collaborative initiatives.    3. How you did it  See the white paper (below).    4. Share the docs  #HackFSM: Bootstrapping a Library Hackathon in Eight Short Weeks    Abstract: This white paper describes the process of organizing #HackFSM, a digital humanities                          hackathon around the Free Speech Movement digital archive, jointly organized by Research IT at                            UC Berkeley and The Bancroft Library. The paper includes numerous appendices and templates                          of use for organizations that wish to hold a similar event.  54  http://digitalhumanities.berkeley.edu/fsm-archive-hackathon http://bancroft.berkeley.edu/FSM/ http://bancroft.berkeley.edu/FSM/ http://research-it.berkeley.edu/sites/default/files/publications/HackFSM_bootstrapping_library_hackathon_0.pdf 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 55/180   5. Understanding use  There was never an explicit discussion of “use”; it was left up to the individual student teams to                                    define the audience for their project, and what “use” looked like. Responses varied, and included                              a tool for conducting research, multiple browsing / exploration interfaces, and a few that were                              more like an exhibit.    6. Who supports use  The HackFSM team included The Bancroft Library, the Research IT group in the Office of the CIO,                                  and the School of Information at UC Berkeley. The data preparation for the API involved the                                Library Systems Office and the Bancroft Library. In order to govern access to the Library’s FSM                                API, ResearchIT staff used a common‑good campus service (no cost to users) called API Central,                              provided by UC Berkeley’s Information Services and Technology department. The API Central                        service provides a proxy to the Solr API, and can be configured to require credentials in order to                                    process an HTTP Request (credentials are values of app_id and app_key headers that are set in                                the HTTP Request Header). University IT staff, I‑School faculty, Berkeley alumni, and individuals                          from local tech companies served as code mentors during the hackathon. Eventbrite was used                            for registration of participants. Social media accounts (twitter and Facebook) were used to                          promote the event. During the hacking period, students, mentors, and event organizers                        communicated via Piazza, a free platform that offers a course‑ based message board, commonly                          used in STEM courses at UC Berkeley.    The Library administration offered space, as the new Berkeley Institute for Data Science space                            and the UC Berkeley School of Information for the opening and closing events. During the                              hackathon students were encouraged to make use of physical collaboration space provided by                          our new social sciences D‑Lab and library.    7. Things people should know  Projects like this are highly collaborative and require technologists as well as content providers.                            The most successful outcome of the project was student engagement. Students from across                          disciplines came together to build something.    Maintaining the winning sites was not successful and we need better method and practices to                              achieve a record of this work.    While the main work product was a website, the greater product was that developers and                              humanists learned to communicate and work together. IT was humanists and technologists                        working and talking together, learning from and collaborating with each other in the process of                              building new scholarly output. Hopefully events like HackFSM can prepare them for future                          collaborations in a research environment where such interdisciplinary projects will be more                        common.    55  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 56/180 8. What’s next  Our hope is to prepare more digitized collections as data so they are ready to be used                                  computationally. Current OCR could be improved and brought to a point of being “research                            ready” for computational use. We plan to write a grant to prepare a large recently digitized                                archival collection, working with local data scientists on the requisite steps we would need to                              take to get the data to a point of usefulness.                                                                                56  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 57/180 Facet 10: HathiTrust Research Center Extracted Features Dataset  Eleanor Dickson, University of Illinois at Urbana Champaign    1. Why do it  HathiTrust Digital Library is a massive digital collection, comprising more than 15.8 million                          volumes, and growing. HathiTrust aims to leverage the scope and scale of the digital library to                                the benefit of research and scholarship. The collection includes considerable material under                        copyright or subject to licensing agreements, which prohibits HathiTrust from releasing much of                          it—either in the form of plain text files or scanned pages—as freely‑available data. The                            HathiTrust Research Center therefore develops tools and services that open the collection to                          data‑driven research while remaining within the bounds of copyright and licensing restrictions,                        allowing only  non‑consumptive research .    One way the Research Center approaches this goal is through tools and technical infrastructure                            that mediate access to the data, including web algorithms researchers can run on HathiTrust                            data, the HathiTrust+Bookworm visualization tool, and the HTRC Data Capsule secure computing                        environment. Results from a user‑needs assessment for text analysis conducted by the Research                          Center, as well as anecdotal evidence from researchers affiliated with HTRC, evinced the value of                              flexible, open data for text analysis research. To this end, the Research Center released the  HTRC                                Extracted Features Dataset in 2015, which includes metadata and data derived from the                          HathiTrust corpus. The derived “features” in the dataset include page count, line count, empty                            line count, counts of characters that begin and end lines, and part‑of‑speech tagged word                            counts. The first release (v.0.2) included 4.8 million public domain volumes from the collection,                            and second release (v.1.0) opened 13.7 million volumes from the collection, representing a                          snapshot of the entire HathiTrust Digital Library circa 2016.    2. Making the Case  The HTRC Extracted Features dataset was in part born from other projects at the Research                              Center, including the Andrew W. Mellon‑funded  HathiTrust+Bookworm project, that required the                      HTRC to process full volume text into alternate formats. The team working on these projects                              realized that the data they were deriving would likely be useful to researchers and satisfy the                                HTRC’s policy for non‑consumptive research.    Much text analysis research begins with the process of generating so‑called features from the                            original text, which are then counted and calculated to draw conclusions about the data. HTRC                              Extracted Features aids the researcher by providing the data already in feature format.                          Furthermore, this shift in format from full text to features distills the contents of the volumes                                into facts and metadata, discarding the original expression of the full text. The Extracted                            Features dataset therefore strikes a balance of meeting the needs of researchers in a                            non‑consumptive manner.    57  https://www.hathitrust.org/htrc_ncup https://analytics.hathitrust.org/datasets https://analytics.hathitrust.org/datasets https://bookworm.htrc.illinois.edu/develop/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 58/180 The research opportunities created by the release of HTRC Extracted Features was understood                          throughout HathiTrust and HTRC, and after review, the dataset was released.      3. How you did it  Deriving the HTRC Extracted Features was largely the work of Peter Organisciak (University of                            Denver), Boris Capitanu (University of Illinois), and Ted Underwood (University of Illinois).                        Together they collaborated to create a data model and write code to derive the extracted                              features.    The resulting dataset includes: *For every volume: metadata, including bibliographic metadata,                      word counts, and page counts. *For every page in a volume: part‑of‑speech tagged tokens                            (words) and their counts. Metadata, including information about the page (number of lines,                          number of empty lines, counts of characters beginning and ending lines), and the language, which                              has been computationally determined.    HTRC Extracted Features are available in JSON format, where each file represents a volume.                            Within the JSON files, data is organized by page in the volume. JSON is a hierarchical file format                                    popular for exchanging data, and it lends itself well to representing book data.    HTRC Extracted Features are available using  rsync , which HathiTrust tends to use to share data                              and is considered an efficient file transfer protocol. Volumes download in  pairtree format, a                            highly‑nested directory structure.    The data can be retrieved with a structured URL that includes the standard HathiTrust volume                              identification number. The rsync URL format is: data.analytics.hathitrust.org::features/. More                  information about generating the rysnc URL can be found here:                    https://wiki.htrc.illinois.edu/x/oYDJAQ  .    4. Share the docs  The following sources contain more information about HTRC Extracted Features.    Code to extract features:   ● https://github.com/htrc/HTRC‑FeatureExtractor    Data paper:  ● Organisciak, P., Capitanu, B., Underwood, T. & Downie, S.J. (2017). “Access to billions of                            pages for large‑scale text analysis.” iConference 2017. Wuhan, China.                  http://hdl.handle.net/2142/96256     HTRC Extracted Features documentation:  ● https://wiki.htrc.illinois.edu/x/WQCGAQ   58  https://linux.die.net/man/1/rsync https://confluence.ucop.edu/display/Curation/PairTree https://wiki.htrc.illinois.edu/x/oYDJAQ https://github.com/htrc/HTRC-FeatureExtractor http://hdl.handle.net/2142/96256 https://wiki.htrc.illinois.edu/x/WQCGAQ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 59/180   HTRC Feature Reader toolkit:  ● Python toolkit for interacting with HTRC Extracted Features:                https://github.com/htrc/htrc‑feature‑reader/     5. Understanding use  The HTRC Extracted Features dataset is useful for both research and teaching. As discussed in                              section 2 above, the feature format provides the data in a derived manner that aids the research                                  process without over‑mediating access to the data. As structured and pre‑processed data, it                          does not meet the needs of all users, for example those whose work requires access to bigrams                                  or greater, though it is useful for research that follows the bag‑of‑words model or that starts                                from token counts. Demonstrated uses have shown the data’s value in large‑scale computational                          text analysis, such as text classification using machine learning techniques, and in‑classroom for                          teaching data science and digital humanities. Exemplary uses are outlined below.    Text classification with HTRC Extracted Features  Ted Underwood at the University of Illinois has drawn on HTRC Extracted Features in his research                                on literary genres. His work in machine learning uses the features data, including words and                              word counts, characters, and computationally‑inferred, page‑level metadata, to make inferences                    about genre in HathiTrust. Dr. Underwood classified volumes in the broad categories of fiction,                            poetry, drama, nonfiction prose, and paratext. His work classified over 800,000 volumes at the                            page‑level, and resulted in a derived dataset containing word counts by genre and by year for                                volumes from 1700‑1922.    More information about this research is available on FigShare:                  http://dx.doi.org/10.6084/m9.figshare.1281251  .    Pedagogical application of HTRC Extracted Features  Chris Hench and Cody Hennesy at the University of California, Berkeley have developed a                            module for the Berkeley Data Science Education Program that makes use of HTRC Extracted                            Features. In the first iteration of the module, students documented the use of Extracted Features                              in data visualization, mapping, and classification in Jupyter Notebooks. Their Notebooks will be                          re‑used in the classroom over the next year. Chris will introduce the curriculum to students in his                                  course, “Rediscovering Texts as Data.” In that multidisciplinary, digital humanities class, students                        will build on the existing Jupyter Notebooks as they develop coding skills. Chris also imagines                              using the Notebooks in workshops with non‑programmers, where they will provide a legible                          introduction to text analysis by revealing how Python code is used to interact with the data                                without requiring attendees to program.    The Jupyter Notebooks are shared on GitHub:  https://github.com/ds‑modules/Library‑HTRC  .      59  https://github.com/htrc/htrc-feature-reader/ http://dx.doi.org/10.6084/m9.figshare.1281251 https://github.com/ds-modules/Library-HTRC 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 60/180 6. Who supports use  Use of HTRC Extracted Features is supported by two main groups within the HTRC: the HTRC                                Tech Team and the HTRC Scholarly Commons. The HTRC Tech Team is comprised of research                              programmers, software engineers, and researchers (faculty, postdocs, and graduate students)                    affiliated with the  University of Illinois School of Information and  Indiana University Data To                            Insight Center . The HTRC Scholarly Commons group is made up of librarians from the University                              of Illinois and Indiana University who are affiliated with digital scholarly initiatives at their local                              campuses.    The Tech Team provides technical support for the data, including writing the code to generate                              the features, processing data on supercomputers at the University of Illinois and Indiana                          University to derive the dataset, and providing reliable access to the data. The HTRC Scholars’                              Commons supports research and teaching with the suite of HTRC Tools and Services. The                            Scholars’ Commons leads workshops, conducts outreach, and offers support to researchers who                        have questions about using the dataset. The HTRC Tech Team and Scholars’ Commons have                            collaborated on questions of data curation and preservation of the dataset, discussed in more                            detail in section 7 below.    7. Things people should know  At the scale of HathiTrust, challenges to access and storage become particularly acute. Crunching                            feature data for millions of files is computationally expensive, and requires access to high                            performance computers. HathiTrust is also a non‑static collection: Volumes are added daily, and                          (with less frequency) volumes are removed. For these reasons, HTRC has versioned the dataset                            following a “snapshot” model. Due to the time it takes to generate the features, the dataset will                                  never be exactly current with the HathiTrust Digital Library, but instead captures the collection at                              a moment in time. The Research Center continues to provide access to both extant versions of                                the dataset,  v.0.2 and  v.1.0 , but in the future, may have to look to alternate models for access to                                      versions. Each version of the dataset is terabytes in size and storage may prove an issue if every                                    new version includes features for the entire corpus.    Others interested in creating derived datasets as a model for opening access to restricted                            collections should consider what features would be useful to their researcher community. In                          addition to the token (word) counts, HTRC Extracted Features includes additional metadata,                        some of it processed from MARC records and others calculated during feature‑extraction, that                          we hope provides valuable context for researchers who want to make use of the dataset. Other                                collections with other perceived user communities may want to include additional features.    8. What’s next  As HathiTrust continues to grow, the HTRC Extracted Features dataset will be periodically                          updated with new versions. Between the first and second releases of the dataset, significant                            changes were made to simplify the data model that required all of the data to be re‑crunched. In                                    future releases, only new or differing files may need to undergo feature‑extraction. Still, there                            60  https://ischool.illinois.edu/ https://pti.iu.edu/centers/d2i/people.html https://pti.iu.edu/centers/d2i/people.html https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=37322766 https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 61/180 are some issues in the existing data, primarily related to the tokenization of Chinese‑, Japanese‑,                              and Korean‑language text, that HTRC plans to improve on in future releases.                                                                                          61  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 62/180 Facet 11: Beyond Penn’s Treaty  Michael Zarafonetis, Haverford College; Sarah M. Horowitz, Haverford College    1. Why do it  At Haverford, we believe that libraries should move beyond the creation of digital images of                              original sources. Digital materials should allow scholars to do interesting and amazing things                          with our unique collections beyond what is possible with their physical incarnation rather than                            trying to replicate the experience of the original. We believe that “digitization” encompasses all                            of this work, rather than just the creation of images. As part of our efforts to make our                                    collections available to a wider set of users and to be used in new and interesting ways, we have                                      developed a number of projects that use this expansive definition of digitization with public                            facing websites that facilitate exploration of the collections.    Beyond Penn’s Treaty fits into this effort for a number of reasons. While it includes digital images                                  of materials–primarily journals and letters written by Quaker travelers in the late eighteenth and                            early nineteenth centuries–it also has added value in the form of  TEI encoded and linked text , as                                  well as further information on the people, places, and organizations encoded. The materials                          from Quaker & Special Collections included in the project are frequently requested, making                          them good candidates for digitization and wider distribution.    2. Making the Case  The types of materials included in this project are some of the most requested by researchers                                and scholars using Quaker & Special Collections. Many of the included documents had only                            recently been cataloged as part of a grant‑funded project. Because much of the work for the                                project was in‑scope for the Digital Scholarship team (creating databases, writing code, etc.), we                            needed only informal approval from the library director. She approved it based on the project’s                              ability to showcase these newly‑cataloged materials and add to our growing collection of digital                            collaborations between Quaker & Special Collections and Digital Scholarship.    3. How you did it  We collaborated with colleagues at the Friends Historical Library (FHL) at Swarthmore College to                            add their materials to the digital collection of travel journals and letters. Items from Haverford                              and FHL were scanned in their respective departments. The Digital Scholarship team at                          Haverford, at the time composed of two DS librarians and several student assistants, then                            migrated the digital objects from a CONTENTdm instance to a locally hosted Omeka instance                            with the Scripto/Scribe plugin and theme to facilitate transcription. Student workers in the                          library (in both DS and Quaker and Special Collections) transcribed materials during their shifts.                            Summer interns at Swarthmore (2016) and Haverford (2017) encoded the materials in TEI XML                            and shared those transcriptions in a Google Drive folder while also producing a master database                              (Google Sheet) of biographical, location, and organization records. An additional intern also                        62  https://pennstreaty.haverford.edu/ https://github.com/HCDigitalScholarship/penns_treaty_data 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 63/180 worked on cleaning geographical data and building maps tracing travel routes recorded in the                            documents. Student interns were overseen by staff from Quaker & Special Collections and Digital                            Scholarship with expertise in the subject, technologies used, and metadata. Pat O’Donnell at FHL                            provided subject expertise in Quaker biography and history, as well as experience with authority                            control for Quaker records, to help build out the database and provide quality control for the                                records created. The transcribed and encoded documents are made accessible to the public in a                              custom‑built Django site–Beyond Penn’s Treaty–that provides multiple entry points to the                      collection. Users can explore several maps that trace the routes of Quaker travelers and search                              across the entire collection for person, place, and group names. The encoding of the documents                              creates future opportunities for visualizing the collection based on researcher interests.    4. Share the docs  The TEI XML documents are publicly available in a  Github repository , as is the code for the                                  Django site . We have a  Google Doc with instructions for scanning, transcribing, and encoding                            materials.    5. Understanding use  Like most of our digital scholarship projects, Beyond Penn’s Treaty is outfitted with Google                            analytics to allow us to track basic metrics of use on the page. However, beyond that, our data                                    about use is mostly anecdotal. Since we provide all the materials for people to download and                                use, we only hear about these uses if they get in touch. As a relatively new project, we are not                                        aware of any major uses of this data.    6. Who supports use  Use of the data is supported by Digital Scholarship and Quaker & Special Collections. The                              Coordinator for Digital Scholarship and Services and the Digital Scholarship Librarian have led the                            development of the Django site, with regular input from the Head of Quaker & Special                              Collections. In the past year, encoding and transcription work and some of the Django                            development has also been managed our Metadata Librarian, who has dedicated time for DS                            projects built into their job responsibilities and is a member of the DS team. Special Collections                                and DS staff continue to work together to identify funding opportunities and to create student                              internships to continue the digitization, transcription, and encoding of new materials.    7. Things people should know  Much of the work involved with this project was done by student interns. This is a familiar model                                    for us, and one that works well in an undergraduate liberal arts setting. Using students is not                                  necessarily less work than doing such a project in other ways, however, as they need lots of                                  oversight and supervision. Such deep opportunities can be transformative experiences for                      students and rewarding for all those involved in such projects.    While this was a new project for us, it is built on other work we had done. We have used Django                                          as the framework for a number of other projects, such as  Quakers & Mental Health , and the                                  63  https://github.com/HCDigitalScholarship/penns_treaty_data https://github.com/HCDigitalScholarship/QI/tree/master/QI https://docs.google.com/document/d/1AMwzcHuydaaGk6-TaD5fYQCFOgXAirpuGGoDNFf9h0A/edit http://qmh.haverford.edu/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 64/180 transcription and transformation process we employed was similar to that of the  Ticha project .                            The project also built on the strong collaboration between Digital Scholarship and Quaker &                            Special Collections.    8. What’s next  Since all of the documents in the project are encoded in XML, we can create visualizations of                                  many different kinds to explore the collection as a whole and the connections between people,                              places, and groups within it. We also hope to integrate the people, places, and organizations that                                have been encoded into a Quaker linked data project that we are building. This application will                                allow researchers to explore connections across our entire suite of Quaker projects.                                                                64  https://ticha.haverford.edu/en/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 65/180 Facet 12: Ticha: A Digital Text Explorer for Colonial Zapotec  Brook Lillehaugen, Haverford College; Michael Zarafonetis, Haverford College    1. Why do it  The digitization, transcription, and encoding of these  documents is part of Dr. Brook                          Lillehaugen’s linguistics research on the Zapotec family of languages in the Oaxaca region of                            southern Mexico. The documents include printed texts and manuscripts written by Spanish                        monks, bills of sale, religious testaments, land deeds, and other manuscripts that include the                            Spanish, Latin, and Zapotec languages. The work has been done over the past several years and                                continues as the project team explores more archival material in Mexico. The transcription and                            encoding is crucial to creating a digital annotated version of colonial period texts that include the                                Zapotec language, which include morphological analysis within the texts. Additionally, the  public                        interface features a transcription tool that allows the public to transcribe documents, providing                          avenues for students, other scholars, and indigenous community members to engage with the                          materials.    2. Making the Case  No administrative case needed to be made, as digital scholarship staff in the Haverford library                              supports faculty and student research. This project is essential to Dr. Lillehaugen’s research. The                            main institutional or administrative barrier is obtaining permission from various Mexican                      archives to make the images publicly available.    3. How you did it  The project is composed of several workflows. The first is digitization of archival manuscripts                            (bills of sale, religious testaments, etc.), which is done primarily by project team                          members–faculty, student research assistants, and librarians. The Ticha project employs a                      postcustodial approach to the creation of the digital archive. The digital images are organized                            and stored in a Dropbox folder, and uploaded to an Omeka instance with the Scribe/Scripto                              theme and plugin combination. There they are described by student assistants, and made                          available for transcription. Once the transcriptions are complete, they are visible alongside the                          image of the manuscript.    For printed texts and bound volumes, transcription and encoding is done by students in Dr.                              Lillehaugen’s Colonial Valley Zapotec class. Using Git and Github for version control, students                          transcribe texts digitized at the Internet Archive and push their work to a remote repository.                              Making several passes at their assigned sections, they encode for language, outline structure,                          and formatting in TEI XML markup. We chose TEI to adhere to an encoding standard for texts,                                  and to draw comparisons across texts in the growing collection. This XML markup is merged with                                an export of morphological analysis from the  Fieldworks Language Explorer (FLEx), a popular                          software package in the field of linguistics, which is then rendered into HTML for the public site.  65  https://github.com/HCDigitalScholarship/ticha-xml-tei https://ticha.haverford.edu/en/ https://ticha.haverford.edu/en/ https://www2.archivists.org/glossary/terms/p/postcustodial-theory-of-archives https://software.sil.org/fieldworks/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 66/180   The public website is built in Django, a Python framework for the web, because many of our                                  student assistants are Computer Science majors who learn Python in their introductory courses.                          Using the Omeka API, we can update the data and metadata in the archival materials section of                                  the site by running a Python script. We also provide a download link to the plain text                                  transcriptions of each page on the website. A bulk download option of all texts is coming soon.    4. Share the docs  Most of our documentation is in the  Github repository  for the encoded texts.    5. Understanding use  The materials on the site can be used freely under a Creative Commons Attribution and                              Share‑Alike license. The encoded transcriptions are of research value to Dr. Lillehaugen and                          linguists who study the Zapotec family of languages. Access to the documents (both the digitized                              originals and the transcriptions) is important for community members to explore their language                          and history. By soliciting direct input from these community members and from from workshops                            in Oaxaca that the public interface facilitates this exploration. We continue to consult our                            Zapotec speaking collaborators on design and interface questions.    By providing access to the encoded texts in TEI XML, we hope that scholars can find interesting                                  ways of visualizing the collection.    We use Google Analytics to track usage of the project, and to help us make design decisions.    6. Who supports use  The Digital Scholarship team in the Haverford library provides technical support for the project,                            with server space for the public interface provided by Instructional and Information Technology                          Services. Mike Zarafonetis (Coordinator for Digital Scholarship and Services and a project team                          member), and Andy Janco (Digital Scholarship Librarian) provide project management and                      technical support for the project. Technical work (TEI quality control, Django project feature                          development, etc.) is done by student research assistants and DS student assistants. DS also                            provides instructional support for Dr. Lillehaugen’s class, in which students collaboratively                      transcribe and encode the larger printed texts.    7. Things people should know  This project is very inclusive of undergraduate students in the work of transcribing, encoding,                            and developing the web platform for the public site. This is a model that is familiar to us in the                                        Haverford libraries, and one that is aligned with our goals as a liberal arts institution. These                                students require a good deal of instruction and supervision, but such deep opportunities can be                              transformative experiences for them and rewarding for all those involved in such projects.    66  https://github.com/HCDigitalScholarship/ticha-xml-tei 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 67/180 Additionally, members of the project team are very intentional about incorporating feedback                        from Zapotec‑speaking community members. The transcription feature, for example, grew out of                        a request from speakers of the language who wished to contribute to the project. Thinking                              expansively about our user base, particularly beyond a strictly scholarly audience, is important.    8. What’s next  We continue to add more archival manuscripts and bound texts to the public interface. Students                              are currently encoding and transcribing Fray Leonardo Levanto’s Arte de la Lengua Zapoteca, and                            we hope to have the encoded version completed by the end of 2017. The next printed text for                                    transcription, encoding, and analysis will be Juan de Cordova’s Vocabulario en Lengua Zapoteca.    We also plan to add interlinear analysis of the Zapotec language to the archival manuscripts in                                the near future, which break down glosses by component parts. Interlineal analysis is already in                              place for some of the printed texts (see this  example page from Juan de Cordova’s Arte ).                                                        67  https://ticha.haverford.edu/en/texts/cordova-arte/13/original/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 68/180 Facet 13: Vanderbilt Library Legacy Data Projects  Veronica Ikeshoji‑Orlati, Vanderbilt University    1. Why do it  The Jean and Alexander Heard Library has become the repository for dozens of digital projects                              executed across the university. As stewards of these digital collections ‑ encompassing                        databases, archives, e‑editions, and exhibitions ‑ it is incumbent upon us to ensure not only the                                availability, but also the accessibility of these resources to current and future generations. Every                            digital project is the product of hundreds, if not thousands, of hours of intellectual labor. To                                facilitate (re)use of digital scholarship pioneer and practitioner contributions requires that their                        work be thoughtfully curated, documented, and made publically available.    2. Making the Case  The administrative case for instituting a “data‑first” policy of distilling the content and structures                            of digital projects into machine‑actionable datasets is driven not only by ideological                        considerations but also practical ones. Fundamentally, the infrastructure to support continued                      development of sunsetted digital projects without personally invested stakeholders is lacking.                      The time and expertise required to satisfactorily migrate and maintain all sites built in Drupal 6,                                for example, is not fiscally viable if the library is to care for an ever‑burgeoning collection of                                  digital projects. In addition, the CLIR Postdoctoral Fellowship Program in Data Curation has                          allowed the library to experiment with integrating digital data curation practices into Digital                          Scholarship workflows.    3. How you did it  The first dataset curated by current CLIR postdoctoral fellow Veronica Ikeshoji‑Orlati is the                          e‑edition of Raymond Poggenburg’s Charles Baudelaire: Une Micro‑histoire. Poggenburg initially                    published the Micro‑histoire in 1987 as an entry‑based chronology of the life of Charles                            Baudelaire (1821‑1867). In the early 2000s, an expanded e‑edition of the Micro‑histoire was                          published by the Vanderbilt University Press and Jean and Alexander Heard Library. In 2016, due                              to the deterioration of the perl framework on which the e‑edition was built and the library’s                                desire to increase the accessibility of the Micro‑histoire’s contents, the data and metadata from                            the relational database underlying the e‑edition were extracted into CSV format. Data cleaning                          was accomplished with OpenRefine, and the Library of Congress  Metadata Object Description                        Schema (MODS) version 3.6 was selected for structuring the data and metadata in XML format.                              The dataset is currently in a github repository awaiting legal counsel’s approval for public                            release. The  process of curating the Micro‑histoire dataset was presented at the IDCC 2017                            conference.    4. Share the docs  68  http://diglib.library.vanderbilt.edu/baud-search.pl http://www.loc.gov/standards/mods/ http://www.loc.gov/standards/mods/ http://www.dcc.ac.uk/sites/default/files/documents/IDCC17~/presentations/VAI-CBA_IDCC17_presentation.pdf 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 69/180 Legacy data curation protocols and institution‑wide data management policies are currently                      being drafted. Each project, in its public release through the  library GitHub account, is                            accompanied by documentation specific to that project.    5. Understanding use  Our goal in making Vanderbilt’s digital project datasets publically available under CC0, CC‑BY, or                            CC‑BY‑NC licenses (as appropriate) is to facilitate (re)use of the data in research and teaching                              contexts. It is anticipated that the communities currently utilizing the digital projects will engage                            with the curated datasets for their research purposes. In addition, new users interested in                            scholarly meta‑analyses or large‑scale quantitative research may incorporate the library’s                    datasets into their work. In the case of the Poggenburg Micro‑histoire dataset, for instance,                            Baudelaire scholars are the most likely audience, but those interested in broader questions in                            French history and literature may find the data of use, too. While the users for each dataset may                                    differ, it is hoped that the curated datasets will also be of service to teachers working with                                  students to learn how to interrogate humanities and social science data in meaningful and                            methodologically sound ways.    6. Who supports use  Members of the  Digital Scholarship and Scholarly Communications team in the Jean and                          Alexander Heard Library are the primary facilitators for data acquisition, curation, publication,                        and use projects on campus. A new position, the Curator of Born‑Digital Collections, has been                              created in order to continue curation efforts on library‑housed digital datasets. In order to                            encourage campus use of the datasets, the Digital Scholarship team conducts regular workshops                          and hosts working groups in Linked Data and the Semantic Web, Tiny Data (data curation for the                                  humanities), GIS, and XQuery to develop a cohort of data‑literate faculty, staff, and students                            around campus.    7. Things people should know  As many data curators may already know, an overwhelming majority of one’s time is given over                                to  data cleaning and standardization . To successfully run a data curation program within a library,                              it is critical to translate the lessons learned in curating legacy data sets to training programs in                                  data management for researchers across campus. The data‑driven research projects of today are                          the data curation challenges of the future, so establishing sound data management practices in                            current digital projects streamlines the process of ingesting them into the library’s collection                          when they are completed. In addition, a data curation program must be grown in tandem with                                digital scholarship education infrastructure in order to arm teachers and researchers with the                          programming skills required to grapple with the curated datasets.    8. What’s next  Currently, Veronica Ikeshoji‑Orlati is curating the TV News dataset, a collection of nearly 1.1                            million abstracts of news broadcasts from ABC, CBS, NBC, CNN, and Fox News dating back to                                August 5, 1968. The  Vanderbilt Television News Archive is one of the richest resources for US                                69  http://heardlibrary.github.io/ https://www.library.vanderbilt.edu/scholarly/ https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html https://tvnews.vanderbilt.edu/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 70/180 news reporting in the 20th and 21st century, but access to the metadata is limited due to the                                    current web interface. In order to facilitate not only improved discoverability of news segments,                            but also quantitative analysis of the dataset as as whole, Ikeshoji‑Orlati is collaborating with                            Suellen Stringer‑Hye (Linked Data and Semantic Web Coordinator), Steve Baskauf (Senior                      Lecturer of Biological Sciences), Zora Breeding (Cataloguing and Metadata Team Leader), and                        Jacob Schaub (Music Cataloguer) to map the dataset to the  IPTC Newscodes Vocabulary . In                            addition, she is working with Lindsey Fox (GIS Librarian) to enrich the dataset with geospatial                              data.                                                                    70  https://iptc.org/standards/newscodes/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 71/180 Facet 14: The Museum of Modern Art Exhibition Index  Jonathan Lill, MoMA Archives    1. Why do it  Since 1929, The Museum of Modern Art (MoMA) has been and remains the preeminent art                              institution in the history of 20th and 21st century visual culture. Through groundbreaking                          exhibitions about Cubism, abstract art, Surrealism, and other art movements, MoMA led the way                            in promoting artists who are now household names. MoMA established a holistic approach to                            the understanding of Modernism by exhibiting and establishing curatorial departments devoted                      to film, architecture and design, and photography. MoMA demonstrated that those fields of                          activity were worthy of critical analysis and appreciation.    The Museum Archives works continually to tell that history of the Museum, and to organize and                                provide access to the documents and records that evince those decades of activity. We strongly                              believe that exhibition history isan important scaffold that can be used to build an understanding                              of MoMA’s accomplishments.  Indexing exhibition artists and curators provides researchers new                      pathways of exploration while linking archival resources and artworks in the collection . This work                            helps increase exposure and use of MoMA Archives’ historical collections and the dissemination                          of MoMA’s history.    2. Making the Case  In 2014 the MoMA Archives received funding to organize and describe MoMA’s exhibition files,                            which comprised paper records from all curatorial departments and the museum registrar for                          exhibitions staged since 1929. We decided that an exhibition index could be built as part of that                                  project workflow. Due to our experience fielding public and staff inquiries and guiding user                            research, the Archives had developed an appreciation of the utility an exhibition index. How this                              data might be made available to researchers was unknown at the inception of the project.    Simultaneous to the Archives’ work on this project, the MoMA hired a new director of web and                                  video who was given the mandate of radically expanding the Museum’s web content. She                            understood that our data could power the deployment of thousands of new web pages devoted                              to historical exhibitions, which could then be linked to numerous digital resources such as                            scanned press releases, exhibition catalogues, and installation photographs. Only with the web                        team pushing this project forward was the Archives able to move to completion. The new                              exhibition pages launched in September 2016. The data set was  published to Github at the same                                time.    3. How you did it  The MoMA Archives had long maintained a simple list of historical exhibitions. I built an Access                                database, parsed that list, and imported a table of over 50,000 artist names from the Museum’s                                71  https://github.com/MuseumofModernArt/exhibitions https://github.com/MuseumofModernArt/exhibitions https://github.com/MuseumofModernArt/exhibitions 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 72/180 collection management system (The Museum System, TMS, vended by Gallery Systems). I                        created a simple interface that allowed interns to connect names to each exhibition using                            drop‑down menus and when necessary to create new name records. Additional data was                          gathered from exhibition checklists scanned as part of the larger exhibition files project. The                            database structure allowed for easy review of the data,error checking, editing, and other                          maintenance. Once the indexing was largely completed, names in the index were reconciled to                            VIAF identifiers using the OpenRefine. The VIAF ids were then used to add Wikidata QIDs and                                Getty ULAN record numbers. Once this data was used to generate web pages, URLS for                              exhibitions and artists were added back into the dataset. Gallery Systems assisted with importing                            the data back into TMS from the Access‑generated csv files. The web team extracted data from                                TMS to ingest into the web system as they do with collection objects and other data. A simple                                    flat version of the data was posted to Github.    This project required close collaboration among several departments: the MoMA Archives, the                        data asset management system administrators who managed all the digital objects to be                          connected to our new exhibition web pages, the TMS administrators, and the digital media                            team. Importantly, this was the first time the Archives took responsibility for historical exhibition                            data in our collection management system and on the web site, involving us more closely in                                some key museum systems.    4. Share the docs  All documentation for the exhibition index and MoMA’s collection are located on Github, along                            with the actual datasets:  https://github.com/MuseumofModernArt/exhibitions     5. Understanding use  The immediate and most practical use of this data is for answering research inquiries: who was                                in an exhibition, how many exhibitions has an artist been in, how often two artists have been                                  exhibited together, etc. This amounts to significant daily usage by library and archival researchers                            as well as the general public. With basic database or spreadsheet skills, more advanced inquiries                              can be answered by this data such as who was the youngest artist to be given a solo exhibition at                                        MoMA? Or which artists have been exhibited most frequently without having works in the                            collection?    Separate from immediate needs of art historians and scholars, we expect this resource should be                              of tremendous use in classroom teaching about specific artists, modern art, and museology in                            America. Further, we believe this data can be used to connect digital and archival resources                              across the web. The exhibition index is less important for the information it contains than for the                                  people, things, and data it allows a user to connect together. Its real potential is only realized                                  when connected to Wikipedia entries, library union catalogs, and other datasets such as  Social                            Networks and Archival Context (SNAC) or the American Art Collaborative. Ideally, this index can                            serve as a model for a multi‑institution pooling of exhibition and artist data and online archival                                resources.  72  https://github.com/MuseumofModernArt/exhibitions https://snaccooperative.org/?redirected=1 https://snaccooperative.org/?redirected=1 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 73/180   6. Who supports use  [blank]    7. Things people should know  To build an exhibition index with any speed, the materials that provide the data must be located                                  and near at hand, preferably digitized, which is why conducting this work alongside a digitization                              or processing project is ideal. OCR of archival documents does not yield readily usable data.                              Facility with database applications and data manipulation software or programming languages is                        key. But most important is having labor to perform the data entry. Our workflow proved that                                with a narrowly constructed date‑entry interface, precise detailed instructions, and proper                      supervision and review, that this work can be swiftly and effectively performed by                          non‑professional staff and interns. Beginning with imported name records and other data                        increased efficiency and reduced mistakes. Error checking of the data showed that the error rate                              was within acceptable bounds and that most errors were omissions in data.    8. What’s next  Our initial funding allowed us to build an exhibition index from 1929 through 1989 (while                              primarily processing and opening to the public tens of thousands of folders of paper records). A                                new round of funding is now allowing us to extend that work through 2000, merge it with more                                    recent data created in TMS, and to further enrich the data by adding exhibition information such                                as department of origin, physical location, and subject tags. We are also working to combine this                                data with the exhibition index of MoMA PS1 (constructed as a smaller local project five years                                ago) and can begin to explore merging this data with that of other institutions such as the New                                    Museum, White Columns, and other arts institutions.                                  73  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 74/180 Facet 15: Social Feed Manager  Laura Wrubel, Software Development Librarian, George Washington University; Justin Littman,                    Software Development Librarian, George Washington University; Dan Kerchner, Senior Software                    Developer, George Washington University    1. Why do it  Social media platforms produce and disseminate a record of our cultural heritage and are a                              source of data for answering research questions from numerous disciplines. After learning about                          a George Washington University faculty member’s research which involved collecting tweets                      using a manual process, we developed prototype software in 2012 to connect to Twitter’s APIs                              and help her collect data. Conversations with our university archivist highlighted use cases for                            collecting social media in the archives for future researchers. We saw a role for the library to                                  build better tools for our community to conduct social media research. This led us to develop                                Social Feed Manager , which empowers researchers to build collections and enables libraries to                          proactively create datasets for use within their community. Along with providing data, we offer a                              consultation service for students, faculty, researchers–and also archivists and librarians–to                    access and use social media data.    2. Making the Case  Development of Social Feed Manager started through an IMLS Sparks grant and proceeded with                            support from  National Historical Publications and Records Commission and the  Council on East                          Asian Libraries . Library leadership participated and supported these grants which defined work                        proceeding from our existing relationships with faculty and archivists. Grant funding and project                          deliverables, as well as researcher and archivist needs, drove the allocation of staff time from                              developers, archivists, and librarians to support the work. Developing software and building a                          service supporting social media research might appear to be peripheral to typical library                          operations. Yet, the growing integration of the library’s staff into  research projects ,including                        funded research, SFM’s popularity with students at all levels, and the prominence of projects                            supported by data collected using SFM have become compelling evidence of its value and how                              this work supports library strategic goals concerning research and cross‑disciplinary                    collaboration.    3. How you did it  Our initial project team in 2013‑14, funded by a Sparks! grant from IMLS, was small and focused:                                  the library’s director of scholarly technology (who served as project manager and principal                          investigator), a software developer, our e‑resources content manager, and a graduate student                        developer. In this first phase, we developed a suite of utilities and an administrative interface to                                manage collecting activities against the Twitter public APIs. A basic user interface provided                          access to data from Twitter user timelines, one at a time. We collected data of interest to the                                    GW research community and in support of specific faculty and student research projects. This                            74  https://gwu-libraries.github.io/sfm-ui/ https://www.archives.gov/nhprc http://www.eastasianlib.org/MellonGrants.htm http://www.eastasianlib.org/MellonGrants.htm https://gwu-libraries.github.io/sfm-ui/data-research/#research-using-social-feed-manager 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 75/180 included tweets by members of Congress, news outlets, and public sports and entertainment                          figures. The project team mediated much of the running of the data collecting and exporting                              data beyond simple downloads of an individual timeline’s tweets.    In our second round of grant funding from the National Historical Publications and Records                            Commission and the Council of East Asian Libraries, we further developed the software and                            widened staff involvement in the project. Our grant funded the exploration of social media                            archiving and thus several of our archivists and our digital services manager participated as team                              members. The project included a significant software development component, as we added                        social media platforms, built a user interface to empower researchers to manage their own                            collections, and added more functionality overall to manage collecting from the Twitter, Tumblr,                          Flickr, and Sina Weibo APIs. To improve SFM’s usability, our grant from NHPRC supported                            bringing on a UX consultant to conduct an expert review of its interface. We also brought on an                                    experienced digital archivist to review the technical architecture and archival use cases. We                          wrote documentation and a quick start guide for both end users and other institutions using                              Social Feed Manager.    As a library, we actively collected tweets related to topics of interest on the GW campus. The                                  largest and most heavily used collection has been our  2016 elections collection , containing over                            280 million tweets. To facilitate making this data accessible to the GW community and beyond, a                                team member created  TweetSets , which provides a self‑service interface for the GW community                          to download data and for the broader community to download tweet identifiers.    The changing terms of use for social media platforms and accompanying changes to APIs are a                                challenge both for maintaining working software and supporting research.    A current challenge is tracking and keeping up with the many research projects that use SFM. We                                  want to be able to tell the story about the students and faculty in a wide range of disciplines and                                        schools who are using SFM, and the contributions our librarians make to this work.    4. Share the docs  Documentation for the Social Feed Manager software.    The following documents are available through Social Feed Manager  project site :    ● Social media research ethical and privacy guidelines : general guidelines for GW researchers                        focusing on the collecting, sharing, and publishing of social media data  ● Social Feed Manager: Guide for Building Social Media Archives , Christopher J. Prom (2017)  ● Building Social Media Archives: Collection Development Guidelines    The details of our software development work are available on  GitHub . This includes                          issue‑tracking and prioritization, past and ongoing milestone activity, and release notes. We also                          75  https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PDI7IN https://tweetsets.library.gwu.edu/ https://gwu-libraries.github.io/sfm-ui/ https://gwu-libraries.github.io/sfm-ui/resources/social_media_research_ethical_and_privacy_guidelines.pdf https://gwu-libraries.github.io/sfm-ui/resources/SFMReportProm2017.pdf https://gwu-libraries.github.io/sfm-ui/resources/guidelines https://github.com/gwu-libraries/sfm-ui 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 76/180 publish  blog posts with each release, highlighting new features useful to the community and                            sharing tips for collecting and working with the data.    5. Understanding use  Our consultation model means that we typically have contact with users of Social Feed Manager                              and/or social media data and have an ongoing conversation about the analysis methods,                          findings, and outcomes of their research. This model also supports including discussion about                          ethical use of social media data.    In addition to being publicly available from TweetSets, several proactively collected datasets are                          available publicly on Dataverse, as sets of tweet identifiers. Twitter’s terms of use do not allow                                full tweet data to be shared, but tweet identifiers may be shared for research purposes. A                                researcher can pull the full tweet, or “hydrate” it, from Twitter’s API. Download metrics are                              available through Dataverse and its collections are highly discoverable via Google. We receive                          occasional follow‑up requests or questions and track citations of datasets we’ve published.    Within the university, we are tracking schools and departments we’ve interacted with and                          monitor for published research that uses SFM, presentations, posters.    6. Who supports use  We have a team of software developer librarians who develop Social Feed Manager, provide                            consultations with faculty and students, teach workshops, and manage related services. Our                        subject specialist librarians are a frequent source of referrals. Our data services librarian                          sometimes participates in consultations, especially where they involve the larger research data                        lifecycle.    7. Things people should know  Ethical and privacy considerations need to stay at the forefront of this work and are a thread                                  throughout the software development, research consultation, and instructional aspects of this                      work.    It is not enough to provide a tool for building social media collections: users will need support in                                    understanding and optimizing their collecting parameters, understanding the data, and finding                      ways to manipulate or reformat it for analysis. We work with freshmen in writing seminars,                              undergraduates and graduate students from a wide range of disciplines, and faculty, with varying                            familiarity with CSV and JSON data, social media platforms, and research methods suited to                            social media data.    Social media platforms are constantly changing. Terms of use and API affordances are designed                            for commercial users rather than academic or research use. It’s necessary to spend time                            understanding social media platforms, researcher needs, and staying up to date since what is                            76  https://gwu-libraries.github.io/sfm-ui/blog 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 77/180 available is always changing. Advocacy for researcher needs can sometimes lead to change with                            platform terms, even if only over the long‑term.    8. What’s next  We are continuing to maintain Social Feed Manager and trying to keep up with changing API                                affordances. We’re further developing our workshops and outreach on campus. The interest in                          our 2016 elections collection has led to our working with external audiences for this data such as                                  journalists and non‑profits, and we participate in conferences related to that work. We’re being                            proactive about the 2018 midterm elections and collecting with future research uses in mind.                                                                77  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 78/180 Appendix 3: Collections as Data Personas   October 2017 ‑ April 2018    Collections as Data (CAD) Personas represent an initial set of high level role types associated with                                collections as data activity. While distinctions are fuzzy in the context of disciplinary and professional                              praxis, roles represented by personas can generally be understood in alignment with data stewardship or                              use. On the whole, personas aim to surface needs, motivations, and goals in context. These                              representations are derived from Collections as Data project engagements and project team experience.    In Agile software development, a persona is used to help develop a broadly shared orientation to user                                  experience. Gary Geisler has written, “Personas offer a way to summarize findings from user research                              and help determine user requirements and priorities. These documents help project teams develop a                            common understanding of a project’s intended audience and priorities. They also serve as a useful                              reference for design decisions throughout the development process.”                                              78  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 79/180         79  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 80/180             80  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 81/180             81  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 82/180             82  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 83/180             83  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 84/180             84  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 85/180           85  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 86/180           86  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 87/180           87  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 88/180           88  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 89/180           89  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 90/180   90  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 91/180 Appendix 4: 50 Things  Want to support collections as data at your institution, but not  sure how to begin? Drawing on what we learned from engaging  with practitioners and researchers throughout the  Always Already  Computational  project, the project team compiled a list of 50  Things you can do to get started. 50 Things is intended to open  eyes, stimulate conversation, encourage stepping back, generate  ideas, and surface new possibilities. If any of that gets traction,  then perhaps you can make the case for investing in collections as  data at your institution in a meaningful, if not systematic, way.       Our best advice: start simple and engage others in the process.  You may find some activities listed here are already underway!     About this publication: 50 Things was published in October 2018 under a CC BY‑NC‑SA 4.0 license.     1. Know how optical character recognition (OCR) output is produced in your digitization workflows.  What software is used? What formats are created? What levels of accuracy are produced?  Where is it stored? Is it available for user download?    2. Create an inventory of full‑text collections managed by your institution. Document rights status,  license status, discoverability, and downloadability. Ask the question: are we offering optimal  access for computational use of the full‑text? How can we make it better?     3. Migrating a legacy digital collection to a new system or platform? Take the opportunity to make  the content accessible to researchers that have computational projects in mind.    4. Interview the archivist, librarian, or curator responsible for a digital collection to document data  provenance and decisions made in the course of collection processing and digitization. Work to  make this information publicly available.    5. Inventory your data holdings. Just make a simple list. And then commit to keeping it up to date,  and watch it grow.    6. Add new fields to the collection management database to indicate and describe data  components.     7. Survey your digital collections to identify characteristics ‑‑ good metadata, open access, good  OCR, high usage, relevance to a high‑profile academic program or research area at the institution  91  https://collectionsasdata.github.io/ https://collectionsasdata.github.io/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 92/180 ‑‑ which lend themselves to high impact as data.      8. Recognize and identify the things you need to do differently than have been done for physical  collection objects.    9. Find out if your digital collection database or access platform has an API available for querying by  the public. If it does not, see if it is possible to develop one. If it does, determine if it is actively  used. If it is actively used, see if you can reach out to users and ask about their usage!     10. Talk to a colleague responsible for systems that provide networked access to digital collections  about possible approaches to facilitate download of collection data in bulk.    11. Add a terms of use to your archival finding aids.    12. Read the language of your organization’s collection deed of gift or purchase agreement to  evaluate whether it allows for providing access to collection content in the form of data.    13. Review your digital collections metadata and evaluate the rights statements and license  statements in terms of consistency and clarity. Are you able to adopt  rightsstatements.org ?    14. Socialize Collections as Data as something that can be supported by units and staff across the  library. Identify some champions across the organization and people who have skills or position  to do the work.    15. Talk to people responsible for research data management to encourage planning for data  preservation and other considerations that make it possible for others to reuse the data in the  future.    16. Review your institution’s mission statement or strategic plan documentation, and consider if and  how Collections as Data activities are aligned with and support it.    17. Share sample projects with community partners to give them an idea of how their collections  can be used and be relevant to new ways of conducting scholarship.     18. Network with people who work with data and have the skills or knowledge you need to get your  work done.    19. Identify barriers and limitations to what services you can offer support, and talk with colleagues  about creative, feasible solutions to overcome them.    20. Publish or present on "Wikidata for librarians," including case studies of libraries working with  Wikidata to expand discovery of collections.    92  http://rightsstatements.org/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 93/180 21. Read up on IIIF (for example, check out this  useful tutorial ) and determine what hurdles to  implementation exist at your institution. Then talk to relevant folks about what it would take to  overcome them.    22. Read the resources in the Always Already Computational project's  Zotero library .    23. Develop a workshop focused on the use of data in abd about collections; shop it to department  faculty and incorporate it into research orientations for faculty and students.    24. Mentor a liaison interested in learning a data science skill who is well positioned to identify  datasets and data support needs amongst their researchers.    25. Conduct user testing of your library’s main discovery environment, with the goal of  understanding how easy or hard it is for a researcher to find the available data collections.    26. Develop a portal page with a site map specifically for discovering collections at your institution  available for computational use and related support services.    27. Begin tracking demand for and use of data in and about your collections.     28. For a collection that cannot be made available openly on the web, investigate if your  organization is able to support mediated access to the data, such as through an offline or  encrypted workstation.    29. Prepare and provide datasets that are intentionally useful, in terms of size and complexity, for  teaching in semester‑ or quarter‑long classes.    30. For classes that draw directly on library collections and generate data, ask the students to submit  their data products back to library, through the institutional repository. Normalize the process of  giving back and augmenting the collections with data. This may work particularly well for  collections that are institutionally or regionally focused.    31. Identify a faculty member who does computational analysis for their own research and find a  way to transfer or replicate the tools and approaches they use to apply them to a library  collections‑as‑data use case.    32. If you offer an API to your repository, evaluate the public‑facing documentation to see if it is  clear, current, accurate, and discoverable by researchers.     33. Publish documentation about how to find, use, and interpret collections as data in multiple  places including blogs, README files, and LibGuides.    93  https://iiif.github.io/training/iiif-5-day-workshop/ https://www.zotero.org/groups/2171423/collections_as_data_-_projects_initiatives_readings_tools_datasets/items? 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 94/180 34. A dataset should always be accompanied by a README plain text file that documents basic,  important information about the data. Make READMEs part of your data documentation  practice. Develop one or more template to that can be used by librarians and researchers.    35. Make an effort to make existing OCR output generated from past scanned text collections  projects more available for computational analysis, such as through bulk download.    36. When planning your next digitization project, incorporate additional steps for preparing content  files, OCR or transcription text, and metadata for bulk access. Document the key issues and  decision points you encounter as you evolve and expand your digitization workflows.    37. Talk to colleagues involved in taking in deposits to your institutional repository or research data  repository about a process for encouraging and accepting contributions back from users of data  in your collections.    38. Gain the support of administration by following and supporting the work of third‑party research  groups like OCLC that help bolster and highlight the trends in the development of collections as  data.    39. Provide a resource that shows a data user how to cite a dataset, and that shows a data creator  how to format a preferred citation for an original dataset and a derivative dataset.    40. Ask a subject specialist at your institution if faculty or students are requesting data about or  derived from library collections.    41. Take a public services librarian, curator, or archivist out for coffee to talk about collections as  data. Ask what they are hearing from faculty, students, and other users of collections about  computational use and which collections have potential for taking action to lower barriers to  computational use.    42. Investigate how your library is collecting, managing, and making email archives accessible.  Consider whether a collections as data approach will serve your institution's goals.    43. Start small. Start with a research question, and choose projects that have promise to be  generalizable for use by future scholars such that the investment is worth the level of  commitment. No one‑offs!    44. Start with a prototype or proof of concept. It's fine if your Collections as Data project does not  integrate with institutional repository or formalized infrastructure.     45. Collaborate with subject specialists or instruction librarians to ask scholars about interest in  computational data in and about collections. Compile their ideas to make a case, and build a  94  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 95/180 team for the next opportunity to pursue one of them.    46. Be thoughtful and strategic about allocating scarce resources to collection digitization projects.  Consider prioritizing projects that produce outcomes that are reusable (derivative datasets) and  repeatable (processes, tools, workflows) that can benefit your department and your users again  and again.    47. Explore what it would take for your organization to contribute subject data to Wikidata, drawing  on a local collection and then incorporating the Wikidata links into your local discovery  environment.    48. Test how data gathered in a crowdsourcing project can be associated with the existing source  object data and can also serve as stand‑alone dataset.    49. Use your favorite search engine to find information about APIs provided by museums and read  about the various ways that data about museum collections can be analyzed to discover new  insights.    50. Keep tabs on the projects emerging in the  Collections as Data: Part to Whole project , funded by  the Mellon Foundation. They are bound to point a way forward for us all!                                          95  https://collectionsasdata.github.io/part2whole/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 96/180 Appendix 5: Collections as Data Methods Profiles    CAD Methods Profiles are designed to help people who work in libraries, archives and museums gain                                a better understanding of common research methods that make use of cultural heritage collections                            for computational analysis. Of course, these descriptions are simplified versions of the methods, and                            are described mostly in the context of their implications for the creation, description, packaging, or                              distribution of collections as data. Profiles should be used in the context of the principles articulated                                in the Santa Barbara Statement on Collections as Data.        Text Mining  Laurie Allen and Scott Enderle, University of Pennsylvania    1. What is it?    Looking for patterns in text. Generally, text mining is done on a corpus of texts rather than a                                    single text. Finding and assembling a corpus that is appropriate to the research needs of a                                project can be one of the trickiest and most time consuming things that a researcher does when                                  approaching a project. There is not currently an agreed upon standard for describing or sharing                              text corpora, though there are a variety of guides to finding them, and vendors who sell access                                  to text that researchers can assemble to create a corpus.    See a few definitions and links:    ● Drucker, Johanna. Data Mining and Text Analysis ‑ Introduction to Digital Humanities.                        Accessed August 27, 2018.  ● Underwood, Ted. Seven Ways Humanists Are Using Computers to Understand Text. The                        Stone and the Shell (blog), June 4, 2015.    2. Who uses it?    Text mining is used across humanities disciplines (notably language and literature departments,                        and history) and in the social sciences, especially political science, communications, and                        business. There are also text corpora used in machine learning applications as well as linguistics.                              Disciplinary uses of text mining vary both in method of analysis, and, importantly, in the kinds of                                  texts included in the corpus of study. For example, a corpus of the front page articles of current                                    major newspapers might be valuable to a political scientists, while a scholar of 19th C. English                                novels might want a corpus of literary reviews.  96  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 97/180   3. What form of data is most useful for this method?    Generally, researchers doing text analysis will want to use plain text (i.e. machine readable, but                              without markup) in large quantities. They will also need accompanying metadata at a variety of                              scales. That is, sometimes they’ll want metadata at the book/article level, or at the collection                              level, and for some uses, it is helpful to have chapter or section level metadata. In linguistic uses,                                    analyses of texts sometimes include annotations down to the specific phoneme level, which                          make linguistic corpora less widely produced by libraries/archives/museums.    4. What might researchers explore when they’re text mining?    They might look for word frequency counts (how often is a particular word used) at the page,                                  article/chapter, or volume level, or use those counts for further analysis. For that reason, a                              dataset of frequency counts, even in the absence of fulltext, is often useful, especially in cases                                where the full content of a corpus can not be made available because of copyright restrictions.    Researchers often look for patterns in the data as they relate to features in the metadata (for                                  example, how does the frequency of a word in texts change over time). Reliance on both the                                  metadata about each text and the text themselves makes it important for researchers to know                              about large inconsistencies in the data or metadata quality. For example, if the OCR quality is                                inconsistent across a collection, it is very useful to include standard metadata about OCR quality                              for each text, if it is known. Or, if cataloging or metadata creation practices changed over time,                                  those changes should be noted so that researchers can account for those changes in their                              analyses.    In some cases, people are interested in locations of words on pages (If an OCR program has                                  included information about bounding boxes, it would be nice to have multiple versions – one                              with bounding boxes, and the other without).    5. Common tools used for text mining    Most people who do text mining are using scripting languages like Python or R.    Beyond that, there are a few other tools, useful for analysis and teaching like:  ● Voyant  ● AntConc  ‑ (See also Heather Froehlich’s  AntConc lesson on Programming Historian )  ● Topic Modeling Tool  ● Mallet    6. Things to look out for when preparing collections for text mining  97  https://voyant-tools.org/ http://www.laurenceanthony.net/software/antconc/ https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc https://github.com/senderle/topic-modeling-tool http://mallet.cs.umass.edu/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 98/180   Copyright: This is a big one, for obvious reasons. Where fulltext can not be provided, some                                libraries provide wordcounts or other analytics about the texts.    Documentation of text and metadata: Multiple versions of texts can be a big source of                              frustration or confusion in text analysis. For example, a series of reports might have the same                                first page, which is duplicated across all reports. Flagging those kinds of duplications can be                              valuable in helping researches cut the preparation time to making a corpus usable.    7. Examples of this method in use    Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in  English‑Language Fiction.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.    Barron, Alexander T. J., Jenny Huang, Rebecca L. Spang, and Simon DeDeo. “Individuals,  Institutions, and Innovation in the Debates of the French Revolution.” Proceedings of the  National Academy of Sciences, April 17, 2018, 201717729.  https://doi.org/10.1073/pnas.1717729115.    8. Examples of collections optimized for this use    “Documenting the American South: DocSouth Data.” Accessed August 27, 2018.  https://docsouth.unc.edu/docsouthdata/.    Chronicling America:  https://chroniclingamerica.loc.gov/    La Gaceta De La Habana:  https://merrick.library.miami.edu/cubanHeritage/cubanlaw/lagaceta.php                         98  https://chroniclingamerica.loc.gov/ https://merrick.library.miami.edu/cubanHeritage/cubanlaw/lagaceta.php 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 99/180 Network Analysis  1. What is it?     Network analysis supports quantitative and qualitative study of relationships between entities.  Entities can be people, places, or things. Network analysis is especially helpful for studying  multiple levels of complex systems.     A few resources and links:     “Network Analysis: Lesson Directory.”  Programming Historian .   https://programminghistorian.org/en/lessons/?topic=network‑analysis     Easley, David, and Jon Kleinberg.  Networks, Crowds, and Markets: A Book by David Easley and  Jon Kleinberg . Accessed May 14, 2019.  https://www.cs.cornell.edu/home/kleinber/networks‑book/ .     Locke, Brandon.  Humanities Data Curation Record. Network Graphs and Network Analysis . 2017.  Reprint, Data Praxis, 2018.   https://github.com/datapraxis/hdcr .     2. Who uses it?     Network analysis is used across a wide range of communities with some variation in terminology  based on discipline. While social network analysis is popular, network analysis is also used to  study physical infrastructure, e.g. transmission of energy through an electrical grid, or the flow of  traffic. It can also be used for fictional characters in plots. In business network analysis it is used  to study how organizations form, how money transfers from one place to another. It is also used,  famously, in recommendation engines.     3. What form of data is most useful for it?     Researchers need relational information for network analysis, which can be found in many  datasets. However, not all networks are useful for analysis, so there can be a fair amount of  exploration in finding network datasets. The most basic forms of data for network analyses  simply require that each record includes two entities and a relationship. For example, a simple  spreadsheet with many rows and three columns. For each row: one person (entity) sent a letter  (relationship) to another person (entity), or one publication (entity) was authored (relationship)  by a person (entity).  Other data can become part of network analysis as well, but the simplest  notion of the network simply requires entities and relationships.     99  https://programminghistorian.org/en/lessons/?topic=network-analysis https://programminghistorian.org/en/lessons/?topic=network-analysis https://www.cs.cornell.edu/home/kleinber/networks-book/ https://www.cs.cornell.edu/home/kleinber/networks-book/ https://github.com/datapraxis/hdcr https://github.com/datapraxis/hdcr 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 100/180 4. What data features might researchers explore?      After establishing whether network analysis is the right method, researchers might explore the  size of a particular network, either by counting the number of nodes (entities) or number of  edges (relationships).  They might ask what is the percent of the network that is isolated from  the rest? They may also look at network level measurements ‑ who is most central, who are the  most important conduits? What are the people places or things that have easiest access to outer  bounds of network? They may look at the clustering coefficient – do relationships in the network  tend to clump together or are they fairly diffuse?     5. Common Tools      Palladio   http://hdlab.stanford.edu/palladio/  (for very lightweight exploration of networks,  designed for historical data)  Cytoscape   http://www.cytoscape.org/  Gephi   https://gephi.org/  NodeXL   https://www.smrfoundation.org/nodexl/  Pajek   http://mrvar.fdv.uni‑lj.si/pajek/      6. Examples of this method in use      Warren, Christopher N., Daniel Shore, Jessica Otis, Lawrence Wang, Mike Finegold, and Cosma  Shalizi. “Six Degrees of Francis Bacon: A Statistical Method for Reconstructing Large Historical  Social Networks.”  Digital Humanities Quarterly  010, no. 3 (July 12, 2016).  Moravec, Michelle. “Network Analysis and Feminist Artists.”  Artl@s Bulletin  6, no. 3 (November  30, 2017).   https://docs.lib.purdue.edu/artlas/vol6/iss3/5 .     White, Howard D., and Katherine W. McCain. “Visualizing a Discipline: An Author Co‑Citation  Analysis of Information Science, 1972–1995.”  Journal of the American Society for Information  Science  49, no. 4 (1998): 327–55.  https://doi.org/10.1002/(SICI)1097‑4571(19980401)49:4<327::AID‑ASI4>3.0.CO;2‑4 .     Bibliography of Historical Network Research   http://historicalnetworkresearch.org/bibliography/     7. Examples of collections optimized for this use     The following sources provide directories of network data:  “CASOS Tools: Network Analysis Data | CASOS.”   http://casos.cs.cmu.edu/tools/data2.php .  “Index of Complex Networks.” Index of Complex Networks.   http://icon.colorado.edu/ .     “Stanford Large Network Dataset Collection.”   http://snap.stanford.edu/data/index.html .  sualization ‑ Thomas interested in this, planning to try and chat with Lauren Klein  100  http://hdlab.stanford.edu/palladio/ http://hdlab.stanford.edu/palladio/ http://www.cytoscape.org/ http://www.cytoscape.org/ https://gephi.org/ https://gephi.org/ https://www.smrfoundation.org/nodexl/ https://www.smrfoundation.org/nodexl/ http://mrvar.fdv.uni-lj.si/pajek/ http://mrvar.fdv.uni-lj.si/pajek/ https://docs.lib.purdue.edu/artlas/vol6/iss3/5 https://docs.lib.purdue.edu/artlas/vol6/iss3/5 https://doi.org/10.1002/(SICI)1097-4571(19980401)49:4%3c327::AID-ASI4%3e3.0.CO;2-4 https://doi.org/10.1002/(SICI)1097-4571(19980401)49:4%3c327::AID-ASI4%3e3.0.CO;2-4 http://historicalnetworkresearch.org/bibliography/ http://historicalnetworkresearch.org/bibliography/ http://casos.cs.cmu.edu/tools/data2.php http://casos.cs.cmu.edu/tools/data2.php http://icon.colorado.edu/ http://icon.colorado.edu/ http://snap.stanford.edu/data/index.html http://snap.stanford.edu/data/index.html 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 101/180 Appendix 6: National Forum Position Statements   March 2017     Forum participants were asked to respond to the following prompt:    Leading up to the forum, [we] ask that you write a brief position statement derived from direct or                                    related experience salient to the scope of work described in Always Already Computational. We                            welcome bridging, divergence, and provocation. Is there something concrete or conceptual we                        are missing? Are there projects and initiatives this work should be connected to? Are there                              questions and communities we aren’t currently considering? This is an opportunity to highlight                          aspects of your experience that relate to the project and will to some extent help stage                                interaction at the face‑to‑face meeting ‑ and beyond ‑ as the project team works to iteratively                                refine forum outputs in a range of professional and disciplinary communities.   Perspectives represented in the position statements highlight the many directions collections as data  work could go. The statements certainly informed the work of the forum, and consequently the  iterative community based development of project outcomes.                       101  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 102/180 Pseudodoxia Data: our ends are as obscure as our beginnings      Jefferson Bailey, Internet Archive      In his meditation on oblivion and regeneration, W.G Sebald writes, “on every new thing there lies already                                  the shadow of annihilation.” Contemplating collections as data evokes a similar correlation ‑‑ one where                              transformation (“this as that”) is less a process of alteration and more one of extraction of key, but                                    possibly opaque, preexistent characteristics (“these from those”). When we consider the computational                        availability of collections, we begin from a perspective in which collections are an amalgamation of                              fragmentary elements ‑‑ and their decomposition is neither affordance nor flaw, but instead a natural                              state of flux that allows them to be contextualized anew through a continual state of reconstitution and                                  derivation. This prevailing logic of decomposition distinguishes collections not as data but instead as                            pieces and processes, with attendant opportunities and entanglements ‑‑ collections and data become                          inseparable, commingled not in operation but instead via a type of consanguinity. Likewise, our services                              supporting computational access to data should match this latent consanguinity.     As a large‑scale, online digital library that is also a mission‑driven, nonprofit technology developer, the                              Internet Archive has long approached collections as data. Being fully online, with no physical reference                              collections other than those intended for digitization, collections and data are so intertwined as to be                                indivisible, either in concept, technology, or use. The Internet Archive’s collections include more than 30                              petabytes of unique data and has supported computational use of these collections since its beginning,                              from projects as wide‑ranging as semantic analysis of television closed‑caption transcripts to network                          graph study of linking behavior of hundreds of terabytes of web data. In addition, and as a self‑sustaining                                    non‑profit, the Internet Archive has facilitated this type a research through a service‑oriented and                            sustainable program development approach. Developing data‑driven approaches to access and binding                      them to scalable, sustainable programs has elucidated many of the obstacles and potential solutions that                              emerge from this work. Questions that have emerged:    ● How can computational research services create better pathways to interpretation through tools                        and methods for the smooth traversal between “reduction and abstraction” inherent in                        derivation and aggregation?  ● How can new access models help researchers have greater comfort with technical mediation at                            multiple levels and with an increasing distance between the granularity and totality of the                            object(s) of study?  ● How can programs address the challenges still inherent, even with derived datasets, of limited                            technical proficiency and local infrastructure?    In testing multiple models internally, and surveying and collaborating with similar efforts in the                            community, we developed a loose typology of program models for research services, oriented towards,                            but not exclusive to, very large born‑digital collection such as web archives.    ● Bulk Data Model : The totality of domain, global‑scale crawl, or large born‑digital collection is                            transferred to researchers via data shipped on drives. Analysis takes place locally, usually in a                              researcher’s own high‑performance computing environment.  102  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 103/180 ● Cyberinfrastructure Model : A custodial/archival institution provides free/subsidized access to its                    own computing environment that is pre‑loaded with data, VMs, and other tooling. Researchers                          can do analysis in this remote environment and export results.  ● Roll Your Own Model : Researchers receive support, generally in the form of funded or                            sponsored services, to create their own tools and leverage existing data platforms for candidate                            collection building and analysis.  ● Programming Support Model: Researchers, generally non‑technical, are given time with                    specialized technical support staff (engineers) to collaboratively build or aggregate datasets and                        perform analysis.  ● Middleware Model : The creation of specific tools and platforms that operate between data                          hosted with a custodian and advanced analytics tools maintained externally.  ● Derivative Model : Provide pre‑defined datasets that contain key extracted, derived, or                      pre‑analyzed data culled from specific resources. The derived datasets support specific research                        questions, are fungible, and align data and delivery with researcher need.    While the Internet Archive has pursued many of these models, the most flexible and scalable has proven                                  to be the derivative model, in which key elements are extracted from primary resources and packaged in                                  simple but easy‑to‑use datasets. This preference was the result of many lessons learned in working to                                support computational use of extremely large digital collections.     ● Services for computational access are more successful when built on top of, or expanded from,                              pre‑existing internal systems, processes, and infrastructure. Modular, generalized, and                  interoperable are preferred and boutique services don’t scale.  ● Research services should be flexible and, most importantly, content delivered should be                        disposable to the providing institution and be able to be recreated by existing, ongoing pipelines                              or frameworks.  ● Focus on derivation (extract desired data from origin), portability (processes should work on                          multiple content types or in many areas of the workflow) , and access (ease of transfer of data to                                      recipient and ease of use by the recipient).  ● Focus on scalable partnerships & decentralization in research service support.  ● Researcher expectations often are not aligned with available custodial resources or services and                          research methodologies (conceptual, practical, technical) often are not aligned with target data                        characteristics, acquisition methods, or management tools.  ● Service models must be self‑sustaining and scale. No “grant then gone.”  ● Continually orient towards mutually reinforcing work, be it with collaborators or researchers,                        and always allow for generality, in partners, technologies, and models.    Discovering how these lessons and approaches match, contest, or augment the findings of other efforts                              will be a particularly informative result of the “Collections as Data” forum.              103  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 104/180 Experiencing Library Collections as Data  Alexandra Chassanoff,  Massachusetts Institute of Technology    Recent empirical research has confirmed that digital tools and technologies are fundamentally changing                          how scholars work.[1] Yet the inverse of this relationship has received little attention – how is                                infrastructure changing to support emergent scholarly practice?[2] As you note in your grant narrative,                            “Predominant digital collection development focuses on replicating traditional ways of interacting with                        objects in a digital space.” Indeed, much of the research examining how scholars find, access, and use                                  materials in digital collections has paid little attention to qualitative factors about the interaction                            between collection users and environmental aspects.[3]    My doctoral research focused on this problem – exploring how scholars were searching for, accessing,                              and using digitized archival photographs as forms of historical evidence. An underlying objective of my                              research was to explore the interpretive and evaluative practices that scholars bring to bear on                              non‑textual objects of humanistic inquiry. The intent was to think about how digitized photographs can                              function as data, and to provide a perspective on what makes interactions meaningful for scholars                              working with digital materials.      In my role as the project manager on the  BitCurator and BitCurator Access projects, I worked with                                  scholars and archivists to develop approaches and methodologies for accessing and using born‑digital                          materials. At the close of each project, I recall thinking that technology was hardly the difficult part of                                    our work. Rather, the challenges we faced seemed to be conceptual in nature. How might we envision                                  ways to access born‑digital materials? Relatedly, how might we use born‑digital materials in our                            research? What kinds of questions could be asked and answered from examination of contents of the                                so‑called black box?      It seems that we face a similar challenge in considering library collections as data. I am grateful that this                                      forum is explicitly seeking to address this gap, particularly through the enlistment of a diversity of players                                  in the cultural heritage community. Technologists, librarians, museum professionals, archivists, and                      scholars will contribute important and unique perspectives to this conversation. Strategic approaches                        that facilitate access to, and preservation of, library collections as data will need to consider the constant                                  and shifting interplay between infrastructure and emergent scholarly practices. For example, recent                        research has shown that scholars are using Google Image Search to locate archival photographs.                            Traditional archival design approaches may not accommodate the serendipitous possibilities of digital                        space.      In thinking about ways to facilitate use and reuse, I hope to draw on my current research as a CLIR/DLF                                        Software Curation Postdoctoral Fellow. Since October, I have been working at the MIT Libraries to                              investigate and make recommendations for how institutions can manage software as complex digital                          objects across generations of technology. Software is another type of “data”, albeit one with implicit                              constraints for access, use and reuse. Researchers rely on software for a variety of research activities – as                                    a subject of research itself, a way to operationalize methods, or to reproduce and validate previous                                results. Institutions are increasingly tasked with activities related to the active management of software:                            from creation through use, dissemination, preservation and reuse. Institutional approaches to software                        104  https://www.bitcurator.net/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 105/180 collection development must consider software in a variety of contexts: at an intellectual level (e.g.                              selection and appraisal); in planning for and designing repositories, platforms, services; and in                        developing staff competencies.   How can we accommodate the fluid and rapidly changing practices which characterize the current                            scholarly landscape? The results of my dissertation research suggest that one part of the puzzle might be                                  to develop an understanding of the factors and qualities that make experiences meaningful in different                              kinds of interactions. For example, what is it about the experience of (digitized) oral histories that make                                  them accessible and usable? Rather than focusing on delivery mechanisms or crafting explicit                          methodological approaches, we might do well to consider the myriad ways in which specific types of                                materials in digital library collections can be experienced.     Works Cited    [1] Alexandra Chassanoff, “Historians and the Use of Primary Source Materials in the Digital Age,”  The  American Archivist  76, no.2 (2013):458‑480; Jennifer Rumer and Roger C. Schonfeld,  Supporting the  Changing Research Practices of Historians, Final Report from ITHAKA S+R  (2012), 11    [2] The important relationship between infrastructure, technology, and scholarship is explored in  Christine Borgman’s  Scholarship in the Digital Age: Information, Infrastructure and the Internet  (Cambridge: MIT Press, 2007).     [3]  Two notable exceptions in the field of Library and Information Science (LIS) are: Marcia Bates, “ T he  Cascade of Interactions in the Digital Library Interface,”  Information Processing and Management  38, no.  3, 2003; Christopher A. Lee,   “ Digital Curation as Communication Mediation ,” in  Handbook of Technical  Communication , ed. Alexander Mehler, Laurent Romary, and Dafydd Gibbon (Berlin: Mouton De Gruyter,  2012), 507‑530.                                       105  http://www.ils.unc.edu/callee/p507-lee.pdf 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 106/180 Unsolved Problems in the Humanities Data Generation Workflow: Digitization  Complexities, Undiscoverable Audiovisual Materials, and Limited Training for  Information Professionals  Tanya Clement, University of Texas Austin       Digital Humanities has changed rapidly from a field that in which we primarily build and create access to                                    resources in the humanities to a field in which we deploy analytics on those resources in accordance with a                                      general move to data analytics. The Always Already Computational initiative is taking an essential step                              towards bridging the first activity (digitization) to the second (analytics) by focusing on how we structure,                                bundle, and disseminate digitized or born digital collections and metadata on such collections. This is                              important and much needed work, but there are three main areas of concern or “unsolved problems” that I                                    would like to introduce into the conversation for the consideration of the group: (1) digitization workflows;                                (2) AV metadata; (3) and pedagogy in terms of training information professionals about data science, data                                analytics, and data visualization.    Digitization workflows are where much library collections “data” such as descriptive or technical metadata                            are born, but these workflows are complicated processes that include selecting collections; establishing                          performance goals based on standardized measurement protocols; developing efficient test plans; and                        taking corrective action to maintain quality. Even as cultural heritage institutions continue to rapidly digitize                              and refine these workflows, our knowledge about new approaches to digitization standards, to schemas for                              the semantic web, and to increasing our regard for issues of diversity and inclusivity in the digitization of                                    cultural heritage artifacts continues to evolve. Newly issued guidelines from FADGI[1] – an initiative                            incorporating many entities at the Library of Congress – challenge librarians and archivists to improve image                                quality precisely when pressures to digitize everything including collections that embody inclusivity are                          building. Consequently, much of the metadata that we may use in a data framework has been generated                                  during an evolving and complex digitization process, which is often a time of increased one‑time funding for                                  the specific digitization job. To what extent will the guidelines that we generate during Always Already                                Computational take digitization workflows into account? Can we advise libraries and archives on how an                              understanding of an eventual data framework can be integrated into these workflows such that when                              requests for funding are made our colleagues can anticipate generating the kinds of data that we will need                                    for a data access environment?     Second, and a case in point for the first “unsolved” problem, Audiovisual materials are notoriously under                                represented in digital humanities precisely because they often lack the detailed data (or metadata) that                              supports their effective discovery, identification, and use by researchers, students, instructors, or collections                          staff. In recent years, increased concern over the longevity of physical AV formats due to issues of media                                    degradation and obsolescence, combined with the decreasing cost of digital storage, have led libraries and                              archives to digitize recordings for purposes of long‑term preservation and improved access. However, unlike                            textual materials, for which some degree of discovery may be provided through full‑text indexing, AV                              materials that lack detailed metadata cannot be found, understood, or consumed. Most open source and                              commercial efforts that attempt to generate computationally‑assisted metadata and to facilitate improved                        discovery are narrow in focus, non‑scalable, developed as standalone tools, and do not address the rights                                and permissions that collections staff must consider for creating access. Because of the complicated morass                              of technical and social issues that limit AV discovery, and descriptive access to audiovisual objects at scale                                  106  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 107/180 would require a variety of mechanisms for analysis that would need to be linked together with tasks                                  involving human labor in a recursive and reflexive workflow platform that could eventually facilitate                            compiling, refining, synthesizing, and delivering metadata. Colleagues from Indiana University and                      AVPreserve and a team of researchers at UT including myself are in the process of developing such a                                    workflow platform, which would allow libraries and archives to bring together and use task‑appropriate tools                              in a production setting. This work is in direct conversation with the kind of framework that Always Already                                    Computational is proposing, but we believe that AV needs, which include generating data about AV materials                                as a solitary means of providing access to materials that may never (because of privacy and copyright                                  concerns) be publically accessible, are distinct from, though complementary with, those needs that                          correspond to generating data for text collections.     Third, while information literacy is today a routine goal of library instruction, data work that includes                                enabling data discovery and retrieval, maintaining data quality, adding value, and providing for re‑use lags as                              a topic.[2] If the library is the laboratory of the humanities, this lag impacts how the digital collections that                                      librarians curate are used in the humanities. Rigorous data work requires data “carpentry” knowledge that                              considers validity, reliability, and usability as well as critical literacies more generally such as data quality,                                authenticity, and lineage, but humanists and librarians are not traditionally trained on evaluating these                            aspects of data. The corresponding difficulty of training students and professional academic librarians lies in                              the ever‑evolving nature of data work, which must respond to changing standards and needs in the context                                  of increasing data in the humanities and of changing infrastructures in libraries. There is work being done in                                    this space including the Data Science Curriculum Project, which is meeting just after the Always Already                                Computational meeting in Washington DC with representatives from the American Statistical Association                        (ASA), the ASA Business‑Higher Education Forum (BHEF), the Association for Computers and the Humanities                            (ACH), the Association for Computing Machinery (ACM), the Association for Information Systems (AIS), the                            IEEE Computer Society (IEEE‑CS), INFORMS, the iCaucus, EDISON, and the American Association for the                            Advancement of Science (AAAS). As well, many programs in Data Science have emerged in recent years at                                  many universities and in many iSchools, but there are few programs of study that focus specifically on                                  teaching students with concerns shaped by the humanities in the context of humanities collections.                            Conversations on data science pedagogy are needed to ensure the integration of up‑to‑date resources,                            theories, and practices in data work in a curriculum that will be geared towards inclusivity and teaching the                                    next generation of our digital workforce about data preparation and analysis in the humanities. Again, this                                work is directly relevant to the Always Already Computational conversation since the data framework                            proposed requires practitioners who also have some training in data work.    Works Cited    [1] Federal Agencies Digitization Guidelines Initiative. Technical Guidelines for the Still Image Digitization  of Cultural Heritage Materials. September 2016.  http://www.digitizationguidelines.gov/ .    [2] Association of College and Research Libraries. Working Group on Intersections of Scholarly  Communication and Information Literacy. Intersections of Scholarly Communication and Information  Literacy: Creating Strategic Collaborations for a Changing Academic Environment. Chicago, IL: Association  of College and Research Libraries, 2013.          107  http://www.digitizationguidelines.gov/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 108/180 Computing in the Dark:  Spreadsheets, Data Collection and DH’s Racist Inheritance   P. Gabrielle Foreman and Labanya Mookerjee, University of Delaware    Living in a nation of people who decided that their world view would combine agendas for individual  freedom and   mechanisms for devastating racial oppression presents a singular landscape.    ‑Toni Morrison , Playing in the Dark    Early on in the “Always Already Computational” abstract this assertion appears, underscoring a central                            assumption of the project: “predominant digital collection development focuses on replicating                      traditional ways of interacting with objects in a digital space. This approach does not meet the needs of                                    the researcher, the student, the journalist, and others who would like to leverage computational                            methods and tools to treat digital library collections as data.” Not only do the protocols and                                development of digital collections, of interacting with objects, not meet the needs of various users—let’s                              call them people or communities—who interact with “objects in digital spaces,” the lexicon itself                            reproduces particularly freighted ideas for Black communities of researchers and students, many of                          whose ancestors entered the West as chattel property, as people who were both called objects and                                “leveraged,” that is bartered, mortgaged, sold and  listed  as such. In the US, this is true for the almost 250                                        years of municipal, census, and other records which make up collections and archives during slavery, for                                records that document the debt peonage that characterizes Jim Crow, and, one might argue, for ways in                                  which Black people are accounted for in a prison industrial complex that again treats members of                                communities as things to be categorized, as surveilled and recorded objects.    The lexicon of digital collections extends the freighted, fretted, relation of categorization and data                            collection, to Black subjects and Black subjectivity. The term "item,” like “object,” again recalls the ways                                in which Black people appear/ed in public records—as items on manifests, as "losses" on insurance                              claims, and again as items for sale in newspapers or to be distributed in probate. “Fortune” was an 18 th ‑                                      century Connecticut enslaved man whose very name announces his relation to the capital production,                            the wealth and fortune, he was meant to produce for his enslaver, Dr. Preserved Porter (this is not a                                      typo). When the doctor died not long after he did, Fortune appears in probate records as a skeleton the                                      doctor made from his body, claiming him in death as in life, and literally transforming him into both                                    material object and intellectual prop and property. Fortune’s own wife, Dinah, still enslaved by the                              family, was worth  less as a living, sentient, being in those records than her husband’s skeleton, a skeleton                                    she may have had to dust or clean, the bones of a husband she could not bury.    Likewise, the spreadsheet opens up complex analogies to the ledger, as Labanya Mookerjee, a former                              exhibits committee co‑chair for the Colored Conventions Project, writes in her “ Disrupting Data Viz. &                              the Colored Conventions Project :  Interrogating Data Management Methods through Disability Studies ,” a                        piece she wrote and published on tumblr for a graduate seminar led by P. Gabrielle Foreman. Storing                                  data in spreadsheets powered by programs such as Microsoft Excel introduces an additional layer of                              complications; spreadsheets, as bookkeepers of capitalism, can be traced directly to the history of slave                              trader ledgers . The violence of this history runs the risk of being replicated if we continue to use                                    conventional methods of storing data. As many DH critics have now pointed out, the institutional power                                108  http://disruptingdataviz.tumblr.com/post/144639429214/introduction-to-disrupting-data-viz-the-colored http://disruptingdataviz.tumblr.com/post/144639429214/introduction-to-disrupting-data-viz-the-colored http://t.umblr.com/redirect?z=http%3A%2F%2Fwww.slate.com%2Fblogs%2Fthe_vault%2F2014%2F04%2F07%2Fslave_trader_ledger_william_james_smith_accounting_book.html&t=OWNiZjVjZjlhYTU2NWMxNTkwMGFiM2U4Zjk3NTY2MzA4ODFiMjQ5OCw2UzhhSFppNQ%3D%3D http://t.umblr.com/redirect?z=http%3A%2F%2Fwww.slate.com%2Fblogs%2Fthe_vault%2F2014%2F04%2F07%2Fslave_trader_ledger_william_james_smith_accounting_book.html&t=OWNiZjVjZjlhYTU2NWMxNTkwMGFiM2U4Zjk3NTY2MzA4ODFiMjQ5OCw2UzhhSFppNQ%3D%3D 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 109/180 invested in the process of data collection—the prelude to data visualization—can be discussed alongside                            conversations on the power in the production of the archive. Computational activity “is contingent on                              the availability of collections that are tuned for computational work (Hughes 2014),” as the Always                              Already Computational abstract asserts. “Suitability is predicated on form, integrity, and method of                          access (Padilla 2016). This points us to the hegemonic logic guiding the selective operations in                              knowledge production that has been interrogated through studies on the archives (Trouillot) and in data                              visualization (Drucker). Both Trouillot and Drucker make a DH community (attuned to archive production                            as well as archive availability) aware of the need to name the difference between “capta” and “data” and                                    to challenge and counter the institutional powers that authorize “credibility” or “suitability” (Padilla).    Datasets, when constructed using conventional methods of data collection and organization, run a                          similar risk of activating institutional power and defining “credibility,” especially when the data is                            procured from traditional archival sources that too often excise, anonymize and erase certain subjects,                            transmogrifying them in turn into (almost invisible, ghosting) “objects” and “items.” Two examples from                            the Colored Conventions movement obtain. First is the challenge of including Black women whose                            names and participation are excised when we use traditional methods of collecting and naming data                              (from the lists of thousands of delegates over seven decades). Curating a dataset that is reflective of the                                    actual history of women’s involvement has prompted CCP to revisit the logic used to develop the                                parameters of what qualifies as “participations,” extending the definition of participation from appearing                          in the minutes, to attendance at the gatherings, and to hosting and curating conversations (following                              Psyche Williams‑Forson) at boarding houses, eateries etc. where women’s presences or imprints appear.                          A second example is the work that Jim Casey, co‑founder of CCP, has done on social network analyses                                    and data visualization between Colored Conventions and The Underground Railroad showing a surprising                          lack of overlap and co‑attendance. “All of this data is vexed,” asserts Casey, “shaped by centuries of                                  decisions based on racial hierarchies about what to record, store, and reproduce.” Casey uses Siebert’s                              “Directory of the [3000] Names of Underground Railroad Operators” included in his Underground                          Railroad (1898), and Boston Public Library’s Anti‑Slavery Collection Data. These sources hew to a                            historical imaginary that places whites at the center of the UGR and that excises Black leadership and                                  involvement, a corrective that has just begun to appear in recent scholarship and has not produced a                                  directory as of yet. Based on racially hegemonic raw data, the co‑attendance visualizations don’t capture                              Black UGR involvement by default.     This leads us to this set of questions. How do we account for (new, collective) data collection that                                    accounts for haunting imprints and outright absences in the archives upon which we depend? What are                                the implications of a lexicon and set of practices/tools that rely upon and reproduce a colonial language                                  of power and entitlement in the digital humanities as we think collectively about best practices to                                “leverage computational methods and tools to treat digital library collections as data”.                       109  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 110/180 Frictionless Collections Data   Dan Fowler, Open Knowledge Foundation       Data Package is a containerization format for all kinds of data. It provides a framework for “frictionless”                                  data transport by specifying useful metadata that allows for greater automation in data processing                            workflows. The aim is to provide the minimum amount of information necessary to transfer data from                                one researcher to another, and, likewise, one data analysis platform to another. After several years                              developing these specs for general use, it is worth directly examining the extent to which library and                                  museum collections data are amenable to this approach.    New approaches to publishing library and museum collections data are necessary. Such data, released                            on the Internet under open licenses, can provide an opportunity for researchers to create a new lens                                  onto our cultural and artistic history by sparking imaginative re‑use and analysis. For organizations like                              museums and libraries that serve the public interest, it is important that data are provided in ways that                                    enable the maximum number of users to easily process it. Unfortunately, there are not always clear                                standards for publishing such data, and the diversity of publishing options can cause unnecessary                            overhead when researchers are not trained in data access/cleaning techniques.      One approach for publishing collections data is via an API (Application Programming Interface) on a                              record‑by‑record basis. This approach has its advantages: the data is likely structured and well                            described. However, these services may not map directly to the types of queries or analyses  researchers                                need to run. Further, for both the researcher and publisher, it can be tedious and costly to provide large                                      amounts of collections data delivered record‑by‑record. For certain use cases, it is preferable to publish                              data in bulk format in open standards like CSV or JSON. The  Metropolitan Museum of Art and  Tate                                    Gallery , for instance, have released their collections data as sets of text‑based files on GitHub. In this                                  approach, associated documentation is provided via files named by convention, for example, “README”                          or “LICENSE”. This method of publishing allows users to load data into their own tools without the                                  overhead of programming against an API.      Documentation for data published in bulk is often ad hoc. There is often no clear or rigorous                                  documentation of the fields (what types of data are in each column). Reading such data into data                                  analysis programs using the built‑in CSV ingest mechanisms yields data divorced from context: common                            date and boolean (“TRUE/FALSE”) columns must be explicitly assigned as such, numeric identifiers may                            be incorrectly loaded as integers, etc. These datasets are often exported from in‑house collections                            database software, and small errors in the translation of these often large datasets may go unnoticed.      Data Packages for Collections  Frictionless Data , developed in the open by Open Knowledge International and members of the open                              data community, is an ideal framework for publishing this type of bulk data. The Data Package format,                                  requiring only the addition of a descriptor file called datapackage.json, provides a minimally invasive, but                              standardized way to provide clear and machine‑readable metadata. Datasets created as Data Packages                          can later be easily exposed as APIs given the wealth of metadata provided.      110  https://gdstechnology.blog.gov.uk/2017/02/03/providing-access-to-datasets-through-apis/ https://gdstechnology.blog.gov.uk/2017/02/03/providing-access-to-datasets-through-apis/ https://github.com/metmuseum/openaccess https://github.com/tategallery/collection https://github.com/tategallery/collection http://frictionlessdata.io/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 111/180 As an example, the  Carnegie Museum of Art in Pittsburgh, Pennsylvania has provided its collections data                                as a downloadable Data Package.  Providing the data in this format yields several benefits:    1. Users are provided with useful metadata to allow for easy import into their preferred analysis                              tool. These explicitly defined column types and metadata can eliminate some of the tedious                            work involved in “wrangling” a dataset.  2. Publishers can use tooling like  Good Tables  to automatically validate data.  3. Basic documentation for how to use the dataset (e.g. what columns mean) can be automatically                              created from structured metadata.  4. Collections data can be licensed in a machine‑readable manner.  5. In the absence of Data‑Package‑aware tooling, the original data can be read/written as usual.    Over the course of this year, with the continued support of a grant from the Sloan Foundation, we are                                      looking to work with researchers and institutions across a variety of fields to pilot the use of the                                    specifications. This may involve building tools and writing guides to analyse, validate, and/or visualize                            collections data. Through this process we hope to improve the specifications more generally while also                              providing useful tooling for researchers in digital humanities.                                                                111  https://github.com/cmoa/collection http://goodtables.okfnlabs.org/ mailto:daniel.fowler@okfn.org mailto:daniel.fowler@okfn.org 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 112/180 Book carts of Data:  Usability and Access of Digital Content from Library Collections   Harriett Green, University of Illinois at Urbana‑Champaign      Not all of the data we create or purchase for Library collections comes in neat multi‑gigabyte packages of                                    ordered files: We recently discovered that datasets we had purchased as part of a database licensing                                negotiation were more shelf ready than machine ready: They currently exist as stacks of hard drives,                                discs, and other bewildering formats sitting on a book cart. How do we provide access to these data                                    collections?    In my extensive work with research teams, graduate students, and faculty members to obtain, generate,                              and transform data derived from collections in the University of Illinois Library and far beyond, the                                question of access and usability consistently rises to the fore. Thus, I would ask, how can we                                  conceptualize the full spectrum of data usability? It is not enough for us to digitize the collection                                  materials and for the data to exist on someone’s server: Usability encompasses data formats, tool                              interoperability to the negotiated permissions and rights for researchers to share and manipulate data as                              they engage in analytic workflows.     Data usability means developing data models that take into account the actions that will be performed                                on our data. In determining the different types of data models that we can build and implement into our                                      collections, we must consider how humanists and social scientists effectively work with data in their                              research and teaching.     My work with the HathiTrust Digital Library and HathiTrust Research Center has seen this practice: The                                HTRC has attempted to meet various expertise levels and needs of users in enabling access to the data:                                    On the newcomer end of the spectrum, we provide fully guided access to gathering and using data                                  through our Workset Builder and the Portal with its pre‑set algorithms. But researchers frequently                            express the need for larger‑scale data that is more pliable and manipulatable, so the HTRC developed the                                  Extracted Features datasets that allow researchers to generate highly customized and curated datasets.                          But the barriers to accessing this data can be high in terms of skillsets needed to both access and use the                                          data.    My research explorations on scholarly research practices also have shown me that data usability is                              critical:    Our research for the HTRC’s Workset Creation for Scholarly Analysis project examined researcher                          requirements for textual corpora to be useable for research (Fenlon et al. 2015, Green et al. 2014).                                  Our interviews with scholars revealed that the core areas of concern for researchers included the                              conceptualization of collections as reusable datasets and resources for scholarly communications;                      the ability to break apart collections into various levels of granularity to generate diverse objects of                                analysis; and the need for enriched metadata. We proposed building out the data model of the                                “workset,” the HTRC‑specific term for textual corpora that researchers build.    112  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 113/180 Our subsequent user study for HTRC User Requirements (Green and Dickson, 2016) gave further                            insights on how researchers used textual corpora and their scholarly practices that shape their needs                              for being able to work effectively with text collections in the HathiTrust Digital Library, as well as                                  overall. We learned that scholarly practices and notable challenges when working with our textual                            collections included the ability to acquire and structure the data; the need for a space to work with                                    various tools and generate results; the ability to share data for research collaborations; and the role                                of data in teaching and training.    And my recently concluded research study for Emblematica Online explored how scholars engaged                          with the digitized emblem books drawn from leading rare book collections at Illinois, HAB                            Wolfenbuettel, University of Glasgow, Duke, and the Getty Institute. In my examination of how                            scholars engaged with these multi‑institutional collections, their metadata, and the interlinked                      digital content through interviews and usability testing sessions, we found that the expectations of                            users when exploring digital collections is complex: They range from the basic need for high‑quality                              reproductions, which  Emblematica  was praised for by all participants; to advanced scholarly                        concerns such as the ability to distinguish between the types of archival content they are                              perusing—emblem books versus emblems themselves—and the historical particularities of this                    specialized genre of emblem studies. Respondents frequently expressed the need for context,                        annotated content, and other functionalities that would allow them to fully engage with the emblem                              books as an archival source and scholarly area. We considered that this may reveal the needs of                                  interdisciplinary scholarship as researcher take advantage of easy access to vast digital collections of                            content: The scholarly knowledge base that users approach with digital collections varies widely,                          and an effective digital collection must welcome all levels and inculcate them into the scholarly                              domain of the collection.    These are some of the findings I have learned in my work to examine what researchers needs are as they                                        engage with our Library collections in digital formats and make use of these materials as data. This                                  Forum’s discussion can provide critical new avenues for exploring how collections can be accessible,                            browseable, and extensible for addressing a diversity of emergent uses in research and teaching.    Works Cited    Fenlon K., Senseney M., Green H., Bhattacharyya S., Willis C. and Downie, J. S. (2014). Scholar‑built  collections: A study of user requirements for research in large‑scale digital libraries.  Proceedings of the  American Society for Information Science & Technology  51(1), 1–10. doi:  10.1002/meet.2014.14505101047    Green, H. E., Fenlon, K., Senseney, M., Bhattacharyya, S., Willis, C., Organisciak, P., Downie, J.S., Cole, T.,  and Plale, B. (2014). Using Collections and Worksets in Large‑Scale Corpora: Preliminary Findings from  the Workset Creation for Scholarly Analysis Prototyping Project. Poster presented at iConference 2014,  Berlin, Germany.    Green, Harriett, Eleanor Dickson, and Sayan Bhattacharyya. “Scholarly Requirements for Large Scale Text  Analysis: A User Needs Assessment for the HathiTrust Research Center.” Digital Humanities 2016  Proceedings, Krakow, Poland, July 11 – 15, 2016.    Green, Harriett, Mara Wade, Timothy Cole, and Myung‑Ja Han. 2015. “User Engagement with Digital  Archives: A Case Study of Emblematica Online.”  In  Creating Sustainable Community: The Proceedings of  113  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 114/180 the ACRL 2015 Conference , edited by Dawn Mueller, 177–187. Chicago, IL: Association for College and  Research Libraries.                                                                                            114  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 115/180 Historical Complications of/for Open Access Computational Data  Jennifer Guiliano, Indiana University–Purdue University Indianapolis       Always Already Computational seeks to support the “development of a strategic approach to developing,                            describing, providing access to, and encouraging reuse of library collections that support                        computationally‑ driven research and teaching.” Historically, data in the digital collections sphere has                        most often been expressed as homogenous datasets falling into one of three primary types: textual,                              visual, or audio. “Scholars” or “researchers” use large scale textual information derived from digitized                            volumes or the extraction of text only from hypertextual and multimedia environments or they mine                              hundred or even thousands of hours of video or audio materials to extract and analyze subsets. Due to                                    the dominance of datasets like those derived from the Google Books corpus or through webscraping                              tools that cull text,image, or audio, large or dense cultural datasets are the norm in digital humanities,                                  and are not only homogenous in type but rarely imagine interactions as led by or with intervention from                                    individuals not holding the role of scholar or researcher.    More simply, I am suggesting that the question of creating computationally‑accessible datasets is not just                              the deployment of an ecosystem for development, description, access, and reuse but a recognition that                              there are potentially multiple ecosystems of research and teaching that  must exist simultaneously  and be                              treated as relational computational data. To illustrate this principle, I’ll provide a brief synopsis of the                                work of Edward Curtis and how the open access images that are currently available as                              computationally‑accessible data through the Library of Congress present a complicated consideration of                        computational data. Beginning in 1868, Edward S. Curtis embarked on a thirty‑year career documenting                            over eighty native communities. Participating as part of scientific expeditions and anthropological                        excursions, he produced roughly 20 volumes of information on Native and Indigenous life that were                              accompanied by photographic images as part of his  The North American Indian series. Created primarily                              as silver‑gelatin photographic prints, this series has long held a place of prominence in historical analysis                                as the images are not only noted for their rarity but for the limited dissemination and reuse throughout                                    the twentieth century as full sets of materials. Only 300 sets of the 20 volume series were sold; however,                                      these images as individual objects have seen significant dissemination and reuse since their acquisition                            by the Library of Congress. More than 2,400 silver‑gelatin photographic prints (of a projected total of                                40,000) were acquired by the Library of Congress through copyright deposit from about 1900 through                              1930. About two‑thirds (1,608) of these images were not published in Curtis's multi‑volume work,  The                              North American Indian . The collection includes individual and group portraits, as well as photographs of                              indigenous housing, occupations, arts and crafts, religious and ceremonial rites, and social rituals (meals,                            dancing, games, etc). More than 1,000 of the photographs have been digitized and individually described                              and are available through the Library of Congress API as well as via manual download of both jpeg and                                      tiff file formats.    Using strategies common to anthropologists working in indigenous communities at the turn of the 20th                              century, Curtis modified the images he produced to remove signs of modernity and contemporary life.                              This included providing specific forms of dress that were perceived as being “more traditional” as well as                                  stronger interventionist strategies like removing objects that would signal integration with 20th century                          Euro‑American society. When viewing an image of a Piegan lodge on the LOC website,  the unretouched                                negative is provided to the API of an image of two Piegan men situated in their lodge with a clock                                        centered between them. A computational dataset would expose the existence of this image, which could                              115  http://www.loc.gov/pictures/resource/cph.3b14188/?co=ecur http://www.loc.gov/pictures/resource/cph.3b14188/?co=ecur 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 116/180 allow scholars to run object based visual analysis algorithms to identify the clock in the image and                                  potentially find other images of modernity using shape‑segmentation leading to some conclusions about                          the interventionism of technology in indigenous life‑‑‑how widespread has technology embedded itself                        into indigenous life? But in current thinking about computationally‑accessible data, what would not be                            revealed is that this original negative shows an alarm clock between two seated men in a Piegan lodge,                                    not the published, retouched image that American audiences would have viewed in  The North American                              Indian . Curtis physically cut the clock out of the negative. He then the retouched the image for                                  publication in  The North American Indian . It is important for accuracy purposes for the dataset to reflect                                  not just the original photographic negatives but also relational data derived from what was actually                              published by Curtis. Otherwise, researchers might conclude that Americans were familiar with signs of                            modernity in indigenous life when, in fact, that conclusion is relatively recent historiographically. Other                            examples of this type of relational computational‑data are available with Curtis: he depicted a Crow war                                party on horses, even though there had been no Crow war parties for years, and he used techniques of                                      focus and duration to induce hue saturation that romanticized images.     More problematically, for our computational dataset, Curtis was also known to photograph religious                          rituals as part of his excursions. The [ Oraibi snake dance ] image depicts Hopi natives that were part of                                    the Snake and Antelope societies participating in a communal ceremony. Performed in August to ensure                              abundant rainfall to help corn growth, the ritual was the most widely photographed ceremony in the                                Southwest Pueblos by non‑native observers. In current computationally‑accessible form, there are a                        number of issues to confront: 1) there is no notation that this image is of a religious ritual that is now                                          prohibited from viewing by the non‑Hopi public (and thus should be pulled from view for reasons of                                  cultural sensitivity); 2) when subjected to computer vision techniques, the derivative images rely on                            segmentation of physical bodies‑‑‑a form of disembodied violence that reflects colonial practices where                          Natives are treated as less than human through segmented image representation (e.g. scalps, severed                            limbs, etc). More holistically, this case illustrates one of the long‑term challenges of                          computationally‑enabled access: computers cannot identify culturally‑sensitive data nor is there an                      efficient means to retrieve culturally‑sensitive data once it has been distributed in computational form.                            While data might be displayed in an integrated manner, when it comes to the processing or analysis of                                    our data, computational analysis has largely existed at a segmented level rather than as an integrated                                structural process for research and teaching purposes. A complex humanities system for data are often                              artificially layered representations that rely on augmentation of 'found' datasets such as traditional and                            web archives.     Often, human intervention is needed to verify the results of these computational processes, which have                              a habit of very quickly highlighting contradictions at the level of both object and corpora. An integrated                                  data ecosystem posits that through computational analysis it is important not only for core activities of                                development, description, access, and reuse, but also the return of data to its originating collection                              through data correction and relational derivatives. More simply, what is needed is an integrated                            humanities data ecosystem that recognizes approaches to computationally‑accessible data and relies on                        important characteristics of humanities research data and humanities research practices: 1) humanists                        tend to create data, not just gather data; 2) some of this data is inherently structured, but most is not; 3)                                          the resulting data is often highly interpretative, which has implications for sharing and re‑use; 4) data                                creation is often iterative and layered with implications for copyright, versioning and active working                            spaces; and 5) the process is as important as the product. And, significantly, to envision the broadest                                  potential intervention of computationally‑accessible datasets, we cannot envision that the terms                      “scholar” and “researcher” belong to the academic or archival communities. We must understand that                            116  http://www.loc.gov/pictures/collection/ecur/item/90716879/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 117/180 the communities of origin should be the initiating point for considering development, deployment,                          access, etc.    Works Cited    [1] Portions of this response appeared in an earlier form in the Introduction to “The Future of Digital  Methods for Complex Datasets”, an  International Journal of Arts and Humanities Computing (IJHAC)  special edition and as a contribution to a Digital Library Federation panel on Humanities Data issues.  Jennifer Guiliano and Mia Ridge, International Journal of Humanities and Arts Computing, Volume 10  Issue 1, Page 1‑7. DOI:   http://dx.doi.org/10.3366/ijhac.2016.0155 .                                                                            117  http://dx.doi.org/10.3366/ijhac.2016.0155 http://dx.doi.org/10.3366/ijhac.2016.0155 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 118/180 Identifying Use Cases for Usable and Inclusive Library Collections as Data    Juliet L. Hardesty,   Indiana University      A grounded, practical approach to digital projects often centers around concerns of how will the                              project be useful, how can the project realistically be completed, and what information is necessary                              to make this project (or the items in a digital project) discoverable and accessible? Based on this                                  approach, there are two sides to making library collections useful as computational data – the                              collection‑holding library has to be able to release the data in a way that allows for computation and                                    researchers have to be able to find out about this data and do something with it. Putting data out                                      there does not mean it will be used and offering a computational interface does not mean it will fit all                                        research needs.    The grant references the HathiTrust Research Center (HTRC) as an example of a computational                            interface for researchers. It also references Hydra‑in‑a‑Box as an example of an application that could                              benefit from computational functionality. This generated the thought of an HTRC‑in‑a‑Box that could                          work for libraries to set up their own computational interface for their collections. Open government                              data efforts like  Code for America or data.gov and ckan.org show how various groups and individuals                                can come together around a common goal of providing access to computational data and provide                              ways to access, analyze, and offer data. It would be useful to examine those models when discussing                                  approaches to treating library collections as data.    This project is concerned with all types of digital objects. Text, images, audio, video, born‑digital,                              3‑dimensional, all have unique aspects to them that are sometimes computationally available but                          often are not. Sometimes the only way to know about segments on a video or the contents of an                                      image is to have textual description available. That requires metadata generation or metadata                          enhancement. This work can be manually intensive but can also be aided by software. Efforts such as                                  AVPreserve’s plan to enhance metadata in stages for Indiana University’s Media Digitization and                          Preservation Initiative move gradually toward more advanced technologies to identify aspects such as                          people’s faces, beats per minute, and speaker identification in video and audio for the purpose of                                producing metadata than can then be discovered by researchers.[1] Another project to watch will be                              Wikimedia Commons’ Structured Data project to “develop storage information for media files in a                            structured way on Wikimedia Commons, so they are easier to view, translate, search, edit, curate and                                use.”[2] This process will not always be just about putting the data out there or making it possible for                                      researchers to access the data, it will also involve producing data about different types of objects than                                  has traditionally been the case in digital libraries. Recommendations, tools, and workflows for                          metadata enhancement will be necessary to create usable computational data.    Michelle Dalmau, Head of Digital Collections Services at Indiana University, correctly points out that                            different use cases are needed for library collections as data.[2] At Indiana University, several digital                              collections are available as datasets,[3] largely based on researcher requests. Tracking use in the wild                              is challenging, but datasets are used in the classroom (Charles W. Cushman Photograph Collection)                            and for research (Wright American Fiction). Looking at how data is used for research compared to                                how it is used pedagogically for instruction might lead to insights on qualities of data that make                                  collections better suited for teaching versus research. Being able to reliably trace the ways in which                                118  https://www.codeforamerica.org/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 119/180 these data sets are used will demonstrate impact to stakeholders. Using metadata about digital                            collections versus using the collection items themselves for content analysis is something else to                            consider. The British Library offers image collections for analysis separate from bibliographic datasets                          about their archival holdings. Indiana University’s Cushman dataset offers only the metadata about                          the images, not the images themselves.    A final point to bring up concerns diversity and inclusion. Not only should this project make sure the                                    collections considered for use cases are diverse in format, content, and source, but the project itself                                needs to have a broad and deep representation of voices and perspectives on computational data.                              These are not data that are only useful in the academic realm. Access to computational data or                                  workflows and tools to allow others to provide access to computational data will be ever more                                important in the world, particularly if national governments continue to trend toward populism,                          nationalism, and privatization.     Works Cited    [1] Rudersdorf, Amy and Juliet L. Hardesty. (2016). “AV Description with AVPreserve and IU: Strategies  and tools to describe audiovisual materials at scale for Indiana University’s Media Digitization and  Preservation Initiative.” Digital Library Federation Forum, Milwaukee, Wisconsin.  https://osf.io/gfazc/     [2]  Juliet L. Hardesty interviewed Michelle Dalmau regarding library collections as data in February 2017.    [3]  https://commons.wikimedia.org/wiki/Commons:Structured_data     [4] British Library. Collection guides: Datasets for image analysis.  http://www.bl.uk/collection‑guides/datasets‑for‑image‑analysis                                             119  https://osf.io/gfazc/ https://commons.wikimedia.org/wiki/Commons:Structured_data http://www.bl.uk/collection-guides/datasets-for-image-analysis 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 120/180 Emerging Memory Institution Data Infrastructure in the Service of  Computational Research  Christina Harlow, Cornell University      In my opinion, the  Always Already Computational Forum work area rests at the intersection of the                                understood functionalities of memory institution’s collection platforms and the needs of researchers                        working with large‑scale or computational data analysis techniques. In thinking about this Forum’s scope                            and my own work, I am struck by possible collaborations not leveraged or mentioned. I would like to                                    explore if my work approach to a facet of a larger data problem could expand and, in turn, be expanded                                        by the Forum’s discussion and deliverables on computational research needs and memory institution                          data practices.    My position for this upcoming Forum will mostly fall along these points:    ● If library collections, including but not limited to that of digital repository platforms, are  considered (primarily digital repositories are targeted in the proposal), there is a wealth of data  and metadata (*data) that already exists. Better yet, memory institutions already work with this  *data at scale using traditional and emerging technologies that underpin and are hidden by  delivery and discovery interfaces. How can this underlying ecosystem be better leveraged for  computational data analysis by researchers? i.e. do we just need to make access to a Solr index  publicly available? Can we plug into our library data ETL systems a public Hadoop integration  point? Do we need to better document and expose to new communities our existing data APIs or  data exchange protocols?     ● I would like to surface the functional needs of the research areas alluded to in the proposal, then  see where they overlap with existing *data operations work areas in memory institutions. A  strategic partnership here means we can strengthen the cases for, collaboration on, and support  of the technological, procedural, and organizational frameworks emerging. These are already  being built and used to support efforts of memory institutions and their data partners.    ● Computational or large‑scale *data work requires transparency and agreement on a number of  points to make it statistically relevant and publicly reliable. These agreement points include but  are not limited to:    o Machines should be able to understand the models or entities represented by the data;  o This requires having shared specifications around *data representation and contextual  meaning of models, datum, types, etc.;  o We need to build and maintain consistent data exposure services, points or methods so  that computational work can be reproducible, iterated, or distributed as needed (for  scalability);  120  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 121/180 o Recognize that technological frameworks for computational analysis (for example,  Hadoop) often require significant hardware, software, and maintenance to support.  Stability of how data is exposed and data provenance can mitigate the technological  burden by offering consistency on which multiple partners can build and coordinate  efforts on the frameworks;  o And what is the responsibility of the originating memory institution to support capture  of that computational data output for sake of archiving, reproducibility, discoverability,  and expanded *data services?    My positions come from my own work on metadata operations within a large and well‑funded academic                                library system. My work focuses on building an efficient and coordinated *data ecosystem among                            sources including but not limited to:    ● A traditional MARC21 Catalog with about 9 million bibliographic records, managed in an ILS  (Integrated Library System), a few Oracle databases, a Perl‑based metadata reporting and  management interface, and other batch job management and metadata exposure services (APIs  and data exchange protocols like Z39.50 or SRU);    ● A locally‑developed metadata integration layer that takes multiple data representations of  authority, bibliographic and other metadata retrieved via APIs, merges them, and indexes into a  number of Solr indexes;    ● Multiple (~8 depending on the definition) digital repository applications and services for delivery  of data and metadata to user interfaces. These repositories span technology and resource types  from lone Fedora 4 instances for object persistence of primarily text‑focused digital surrogates to  more traditional DSpace installations for user‑generated scholarly output type resources;    ● A locally‑managed authorities and entities interface that deals with both local vocabularies and  enhanced representations of currently 3 large (>1 million resources) external metadata sets;    ● And *data from archives, preservation, digitization, and many other workflows and systems.    In building a coherent ecosystem for this *data, I work with enterprise data tooling and approaches that                                  perhaps also can support the computational data analysis needs to be surfaced in the  Always Already                                Computational Forum. In particular, I am leveraging ETL and distributed data management systems that                            then interact with (and coordinate) existing memory institution *data standards, applications,                      specifications, and exchange protocols. Due to the computational support of the selected distributed                          data systems, I run a number of processes that parallel some computational data approaches, but for                                different ends. I would like to outline how we could reuse or expand these existing approaches and                                  services to support the researchers (and their respective areas) who take part in this Forum.            121  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 122/180 On the Computational Turn in Archives & Libraries and the Notion of Levels of  Computational Services   Greg Jansen and Richard Marciano, University of Maryland    1. The Computational Turn in Archives & Libraries  The University of Maryland iSchool’s Digital Curation Innovation Center (DCIC) is pursuing a strategic                            initiative to understand and contribute to the computational turn in archives and libraries. The                            foundational paper (with partners from UBC, KCL, TACC, and NARA) calls for re‑envisioning training for                              MLIS students in the “Age of Big Data”. See: “ Archival Records and Training in the Age of Big Data ”. We                                        argue for a new Computational Archival Science (CAS) inter‑discipline, with motivating case studies on:                            (1) evolutionary prototyping and computational linguistics, (2) graph analytics, digital humanities and                        archival representation, (3) computational finding aids, (4) digital curation, (5) public engagement /                          interaction with archival content, (6) authenticity, and (7) confluences between archival theory and                          computational practices: cyberinfrastructure and the records continuum.     Deeper experimentation with these new cultural computational approaches is urgently needed and the                          DCIC is developing a CAS curriculum that brings together faculty from Computer Science, Archival &                              Library Science, and Data Science. We conduct experiential projects teams of students to help them: gain                                digital skills, conduct interdisciplinary research, and explore professional development opportunities at                      the intersection of archives, big data, and analytics. These projects leverage unique types of archival                              collections: refugee narratives, community displacement, racial zoning, movement of people, citizen                      internment, and cyberinfrastructure for digital curation. See “ Practical Digital Curation Skills for                        Archivists in the 21st Century ” (Lee, Kendig, Marciano, Jansen), MARAC 2016. Two workshops on the                              interplay of computational and archival thinking were held in  April 2016 and  December 2016 , and a                                pop‑up session  at SAA 2016 discussed archival records in the age of big data.    Finally, the DCIC is developing new cyberinfrastructure, called  DRAS‑TIC (see  Nov. 2016 CNI talk ), that                              facilitates computational treatment of cultural data.  DRAS‑TIC stands for Digital Repository at Scale that                            Invites Computation (To Improve Collections), and blends hierarchical archival organization principles                      with the power and scalability of distributed databases.    Our position statement builds to these CAS investigations by suggesting a framework for “Levels of                              Computational Service” to better describe the emerging ecosystem and identify gaps and opportunities.    2.  Levels of Computational Service  Journalists, researchers, planners, and other user patrons support their investigations with new methods                          of computational analysis. Libraries, archives, museums, and scientific data repositories hold data that                          will inform their disciplines. It is far easier today to analyze Twitter behavior than it is to investigate                                    public life using public data from public institutions, such as government records, cultural heritage, and                              science data. We strive to make our public data and cultural memory as open to research as Twitter.  122  http://dcicblog.umd.edu/cas/about/ http://bit.ly/2lL5Et9 http://bit.ly/2lL5Et9 http://dcicblog.umd.edu/cas/ http://dcicblog.umd.edu/cas/ieee_big_data_2016_cas-workshop/ https://archives2016.sched.com/event/7f9D https://www.cni.org/topics/digital-curation/drastic-measures-digital-repository-at-scale-that-invites-computation-to-improve-collections 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 123/180 Computational analysis happens in various technical environments: on a single server; in distributed                          clusters; on cloud services. The tools we use have unique requirements, configurations, and hardware. It                              is said that a data stewardship organization cannot anticipate the uses for their data, but it is equally true                                      that they cannot anticipate the tools used for analysis. Organizations need a service strategy that serves                                a range of users, from the most technically innovative, to the most time and resources constrained. We                                  describe a range of services for collections as data without losing site of core services. This is a “maturity                                      model” for stewardship organizations, with  levels of computational services  that show a clear                          progression toward full service.    2.1.  Core Service Level  Shipping datasets into the researcher compute environment remains the critical use case, maximizing                          flexibility and allowing researchers to link many datasets into one corpus. Researchers need to  discover,                              scope, ship and make reference to datasets . Though we may also move computational work across them,                                boundaries are an important place to define stable conditions, such as custody, provenance, security,                            and concise technical contracts. Even the most advanced repository must establish these boundary                          conditions.    ● Define license terms, how can we use the data?  ● Define provenance:  ○ Who produced the data and why?  ○ How did it arrive here?  ○ Do versions exist elsewhere?  ● Define dataset scope:  ○ What makes the corpus complete?  ○ Is it complete?  ○ Is it growing? What is the update history?  ● Transfer methods with integrity verification and resume from failure  ● Persistently citable datasets    2.2.  Protocols Service Level   ● File‑by‑file transfer through HTTP API (instead of batch downloads, like ZIPs)  ● Define citable subsets through custom queries or functions.  ● Check for updates to any dataset or subset. (via HTTP API)  ● HTTP API for navigation of structured collections:  ○ Static site (Apache or Nginx auto‑index of files)  ○ Cloud Data Management Interface (CDMI)  ○ Linked Data Platform  (and Fedora API)  ● Delivery to cloud and cloud‑hosted, public datasets    2.3. Enhanced Service Level  ● Derived data available as subsets:  ○ plain text for documents and images  ○ normalized file formats  123  https://www.w3.org/TR/2015/NOTE-ldp-primer-20150423/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 124/180 ○ tabular data for table‑like sources  ○ linked data for graph‑like sources  ● Machine‑readable provenance records  ● Crowd‑sourcing of metadata  ● Named entity indexing and subsetting (people, places, organizations, dates, events)  ● Geospatial indexing and subsetting  ● Consistent and citable random sample subsets (add random seeds to each observation)    2.4.  Computer Room Service Level  Container technologies, such as Docker, ship a custom compute environment to the dataset location. A                              hosted database can be opened up for queries or distributed compute jobs. While not as flexible as the                                    researcher environment, computer room services provide rapid and cost‑effective analysis. Journalists                      on deadline benefit most from computer room services.    There are also growing calls, beyond the physical sciences, for analysis of big collections data in                                journalism and humanities scholarship. The sheer scale of big data makes transfer prohibitive, as is                              provisioning enough storage to host an entire corpus. At the Digital Curation Innovation Center at the                                University of Maryland’s iSchool, we are actively developing the  DRAS‑TIC repository (Digital Repository                          at Scale that Invites Computation). Through  DRAS‑TIC we aim to deliver computer room‑style services                            over heterogeneous digital collections and remove the limits of scale.    ● Run an Apache Spark job on a defined dataset  ● Host a compute container with a dataset mounted locally  ● SPARQL query service  ● Use techniques above to produce a new subset for transfer    3.  Provisioning the Researcher Environment  From code notebooks to deployment scripts that provision clusters, it becomes easier to create and                              share compute environments. Research that aims towards publication will also need to track the                            research steps workflow. Through machine readable scripts and provenance, we can aim to reproduce                            an analysis at a different time and place, starting from the cited datasets and well described methods.                                  The curation activities performed by a stewardship organization and the steps taken by the researcher                              can form an unbroken chain of events leading to a reproducible product.    Summary  For verifiable results in scholarship, or public trust in an independent press, we need to provide relevant                                  datasets and services that make it straightforward to trace findings back to their source in the public                                  record. We must confront a rightly skeptical reader, who faces increasingly high‑flying visualizations and                            claims made from them. They are correct to demand links to the underlying evidence and methods. By                                  providing these we enrich public understanding and trust. At the Digital Curation Innovation Center                            (DCIC) we have committed to this agenda and pursue it through our research projects, scholarly                              activities, and the active development of the DRAS‑TIC software project, and the building of a                              computational archival community .  124  https://saaers.wordpress.com/2016/07/27/building-a-computational-archival-science-community/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 125/180 Partnership Recommended – The case of curating research data collections[1]   Lisa Johnston, University of Minnesota Libraries    Digitization alone is not enough to support large‑scale computational analysis of library collections.                          Rather the more difficult steps of digital curation will be necessary to prepare our collections for                                appropriate reuse. Partnership may be the key.    Take for example the problem of analog data. The extraction of historical climate data from tables and                                  charts and other artifacts (e.g., Zooniverse's Old Weather project) is an ambitious and important                            undertaking as these data are undeniably valuable and temporally unique. Yet, the digitization of data                              points from the written page is just the first step toward a greater integration of their meaning in                                    modern and future research. In order for computation of these collections to be successful, the digital                                surrogate must be curated in a number of ways. The data may be transformed, cleaned, normalized,                                described, contextualized, and quality assurance measures put in place to ensure trust and track                            provenance of the work, to name a few. Data curation activities prepare and maintain research data in                                  ways that make it findable, accessible, interoperable and reusable (FAIR).     In our work, the  Data Curation Network project has taken steps to better understand the data curation                                  activities mentioned above and identify ways to harness the necessary domain and file format expertise                              needed to curate research data across a network of partner institutions.[2] We represent academic                            library data repository programs that are staffed with curation experts for a range of data domains and                                  data file formats. Our goals are to develop practical and transparent workflows and infrastructure for                              data curation, promote data curation practices across the profession in order to build an innovative                              community that enriches capacities for data curation writ large, and most importantly, develop a shared                              staffing model that enables institutions to better support research by collectively curating research data                            in ways that scale what any single institution might accomplish individually.   We are not alone in this desire to partner on data curation skills, staff, and infrastructure. National                                  examples of data curation such as the Portage Network (https://portagenetwork.ca), developed by the                          Canadian Association of Research Libraries (CARL), aims to support library‑based data management                        consultation and curation services across a broader network and the JISC‑funded  Research Data                          Management Shared Service Project aims to develop a lightweight service framework that can scale to                              all UK institutions and result in efficiencies by “relieving burden from institutional IT and procurement                              staff.” In the US, partnerships on technological infrastructure are booming. The Project Hydra’s Sofia                            platform (https://projecthydra.org), which builds in the DuraSpace Fedora framework, has been                      co‑developed by numerous institutions that seek to build a better digital repository infrastructure for                            data. And the  Hydra‑in‑a‑Box project (lead in part by another partnership success story for disseminating                              archival materials, the Digital Public Library of America) aims to provide a networked platform for                              repository services that will scale for institutions big and small. Another inspiring example is the                              Research Data Alliance , which provides an incubator for collaboration around a range of data‑related                            topics. RDA projects to track include the Publishing Data Workflows working group and the newly formed                                Research Data Repository Interoperability working group. And partnerships do not necessarily need to                          start at the national‑level. Several smaller‑scale partnerships underway for sharing curation staff                        expertise across institutions include the  Digital Liberal Arts Exchange , which facilitates data‑related                        problem solving and communication amongst peers as well as providing hosting services that allows                            125  https://sites.google.com/site/datacurationnetwork https://www.jisc.ac.uk/rd/projects/research-data-shared-service https://www.jisc.ac.uk/rd/projects/research-data-shared-service http://hydrainabox.projecthydra.org/ https://rd-alliance.org/ https://dlaexchange.wordpress.com/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 126/180 digital humanities projects to be run on shared infrastructure. And the  DataQ Project, which provides a                                virtual online forum for expert data staff to discuss and provide solutions for data issues in a                                  collaborative way.      By partnering on data curation efforts like these we may move beyond individualized digital curation                              strategies toward what I hope will become a robust “network” of digital collections that are                              computational, but also trusted. And as partners in this effort we may continue a shared dialogue and                                  collectively develop new and improved processes for curating research data and other digital objects.                            Finally, our networked research collections will demonstrate our continuing and important role that                          libraries and archives have to play in the broader scholarly process.     Works Cited    [1] Portions of this statement were also published in “Concluding Remarks” by Lisa R. Johnston in  Curating Research Data Volume 2: A Handbook of Current Practice  (ACRL, 2017) available as an open  access ebook at  http://www.ala.org/acrl/publications/booksanddigitalresources/booksmonographs/catalog/publications .     [2] Currently in our planning phase, the Data Curation Network aims expand into a sustainable entity  that grows beyond our initial six partner institutions, lead by the University of Minnesota, and are the  University of Illinois, Cornell University, the University of Michigan, Penn State University, and  Washington University in St. Louis.                                              126  http://researchdataq.org/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 127/180 Ways of Forgetting: The Librarian, The Historian, and the Machine  Matthew Lincoln, Getty Research Institute    Jorge Luis Borges tells us of Funes, the Memorious: a man distinguished by his extraordinary recall. So                                  precise and complete were Funes' memories, though, that it was impossible for him to abstract from the                                  near‑infinity of recalled specifics he possessed, to general principles for understanding the world:  Locke, in the seventeenth century, postulated (and rejected) an impossible idiom in                        which each individual object, each stone, each bird and branch had an individual                          name. Funes had once projected an analogous idiom, but he had renounced it as                            being too general, too ambiguous. In effect, Funes not only remembered every leaf                          on every tree of every wood, but even every one of the times he had perceived or                                  imagined it... He was, let us not forget, almost incapable of general, platonic                          ideas... he was not very capable of thought. To think is to forget a difference, to                                generalize, to abstract. In the overly replete world of Funes there were nothing but                            details, almost contiguous details. (Borges 1962, 27)  Attending to Drucker's admonition that all "data" are properly understood as "capata", the story of                              Funes is a potent reminder that it is not only inevitable that we will be selective when capturing datasets                                      from our collections, but that it is actually  necessary to be selective.(Drucker 2014) A data set that aims                                    for perfect specificity does so at the expense of allowing any generalizations to be made though                                grouping, aggregating, or linking to other datasets. For our data to be useful in drawing broad                                conclusions, it is an  imperative  to forget.  However, in considering library and museum collections as data, we must grapple with several different                              frameworks of remembering, forgetting, and abstracting: that of the librarian, the historian, and the                            machine. These frameworks will often be at cross‑purposes:  ● The librarian favors data that is  standard : forgetting enough specifics about the                        collection in order to produce data that references the same vocabularies and thesauri                          as other collection datasets. The librarian's generalization aims to support access by                        many different communities of practice.    ● The historian favors data that is  rich : replete with enough specifics that they may                            operationalize that data in pursuit of their research goals, while forgetting anything                        irrelevant to those goals. The historian's generalization aims to identify guiding                      principles or exceptional cases within a historical context. (No two historians, of course,                          will agree on what that context should be.)    ● The machine favors data that is  structured : amenable to computation because it is                          produced in a regularized format (whether as a documented corpus of text, a series of                              relational tables, a semantic graph, or a store of image files with metadata.) In a                              statistical learning context, the machine seeks generalizations that reduce error in a                        given classification task, forgetting enough to be able to perform well on new data                            without over‑fitting to the training set.  127  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 128/180 At the Getty Research Institute,  our project to remodel the Getty Provenance Index® as Linked Open                                Data is compelling us to balance each of these perspectives against the labor required to support them.                                  Our legacy data is filled with a mix of transcriptions of sales catalogs, archival inventories, and dealer                                  stock books, paired with editorial annotations that index some of those fields against authorities or                              other controlled vocabularies. Originally designed to support the generation of printed volumes, and                          then later a web‑based interface for lookup of individual records, these legacy data speak mostly about                                documents of provenance events, and do so for an audience of human readers. To make these data                                  linkable to museums that are producing their own Linked Open Data (following the general CIDOC‑CRM                              principles of defining objects, people, places, and concepts through their event‑based relationships), we                          are transforming these data into statements about those provenance events themselves. In so doing, we                              are  standardizing the terms referenced,  enriching fields by turning them from transcribed strings into                            URIs of things, and explicitly  structuring  the relationships between these data as an RDF graph.  All this work requires dedicated labor. This leads to hard questions about priorities.  To what extent do we preserve the literal content of these documents, versus standardizing the way that                                  we express the ideas those documents communicate (in so far as we, as modern‑day interpreters, can                                correctly identify those ideas)? To maintain (to remember) plain text notes about, say, an object's                              materials as recorded by an art dealer, is to grant the possibility of perfect specificity about what our                                    documents. But not aligning descriptions with authoritative terms for different types of materials and                            processes forecloses the possibility of generalizing about the history of those materials and processes                            across hundreds of thousands of objects. Remember too much, in other words, and we become Funes:                                incapable of synthetic thought.  Capacious collections data must remember enough  and forget enough to be useful. For which terms will                                we expend the effort to do this reconciliation? Which edge cases will we try to capture in an                                    ever‑more‑complex data model? Opinions on how to draw that line will frequently set the librarian, the                                historian, and the machine at cross purposes. Outlining the necessary competencies a collections data                            production team needs, and the key questions, in order to navigate perspectives must therefore be a                                crucial output of this forum.  Works Cited    Borges, Jorge Luis. 1962. “Funes, the Memorious.” In  Ficciones , edited by Anthony Kerrigan, 107–15.  New York: Grove Press.  Drucker, Johanna. 2014.  Graphesis: Performative Approaches to Graphical Forms of Knowledge  Production in the Humanities.  Cambridge: Harvard University Press.              128  http://www.getty.edu/research/tools/provenance/provenance_remodel/index.html http://www.getty.edu/research/tools/provenance/provenance_remodel/index.html 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 129/180 Assessing Data Workflows for Common Data 'Moves' Across Disciplines  Alan Liu, University of California Santa Barbara    In considering how library collections can serve as data for a variety of data ingest, transformation,                                analysis, replication, presentation, and circulation purposes, it may be useful to compare examples of                            data workflows across disciplines to identify common data "moves" as well as points in the data                                trajectory that are especially in need of library support because they are for a variety of reasons brittle.    We might take a page from current research on scientific workflows in conjunction with research on data                                  provenance in such workflows.  Scientific workflow management is now a whole ecosystem that includes                            integrated systems and tools for creating, visualizing, manipulating, and sharing workflows (e.g., Wings,                          Apache Taverna, Kepler, etc.). At the front end, such systems typically model workflows as directed,                              acyclic network graphs whose nodes represent entities (including data sets                    and results), activities, processes, algorithms, etc. at many levels of                    granularity, and whose edges represent causal or logical dependencies                  (e.g., source, output, derivation, generation, transformation, etc.)  (see fig.                  1) .  Data provenance (or "data lineage" as it has also been called in relation                            to workflows) complements that ecosystem through standards,              frameworks, and tools‑‑including the Open Provenance Model (OPM) the                  W3C's PROV model, ProvONE, etc. Linked‑data provenance models have                  also been proposed for understanding data‑creation and ‑access histories                  of relations between "actors, executions, and artifacts.”[1] In the digital                    humanities, the in‑progress "Manifest" workflow management system              combines workflow management and provenance systems.[2]    The most advanced research on scientific workflow and provenance now goes beyond the mission of                              practical implementation to meta‑level  analyses of workflow and provenance. The most interesting                        instance I am aware of is a study by Daniel Garijo et al. that analyzes 177 workflows recorded in the                                        Wings and Taverna systems to identify high‑level, abstract patterns in the workflows.[3] The study                            catalogs these patterns as  data‑oriented motifs (common steps or designs of data retrieval, preparation,                            movement, cleaning/curation, analysis, visualization, etc.) and  workflow‑oriented motifs (common steps                    or designs of "stateful/asynchronous" and "stateless/synchronous" processes, "internal macros,"                  "human interactions versus computational steps," "composite workflows," etc.). Then, the study                      quantitatively compares the proportions of these motifs in the workflows of different scientific                          disciplines. For instance, data sorting is much more prevalent in drug discovery research than in other                                fields, whereas data‑input augmentation is overwhelmingly important in astronomy.  Since this usage of the word  motifs is unfamiliar, we might use the                          more common, etymologically related word  moves to speak of                  "data moves" or "workflow moves." A  move connotes a                  combination of  step and  design . That is, it is a step implemented                        not just in any way but in some common way or form. In this regard,                              the Russian word  mov for "motif," used by the Russian Formalists                      and Vladimir Propp, nicely backs up the choice of the word  move to                          mean a commonplace data step/design. Indeed, Propp's              diagrammatic analyses of folk narratives  (see fig. 2) look a lot like                        129  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 130/180 scientific workflows. We might even generalize the idea of "workflows" in an interdisciplinary way and                              say, in the spirit of Propp, that they are actually  narratives . Scientists, social scientists, and humanists do                                  not just process data; they are telling data stories, some of which                        influence the shape of their final narrative (argument, interpretation, conclusion).   The takeaway from all the above is that a comparative study of data workflow and provenance across                                  disciplines (including sciences, social sciences, humanities, arts) conducted using workflow modeling                      tools could help identify high‑priority "data moves" (nodes in the workflow graphs) for a library‑based                              "always already computational" framework.  One kind of high priority is likely to be very common data moves. For example, imagine that a                                    comparative study showed that in a sample of  in silico or data analysis projects across several disciplines                                  over 40% of the data moves involved R‑based or Python‑based processing using common packages in                              similar sequences (perhaps concatenated in Jupyter notebooks); and, moreover, that among this number                          60% were common across disciplinary sectors (e.g., science, social science, digital humanities). Then                          these are clearly data moves to prioritize in planning "always already computational" frameworks and                            standards.  Another kind of high priority may be data moves that involve a lot of friction in projects or in the                                        movement of data between projects. One simple example pertains to researchers at different                          universities ingesting data from the "same" proprietary database who are prevented from standardizing                          live references to the original data because links generated through their different institutions' access to                              the databases are different. Friction points of this kind identified through a comparative workflow study                              are also high value targets for "always already computational" frameworks and standards.  Finally, one other kind of high priority data move deserves attention for a combination of practical and                                  sensitive issues. Many scenarios of data research involve the generation of transient data products (i.e.,                              data that has been transformed at one or more steps of remove from the original data set). A                                    comparative workflow study would identify common kinds of transient data forms that require holding                            for reasons of replication or as supporting evidence for research publications. In addition, because some                              data sets cannot safely be held because of intellectual property or IRB issues, transformed datasets (e.g.,                                converted into "bags of words," extracted features, anonymized, aggregated, etc.) take on special                          importance as holdings. A comparative workflow study could help identify high‑value kinds of such                            holdings that could be supported by "always already computational" frameworks and standards.  Works Cited  [1] Hartig, Olaf. "Provenance Information in the Web of Data." In  Proceedings of the Linked Data on the  Web Workshop at WWW , edited by Christian Bizer, Tom Heath, Tim Berners‑Lee, and Kingsley Idehen,  April 20, 2009.  http://ceur‑ws.org/Vol‑538/ldow2009_paper18.pdf .  [2] Kleinman, Scott. Draft Manifest schema. WhatEvery1Says (WE1S) Project, 4Humanities.org.  [3] Garijo, Daniel, Pinar Alper, Khalid Belhajjamey, Oscar Corcho, Yolanda Gil, and Carole Goble.  "Common Motifs in Scientific Workflows: An Empirical Analysis."  2012 IEEE 8th International Conference  on E‑Science (e‑Science) , 2012: 1–8.  doi: 10.1109/eScience.2012.6404427 .         130  http://ceur-ws.org/Vol-538/ldow2009_paper18.pdf http://ieeexplore.ieee.org/document/6404427/?section=abstract 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 131/180 At the intersection of institution and data  Matthew Miller, New York Public Library       Libraries are awash in data, from the large reservoirs of bibliographic metadata that power discovery and                                access systems, to boutique datasets created from the documents themselves and even the ephemeral                            data exhaust produced by staff and patrons conducting research. Emerging from practical day‑to‑day                          working with this type of data below are some proposed observations and questions around description,                              distribution and access that are potentially useful and could benefit from closer examination.    The most potentially kinetic computationally amenable data comes from the conversion and processing                          of documents themselves. Transforming documents into data at the New York Public Library took the                              form of small projects that converted special collection materials into datasets through the power of                              algorithms, staff and the crowd. The results were a domain specific dataset often with a necessarily                                unique data model. Taking stock of the growing number these datasets we theorized about their                              possible integration with our traditional metadata systems. Would it be possible to go beyond simply                              linking to the dataset as a digital asset? If we were to build a RDF metadata system from the ground up                                          could we begin thinking of it as an open‑world assumption system where the contents of these datasets                                  could exist alongside traditional bibliographic metadata? As more cultural heritage organizations                      continue to produce similar datasets we need to consider how they shape the next generation of our                                  metadata and discovery platforms.    Stepping back from this larger question, when thinking about these resources as discrete datasets, what                              work could be done to improve their use and interoperability? WC3 standards such the VoID Vocabulary                                provided the means to describe the metadata about datasets. Leveraging such standards and                          establishing best practices and preferred authorities could we increase access across humanities                        datasets? How much work and what sort of resources are required to accomplish this at the dataset level                                    and perhaps at the data level as well. For example using common non‑bibliographic authorities such as                                Wikidata URIs in the data to facilitate interoperability across datasets and even institutions.    When publishing data for others it is a balance between providing access to the data in a format that                                      provides the least friction for adoption and use versus how knowledge organization systems work within                              a cultural heritage institution. This often requires preprocessing of library metadata turning it into a                              more accessible form that does not require extensive domain knowledge. For example, when releasing                            the metadata for NYPL’s public domain images we did not publish the MODs XML metadata, the format                                  that it is inherently stored in our systems. Instead we opted to publish it as JSON and also as simple CSV                                          files along with extensive documentation. Reducing the complexity of the format reduced the complexity                            of the tools and skills needed to work with it.     Another example taking this approach a step further is in Linked Jazz project in which we provided access                                    to the data in the form of a SPARQL endpoint. The data, which is stored as RDF statements, represent a                                        131  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 132/180 social network of Jazz musicians. This dataset lends itself to network analysis using popular tools such as                                  Gephi. To make the application of such a tool as simple as possible we added a Gephi file export API                                        allowing anyone to quickly download a gexf file of part of or the whole network to import into the                                      software. This sort of scholarly API is geared for delivering the resources needed to begin utilizing the                                  data immediately as opposed to just providing access to the underlying data store.    The topic of preprocessing introduces the question of best practices and standards that could be                              followed to ensure the broadest access to our datasets. What are some additional use cases that could                                  drive shared best practices or tools for releasing cultural heritage data? Are there more advanced                              preprocessing that could be done to some of the common archetypical data formats found in libraries,                                archives and museums? And what sort of resources are required in an organization to process datasets                                for public consumption?     As institutions increasingly produce and release datasets, establishing some best practices around                        description, distribution and access can facilitate collaboration between organizations and ensure                      productive use of these resources by patrons.                                    132  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 133/180 Metadata and Digital Repository Accessibility Issues  for Library Collections as Data  Anna Neatrour, University of Utah    In thinking of ways to use library collections as data, I was struck with the theme of accessibility. Are                                      researchers genuinely invited to engage with library collections as data? I’m going to focus on this                                narrowly, looking mainly at aspects of metadata and technical infrastructure in digital repositories.  Metadata as invitation to computation  Encouraging usage of library collections as data could be embedded in digital collections metadata by                              including a statement that metadata is free to reuse, providing a CC0 license, or stating that metadata is                                    open as a policy. One example of this is seen in the Harvard policy on  open metadata . Many institutions                                      have agreed that their metadata is in the public domain, which is a condition for harvest by DPLA, but                                      there is often no metadata reuse statement available at the item or collection level in the source digital                                    repositories for these shared collections. Making it clear that we expect metadata to be reused and                                repurposed improves the accessibility of digital library collections as data. Providing an easy way for                              researchers to download metadata in addition to a digital image might also encourage more research                              engagement with digital collections metadata. An example of this can be found in the  University of Hull’s                                  repository , where records are easily downloaded in Mods or Dublin Core. In addition, highlighting                            investigations undertaken by repurposing library metadata within the digital repository itself could spark                          additional ideas for research from people who might be encountering this possibility for the first time.  Make digital repositories more welcoming  While offering access to digital collections via an API may be an effective way of showing that                                  computation is possible with digital collections, it doesn’t provide a welcoming environment for students                            or researchers who are at the initial stages of their research and who might not yet have the technical                                      expertise to utilize an API. Providing a portal to a suite of sample apps created with an API, as  DPLA  does                                          along with the search interface for a digital repository creates a signal that application development and                                computation utilizing a digital library is both possible and desired.   With libraries everywhere continually being asked to do more with less, curating all digital collections for                                computational purposes may be impossible. However, developing easy ways of bulk download for both                            images and metadata outside of an API may open up windows for researchers. Providing clear methods                                to download digital objects across different collections, or interact with images across repositories                          through a framework like IIIF could be yet another method for enabling researchers to interact with                                library collections as data.  Digital collection managers may be able to curate new local or regional corpora by thinking creatively                                about digital items they already own. For example, in my own library at the University of Utah, I’ve                                    wondered about the possibility of making our typewritten oral history transcripts available to                          researchers. These oral histories were scanned as PDFs, and I expect the OCR would be decent enough to                                    support text based topic modeling. Figuring out how to make these resources accessible to researchers                              133  http://library.harvard.edu/open-metadata https://hydra.hull.ac.uk/ https://hydra.hull.ac.uk/ https://dp.la/apps 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 134/180 by packaging them in a way that would encourage computational use is a goal of mine.  What does a digital collections as data repository look like?  Providing additional layers and portals that leverage computational exploration to existing collections                        might serve as an intermediate step. Imagine if text based digital collections also had a Voyant‑like layer                                  built into the digital repository itself that researchers could use, along with pre populated queries and                                visualizations so people at the beginning stages of inquiry could see examples of text analysis. This could                                  support an introductory approach to exploring collections as data in the classroom. Many digital library                              repositories leverage visual possibilities for geospatial visualization and browsing, as in the  Open Parks                            Network Map that shows thumbnail images of digital items along with map locations . Could an interface                                 be built into a digital repository that would enable researchers to easily mash up digital items into a                                    personalized portal that would support geospatial visualization without the need to download metadata,                          enhance information with coordinate data, and then create a more static map in an external system from                                  that exported data? Could our digital repositories provide a mechanism for researchers to curate their                              own research collections, providing a space where digital library objects could be combined with                            researcher supplied data? Any approach have to blend what is pragmatically possible along with support                              for experimentation with the existing infrastructure for our digital repositories. Keeping in mind the idea                              of accessibility for researchers and library users at all stages of inquiry will hopefully result in an effective                                    blend of solutions for interacting with library collections as data.   I’d like to thank Jeremy Myntti and Jim McGrath for providing feedback on a draft of this position  statement.                              134  http://openparksnetwork.org/map/?b=yes&c=yes&z=8&k=pubcode%3ACAHA http://openparksnetwork.org/map/?b=yes&c=yes&z=8&k=pubcode%3ACAHA 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 135/180 Actually Useful Collection Data: Some Infrastructure Suggestions  Miriam Posner       Libraries and archives are increasingly making their materials available online, but, as a general rule,                              these materials aren’t of much use for computational purposes. For the most part, institutions have                              sought to replicate as closely as possible the experience of being in a reading room with an individual                                    object. We see this in artifacts like skeumorphic “swishes” on digital page‑turns, mammoth lists of                              browsable topics, and, what concerns me most here, the inability to download large quantities of object                                metadata. Many of us have learned the basics of webscraping precisely to get around this problem,                                laboriously writing scripts to harvest metadata that we know must already exist somewhere, as data, in a                                  repository.    There are many good reasons cultural institutions impose these limitations on their metadata. For one                              thing, it’s not at all clear how many people actually  want to treat collections as data. Most patrons aren’t                                      accustomed to encountering data in a cultural institution. So perhaps archives are just being good                              stewards of limited resources by focusing their attention on simply making digital facsimiles available.                            But the lack of collection data also limits other people’s imaginations about what they might do with                                  collections’ materials.    I’ve also been told by various institutions that they don’t have the right metadata for researchers to work                                    with ‑‑ that their descriptive information is often schematic, high‑level, and meant for search and                              discovery, not for visualization and analysis. I agree that this is a concern that we need to take seriously,                                      but I contend that even the most basic metadata is often more useful for understanding a collection than                                    many librarians imagine. Simply having author or creator information, or language information, can be                            very helpful. My impression is that many institutions are holding onto their data tightly, with the hope of                                    cleaning and improving it in the future. But researchers can work with imperfect data, if its limitations                                  are discussed frankly. We can also contribute improved data back to the institution.    Going forward, I imagine multiple pieces of infrastructure that could help make the data of cultural                                institutions as widely usable ‑‑ and widely  used  ‑‑ as possible:    A workable humanities data repository or registry.  A good many open data repositories already exist.                              Most of them are designed to hold scientific data, although this need not disqualify them for humanities                                  data. Humanists are actively contributing data (albeit on a relatively small scale) to general‑use data                              repositories such as FigShare and Zenodo. The more troublesome problem is that a) consensus hasn’t                              built around one particular repository; and b) absent a central repository, no substitute, such as a data                                  registry, gathers lists of cultural data in one place. What cultural data exists is stored, for the most part,                                      on GitHub — fine for downloading, versioning, and contributing data, but a terrible way to discover new                                  datasets. We need a better way to find cultural data.    Consideration of APIs versus “data dumps.”  Many cultural institutions, reasonably enough, offer APIs as                            a means of accessing their data. This makes sense for a lot of different reasons, including access to the                                      most recent data and the ability to retrieve institutions’ data in many different ways. The problem here is                                    that many humanists can work with structured data, but  not with APIs . Many common visualization tools                                require no programming, and so it’s possible for humanists to work with data, even in sophisticated,                                135  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 136/180 thoughtful ways, without necessarily knowing how to program. Developers at cultural institutions may                          feel that learning an API is trivial, but for many people, the availability of simple flat files can be the                                        difference between using and not using a dataset. I therefore hope that cultural institutions will consider                                the possibility of providing unglamorous flat files, in addition to API access to their data.    Really lowbrow thought about data formats.  Very simply, my students can work with CSVs, but not XML                                  or JSON. Visualizing and analyzing the latter two formats takes programming knowledge, while even                            non‑coders can import CSVs into Excel and create graphs and charts. Obviously, one can convert XML and                                  JSON to CSVs, but doing this requires some knowledge of these formats, and sometimes some                              programming (or at least command‑line) ability.    Case studies.  It may seem unlikely, given the recent proliferation of digital humanities journals, but it’s                                relatively difficult to find vetted, A‑to‑Z, soup‑to‑nuts examples of how to build visualizations and                            analysis from datasets. The aggregation of a number of fairly simple examples would, I believe, go far in                                    demonstrating how people might use datasets in their own work, and would certainly be of great utility                                  in the classroom. The key here would be to keep the examples quite simple, so that people can replicate                                      and build on them with relative ease.                                    136  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 137/180 Interoperability and Community Building  Sheila Rabun, International Image Interoperability Framework (IIIF) Consortium      I am coming from a non‑traditional background, with a Master’s in interdisciplinary folklore studies,                            having gained the majority of my experience in libraries as the digital project manager and subsequently                                the interim director of the University of Oregon (UO) Libraries’ Digital Scholarship Center. Among many                              digital projects, I was responsible for the Oregon Digital Newspaper Program, where we made large sets                                of newspaper OCR data and images available to the public online, following the Library of Congress’                                Chronicling America site and  open API . While digital newspaper data has been used to create                              visualizations and other computational projects (for example, the  Mapping Texts collaboration between                        the University of North Texas and Stanford University), the learning curve for scholars to find, harvest,                                and use the data provided remains a challenge. Students and faculty from all subject areas are                                increasingly looking to library and information professionals for guidance on where to find accessible                            data resources, how to use them, and recommendations on platforms for sharing their work. In addition                                to determining best practices for making collections available as data, comprehensive training materials                          and documentation for end users will be key to lowering the barrier of entry to make it easier for                                      researchers to get started working with data on their own, encouraging wider re‑use and                            experimentation.    Over the past 7 months I have shifted my focus slightly, as the Community and Communications Officer                                  for the  International Image Interoperability Framework (IIIF) Consortium, to improve digital image                        repository maintenance and sustainability as well as access and functionality for end users. As a                              community‑driven initiative including national and state libraries, museums, research institutions,                    software firms, and other organizations across the globe, IIIF provides  specifications for publishing digital                            image collection data to allow for interoperability across repositories. IIIF specifically addresses the “data                            silo” problem that has been plaguing the digital repository community, particularly by using existing                            standards and models such as JSON‑LD and Web Annotation that make sharing and re‑use easy. A                                growing number of digital image repositories are by adopting IIIF, and the  IIIF Consortium has grown to                                  include 40 institutional members since it was formed in 2015.    The IIIF community and specifications are especially relevant to the goals of the Always Already                              Computational (AAC) work, especially regarding digital images. IIIF has laid a groundwork for creation of                               a library collections as data as an internationally agreed‑upon best practice for making digital image data                                shareable and more usable for study. IIIF utilizes JSON‑LD manifests (representations of a physical object                              such as a book, as described in the  IIIF Presentation API ), to encourage sharing, parsing, and re‑use of                                    data regardless of differing metadata schemas across collections and repositories. The IIIF community                          has built the specifications specifically around  use cases to solve real problems, so far primarily focusing                                on the needs of those both using and making available digitized manuscripts, newspapers, and museum                              collections.    We are currently working on extending the IIIF specifications to include interoperability for  Audio and/or                              Visual materials (with 3D materials further along the roadmap), as well as improved  discovery of                              IIIF‑compatible resources on the web. Collaboration with the existing community that has formed                          around IIIF will be essential for the work of AAC and we welcome new interested parties to get involved,                                      137  http://chroniclingamerica.loc.gov/ http://chroniclingamerica.loc.gov/about/api/ http://mappingtexts.org/index.html http://iiif.io/ http://iiif.io/community http://iiif.io/technical-details http://iiif.io/community/consortium http://iiif.io/api/presentation/2.1/ https://github.com/IIIF/iiif-stories http://iiif.io/community/groups/av/ http://iiif.io/community/groups/av/ https://gist.github.com/azaroth42/01cd1a377d519e29572b8b072ac5980a 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 138/180 inform and provide feedback on approaches for discovery and stay informed with new innovations.                            Libraries and museums have been the primary adopters so far, but we have plans to do more outreach to                                      scholars and researchers in all disciplines, STEM imaging providers, publishers, and the commercial                          sector. Vendors like CONTENTdm and LUNA have incorporated IIIF into their products, and IIIF is gaining                                speed in open source efforts like the Hydra‑in‑a‑box repository product, which is IIIF‑compatible. The                            goals of IIIF and AAC are in alignment, and there is an exciting potential to work more closely together,                                      leveraging the existing IIIF community network and technical framework to create and build upon best                              practices.                                                  138  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 139/180 From libraries as patchwork to datasets as assemblages?  Mia Ridge, British Library     The British Library's collections are vast, and vastly varied, with 180‑200 million items in most known                                languages. Within that, there are important, growing collections of manuscript and sound archives,                          printed materials and websites, each with its own collecting history and cataloguing practices. Perhaps                            1‑2% of these collections have been digitised, a process spanning many years and many distinct                              digitisation projects, and an ensuing patchwork of imaging and cataloguing standards and licences. This                            paper represents my own perspective on the challenges of providing access to these collections and                              others I've worked with over the years.  Many of the challenges relate to the volume and variety of the collections. The BL is working to                                    rationalise the patchwork of legacy metadata systems into a smaller number of strategic systems.[1]                            Other projects are ingesting masses of previously digitised items into a central system, from which they                                can be displayed in IIIF‑compatible players.[2]  The BL has had an 'open metadata' strategy since 2010, and published a significant collection of                                metadata, the British National Bibliography, as linked open data in 2011.[3] Some digitised items have                              been posted to Wikimedia Commons,[4] and individual items can be downloaded from the new IIIF                              player (where rights statements allow). The BL launched a data portal, https://data.bl.uk/, in 2016. It's                              work‑in‑progress ‑ many more collections are still to be loaded, the descriptions and site navigation                              could be improved ‑ but it represents a significant milestone many years in the making. The BL has                                    particularly benefitted from the work of the BL Labs team in finding digitised collections and undertaking                                the paperwork required to make the freely available. The BL Labs Awards have helped gather examples                                for creative, scholarly and entrepreneurial uses of digitised collections collection re‑use, and BL Labs                            Competitions have led to individual case studies in digital scholarship while helping the BL understand                              the needs of potential users.[5] Most recently, the BL has been working with the BBC's Research and                                  Education Space project,[6] adding linked open data descriptions about articles to its website so they can                                be indexed and shared by the RES project.  In various guises, the BL has spent centuries optimising the process of delivering collection items on                                request to the reading room. Digitisation projects are challenging for systems designed around the                            'deliverable item', but the digital user may wish to access or annotate a specific region of a page of a                                        particular item, but the manuscript itself may be catalogued (and therefore addressable) only at the                              archive box or bound volume level. The visibility of research activities with items in the reading rooms is                                    not easily achieved for offsite research with digitised collections. Staff often respond better to                            discussions of the transformational effect of digital scholarship in terms of scale (e.g. it's faster and                                easier to access resources) than to discussions of newer methods like distant reading and data science.  The challenges the BL faces are not unique. The cultural heritage technology community has been                              discussing the issues around publishing open cultural data for years,[7] in part because making                            collections usable as 'data' requires cooperation, resources and knowledge from many departments                        within an institution. Some tensions are unavoidable in enhancing records for use externally ‑ for                              example curators may be reluctant or short of the time required to pin down their 'probable' provenance                                  or date range, let alone guess at the intentions of an earlier cataloguer or learn how to apply modern                                      139  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 140/180 ontologies in order to assign an external identifier to a person or date field.   While publishing data 'as is' in CSV files exported from a collections management system might have very                                  little overhead, the results may not be easily comprehensible, or may require so much cleaning to                                remove missing, undocumented or fuzzy values that the resulting dataset barely resembles the original.                            Publishing data benefits from workflows that allow suitably cleaned or enhanced records to be                            re‑ingested, and export processes that can regularly update published datasets (allowing errors to be                            corrected and enhancements shared), but these are all too rare. Dataset documentation may mention                            the technical protocols required but fail to describe how the collection came to be formed, what was                                  excluded from digitisation or from the publishing process, let alone mention the backlog of items                              without digital catalogue records, let alone digitised images. Finally, users who expect beautifully                          described datasets with high quality images may be disappointed when their download contains                          digitised microfiche images and sparse metadata.  Rendering collections as datasets benefits from an understanding of the intangible and uncertain                          benefits of releasing collections as data and of the barriers to uptake, ideally grounded in conversations                                with or prototypes for potential users. Libraries not used to thinking of developers as 'users' or lacking                                  the technical understanding to translate their work into benefits for more traditional audiences may find                              this challenging. My hope is that events like this will help us deal with these shared challenges.  Works Cited  [1] The British Library, ‘Unlocking The Value: The British Library’s Collection Metadata Strategy  2015 ‑  2018’.    [2] The International Image Interoperability Framework (IIIF) standard supports interoperability between  image repositories. Ridge, ‘There’s a New Viewer for Digitised Items in the British Library’s Collections’.    [3]  Deloit et al., ‘The British National Bibliography: Who Uses Our Linked Data?’    [4]  https://commons.wikimedia.org/wiki/Commons:British_Library    [5]  http://www.bl.uk/projects/british‑library‑labs, http://labs.bl.uk/Ideas+for+Labs    [6]  https://bbcarchdev.github.io/res/    [7]  For example, the 'Museum API' wiki page listing machine‑readable sources of open cultural data was  begun in 2009  http://museum‑api.pbworks.com/w/page/21933420/Museum%C2%A0APIs  following  discussion at museum technology events and on mailing lists.              140  http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 141/180 Maintaining the ‘why’ in Data: Consider user interaction and consumption of library collections  Hannah Skates Kettler, University of Iowa    Always Already Computational represents the next hurdle for libraries, archives and museums. Now that                            the profession is comfortable with the notion of digitization, and have reaped the rewards of greater and                                  broader impact (Proffitt and Schaffner, 2008), it has now turned its focus towards born digital materials.                                It's not that born digital materials, in 2017, is a new notion but it is definitely a concept the profession                                        has been aware of, but has been hesitant to tackle. As a Digital Humanities professional, I deal with the                                      use and creation of born digital materials every day and adapt to the multiplicitous ways library                                collections are created and made available, especially in the Humanities.  I therefore approach the questions in Always Already Computational with these concepts in mind:  Relational Datasets:   No library collection is an island. Library collections are not simply a list of ones and zeros that wait to be                                          consumed and reused, then spat out again as something different. At least, not when we want to be able                                      to cite them. Data (which henceforth will be a stand in for 'library collections') must be persistent in                                    order to be effectively accessible and reused for research. In order to amalgamate various datasets,                              immense amount of time is spent standardizing the data into something that can be cross referenced                                and used computationally. Understanding that our data are unique, it does not necessarily follow that                              access should be as unique and idiosyncratic. What that Linked Data has provided is a framework to link                                    disparate ideas to each other relationally. I am particularly interested in the possibilities of the Linked                                Data at it applies to datasets that would allow one to describe contextual relationships between the                                data, relationships which typically are entirely use and user based. By generalizing data in a way that is                                    useful in multiple contexts by creating a framework that is flexible enough to accommodate data's                              multiplicity.   Association of Paradata:   Pulling from experience with 3D collections, functioning without standards of how to make born digital                              materials more usable makes interfacing with other datasets much more difficult than other more                            traditional data. For example, visual materials are much more reliant on supplemental contextual data                            than text. That is not to say there is no context within textual data, but the aforementioned data could                                      include context within it. Visual data, usually lacks this packaged approach. Visuals are associated with                              text in order to provide that context. Beyond catalogues, visual data's supplemental material is separated                              from and unintentionally disassociated from the visual (think a search result in an image database). Few                                image datasets are accompanied with  why the image was created. True, one can inference based on the                                  basic metadata included with the object, but without intent, it is much more difficult to make judgement                                  about why the dataset (as generated by an API for instance) is included and why others were not. It also                                        makes it easier to fake, or misrepresent library data/collections.    141  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 142/180 Cultural Constructs of Data:   Compounding the narrowed context of textual and numerical datasets, problematic visual datasets, and                          even mixed data sets, you have the social constructs that support data. This aligns very well with the                                    work I, and a group of librarian and museum professionals are doing in association with the Digital                                  Library Federation. As was mentioned in the October 2004 Information Bulletin from the Library of                              Congress, "Because there is no analog (physical) version of materials created solely in digital formats,                              these so‑called 'born‑digital' materials are at much greater risk of either being lost and no longer                                available as historical resources, or of being altered, preventing future researchers from studying them in                              their original form." Their particular focus for this remark was the preservation of born‑digital data. Now                                that the profession, to some extent, has the ability and focus for preservation of born‑digital, it is time to                                      turn our eye to interoperability (like Always Already Computational) and the cultural context of the data                                itself. Consider the book  The Intersectional Internet: Race, Sex, and Culture Online  by Safiya Noble and                                Brendesha Tynes (2016) which underscores "how representation to hardware, software, computer code,                        and infrastructures might be implicated in global economic, political, and social systems of control." Data                              without context is meaningless. Data with context but without social awareness is deceptively                          meaningless. With that deception comes, in the worst case, the use and articulation of argument                              founded on a lack of understanding and awareness of perpetuating ideas that are intrinsically linked to                                the creation and curation of said data. A question for this group would be; how do we attempt to                                      preserve that context without overwhelming the user?   The Always Already Computational group can hopefully come together to attempt to solve this and other                                concerns regarding digital aggregate data.    References  "Born Digital': Eight institutions and their partners received awards totaling almost $15 million from the  Library to collect and preserve digital materials as part of the National Digital Information Infrastructure  and Preservation Program". 2004.  Library of Congress Information Bulletin.  63 (10): 202‑203.  Noble, Safiya Umoja, and Brendesha M. Tynes. 2016. The intersectional Internet: race, sex, class and  culture online. ISBN: 978‑1‑4331‑3000‑7.  Proffitt and Schaffner. 2008. The Impact of Digitizing Special Collections on Teaching and Scholarship:  Reflections on a Symposium about Digitization and the Humanities. Report produced by OCLC Programs  and Research. Published online at:  www.oclc.org/programs/reports/200804.pdf               142  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 143/180 People and machines both need new ways to access digitized artifacts  nonconsumptively  Ben Schmidt, Northeastern University    How can we integrate generations of high‑quality, professionally‑created metadata with electronic                      versions of the object itself? Particularly when copyright comes into play, we can't simply hope for                                openness; and there's a steep trade‑off between the thoroughness of a well‑thought‑out standard and a                              simplicity of conception that makes a digital resource useful for (for instance) a graduate student just                                beginning to get interested in working with large collections.  When we digital humanities researchers say that we're working with the "full text" of a scanned book,                                  it's usually more posturing than truth. In fact, what datasets like the Hathitrust Research Center's                              Extracted Features really do is just radically transform the amount of metadata we have; instead of                                knowing 10 or 20 things from a MARC record (eg: the language, four or five subject headings, the author,                                      the publisher), we just add on an additional several thousand ("How many times does it use the word                                    "aardvark?" "aardvarks?" "abacus?"...). All the rest of the information (even simple stuff like syntax, word                              order, negation) is thrown out. It's great that organizations like JStor and Hathi are starting to release this                                    computationally‑derived metadata. But there's no clear way to incorporate this computational metadata                        into a traditional library catalog. The technical demands of even  downloading something like the HTRC                              EF set exceed both the technical competencies and computing infrastructure of most humanists‑‑I've                          literally spent several weeks recently, restarting downloads and identifying missing files as I try to fill up a                                    RAID array with several terabytes of data. Processing these files into the raw material of research is even                                    harder.  So how do we make collections accessible for work? There are two ways that libraries can take more of                                      the burden onto themselves, and distribute (non‑copyright‑violating) distillations of texts that provide an                          onramp for digital analysis within the reach of mere mortals.  Visual Exploration  One useful and important way to work with this metadata and full text is by exposing through                                  visualization; this is what projects like the Google Ngrams viewer and the  Hathi+Bookworm project I've                              helped work on under an NEH grant. Patrons are able to use this combination of full text and catalog                                      metadata to explore the shapes and contours of vast digital libraries. Since they know (sort of!) what any                                    given word means, they can use it to understand how vocabulary changes; find anomalous, interesting,                              or misclassified items; or understand the limits and constraints of an entire collection, a sorely‑needed                              form of information literacy. We've built the Bookworm platform so the advances we're making with                              Hathi can be used on any smaller (or larger) library, and we hope others will be interested in using to                                        explore their texts in the context of their metadata.  143  http://bookworm.htrc.illinois.edu/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 144/180   Hathi Trust Bookworm browser  Low‑dimensional embeddings  I'd also like to put on the radar a farther out‑there idea that extrapolates from the current trends in the                                        world of machine learning: the idea of a  shared embedding for digital items that would allow machines                                  to compare items across various collections, times, and artifacts. The basic idea of an embedding is to                                  associate a long list of numbers (maybe a few hundred) with a digital object so that items that are similar                                        have similar lists of numbers. These are sort of the inverse of the checksums that libraries frequently                                  associate with digital artifacts now, which are designed so that even the slightest change makes a file get                                    a completely different number. A good embedding will do the opposite; allow users and software to find                                  similar items. In a single collection like Hathi, this practice I've found with even a simple embedding that                                    it's possible to, for instance,  look in the neighborhood of a book like "Huckleberry Finn" and find, in the                                      immediate neighborhood, dozens of titles like "Collected Works of Mark Twain, vol. 8" that lack proper                                titles that would identify them; and in the extended neighborhood other novels about American boys on                                riverboats.  Inside a collection, this makes it possible to find works with improbable metadata. (It's sadly common for                                  the  wrong scan to be associated with metadata, and this can be extremely hard to catch.) Across                                  collections, this makes it possible to engage in the work of comparison, duplicate detection,  Perhaps the most interesting things about embeddings of digital files is that they're  not  restricted to  textual features. Image embeddings are just as possible as textual embeddings, as in  this landscape  visualization of artworks that Google recently produced .    144  http://sappingattention.blogspot.com/2016/05/literary-dopplegangers-and.html http://sappingattention.blogspot.com/2016/05/literary-dopplegangers-and.html http://sappingattention.blogspot.com/2016/05/literary-dopplegangers-and.html http://sappingattention.blogspot.com/2016/05/literary-dopplegangers-and.html https://babel.hathitrust.org/cgi/pt?id=uc1.$b167397;view=1up;seq=7 https://artsexperiments.withgoogle.com/tsnemap/ https://artsexperiments.withgoogle.com/tsnemap/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 145/180    When Google recently released half a million hours of video, they did it not as image stills but as  vectorized features read by a neural network.  These features‑‑essentially, a computer's rough summary of an artifact into a few hundred                          numbers‑‑could make it possible to researchers and students to immediately engage in computational                          analysis without having to wade through the preparatory steps. If done according to shared standards,                              they could make collections interoperable in striking ways  even when texts or images can't be                              distributed . It's probably a few years too early to set a specific embedding for different types of                                  documents, but it is time now to contemplate what it would mean to distribute not documents                                themselves, but a useful digital shadow of them.                        145  https://research.google.com/youtube8m/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 146/180 Repurposing Discographic Metadata and Digitized Sound Recordings  as Data for Analysis  David Seubert, University of California Santa Barbara    Use of sound recordings for research has been slow to develop due to bias against sound recordings as                                    historical documents by textual scholars, lack of descriptive data (discography), and lack of access                            because of restrictive copyright laws that make it difficult to digitize and provide access to collections.                                The use of digitized sound recordings or the discographic metadata about sound recordings as data to                                study is underdeveloped. The UCSB Library wants to encourage scholarship of this kind using the data                                from the American Discography Project.     The American Discography Project that is presently based at the UCSB Library with funding from the                                Packard Humanities Institute was originally conceived as the Encyclopedic Discography of Victor                        Recordings by two record collectors in the early 1960s. They began a project to document every classical                                  recording by the Victor Talking Machine Company, but eventually broadened their goal to include every                              Victor recording session for 78rpm discs. In 1966 they were granted liberal access to the recording files                                  held by RCA Victor Records (now Sony Music Entertainment) and devoted many thousands of hours to                                compiling lists of the tens of thousands of Victor master recording sessions from around the world.    The American Discography Project and its principal product, the  Discography of American Historical                          Recordings (DAHR) is now a research, publication, and digitization program based at the UCSB Library                              with a goal of documenting disc recordings made during the standard groove era (1900‑1950s) by                              American record companies and to digitize as many as possible for online access. Much of the data                                  about a recording (who, what, where, when) is not documented on the recordings themselves, and only                                can be determined by consulting a published discography or primary source documents like company                            recording ledgers.    Now in its fifth decade, the project has expanded beyond Victor to incorporate other published                              discographies and includes data on recordings made by five early 20 th century record companies                            (Berliner, Victor, Zonophone, Columbia and Okeh) with three more large labels (Brunswick, Decca,                          Edison) and several smaller ones in the pipeline.     The sheer amount data documented in the online database is significant. DAHR currently contains over                              6.5 million data points documenting systematically and comprehensively the first 45 years of American                            recording history including:     ● 146,524 recording sessions  ● 417,428 recording events (takes)  ● 107,784 physical manifestations (discs)  ● 36,767 names of performers, authors, composers  ● 90 languages  ● 393 recording locations      The initial project design was to document these recordings in a systematic fashion for the purposes of                                  146  http://adp.library.ucsb.edu/index.php http://adp.library.ucsb.edu/index.php 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 147/180 identification, cataloging by libraries and archives, collectors, and others. A bibliography of sound                          recordings. One of the further goals of the project is to encourage use of sound recordings as primary                                    source documents by scholars in fields beyond the study of music and as the project has grown, we have                                      growing success in this area. Systematically adding audio to the database has allowed scholars to study                                the recordings, in context with authoritative data about their creation.     Sound recordings and the metadata associated with them have not been mined and analyzed the way                                textual archives have. As the Discography of American Historical Recordings grows in size, it is a prime                                  candidate for manipulation and analysis as data, as it contains standardized elements including language,                            dates, geographic information (recording locations), genres, names, and titles.    Since the project was designed from the outset to be structured data, including authority control and                                standardized vocabularies for many elements, a potential and as yet unrealized reuse of the metadata as                                data, is now possible. As a participant in the National Forum, we hope to be able to further                                    conceptualize how this can be best realized.                                           147  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 148/180 The Library as Virtual Reality: A Worldbuilding Approach  Laila Shereen Sakr, University of California Santa Barbara      The process of considering digital library collections as data points relies on similar logics foundational to                                the development of virtual reality (VR). Imagine the library as a VR film or as a computer ‑‑‑ temporally                                      and spatially. If the goal of the “Always Already Computational: Library Collections as Data” project is to                                  find a common framework among librarians, curators, and researchers that makes digitally‑born                        scholarship possible, I would like to suggest considering speculative design methodologies, or what Alex                            McDowell has described as worldbuilding.     Alex McDowell, a deeply influential designer has shifted how we think about design by fundamentally                              changing the role design plays in the creative process, potentially altering audiences’ expectations of                            creative work that ranges from architecture to computer games. Drawing on the literary metaphor                            “worldbuilding” to explain his approach to design, McDowell’s methods represent a cultural shift in his                              industry’s production process. Speculating about what the world “might” look like in the future is easy.                                More challenging, though, is realizing that speculative vision through the design process. McDowell’s                          work realizing a future‑world inspired by Philip K. Dick’s novella in the 2002 film  Minority Report is                                  emblematic of a transformation in design process that is made possible through the use of                              computational media. On  Minority Report , McDowell led his production design team, which began as a                              largely analog art department, through a transition in which they became the first fully‑digital art                              department in the film industry — an example that many other design departments would soon follow                                and that foreshadowed a broader cultural shift in creative process.     Most of the film’s audience will probably remember the gestural interface of the 3D screens used by the                                    agents in the department — speculative designs that, in turn, have influenced actual technologies                            ranging from Apple’s iPad to Microsoft’s Kinect. However,  Minority Report ‘s influence in design reached                            an even wider array of design cultures, including biometrics (particularly retinal scanning), through other                            imagined technologies woven throughout the film’s environment and plot.     In other words, McDowell’s world building integrates interdisciplinary humanistic, scientific, and design                        inquiry with emerging forms of computational media to fundamentally alter the film production process,                            blurring boundaries between physical and virtual environments and the distinctions between film and                          other media forms. In the digitally designed world of  Minority Report , props could be modeled first as                                  two‑dimensional images and later as three‑dimensional physical objects. Then, through                    computer‑controlled milling, those models could be used to create final props by sculpting and                            mold‑making. Bringing direction, cinematography, and design together in the virtual space of the                          pre‑visualization stage, props, actors, and the created world interacted throughout the production                        process. As a result,  Minority Report and McDowell’s world building process signaled a transformation in                              design culture that has not yet fully played out.     148  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 149/180 One approach to worldbuilding builds upon a procedure of information design that moves from                            archiving, to visualizing, to rationalizing, and then to governing. This process must take into account                              matters of scale. Taking from both information design and game design, worldbuilding relies on several                              distinct way visual perspectives: drawing a complete world map and filling in as much information as                                possible, then running the game and letting the players explore that world. This visual perspective                              operates on a large scale. Another perspective begins within specific town/city/place/room...and as they                          explore more and more of the world is revealed. These are some basic guidelines to consider as one                                    conceptualizes building a virtual word of data.     Applying this theoretical framework to a process of speculative design for future library collections,                            could yield interesting results. The practice and ideas of worldbuilding, in McDowell’s definition, are a                              clear example of interdisciplinary work connecting the arts, design, media‑focused computer science,                        and elements of the humanities and social sciences. Worldbuilding is both the creation of media and a                                  design research practice, and in neither case is its interdisciplinarity a luxury, because the work simply                                must engage multiple disciplines in order to achieve a coherent vision and to push many fields forward.                                         149  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 150/180 The struggle for access   Tim Sherratt, University of Canberra      For me, exposing cultural heritage collections to computational methods raises difficult, important, and                          interesting questions about the nature of ‘access’ itself. So while we can and should develop                              best‑practice guidelines, I think we should also admit that we will never be, should never be, satisfied                                  with what cultural institutions deliver. We will always want something more. And that’s a good thing.  I’ve spent far too much of my life hacking the web interfaces of libraries and archives in the pursuit of                                        useful data. But while I would gladly take the time back, I recognise the value of the struggle. Processes                                      such as screen‑scraping and normalisation are often frustrating, but they do at least make you think                                about the processes by which the data was created, managed, and shared.  So for me, one of the key questions is how we expose data to facilitate the use of computational                                      methods while preserving some of the difficulties and irregularities – the chisel marks in the smooth                                worked surface – that remind us of its history and humanity.  I’m not sure whether this is a metadata question, or a matter of how we frame the relationship between                                      researcher and institution. If we think of machine‑actionable data as a product or service delivered by                                institutions, then researchers are cast as clients or consumers. But if each dataset is not a product, but a                                      problem, then we open up new spaces for collaboration and critique.  I’ve started to realise that I have very little interest in statistics, or even data visualisation as I understand                                      it. I use computational methods to manipulate the contexts of cultural heritage collections. Sometimes                            this results in useful tools or interfaces, sometimes it’s more akin to art. I’m motivated by the simple                                    desire to see things differently – to poke at the boundaries and limits of systems in the hope that                                      something interesting happens.  What seems to happen fairly regularly is that I find where the systems are broken. For example, while                                    harvesting debates from the Australian parliament’s online database, I discovered about 100 sitting days                            were missing. This sort of thing happens with complex systems, and the staff at the Parliamentary Library                                  have now fixed the problems. For me, it’s an example of the fact that we can never simply accept what                                        we’re given – search interfaces lie, and datasets have holes. But it’s also shows that once you open up                                      channels for the transmission of data, information flows both ways.  We can’t talk about the need for institutions to provide computation‑ready data without considering                            what they might get in return. The struggle for access might not always be comfortable, but it can be                                      productive. If data is a problem to be engaged with, rather than a service to be consumed, then we can                                        see how researchers might help institutions to see their own structures differently. On a practical level,                                how might we make it easier for institutions to re‑ingest the features and derivative structures identified                                through use.  I’m also a bit suspicious of scale. Big solutions aren’t always best. Large data dumps are great for                                    researchers with adequate computing power and resources, but APIs support rapid experimentation and                          light‑weight interventions. Similarly, while articulating best‑practice for computation‑ready data we                    shouldn’t lose sight of other ways data can be exposed. I want hackable websites as well as                                  downloadable CSVs – all that basic stuff like persistent urls, semantic html, and maybe a sprinkle of RDFa                                    or JSON‑LD, enables data to be discovered everywhere, not just in a designated repository.  150  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 151/180 As I said, we will always want more. Access will never be open and the job will never be done. We need                                            systems, protocols, guidelines, and collaborations that remind us there is always more to do, and offer                                the support to continue.                                                        151  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 152/180 Implications for the Map in a 'Collections as Data' Framework  Tim St. Onge, Library of Congress      I am arriving of the challenge of developing computationally amenable digital library collections from the                              perspective of a digital cartographer and geospatial analyst. My work for the Library of Congress as a                                  cartographer primarily involves digital map‑making and the analysis of born‑digital and made‑digital                        geographic information and maps to serve Congressional research requests. My academic and                        professional backgrounds are based in geographic information science (GIS) rather than in library                          science. However, I am often thinking about how the Library of Congress can best serve our collections                                  to meet the research and access needs of geographers in a digital age.    All of this is to say that my initial thoughts on developing a “library collections as data” framework are                                      largely shaped by the implications for one type of collection material in particular: the map.     There is enormous potential for the computational analysis of historic maps en masse, with methods                              that are both text‑based (e.g. extracting written text to create gazetteers of place names from certain                                time periods, cultures, languages, etc.) and image‑based (e.g. extracting map features based on                          groupings of image pixel values of similar color) (Chiang, Leyk & Knoblock 2014). For the full integration                                  of historic maps into Geographic Information Systems, processes like georeferencing and feature                        digitization, which have achieved varying levels of automation potential, must be completed. It is my                              view that georeferenced versions of scanned maps in library collections are highly appreciated among                            researchers and should be more standard “collections as data” offerings from libraries. The                          georeferenced map viewer created by the National Library of Scotland (2017) demonstrates the                          tremendous value of this type of data offering.    Given the unique challenges of offering historic maps as computationally amenable collections, I admire                            the objective of the Always Already Computational to conceive of a “collections as data” framework that                                is multimedia in scope and not only concerned with text analysis of written works (as critically important                                  and valuable as this is).     In my reading of the “Statement of Need” from the Always Already Computational scope of work                                document, I interpret four major current problems of computationally amenable collections to be (1) the                              lack of a common collections‑transformation framework across institutions, (2) a lack of solutions for                            non‑text media, (3) technical inadequacies in providing collections in large scale, and (4) no data reuse                                paradigm for collections.    In addressing the first and second problems, I look forward to hearing more on the needs of                                  computational researchers who are working with image‑based collections, including, but not exclusively,                        scanned and digitized maps. In this needs assessment more broadly, in an abstract way, I imagine a                                  hierarchy of use cases and analysis tools. Towards the top are elements that are most readily shared                                  among all kinds of library collections (e.g. all collection items have metadata files in standard format; all                                  text‑based, text‑extracted items could undergo analyses like frequency visualization or topic modeling).                        Towards the bottom are more medium‑specific (e.g. only scanned maps are concerned with                          georeferencing and geographic projections). In laying out the strongest commonalities among researcher                        needs in working with library collections, perhaps a framework can be developed that addresses the                              152  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 153/180 greatest, unifying needs of collection patrons across diverse uses in the digital humanities and other                              disciplines. Furthermore, I hope that this framework highlights the unique and worthy challenges of                            devising solutions for researchers of non‑text media.    The third problem of providing collections on a large scale is certainly a critical concern to computational                                  research. If access to collection items is limited to one‑by‑one downloads or deliveries of physical DVDs                                of data, simply the “data acquisition” phase can be sufficiently burdensome to slow or stop                              computational analyses before they even begin. The challenges of large‑scale collection access appear to                            be technological and, as is often the case for libraries and the digital humanities, budgetary. The                                methods of access detailed in the Always Already Computational scope of work document demonstrate                            the wide variability among different institutions. I am interested to hear from project participants on the                                merits of these methods from their experience and what technical and budgetary considerations should                            be made in the process of developing best practices on this issue.     On the fourth problem of the data reuse paradigm, I believe this issue involves not only technological                                  hurdles, but policy ones as well. Simply put, when researchers or patrons more broadly want to give back                                    to libraries, libraries should trust them. For example, this can take the form of an online‑based                                crowdsourced georeferencing tool that allows users to georeference scanned maps from a library                          collection and share them back to the library, which thereby shares that resource universally as a                                GIS‑ready raster image (Fleet, Kowal, & Přidal 2012). Another example would be for libraries to host                                hackathons and other events that invite researchers to interrogate their collections as data and present                              on their findings, thereby allowing libraries learn lessons of the kinds of computational research that can                                (or cannot) work with their collections. I believe the Archives Unleashed series, which focuses on web                                archive research, is a great model for this kind of project (Weber 2016). Any frameworks arising from the                                    Always Already Computational should encourage these kinds of “data sandbox” projects that allow for                            experimentation that reveal new insights into the computational analysis of collections as data and                            provide derived content and research directly back to libraries.    I look forward to learning from the diverse array of participants and contributing my insights to the                                  Always Ready Computational initiative.     Works Cited    Chiang, Y., Leyk, S., & Knoblock, C. A. (2014) A survey of digital map processing techniques.  ACM  Computing Surveys , 47 (1), Article 1 (April 2014), 44 pages. Retrieved from  http://usc‑isi‑i2.github.io/papers/chiang14‑acm.pdf .    Fleet, C., Kowal, K. C., & Přidal, P. (2012) Georeferencer: Crowdsourced Georeferencing for Map Library  Collections.  D‑Lib Magazine , 18 (11/12). Retrieved from  http://www.dlib.org/dlib/november12/fleet/11fleet.html .    National Library of Scotland (2017)  View maps overlaid on a modern map / satellite image . Retrieved  from  http://maps.nls.uk/geo/explore/ .    Weber, M. S. (2016) Archives Unleashed!  Collections as Data | September 27, 2016 | Library of Congress .  Retrieved from  http://digitalpreservation.gov/meetings/documents/dcs16/3_Weber_Archives  Unleashed.pdf .  153  http://usc-isi-i2.github.io/papers/chiang14-acm.pdf http://www.dlib.org/dlib/november12/fleet/11fleet.html http://maps.nls.uk/geo/explore/ http://digitalpreservation.gov/meetings/documents/dcs16/3_Weber_Archives%20Unleashed.pdf http://digitalpreservation.gov/meetings/documents/dcs16/3_Weber_Archives%20Unleashed.pdf 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 154/180 Considering the user  Santi Thompson, University of Houston    As the forum unfolds, I would encourage participants to question and expand our assumptions of those                                who (re‑)use computational library collection data. In my mind, the identities of users and their                              motivations for coming to the digital library are just as important to understand as the technical                                requirements needed to re‑use data in interoperable and collaborative ways. Knowing your users helps                            cultural heritage professionals, among other things, to better select content for the future, market the                              resources and collections available to them, and understand how to describe and make content available                              to others.[1]     I was pleased to see that the proposal for  Always Already Computational acknowledges the user to some                                  degree, noting that current digital library infrastructure and digital collection paradigms do "not meet                            the needs of the researcher, the student, the journalist, and others who would like to leverage                                computational methods and tools to treat digital library collections as data." As such, part of our forum                                  objectives will be to draft potential user stories and “to apply [data definitions and concepts] to a range                                    of potential user communities.” I find this to be incredibly important because libraries (and most likely                                other cultural heritage organization types) have not spent a vast amount of time asking and publishing                                on “who is a digital library user.”     My own research has focused in some narrow ways on better understanding digital library users. My                                collaboration with other members of the DLF Assessment Interest Group’s User Studies Working Group                            has found that the assessment of digital library reuse is complicated for a whole host of reasons,                                  including the profession’s inability to systematically identify and understand digital library users.[2]                        Additional research I have done with a co‑author suggests that digital library users (note:  NOT users of                                  computational data) are more frequently (1) from outside of academia and (2) reusing digital library                              content for a wide array of non‑scholarly pursuits.[3]     I find  Always Already Computational to be an exciting opportunity to address major gaps in our current                                  understanding of what is a digital library collection and how is it being used by targeted audiences. While                                    I recognize that demystifying the digital library user is not the primary pursuit of this national forum, I                                    look forward to discussing this as well as other important aspects of the grant with a deeply                                  knowledgeable and inspiring group of participants. I appreciate the opportunity to contribute to such a                              discussion.    Works Cited    [1] For more on how understanding users and reuses can inform digital library management, see my  work with Michele Reilly: “Understanding Ultimate Use Data and Its Implication for Digital Library  Management: A Case Study,”  The Journal of Web Librarianship 8  (2) (2014): 196‑213. DOI:  http://dx.doi.org/10.1080/19322909.2014.901211 .     [2] In 2015 the User Studies Working Group drafted a white paper, “Surveying the Landscape: Use and                                  Usability Assessment of Digital Libraries,” that explored the state of research around three assessment                            topics: user/usability studies, return on investment, and content reuse. A copy can be found here:                              154  http://dx.doi.org/10.1080/19322909.2014.901211 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 155/180 https://osf.io/uc8b3/ .     [3] See Reilly and Thompson, “Understanding Ultimate Use,” and Michele Reilly and Santi Thompson,                            “Reverse Image Lookup: Assessing Digital Library Users and Reuses,”  The Journal of Web Librarianship                            (2016): 1‑13. DOI:  http://dx.doi.org/10.1080/19322909.2016.1223573 .                                                       155  https://osf.io/uc8b3/ http://dx.doi.org/10.1080/19322909.2016.1223573 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 156/180 Building Institutional and National Capacity for Collections as Data  Kate Zwaard, Library of Congress    About a year ago, the Library of Congress created a new division, National Digital Initiatives, which I am                                    proud to lead. Our mission is to maximize the benefit of the digital collection, to incubate innovation,                                  and to encourage national capacity for digital cultural memory.   In a recent New Yorker article, the Librarian of Congress said she wants The Library of Congress “to get to                                        the point where there’ll still be a specialness, but I don’t want it to be an exclusiveness. It should feel                                        very special because it  is  very special. But it should be very familiar [1]” We in NDI take that message to                                      heart. We believe that an important step in getting users to engage with the Library’s digital material and                                    staff is to provoke, explore, tell stories, and invite.     Our vision is for NDI to help libraries and patrons explore the edges of possibility. To try things ourselves                                      and share with the profession. To help highlight the treasures we have ‑‑ here at the Library of Congress                                      and in our nation’s cultural heritage institutions – and spark people’s imagination around the potential                              uses of digitized or born digital collection objects. To encourage the curious and help them get answers.  To help people understand what a library is.  Upon our founding, the director of National and International Outreach said “It’s not enough anymore to                                just open the doors of this building and invite people in. We have to open the knowledge itself for                                      people explore and use. [2]”    156  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 157/180 A few things we’ve been working on:  ● We organized “ Collections as Data ,” [2] a conference devoted to exploring what’s possible using  computation with digital collections.   ● We hosted an  Archives Unleashed hackathon , bringing together programmers, librarians, and  scholars looking at computational analysis of web archives collections [4]  ● We performed a  digital lab proof of concept  along with a report exploring how to deliver Library  of Congress digital collections as data to on‑site researchers [5]  ● We hosted a  Software Carpentry Workshop  [6] to help teach Library of Congress librarians and  others in the neighborhood how to use code to manage and analyze digital collections.  ● We’ve started a series of  sample code notebooks  to help people work with Library of Congress  data [7]    My background is in software development. Before this job, I ran the Repository Development group [8]                                at the Library of Congress and before that I worked on creating digital preservation software solutions                                for the Government Publishing Office. My perspective is on the very practical. Institutions have spent a                                lot of time, effort, and money on digitizing collections and establishing policies and infrastructures                            around the model of access that mimics analog models. Transforming the technology, staff, and practice                              to accommodate data analysis is a second paradigm shift that will be just as difficult. For many                                  knowledge institutions, funding is decreasing and becoming less secure while the volume and complexity                            of digital information is multiplying and the commitment to analog collections remains. In my view, the                                only way forward is together:  ● Leverage connections with physical sciences, social sciences, and journalism. Work together on  tooling and training.  ● Highlight digital scholarship projects with easy to understand outcomes to make the case beyond  academia.  ● Support distributed fellowship models (NDSR) for building digital stewardship curation skills and  157  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 158/180 building skills for doing digital research.  ● Create train‑the‑trainer programs to help scholars understand what’s possible using computation  ● Get content, methodologies, and tools to K‑12 educational audiences.  ● Explore legal, cultural and privacy review models to guide researchers using novel digital  content, like a light‑weight IRB.  ● Provide space and time for experimentation.  The Library of Congress “preserves and provides access to a rich, diverse and enduring source of                                knowledge to inform, inspire and engage you in your intellectual and creative endeavors.” [9] We are                                thrilled to be a part of this exciting conversation, and look forward to working together.    Works Cited     [1] “The Librarian of Congress and the Greatness of Humility” by Sarah Larson.  The New Yorker .   February  19, 2017  http://www.newyorker.com/culture/sarah‑larson/the‑librarian‑of‑congress‑and‑the‑greatness‑of‑humili ty  [2] “Data and Humanism Shape Library of Congress Conference” by Mike Ashenfelder.  The Signal .  October 21, 2016  http://blogs.loc.gov/thesignal/2016/10/data‑and‑humanism‑shape‑library‑of‑congress‑conference/  [3] “Collections as Data Report Summary” by Jaime Mears.  The Signal . February 15, 2017  http://blogs.loc.gov/thesignal/2017/02/read‑collections‑as‑data‑report‑summary/  [4] “Co‑Hosting a Datathon at the Library of Congress” by Jaime Mears.  The Signal.  July 21, 2015  http://blogs.loc.gov/thesignal/2016/07/co‑hosting‑a‑datathon‑at‑the‑library‑of‑congress/?loclr=blogsig  [5] “Library of Congress Lab: Library of Congress Digital Scholars Lab Pilot Project Report” by Michelle  Gallinger and Daniel Chudnov. December 21, 2016  http://digitalpreservation.gov/meetings/dcs16/DChudnov‑MGallinger_LCLabReport.pdf  [6] Software Carpentry at the Library of Congress  https://oulib‑swc.github.io/2017‑02‑15‑loc/  [7] data‑exploration Github page  https://github.com/LibraryOfCongress/data‑exploration  [8] “Yes, the Library of Congress Develops Lots of Software Tools” by Leslie Johnston. August 16, 2011  https://blogs.loc.gov/thesignal/2011/08/yes‑the‑library‑of‑congress‑develops‑lots‑of‑software‑tools/  [9] “About the Library”  https://www.loc.gov/about/        158  http://www.newyorker.com/culture/sarah-larson/the-librarian-of-congress-and-the-greatness-of-humility http://www.newyorker.com/culture/sarah-larson/the-librarian-of-congress-and-the-greatness-of-humility http://www.newyorker.com/culture/sarah-larson/the-librarian-of-congress-and-the-greatness-of-humility http://blogs.loc.gov/thesignal/2016/10/data-and-humanism-shape-library-of-congress-conference/ http://blogs.loc.gov/thesignal/2016/10/data-and-humanism-shape-library-of-congress-conference/ http://blogs.loc.gov/thesignal/2017/02/read-collections-as-data-report-summary/ http://blogs.loc.gov/thesignal/2017/02/read-collections-as-data-report-summary/ http://blogs.loc.gov/thesignal/2016/07/co-hosting-a-datathon-at-the-library-of-congress/?loclr=blogsig http://blogs.loc.gov/thesignal/2016/07/co-hosting-a-datathon-at-the-library-of-congress/?loclr=blogsig http://digitalpreservation.gov/meetings/dcs16/DChudnov-MGallinger_LCLabReport.pdf http://digitalpreservation.gov/meetings/dcs16/DChudnov-MGallinger_LCLabReport.pdf https://oulib-swc.github.io/2017-02-15-loc/ https://oulib-swc.github.io/2017-02-15-loc/ https://github.com/LibraryOfCongress/data-exploration https://github.com/LibraryOfCongress/data-exploration https://blogs.loc.gov/thesignal/2011/08/yes-the-library-of-congress-develops-lots-of-software-tools/ https://blogs.loc.gov/thesignal/2011/08/yes-the-library-of-congress-develops-lots-of-software-tools/ https://www.loc.gov/about/ https://www.loc.gov/about/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 159/180   Appendix 7: Forum Summaries  Forum 1:  March 1‑3, 2017 | Santa Barbara, California  The first forum was a gathering of key stakeholders, practitioners, thought leaders, and scholars currently                              working with collections as data. Each participant was asked to prepare a position statement in advance of                                  the forum to help frame the discussion. Forum sessions were a mixture of group discussions, presentations,                                and small group work using human centered design techniques. Activities were designed to document                            current practice, surface problems, and generate new ideas and approaches for collections as data work.                              Although crafting a joint framework and strategic direction for collections as data was an initial goal of the                                    forum, this was ultimately proved not to be achievable because of the multiplicity of techniques,                              approaches, and user needs for collections as data. Instead, forum participants crafted the Santa Barbara                              Statement, which represented a consolidation of the major themes of the forum. These included the                              complexity of the collections as data landscape, particularly the wide range of consumers and use cases;                                questions of scalability; open access solutions; ethical concerns; and partnerships.  Agenda  March 1  8:30 Breakfast  8:45 Welcome & Introductions  9:15  Project scope overview  Thomas Padilla  9:45 Project Outcomes‑‑focused group discussion  10:15 Break  10:30 Collections as Data Panel ‑‑ Existing implementations  Miriam Posner (UCLA), Harriett Green (UIUC), Tim Sherratt (University of Camberra), Mia Ridge (British  Library), Jefferson Bailey (Internet Archive), Gabrielle Foreman (University of Delaware)    11:45 Idea Generation  Discussion:  You each came with a set of collections as data related ideas, expressed in part  through your position statements. You are in a group of people with a range of experiences.  During this time we would like you to work to align your experiences to generate ideas that hold  the potential to push collections as data work forward. We ask that you focus your discussion on  enumerating as many ideas as possible. We do not expect you to create detailed roadmaps  whereby these ideas might be pursued. This conversation is purely geared toward getting as  many of our  ideas to the surface as possible.     12:00 Working Lunch  1:30 Sharing  2:00 Break  159  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 160/180 2:15 Play it Out  Discussion:  How might some of the ideas you generated be implemented?  3:45 Break  4:00 Sharing  4:30 Reflection  Discussion:  This afternoon you spent time reflecting on your collective position statements and  discussing ideas that push collections as data work forward.  We now ask you to spend a few  moments in your focused group to critique these all of the ideas that were generated.  What do  you think are particularly good or useful ideas?  What might be easy to implement?  What  problems and pitfalls exist?     5:00 Set Stage for Day 2    March 2  8:30 Breakfast  9:00 Gather Data  Activity:  In this exercise, you will rely on each other as a sort of “focus group” to gather a set of  data about how you engage with collections as data.  Please answer the following questions as a  group, recording your answers in this document.  You may choose which set of questions are  most relevant to your perspective in creating/manipulating/consuming collections as data.  Some  groups may choose to answer from multiple perspectives.  We will build upon this data the  remainder of the day.      10:00  Break  10:15 Story Generation  Activity:  Using the data gathered earlier this morning from all of the groups and your own  personal experiences, write 2‑3 user stories.     11:45 Lunch  1:00  Story Review and Critique  Activity:  Examine and refine the use cases generated by another group.      2:00 Prototyping  Activity:  Using the best or most interesting ideas from the story generation activity, design a  product, system, service or curriculum, etc., that meets the needs of one or more people you  choose from the stories.  You will be presenting your prototype idea using a  Concept Poster .  Be  sure to consider the effect of your solution on other stakeholders to demonstrate viability and  impact.      3:00 Break  3:15 Share prototypes  4:00 Discussion: Implications for Libraries  5:00 Review of Day  160  https://www.google.com/search?q=luma+institute+concept+poster&espv=2&biw=1392&bih=780&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwj395Oi8rXSAhUhrlQKHVkRDLsQsAQIJQ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 161/180   March 3  8:30 Breakfast  9:00 Discussion: Absences  9:30  Statement Creation  10:45 Break  11:00 Engagement  11:45  Closing remarks  Attendees   John Ajao  Director, Systems and Repository Operations  University of California Santa Barbara  Matthew Miller  Head of Semantic Applications & Data Research   New York Public Library  Jefferson Bailey  Head of Web Archiving Programs  Internet Archive  Anna Neatrour  Metadata Librarian  University of Utah  Alex Chassanoff  Software Curation Postdoctoral Fellow  Massachusetts Institute of Technology  Miriam Posner  Digital Humanities Coordinator  University of California Los Angeles  Tanya Clement  Assistant Professor of Information  University of Texas Austin  Sheila Rabun  Community and Communications Officer  Stanford University  P. Gabrielle Foreman  Professor of English and Black American Studies  University of Delaware  Mia Ridge  Digital Curator  British Library  Daniel Fowler  Developer Advocate  Open Knowledge Foundation  Laila Sakr  Assistant Professor of Film and Media Studies  University of California Santa Barbara  Harriett Green  English and Digital Humanities Librarian   University of Illinois at Urbana Champaign  Ben Schmidt  Assistant Professor of History  Northeastern University  Jennifer Guiliano  Assistant Professor of History  Indiana University‑Purdue University Indianapolis  David Seubert  Curator of Performing Arts Collections  University of California Santa Barbara  161  http://www.jeffersonbailey.com/ http://www.filmandmedia.ucsb.edu/people/faculty/shereensakr/shereensakr.html http://jguiliano.com/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 162/180 Julie Hardesty  Metadata Analyst   Indiana University  Tim Sherratt  Associate Professor of Digital Heritage  University of Canberra  Christina Harlow  Metadata Librarian  Cornell University  Hannah Skates Kettler  Digital Humanities Librarian  University of Iowa  Greg Jansen  Research Software Architect  University of Maryland  Timothy St. Onge  Cartographer  Library of Congress  Lisa Johnston  Research Data Management/Curation Lead &  Co‑Director University Digital Conservancy  University of Minnesota  Santi Thompson  Head of Digital Research Services  University of Houston  Matthew Lincoln  Data Research Specialist  Getty Research Institute  Kate Zwaard  Head of National Digital Initiatives  Library of Congress  Alan Liu  Distinguished Professor of English  University of California Santa Barbara    Forum 2:  May 7‑8, 2018 | Las Vegas, Nevada  After spending a year at conferences, workshops, and seminars talking about what collections as data is,                                we held a second national forum focused the nuts and bolts of collections as data work, particularly how                                    communities interested in getting started with collections as data work could move forward. The first                              day of the forum focused on current implementations and how a variety of consumers, from librarians to                                  scholars to the general public, interacted with collections as data resources. This section of the forum                                was livestreamed and received over 400 live and subsequent views. As in the first forum, the variety of                                    these collections as data implementations once again demonstrated that the collections as data                          landscape is complex and no one set of solutions will be feasible or even appropriate for everyone.                                  Forum participants then focused on reality checks of  Always Already Computational deliverables based                          on their own experiences with collections as data.    Agenda  Monday, May 7    8:30 Breakfast  162  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 163/180 8:45 Dean Welcome & Introductions  9:00  Project Update   Thomas Padilla  9:30 Panel 1: Who is Collections as Data for?   Who is Collections as Data for? Building on Principle 5, the forthcoming version of the  CaD Santa  Barbara Statement  will assert that "Collections as data designed for everyone serve no one."  How has your work with CaD been forged around specific people, whether those represented in  the collections, built into the design of the dataset, or reflected in your own teaching and/or  learning? What work have you done to match CaD with populations?    Dot Porter (UPenn), Shawn Averkamp (NYPL), Bergis Jules (UC Riverside)   10:30‑10:45  Break  10:45 Panel 2: What is the coolest thing about your Collections as Data work?   What is the coolest thing about your collections as data work? Tell us why you became                                involved with this work and what motivates your continued dedication or interest. We'd like to                              show our attendees the spirit and possibilities of collections as data work.    Micki Kaufman (CUNY), Inna Kouper (Indiana), Greg Cram (NYPL), Laurie Allen (UPenn)  11:45‑12  Break  12:00 Panel 3: How have you implemented Collections as Data?   Viewers of our livestream are likely interested in how they might participate in or grow                              collections as data. How have you started, shifted, or institutionalized collections as data?                          How do you see this work aligning with your institutional/organizational mission? What                        surprised you about the process, and what do you plan or hope to do next?    Meghan Ferriter (LOC), Mary Elings (UC Berkeley), Helen Bailey (MIT),  Veronica Ikeshoji‑Orlati (Vanderbilt)  1:00 Lunch  2:00 Introducing the Guide   2:30 Reality Check on Project Deliverables ‑‑ group‑based discussion and activities  All   5:00 Break for dinner ‑  On your own    Tuesday, May 8  8:30 Breakfast  9:00 Future Directions: Moving Stuff Forward ‑‑ group‑based discussion  All  11:45 Wrap up  Thomas Padilla  12:00 End ‑  Box lunch provided  163  https://collectionsasdata.github.io/statement/ https://collectionsasdata.github.io/statement/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 164/180 Attendees  Elvia Arroyo Ramirez
  Assistant University Archivist  University of California Irvine  Micki Kaufman
  City University of New York  Shawn Averkamp  Manager of Metadata Services  New York Public Library  Inna Kouper
  Assistant Scientist, School of Informatics,          Computer, and Engineering  
Assistant Director, Data to Insight Center Indiana              University  Helen Bailey  
Engagement Data Engineer  Massachusetts Institute of Technology  Mark Matienzo  
Collaboration & Interoperability Architect  Stanford University  Alex Chassanoff
  CLIR/DLF Postdoctoral Fellow in Software          Curation  Massachusetts Institute of Technology  Jake Orlowitz
  Head of The Wikipedia Library  Wikimedia Foundation    Kalani Craig
  Clinical Assistant Professor, Department of          History  
Co‑Director, Institute for Digital Arts &            Humanities   Indiana University  Sarah Patterson
  Lecturer, Department of English  University of Massachusetts Amherst
  Co‑Founder, Colored Conventions  Greg Cram  
Associate Director of Copyright and Information            Policy  New York Public Library  Dot Porter
  Curator of Digital Research Services University of              Pennsylvania  Mary Elings  Head of Technical Services  The Bancroft Library, University of California            Berkeley  Chaitra Powell
  African American Collections and Outreach          Archivist   University of North Carolina Chapel Hill  Meghan Ferriter
  Senior Innovation Specialist   Library of Congress  Chela Scott Weber
  Director of Library and Collections  California Historical Society  164  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 165/180 Devin Higgins
  Digital Library Programmer  Michigan State University  Hannah Scates Kettler  
Digital Humanities Research and Instruction          Librarian  University of Iowa  Veronica Ikeshoji‑Orlati
  CLIR Postdoctoral Fellow  Vanderbilt University  Laura Wrubel  Software Development Librarian  George Washington University  Bergis Jules
  University and Political Papers Archivist          University of California Riverside          165  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 166/180 Appendix 8: Conference engagements, 2017‑2018  Conferences as a way to expand conversation beyond the two national forums. Limited money, chose to                                spend it by hosting mini‑forums with user groups not at the national forum. More gathering of use cases                                    and critique of our assumptions.  Re‑emphasized diversity of experience, capacity, and needs.    2017.    LDCX (March 27‑29, Stanford, California)  ● Thomas Padilla, Hannah Frost, “Supporting End User Computation / Use of Collections” (1 hour                            unconference session)    Csvconf (May 2‑3, Portland, Oregon)  ● Laurie Allen (keynote)    Texas Conference on Digital Libraries (May 23‑25, Austin, Texas)  ● Sarah Potvin, “Almost Already Computational: An Update from the Library Collections as Data                          Effort” (poster)    Association of College and Research Libraries Digital Humanities Interest Group webinar (June 26,                          online)  ● Thomas Padilla, “What Does it Mean: Library Collections as Data” (3 speakers, 60 minute panel)    American Library Association (June 26, Chicago, Illinois)  ● Laurie Allen, “New Kinds of Collections: New Kinds of Collaborations,” on panel for “Creating the                              Future of Digital Scholarship Together: Collaboration from Within Your Library” (3 projects, 90                          minute panel)    Society of American Archivists (July 23‑29, Portland, Oregon)  ● Alexandra Chassanoff, Thomas Padilla, and Elizabeth Russey Roke, “Open Forum ‑ Always Already                          Computational: Collections as Data” (75 minutes)     Digital Humanities (August 7, Montreal, Quebec)  ● Sarah Potvin, Thomas Padilla, Laurie Allen, Stewart Varner, “Shaping Humanities Data” (full‑day                        preconference symposium)    DLF eResearch Network (August 9)  ● Thomas Padilla, “Collections as Data” (60 minutes)    Digital Library Federation (October 23‑25, Pittsburgh, Pennsylvania)  ● Thomas Padilla, “Collections as Data: An Update” (7 minutes)  166  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 167/180 ● Thomas Padilla, Laurie Allen, Stewart Varner, Elizabeth Russey Roke, Hannah Frost, Sarah Potvin,                          “Collections as Data Workshop” (2 hours)    Samvera Connect (November 9, Salt Lake City, Utah)  ● Hannah Frost, “Collections as Data and Samvera” (50 minutes)    Coalition for Networked Information (December 11, Washington, DC)  ● Thomas Padilla, Laurie Allen, Hannah Frost, “Always Already Computational: Collections as Data”                        (1 hour)    2018.    American Historical Association (January 4, Washington, DC)  ● Laurie Allen, Stewart Varner, “Collections as Data,” in workshop on “Getting Started in Digital                            History 2018” (4 hour workshop)    National Institute for Computer‑Assisted Reporting (March 10, Chicago, Illinois)  ● Thomas Padilla and Laurie Allen, “Cultural heritage data? Computational use, needs, and                        opportunities” (60 minutes)     Digital Public Library of America Annual Members Meeting (March 13‑14, Atlanta, Georgia)  ● Elizabeth Russey Roke, “DPLA as Data: Collections as Data in Practice” (90 minute workshop)    LDCX (March 26‑28, Stanford, California)  ● Hannah Frost and Kate Lynch, “Collections as Data” (60 minutes)    Los Angeles Arts Datathon (April 27, Los Angeles, California)  ● Thomas Padilla, “Collections as Data x Arts as Data” (keynote)    DH + Libraries, Sidney Harman Center for Polymathic Studies, University of Southern California (April                            30, Los Angeles, California)  ● Thomas Padilla, “On a Collections as Data Imperative” (60 minutes)    Society of American Archivists (August 12‑18, Washington, DC)  ● Elizabeth Russey Roke, "Collections as Data," Electronic Records Section Meeting (45 minute                        discussion)    Open Repositories (June 4‑7, Bozeman, Montana)  ● Hannah Frost and Sarah Potvin (Moderators), Mark Jordan, Katherine Lynch, Helen Bailey,                        “Enabling Computational Access at Scale: Are Repositories Serving Collections‑as‑Data?” (45                    minutes)    HILT (June 4‑8, Philadelphia, Pennsylvania)  167  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 168/180 ● Thomas Padilla and Mia Ridge, “Collections as Data” (Week long workshop)    Dariah Beyond Europe Workshop at Library of Congress (October 3‑4, Washington DC  ● Laurie Allen, Stewart Varner “Collections As Data: Digital Collections for Emerging Research                        Methods.”  (keynote + workshop 2 hours)    Digital Library Federation (October 15‑17, Las Vegas, Nevada)  ● Thomas Padilla, Stewart Varner, Hannah Frost, Elizabeth Russey Roke, Sarah Potvin, “Always                        Already Computational, Never Quite Automatic: Towards a Collections as Data Framework” (55                        minutes)  ● Sarah Potvin, Thomas Padilla, Santi Thompson, Liz Woolcott, Amanda Rust, Giordana Mecagni,                        “What would the ‘community’ think? Three grant‑funded team reflect on defining community                        and models of engagement” (55 minutes)                            168  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 169/180 Appendix 9: Digital Humanities 2017 preconference: Shaping Humanities Data  Description  How can cultural heritage institutions develop and provide access to collections that are more readily                              amenable to computational use? How does a movement toward thinking about collections as data                            prompt an opportunity to reframe, enrich, and/or contextualize collections in a manner that expands use                              while avoiding replication of bias inherent in collection practice? The  Collections as Data  project                            presents  Shaping Humanities Data  as a venue to explore these questions at  Digital Humanities 2017 .  Shaping Humanities Data features eleven talks and five demonstrations. Talks and demonstrations were                          solicited through a  CFP and reviewed by an international program committee. The event also includes                              opportunities for discussion and workshopping  Collections as Data frameworks. The workshop will                        inform the development of recommendations that aim to support cultural heritage collections as data                            efforts.  Schedule  August 7, 2017  9:30‑10:00  ● Introductions, Schedule, Project Update  10:00‑10:50  ● Reusable Computational Processing of Large‑Scale Digital Humanities Collections                (Marciano and Jansen)  ● MARCing the Boundary: Reusing Special Collections Records through the Early Novels                      Database (Kashyap and Van Tine)  ● Leveraging Core Data for the Cultural Heritage of the Medieval Middle East (Schwartz)  11:00‑12:00  ● Lessons learned through the Smelly London project (Leem)  ● Historical Public Health Data Curation: Indiana State Board of Health Monthly Bulletin                        Project (Pollock and Coates)  ● Javanese Theatre as Data (Varela)  ● High Performance Computing for Photogrammetry Made Easy (Dombrowski, Gniady,                  Simpson, Meredith‑Lobay)  1:00‑1:45  ● Using IIIF to answer the Data Needs of Digital Humanists (Di Cresce)  ● Demonstrating A Multidisciplinary Collections API (Almas and Baumgardt)  1:45‑2:45  ● Collections as Data Workshopping  3:00‑3:50  ● Umbra Search as Data: A digital sandbox to cross the digital divide (Marcus)  169  https://collectionsasdata.github.io/shaping/ https://collectionsasdata.github.io/dh2017/ https://collectionsasdata.github.io/statement/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 170/180 ● Audio Analysis for Spoken Text Collections (Clement and McLaughlin)  4:00‑4:50  ● Facilitating Global Historical Research on the Semantic Web: MEDEA (Tomasek and                      Vogeler)  ● Mending the Vendor: Correction and Exploratory Augmentation of Collections as Data                      (Locke)  ● Learning through Use: A case study on setting up a research fellowship to learn more                              about how one of our collections works as computationally amenable data (Severson                        and Vejvoda)  ● Addressing Copyright and IP Concerns when using Text Collections as Data (Senseney,                        Dickson, and Tracy)  ● Libraries as Publishers of a New Bibliographical Unit (Claeyssens)  4:50‑5:30  ● Wrap‑Up  Program Committee members    Harriett Green, University of Illinois at Urbana Champaign  Inna Kizhner, Siberian Federal University  Alberto Martinez, Colegio de México  Ian Milligan, University of Waterloo  Gimena Del Rio Riande, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)‑ University                          of Buenos Aires  Laurent Romary, Inria and DARIAH  Henriette Roued‑Cunliffe, University of Copenhagen  Melissa Terras, University College London  Presentation abstracts  Demonstrating A Multidisciplinary Collections API  Bridget Almas and Frederik Baumgardt, Tufts University; Tobias Weigel, DKRZ; Thomas Zastrow, MPCDF  The Collections Working Group of the Research Data Alliance (RDA) is a multidisciplinary effort to                              develop a cross‑community approach to building, maintaining and sharing machine‑actionable                    collections of data objects. We have developed an abstract data model for collections and an API that                                  can be implemented by existing collection solutions. Our goal is to facilitate cross‑collection                          interoperability and the development of common tools and services for sharing and expanding data                            collections within and across disciplines, and within and across repository boundaries. The RDA                          Collections API supports Create/Read/Update/Delete/List (CRUD/L) operations. It also supports                  set‑based operations for Collections, such as finding matches on like items, finding the intersection and                              union of two collections, and flattening recursive collections. Individual API implementations can                        declare, via a standard set of capabilities, the operations available for their collections. The Perseids                              170  https://www.rd-alliance.org/groups/pid-collections-wg.html 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 171/180 Project at Tufts University is implementing this API for its collection of annotations on ancient texts. We                                  will review the model and the functionality of the API and demonstrate how we have applied it to                                    manage Perseids humanities data. We will also provide examples of how it is being applied for                                collections of data in other disciplines, including Climate Computing and Geoscience. Finally, we will                            solicit feedback from the participants in the workshop on the API and model and its applicability for                                  other collections of cultural heritage data.    Libraries as Publishers of a New Bibliographical Unit  Steven Claeyssens, Koninklijke Bibliotheek    Large‑scale digitisation of historical paper publications is turning libraries into publishers of data                          collections for machines and algorithms to read. Therefor the library should critically (re)consider 1) its                              new function as a publisher of 2) a new type of bibliographical content in 3) an exclusively digital                                    environment. What does it mean to be both library and publisher? What is the effect of remediating our                                    textual and audiovisual heritage, not as traditional bibliographic publications, but as data and datasets?                            How can we best serve our patrons, new and old, machines and humans?    In my talk I want to address these questions drawing on my background as a book historian specialized in                                      Publishing Studies, and on my experience as the Curator of Digital Collections at the national library of                                  the Netherlands (KB) responsible for providing researchers with access (Data Services) to the large                            collections of data the KB is creating.    At the KB we found there is no one‑way solution to cater the needs of Digital Humanists. I will reflect                                        upon their requirements by analysing the requests for data by Digital Humanists the KB received during                                the year 2016. What kind of data were they looking for? Why did they need the data?    I will identify both valuable as well as incompatible user requirements, indicating the conflicting                            expectations and interests of different disciplines and researchers. Therefore I argue that 1) a close                              collaboration between scholars and librarians is essential if we really want to advance the use of large                                  digital libraries in the field of Digital Humanities, and 2) we need to carefully reconsider our role(s) as a                                      library.    Audio Analysis for Spoken Text Collections  Tanya Clement and Steve McLaughlin, The University of Texas at Austin    At this time, even though we have digitized hundreds of thousands of hours of culturally significant audio                                  artifacts and have developed increasingly sophisticated systems for computational analysis of sound,                        there is very little provision for audio analysis. There is little provision for scholars interested in spoken                                  texts such as speeches, stories, and poetry to use or to even begin to understand how to use high                                      performance technologies for analyzing sound. Toward these ends, we have developed a beginner’s                          audio analysis workshop as part of the HiPSTAS (High Performance Sound Technologies for Access and                              Scholarship) project. We introduce participants to essential issues that DH scholars, who are often more                              familiar with working with text, must face in understanding the nature of audio texts such as poetry                                  171  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 172/180 readings, oral histories, speeches, and radio programs. First, we discuss the kinds of research questions                              that humanities scholars may want to explore using features extracted from audio collections– laughter,                            silence, applause, emotions, technical artifacts, or examples of individual speakers, languages, and                        dialects as well as patterns of tempo and rhythm, pitch, timbre, and dynamic range. We will also                                  introduce participants to techniques in advanced computational analysis such as annotation,                      classification, and visualization, using tools such as Sonic Visualiser, ARLO, and pyAudioAnalysis. We will                            then walk through a sample workflow for audio machine learning. This workflow includes developing a                              tractable machine‑learning problem, creating and labeling audio segments, running machine learning                      queries, and validating results. As a result of the workshop, participants will be able to develop potential                                  use cases for which they might use advanced technologies to augment their research on sound, and, in                                  the process, they will also be introduced to the possibilities of sharing workflows for enabling such                                scholarship with archival sound recordings at their home institutions.    Using IIIF to answer the Data Needs of Digital Humanists  Rachel Di Cresce, University of Toronto    How can we provide researchers and instructors with seamless access to dispersed collections,                          controlled by their formats, frameworks and softwares, across cultural heritage organizations? How can                          we allow free movement of this data so it can be analyzed, measured and presented through different                                  lenses? And how can we support this research without placing too high a technical burden on those                                  institutions, especially those with limited resources? These questions have been at the centre of the                              University of Toronto’s Mellon‑funded project, Digital Tools for Manuscript Study, which aims at                          integrating the International Image Interoperability Framework (IIIF), based on Linked Data principles,                        with existing tools to improve the researcher’s experience. Essentially, the project shifts focus away from                              the tool that makes use of the data onto the data itself as a research and teaching tool.    At the core of the project is working with humanists to understand how they conduct their research and                                    what they need in order to do digital scholarship effectively. We identified, for example, strong needs for                                  data portability, repository interoperability, and tool modularity in scholarly work. We make use of the                              IIIF data standard to support data portability, the Mirador image viewer for its suite of tools for image                                    presentation and analysis and Omeka for its wide adoption among digital humanities scholars and                            cultural heritage organizations. In addition, we have developed a standalone tool set called IIIF To Go.                                This is a user‑friendly IIIF start‑up kit, designed to support both research and pedagogical uses. This talk                                  will discuss our attempt to democratize an international standard by (1) embedding it in tools with wide                                  traction and low entry barriers in the digital humanities and manuscript studies community (2) limiting                              the technical load required to make use of the standard and tools for instruction and research and (3)                                    looking toward Linked Data at GLAM institutions.    High Performance Computing for Photogrammetry Made Easy  Quinn Dombrowski, University of California Berkeley; Tassie Gniady, Indiana University; John Simpson,                        Compute Canada; Megan Meredith‑Lobay, University of British Columbia    172  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 173/180 Photogrammetry (generating 3D models from a series of partially‑overlapping 2D images) is quickly                          gaining favor as an efficient way to develop models of everything from small artifacts that fit in a light                                      box to large archaeological sites, using drone photography. Stitching photographs together, generating                        point clouds, and generating the dense mesh that underlies a final model are all                            computationally‑intensive processes that can take up to tens of hours for a small object to weeks for a                                    landscape to be stitched on a high‑powered desktop. Using a high‑performance compute cluster can                            reduce the computation time to about ten hours for human‑sized statues and twenty‑four hours for                              small landscapes.    One disadvantage of doing photogrammetry on an HPC cluster is that it requires use of the command                                  line and Photoscan’s Python API. Since it is not reasonable to expect that all, or even most, scholars who                                      would benefit from photogrammetry are proficient with Python, UC Berkeley has developed a Jupyter                            notebook that walks through the steps of the photogrammetry process, with opportunities for users to                              configure the settings along the way. Jupyter notebooks embed documentation along with code, and can                              serve both as a resource tool for researchers who are learning Python, and as a stand‑alone utility for                                    those who want to simply run the code, rather than write it. This offloads the processing the HPC cluster,                                      allowing users to continue to work on a computer that might normally be tied up by the processing                                    demands of photogrammetry.    MARCing the Boundary: Reusing Special Collections Records through the Early Novels Database  Nabil Kashyap, Swarthmore College, and Lindsay Van Tine, University of Pennsylvania    In this presentation, Early Novels Database project (END) collaborators Nabil Kashyap and Lindsay Van                            Tine will offer perspectives on the possibilities and perils of reframing the special collections catalog as a                                  collaborative datastore for humanities research. Among other activities, the END project includes                        curating records from regional special collections, developing standards for enhancing catalog records                        with copy‑specific descriptive bibliography, and publishing open access datasets plus documentation.                      Work on END therefore excavates basic questions around what thinking through library holdings as data                              might actually entail. What ultimately constitutes “the data”? What do they do? For whom? Starting                              from Leigh Star’s notion of the boundary object, this presentation explores the theory and praxis of                                MARC as a structure of knowledge that can allow “coordination without consensus.”    The MARC records at the core of the END dataset, the result of meticulous work on the part of                                      institutional catalogers, serve as “boundary objects”–that is, they serve as a flexible technology that                            both adapts to and coordinates a range of contexts. These contexts, in turn, can have very different                                  needs and values, from veteran catalogers to undergraduate interns, special collections to open source                            repositories, and from projected to actual uptake and reuse of the data in classrooms and research.    These shifting contexts call into question just what the “data” is. It will look different to a cataloger, an                                      outside funding organization, a sophomore, a programmer, or an 18th c. scholar. What might appear                              straightforward–creating derivatives, for example–instead reveals a host of issues. Transforming nested                      into tabular data brings to light frictions between disparate assumptions as to the unit of study, whether                                  a work or volume or a particular copy. Privileging certain fields either effaces the specificity of                                173  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 174/180 transcription or sacrifices discoverability. There is no transparent “data dump”; instead, every act of                            transformation reinscribes a set of disciplinary and institutional values. Viewing collections as data is as                              much about opening up data as about actively demonstrating and to an extent prescribing research                              possibilities.    Lessons learned through the Smelly London project  Deborah Leem, Wellcome Trust and University College London    I propose to present the intended aims of the Smelly London project; what we achieved; challenges we                                  experienced working with digitised collections; and possible directions for further development. In order                          to increase the impact and value that cultural heritage digital collections can offer we believe that their                                  online collections and platforms should be more amenable to emerging technologies and facilitate a new                              kind of research.    Wellcome Library – part of Wellcome – is one of the world’s major resources for the study of health and                                        histories. Over the past few years Wellcome have been developing a world‑class digital library by                              digitising a substantial proportion of their holdings. As part of this effort, approximately 5,500 Medical                              Officer of Health (MOH) reports for London spanning from 1848‑1972 were digitised in 2012. Since                              September 2016 Wellcome have been digitising 70,000 more reports covering the rest of the United                              Kingdom (UK) as part of UK Medical Heritage Library (UKMHL) project in partnership with  Jisc and the                                  Internet Archive. However, no digital techniques have yet been applied successfully to add value to this                                very rich resource.    As part of the  Smelly London project, the OCR‑ed text of the MOH London reports has been text‑mined.                                    Through text mining we produced a geo‑referenced dataset containing smell types for visualisation to                            explore the data. At the end of the Smelly London project the MOH smell data will also be available via                                        other platforms and this will allow the public and other researchers to compare smells in London from                                  the 19th century to present day. This has the further potential benefit of engaging with the public.                                  However, cultural heritage organisation do not offer platforms that can help researchers share or                            communicate the data derived from digital collection use.    Mending the Vendor: Correction and Exploratory Augmentation of Collections as Data  Brandon Locke, Michigan State University    Like many university libraries, Michigan State received external hard drives filled with collections they                            held perpetual licenses to. Like many university libraries, those collections have mostly remained mostly                            unused since they’ve been acquired. The data required processing to make them usable, but without                              demand for specific data from scholars, there was little benefit or reason to make all of the data                                    available.  In an effort to pilot a project to make this data more available and to promote use of the datasets,                                        Brandon Locke (Director of LEADR), Devin Higgins (Library Programmer), and Megan Kudzia (Digital                          Scholarship Technology Librarian), embarked on a project to make the papers of Fannie Lou Hamer                              available for download. Hamer’s papers were chosen based on her historical stature and interest to                              174  https://ukmhl.historicaltexts.jisc.ac.uk/home https://collectionsasdata.github.io/shapingdata_dh2017_abstracts/www.londonsmells.co.uk 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 175/180 faculty and graduate students in the Department of History, and upon the relatively small size of the                                  collection.    The original scope of the project was for Higgins and Kudzia to make the plain text files available for                                      download by any MSU student, faculty and staff. LEADR staff would then experiment with different text                                and data mining tools to add metadata and create subsets and auxiliary datasets to accompany the                                collection.    After Higgins and Kudzia made the plain text files  available to the campus community , the LEADR staff                                  immediately encountered troubles with Named Entity Recognition. Upon inspection, the OCR on the files                            were far too flawed for any accurate text mining, and the entire collection had to be redone using the                                      provided page images with close training and manual correction.    This talk will detail some of the shortcomings in the supplied data, discuss opportunities for                              experimental text and data mining to enhance and augment existing collections datasets, and engage in                              opportunities for collaborations between institutions in improving data quality.    Reusable Computational Processing of Large‑scale Digital Humanities Collections  Richard Marciano and Greg Jansen, University of Maryland    The Digital Curation Innovation Center (DCIC) at the U. Maryland iSchool, officially launched the                            “DRAS‑TIC” archiving platform at iPRES 2016, in Oct. 2016. This stands for Digital Repository At Scale                                That Invites Computation [To improve Collections], and is rolled out under a community‑based open                            source license. The goal is to build out an open source platform into a horizontally scalable archives                                  framework serving the national library, archives, and scientific management communities. As a potential                          scalable and computational platform for Big Data management in large organizations in the cultural                            heritage, business, and scientific research communities.    This digital repository framework can scale to over a billion records and has tools for advanced metadata                                  extraction ‑ including from images, file format conversion, and search within the records and across                              collections. The underlying software is based on the distributed NoSQL database, Apache Cassandra,                          created to meet the scaling needs of companies like Facebook. DRAS‑TIC supports integration by                            providing a standard RESTful Cloud Data Management Interface (CDMI), a command‑line interface, web                          interface, and messaging as contents are changed (MQTT). We are now exploring connecting DRAS‑TIC                            with a graph database engine to support social network analysis and computing of archival and library                                collections.  We wish to demonstrate this environment with reusable clustering workflows for grouping digitized                          forms by their layout, a recurring use‑case in many digital humanities projects. This is a preprocessing                                step that has the potential to lead to more accurate OCR of regions in images within digitized forms.    Umbra Search as Data: A digital sandbox to cross the digital divide  Cecily Marcus, University of Minnesota Libraries    175  https://listings.lib.msu.edu/fannielouhamer/ 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 176/180 Publicly launched in 2017, the University of Minnesota Libraries’  Umbra Search African American History                            has been working with partners across the country—from the Digital Public Library of America to Yale                                University to Howard University—to facilitate digital access to African American cultural history. As more                            than a search tool, Umbra Search doesn’t just bring together over 500,000 digital materials from 1,000                                US libraries, archives and museums. It also promotes the use of these materials through programming                              with students, educators, scholars, and artists, and leads a massive digitization effort of African American                              materials to build out a national digital corpus of African American history. Now, Umbra Search is                                exploring what it means to share the Umbra Search digital corpus as a data set that helps to bridge the                                        digital divide and promote digital literacy among underrepresented youth and kids of color. By packaging                              curated sets of Umbra Search data around thematic topics (as well as providing access to the whole of                                    Umbra Search data) with accessible digital storytelling tools that allow students to make data their own,                                Umbra Search provides an introduction to digital storytelling and other digital humanities skills through                            the lens of African American history and culture. Umbra Search’s national digital corpus provides a                              unique opportunity to engage students with STEAM activities and skill building with culturally relevant                            content that affirms African American history and culture. This talk discusses the rationale for developing                              a digital sandbox that provides libraries with a new model for activating primary source materials and                                digital collections—often considered to be among the more rarefied and inaccessible collections in                          libraries—and digital humanities tools in communities that may not regularly engage with archives,                          primary source digital collections, or digital humanities.    Historical Public Health Data Curation: Indiana State Board of Health Monthly Bulletin Project  Caitlin Pollock and Heather Coates, Indiana University‑Purdue University Indianapolis    As digital scholarship librarians, enhancing open digital content to facilitate reuse is a key mission of our                                  work. This talk will introduce the work of IUPUI librarians in curating the Indiana State Board of Health                                    Monthly Bulletin (1899‑1964). While in circulation, this resource was sent to all health officers and                              deputies in the state, plus individual subscribers. Physicians shared information about health and                          wellness, communicable diseases, patent medicines, food safety, and many other topics. As such, the                            Bulletin provides a unique historic portrayal of Indiana public health practice, fascinating images, and                            regular vital statistics from the early and mid‑20th century. This project brings together the Ruth Lilly                                Medical Library and the IUPUI University Library to leverage librarian expertise in digital humanities,                            medical humanities, public health, the history of medicine, and data curation. Our initial focus is curating                                a 10‑year span (1905‑1914) of these bulletins in order to develop and refine processes that can be                                  adapted for other digital collections. Our curation efforts focus on providing greater accessibility to                            students and scholars of Indiana and medical history, public health, and Hoosiers across the state. We                                are creating three types of products: TEI documents; geocoded citizens and professionals, community                          organizations and businesses, and buildings; and vital statistics data. Data dictionaries are being                          developed to support analysis of the vital statistics and to capture additional context about historic                              knowledge of disease and death. Project documentation will be developed to support exploration by the                              public and use by scholars and provide transparency with regards to the decisions made during curation.                                All products generated from the project, including protocols for curation, will be shared openly under a                                CC‑BY license on platforms including Github and the TEI Archiving, Publishing and Access Service (TAPAS)                              Project.  176  https://collectionsasdata.github.io/shapingdata_dh2017_abstracts/umbrasearch.org 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 177/180   Leveraging Core Data for the Cultural Heritage of the Medieval Middle East  Daniel L. Schwartz, Texas A&M University    I direct Syriaca.org, a core data project for Syriac history, literature, and cultures. Syriac is a dialect of                                    Aramaic once spoken by populations across the Middle East and Asia. Syriac sources document key                              moments in the interaction of Judaism, Christianity, and Islam and offer unique perspectives on the                              history of the Middle East from the Roman period through Ottoman rule and into the tumultuous                                present in Iraq, Syria, and the Levant. Syriaca.org has built a core data infrastructure useful to any digital                                    project in the field that is interested in incorporating our URIs for persons, places, works, manuscripts,                                etc. I would like to propose a 30‑minute demonstration of three projects that highlight this utility. 1)                                  SPEAR (Syriac Persons, Events, and Relations) is a digital prosopography that employs our core data                              model (URIs) to extract and encode data about persons, events, and relationships from primary source                              texts. The scale enabled by the digital allows extensive treatment of many subaltern groups usually left                                out of traditional print prosopography. TEI encoding and serialization into RDF allow for multiple ways to                                query and visualize this data. 2) The New Handbook of Syriac Literature is an open‑access digital                                publication that will serve as both an authority file for Syriac works and a guide to accessing their                                    manuscript representations, editions, and translations in digital and analog formats. Though still in                          development, this Handbook will more than double the number of works contained in the last                              publication to attempt something similar, Anton Baumstark’s Geschichte, which is over 90 years old. The                              Handbook is part of Syriaca.org’s efforts to produce reference resources that help overcome the colonial                              biases that informed Orientalist organization of the cultural heritage of the Medieval Middle East. 3) We                                are developing a URI resolver that any project in the field using our URIs can incorporate into their                                    website to show users how many and what types of resources Syriaca.org has on the entities included in                                    their data and to provide direct links to those resources.        Addressing Copyright and IP Concerns when using Text Collections as Data  Megan Senseney, Eleanor Dickson, and Daniel G. Tracy, University of Illinois    Open source text data mining tools such as Voyant and publicly‑available services such as the HathiTrust                                Research Center (HTRC) have brought the potential of new research discoveries through computational                          analytics within reach of scholars. While the tools for mining and analyzing the contents of digital                                libraries as data are increasingly accessible, the texts themselves are frequently protected by copyright                            or other IP rights, or are subject to license agreements that limit access and use.    The HTRC recently convened a task force charged to draft an actionable, definitional policy for so‑called                                non‑consumptive use, which is research use that permits computational analysis while precluding                        human reading. This year, the HTRC released the task force’s Non‑Consumptive Research Policy, which is                              shaping revised terms of service and tool development within the HTRC. Building on the development of                                the HTRC’s policy, our team is seeking to catalyze a broader discussion around data mining research                                using in‑copyright and limited‑access text datasets through an IMLS‑funded national forum that will                          177  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 178/180 bring together experts around issues associated with methods, practice, policy, security, and replicability                          in research that incorporates text datasets that are subject to intellectual property (IP) rights. The                              national forum aims to produce an action framework for libraries with recommendations that will                            include models for working with content providers to facilitate researcher access to text datasets and                              models for hosting and preserving the outputs of scholars’ text data mining research in institutional                              repositories and databanks.    This short talk will describe the task force’s work to establish a Non‑Consumptive Research Policy for the                                  HTRC and outline next steps toward building a more comprehensive research agenda for library‑led                            access to the wealth of textual content existing just out‑of‑reach in digital collections and databases                              through the upcoming national forum.    Learning through use: A case study on setting up a research fellowship to learn more about how one                                    of our collections works as computationally amenable dataset  Sarah Severson and Berenica Vejvoda, McGill University Library and Archives    McGill University Library and Archives recently completed a major project to retrospectively digitize all of                              the dissertations and theses in the our collections. Once these were added to the institutional                              repository, the metadata and full text of over 40,000 electronic theses and dissertations (ETD), from                              1881 ‑ present, became searchable using the traditional database structure of keywords and full text.                              With such a large and comprehensive corpus of student scholarship, we wanted to use this collection as                                  our first foray into thinking about ‘collections as data’ and what kinds of research could be done if we                                      opened up the entire raw, text corpus.    In order to encourage use and dialogue with the collection, the Library created a Computational                              Research Fellowship through an innovation fund. The fellowship call was left deliberately open in order                              to learn what people wanted to do with the collection and the only condition was that they share what                                      they learned openly through presentations about their work and host any code in an open environment                                such as GitHub.    The selected fellow project will specifically utilize Python’s Natural Language Toolkit and capitalize on                            using word2vec (a word embedding algorithm developed by Google), to build an application with a                              front‑end, web‑based interface that will allow researchers to examine how literary terms have changed                            over time in terms of usage and context. The project will also include a data visualization component                                  using Plotly (a Python library) to promote interactive and visually meaningful data displays. More                            concretely, researchers will be able to enter a concept and a time‑period of interest and visualize how                                  the context of the concept has evolved over time. By way of example, the concept of “woman” shifts                                    contextually between First‑wave feminism and prior, as well as through subsequent waves of feminism.  This presentation will look at how we are thinking of our ETD collection as a computationally amenable                                  dataset; the computational fellowship as a means of engagement; and, what we hope to learn about the                                  collection and future library text mining services and support.    178  5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 179/180 Facilitating Global Historical Research on the Semantic Web: MEDEA (Modeling semantically Enhanced                        Digital Edition of Accounts)  Kathryn Tomasek, Wheaton College (Massachusetts); Georg Vogeler, Centre for Information Modeling ‑                        Austrian Centre for Digital Humanities, University of Graz    Social and economic historians have spent at least the past fifty years creating data sets well suited for                                    analysis using post‑WWII computational tools (SPSS/SASS). Contemporary efforts by such historians as                        Patrick Manning to aggregate data sets for human systems analysis demonstrate a desire to take                              advantage of the more recent tools represented by the semantic web. Both Tomasek and Vogeler have                                explored ontologies that can be integrated into the CIDOC‑CRM family of event‑based models and used                              for markup of digital scholarly editions of accounts, a genre of archival documents that support                              humanities research as well as social science research. This short paper offers a brief introduction to                                recommendations for producing digital scholarly editions of accounts that include references to a                          book‑keeping ontology using the TEI attribute @ana. Vogeler has tested comparability of data across a                              small sample of such editions for which the references have been transformed into RDF triples. New                                editions are being added to those stored in the GAMS repository (Geisteswissenschaftsliches Asset                          Management System) at the University of Graz between now and August 2017. We see these editions in                                  sharp contrast to the example of “page‑turning” simulations referenced in the cfp for the workshop:                              creating full digital scholarly editions of accounts using TEI, the book‑keeping ontology, and RDF triples                              are an example of shaping humanities data for use and reuse by taking advantage of the affordances of                                    the semantic web.    Javanese Theatre as Data  Miguel Escobar Varela, National University of Singapore    The  Contemporary Wayang Archive is an archive of Indonesian theatre materials. The online portal’s                            primary goal is to enable users to watch videos alongside transcripts, translations and scholarly notes.                              However, a new version currently under development will enable users to query the archival materials                              via APIs. The first API will be directed at linguistic queries from the transcript and translation corpus. The                                    goal is to enable data‑driven investigations of the ways Javanese and Indonesian are used in the                                performances. Although these languages are widely spoken (Indonesia is the fourth most populous                          country in the world and Javanese is its most widely spoken regional language), there are almost no                                  machine‑readable resources in these languages that can be used in digital humanities and computational                            linguistics research projects. A second API is aimed at video processing applications. The API will serve                                videoframe‑level data that can be used to interrogate and visualize the collection in new ways. We                                believe that most theatre projects in DH remain heavily focused on textual data or on numerical data                                  such as revenue numbers, cast sizes and collaboration networks. However, we believe that video                            processing offers a rich and yet untapped avenue for inquiry [1]. We aim to encourage further research                                  into this area via our video processing API. This talk will briefly outline the objectives and history of CWA,                                      our goals for the future and the technical and intellectual property rights challenges that we face.  References: [1] Escobar Varela, M and G.O.F. Parikesit, ‘A Quantitative Close Analysis of a Theatre Video                                Recording’ in Digital Scholarship in the Humanities (forthcoming), doi:10.1093/llc/fqv069    179  https://collectionsasdata.github.io/shapingdata_dh2017_abstracts/cwa-web.org 5/22/2019 aac_finalreport - Google Docs https://docs.google.com/document/d/12qb6J5dSMcQ0Rt2JE0zAnF_-BpRY04PEMLL8U_Qrjf8/edit# 180/180               180