Chapter 11

Taking a Leap Forward: Machine
Learning for New Limits

Patrice-Andre Prud’homme
Oklahoma State University

Introduction

Today, machines can analyze vast amounts of data and increasingly produce accurate results through
the repetition of mathematical or computational procedures. With the increasing computing ca-
pabilities available to us today, artificial intelligence (AI) and machine applications have made a
leap forward. These rapid technological changes are inevitably influencing our interpretation of
what AI can do and how it can affect people’s lives. Machine learning models that are developed
on the basis of statistical patterns from observed data provide new opportunities to augment
our knowledge of text, photographs, and other types of data in support of research and educa-
tion. However, “the viability of machine learning and artificial intelligence is predicated on the
representativeness and quality of the data that they are trained on,” as Thomas Padilla, Interim
Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). With
that in mind, these technologies and methodologies could help augment the capacity of archives
and libraries to leverage their creation-value and minimize their institutional memory loss while
enhancing the interdisciplinary approach to research and scholarship.

In this essay, I begin by placing artificial intelligence and machine learning in context, then
proceed by discussing why AI matters for archives and libraries, and describing the techniques
used in a pilot automation project from the perspective of digital curation at Oklahoma State
University Archives. Lastly, I end by challenging other areas in the library and adjacent fields to
join in the dialogue, to develop a machine learning solution more broadly, and to explore op-
portunities that we can reap by reaching out to others who share a similar interest in connecting
people to build knowledge.

127


128 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11
Artificial Intelligence and Machine Learning. Why do

they Matter?

Artificial intelligence has seen a resurging interest in the recent past—in the news, in the literature,
in academic libraries and archives, and in other fields, such as medical imaging, inspection of steel
corrosion, and more. John McCarthy, American computer scientist, defined artificial intelligence
as “the science and engineering of making intelligent machines, especially intelligent computer
programs. It is related to the similar task of using computers to understand human intelligence,
but AI does not have to confine itself to methods that are biologically observable” (2007, 2). This
definition has since been extended to reflect a deeper understanding of AI today and what systems
run by computers are now able to do. Dr. Carmel Kent notes that “AI feels like a moving target”
as we still need to learn how it affects our lives (2019). Within the last decades, the amazing jump
in computing capabilities has been quite transformative in that machines are increasingly able to
ingest and analyze large amounts of data and more complex data to automatically produce models
that can deliver faster and more accurate results. 1 Their “power lies in the fact that machines can
recognize patterns efficiently and routinely, at a scale and speed that humans cannot approach,”
writes Catherine Nicole Coleman, digital research architect for Stanford University (2017).

A Paradigm Shift for Archives and Libraries

Within the context of university archives, this paradigm shift has been transforming the way we
interpret archival data. Artificial intelligence, and specifically machine learning as a subfield of
AI, has direct applications through pattern recognition techniques that predict the labeling values
for unlabeled data. As the software analytics company SAS argues, it is “the iterative aspect of
machine learning [that] is important because as models are exposed to new data, they are able
to independently adapt. They learn from previous computations to produce reliable, repeatable
decisions and results” (n.d.).

Case in point, how can we use machine learning to train machines and apply facial and text
recognition techniques to interpret the sheer number of photographs and texts in either ana-
log or born-digital formats held in archives and libraries? Combining automatic processes to as-
sist in supporting inventory management with a focus on descriptive metadata, a machine learn-
ing solution could help alleviate time-consuming and relatively expensive metadata tagging tasks,
and thus scale the process more effectively using relatively small amounts of data. However, the
traditional approach of machine learning would still require a significant time commitment by
archivists and curators to identify essential features to make patterns usable for data training. By
contrast, deep learning algorithms are able “to learn high-level features from data in an incremen-
tal manner. This eliminates the need of domain expertise and hard core feature extraction” (Ma-
hapatra 2018).

Deep learning has regained popularity since the mid-2000s due to “fast development of high-
performance parallel computing systems, such as GPU clusters” (Zhao 2019, 3213). Deep learn-
ing neural networks are more effective in feature detection as they are able to solve complex prob-
lems such as image classification with greater accuracy when trained with large datasets. The
challenge is whether archives and libraries can afford to take advantage of greater computing
capabilities to develop sophisticated techniques and make complex patterns from thousands of

1See SAS n.d. and Brennan 2019.


Prud’homme 129

digital works. The sheer size of library and archive datasets, such as university photograph collec-
tions, presents challenges to properly using these new, sophisticated techniques. As Jason Griffey
writes, “AI is only as good as its training data and the weighting that is given to the system as it
learns to make decisions. If that data is biased, contains bad examples of decision-making, or is
simply collected in such a way that it isn’t representative of the entirety of the problem set[…],
that system is going to produce broken, biased, and bad outputs” (2019, 8). How can cultural
heritage institutions ensure that their machine learning algorithms avoid such bad outputs?

Implications to Machine Learning

Machine learning has the potential to enrich the value of digital collections by building upon ex-
perts’ knowledge. It can also help identify resources that archivists and curators may never have
the time for, and at the same time correct assumptions about heritage materials. It can generate
the necessary added value to support the mission of archives and libraries in providing a public
good. Annie Schweikert states that “artificial intelligence and machine learning tools are consid-
ered by many to be the next step in streamlining workflows and easing workloads” (2019, 6).

For images, how can archives build a data-labeling pipeline into their digital curation work-
flow that enables machine learning of collections? With the objective being to augment knowl-
edge and create value, how can archives and libraries “bring the skills and knowledge of library
staff, scholars, and students together to design an intelligent information system” (Coleman 2017)?
Despite the opportunities to augment knowledge from facial recognition, models generated by
machine learning algorithms should be scrutinized so long it is unclear how choices are made in
feature selection. Machine learning “has the potential to reveal things …that we did not know
and did not want to know” as Charlie Harper asserts (2018). It can also have direct ethical impli-
cations, leading to biased interpretations for nefarious motives.

Machine Learning and Deep Learning on the Grounds of
Generating Value

In the fall 2018, Oklahoma State University Archives began to look more closely at a machine
learning solution to facilitate metadata creation in support of curation, preservation, and dis-
covery. Conceptually, we envisioned boosting the curation of digital assets, setting up policies to
prioritize digital preservation and access for education and research, and enhancing the long-term
value of those data. In this section, I describe the parameters of automation and machine learning
used to support inventory work and experiment with face recognition models to add contextual-
ization to digital objects. From a digital curation perspective, the objective is to explore ways to
add value to digital objects for which little information is known, if any, in order to increase the
visibility of archival collections.

What started this Pilot Project?

Before proceeding, we needed to gain a deeper understanding of the large quantity of files held
in the archives—both types of data and metadata. The challenge was that with so many files, so
many formats, files become duplicated and renamed, doctored, and scattered throughout direc-
tories to accommodate different types of projects over time, making it hard to sift due to sparse


130 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11
metadata tags that may have differed from one system to another. In short, how could we justify
the value of these digital assets for curatorial purposes? How much could we rely on the estab-
lished institutional memory within the archives? Lastly, could machine learning or deep learning
applications help us build a greater capacity to augment knowledge? In order to optimize re-
sources and systematically make sense of data, we needed to determine that machine learning
could generate value, which in turn could help us more tightly integrate our digital initiatives
with machine learning applications. Such applications would only be as effective as the data are
good for training and the value we could derive from them.

Methodology and Plan of Action

First, we recruited two student interns to create a series of processes that would automatically
populate a comprehensive inventory of all digital collections, including finding duplicate files by
hashing. We generated the inventory by developing a process that could be universally adapted
to all library digital collections, setting up a universal list of works and their associated metadata,
with a focus on descriptive metadata, which in turn could support digital curation and discov-
ery of archival materials—digitized analog materials and born-digital materials. We developed a
universal policy for digital archival collections, which would allow us to incorporate all forms of
metadata into a single format to remedy inconsistencies in existing metadata. This first phase was
critical in the sense that it would condition the cleansing and organizing of data. We could then
proceed with the design of a face recognition database, with the intent to trace individuals fea-
tured in the inventory works of the archives to the extent that our data were accurate. We utilized
the Oklahoma State University Yearbook collections and other digital collections as authoritative
references for other works, for the purpose of contextualization to augment our data capacity.

Second, we implemented our plan; worked closely with the Library Systems’ team within
a Windows-based environment; decided on Graphics Processing Unit (GPU) performance and
cost, taking into consideration that training neural networks necessitates computing power; de-
termined storage needs; and fulfilled other logistical requirements to begin the step-by-step pro-
cess of establishing a pattern recognition database. We designed the database on known objects
before introducing and comparing new data to contextualize each entry. With this framework,
we would be able to add general metadata tags to a uniform storage system using deep learning
technology.

Third, we applied Tesseract OCR on a series of archival image-text combinations from the
archives to extract printed text from those images and photographs. “Tesseract 4 adds a new
neural net (LSTM) [Long Short-Term Memory] based OCR engine which is focused on line
recognition,” while also recognizing character patterns (“Tesseract” n.d.). We were able to obtain
successful output for the most part, with the exception of a few characters that were hard to detect
due to pixelation and font types.

Fourth, we looked into object identifiers, keeping in mind that “When there are scarce or
insufficient labeled data, pre-training is usually conducted” (Zhao 2019, 3215). Working through
the inventory process, we knew that we would also need to label more data to grow our capacity.
We chose to use ResNet 50, a smaller version backbone of Keras-Retinanet, frequently used as a
starting point for transfer learning. ResNet 152 was another implementation layer used as shown
in Figure 11.1 demonstrating the output of a training session or epoch for testing purposes.

Keras is a deep learning network API (Application Programming Interface) that supports
multiple back-end neural network computation engines (Heller 2019) and RetinaNet is a sin-


Prud’homme 131

Figure 11.1: ResNet 152 application using PASCAL VOC 2012

Figure 11.2: Face recognition API test

gle, unified network consisting of a backbone network and two task-specific subnetworks used
for object detection (Karaka 2019). We proceeded by first dumping a lot of pre-tagged infor-
mation from pre-existing datasets into this neural network. We experimented with three open
source datasets: PASCAL VOC 2012, a set including 20 object categories; Open Images Database
(OID), a very large dataset annotated with image-level labels and object bounding boxes; and Mi-
crosoft COCO, a large-scale object detection, segmentation, and captioning dataset. With a few
faces from the OID dataset, we could compare and see if a face was previously recognized. Ex-
panding our process to data known from the archives collection, we determined facial areas, and
more specifically, assigned bounding box regressions to feed into the facial recognition API, based
on Keras code written in Python. The face recognition API is available via GitHub. 2 It uses
a method called Histogram of Oriented Gradient (HOG) encoding that makes the actual face
recognition process much easier to implement for individuals because the encodings are fairly
unique for every person, as opposed to encoding images and trying to blindly figure out which
parts are faces based on our label boxes. Figure 11.2 illustrates our test, confirming from two
very different photographs the presence of Jessie Thatcher Bost, the first female graduate from
Oklahoma A&M College in 1897.

Ren et al. stated that it is important to construct a deep and convolutional per-region object
2See ?iiTb,ff;Bi?m#X+QKf�;2Bi;2vf7�+2n`2+Q;MBiBQM.

https://github.com/ageitgey/face_recognition


132 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11
classifier to obtain good accuracy using ResNets (2015). Going forward, we could use the tool
“as is” despite the low tolerance for accuracy, or instead try to establish large datasets of faces by
training on our own collections in hopes of improving accuracy. We proceeded with utilizing the
Oklahoma State University Yearbook collections, comparing image sets with other photographs
that may include these faces. We look forward to automating more of these processes.

A Conclusive First Experiment

We can say that our first experiment developing a machine learning solution on a known set of
archival data resulted in positive output, while recognizing that it is still a work in progress. For
example, the model we ran for the pilot is not natively supported on Windows, which hindered
team collaboration. In light of these challenges, we think that our experiment was a step in the
right direction of adding value to collections by bringing in a new layer of discovery for hidden
or unidentified content.

Above all, this type of work relies greatly on transparency. As Schweikert notes, “Trans-
parency is not a perk, but a key to the responsible adoption of machine learning solutions” (2019, 72).
More broadly, issues in transparency and ethics in machine learning are important concerns in
the collecting and handling of data. In order to boost adoption and get more buy-in with this
new type of discovery layer, our team shared information intentionally about the process to help
add credibility to the work and foster a more collaborative environment within the library. Also,
the team developed a Graphic User Interface (GUI) to search the inventory within the archives
and ultimately grow the solution beyond the department.

Challenges and Opportunities of Machine Learning

Challenges

In a National Library of Medicine blog post, Patti Brennan points out “that AI applications are
only as good as the data upon which they are trained and built”(2019), and having these data
ready for analysis is a must in order to yield accurate results. Scaling of input and output variables
also plays an important role in the performance improvement when using neural network mod-
els. Jerome Pesenti, Head of AI at Facebook, states that “When you scale deep learning, it tends
to behave better and to be able to solve a broader task in a better way” (2019). Clifford Lynch
affirms, “machine learning applications could substantially help archives make their collections
more discoverable to the public, to the extent that memory organizations can develop the skills
and workflows to apply them” (2019). This raises the question whether archives can also afford
to create the large amount of data from print heritage materials or refine their born-digital col-
lections in order to build the capacity to sustain the use of deep-learning applications. Granted,
the increasing volume of born-digital materials could help leverage this data capacity somehow;
it does not exclude the fact that all data will need to be ready prior to using deep learning. Since
machine learning is only good so long as value is added, archives and libraries will need to think
in terms of optimization as well, deciding when value-generated output is justified compared to
the cost of computing infrastructure and skilled labor needs. Besides value, operations, such as
storing and ensuring access to these data, are just as important considerations to making machine
learning a feasible endeavor.


Prud’homme 133

Opportunities

Investment in resources is also needed for interpreting results, in that “results of an AI-powered
analysis should only factor into the final decision; they should not be the final arbiter of that de-
cision” (Brennan 2019). While this could be a challenge in itself, it can also be an opportunity
when machine learning helps minimize institutional memory loss in archives and libraries (e.g.,
when long-time archivists and librarians leave the institution). Machine learning could supple-
ment practices that are already in place—it may not necessarily replace people—and at the same
time generate metadata for the access and discovery of collections that people may never have the
time to get to otherwise. But we will still need to determine accuracy in results. As deep learn-
ing applications will only be as effective as the data, archives and libraries should expand their
capacity by working with academic departments and partnering with university supercomput-
ing centers or other highly performant computing environments across consortium aggregating
networks. Such networks provide a computing environment with greater data capacity and more
GPUs. Along similar lines, there are opportunities to build upon Carpentries workshops and the
communities of practice that surround this type of interest.

These growing opportunities could help boost the use of machine learning and deep learn-
ing applications to minimize our knowledge gaps about local history and the surrounding com-
munity, bringing together different types of data scattered across organizations. This increased
capacity for knowledge could grow through collaborative partnerships, connecting people, schol-
ars, computer scientists, archivists and librarians, to share their expertise through different types
of projects. Such projects could emphasize the multi- and interdisciplinary academic approach
to research, including digital humanities and other forms or models of digital scholarship.

Conclusion

Along with greater computing capabilities, artificial intelligence could be an opportunity for li-
braries and archives to boost the discovery of their digital collections by pushing text and image
recognition machine learning techniques to new limits. Machine learning applications could
help increase our knowledge of texts, photographs, and more, and determine their relevance
within the context of research and education. It could minimize institutional memory loss, espe-
cially as long-time professionals are leaving the profession. However, these applications will only
be as effective as the data are good for training and for the added value they generate.

At Oklahoma State University, we took a leap forward developing a machine learning so-
lution to facilitate metadata creation in support of curation, preservation, and discovery. Our
experiment with text extraction and face recognition models generated conclusive results within
one academic year with two student interns. The team was satisfied with the final output and
so was the library as we reported on our work. Again, it is still a work in progress and we look
forward to taking another leap forward.

In sum, it will be organizations’ responsibility to build their data capacity to sustain deep
learning applications and justify their commitment of resources. Nonetheless, as Oklahoma State
University’s face recognition initiative suggests, these applications can augment archives’ and li-
braries’ support for multi- and interdisciplinary research and scholarship.


134 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 11
References

Brennan, Patti. 2019. “AI is Coming. Are Data Ready?” NLM Musings from the Mezzanine
(blog). March 26, 2019. ?iiTb,ffMHK/B`2+iQ`XMHKXMB?X;QpfkyRNfyjfkef�B@
Bb@+QKBM;@�`2@i?2@/�i�@`2�/vf.

Carmel, Kent. 2019. “Evidence Summary: Artificial Intelligence in Education.” European
EdTech Network. ?iiTb,ff22iMX2mfFMQrH2/;2f/2i�BHf1pB/2M+2@amKK�`
v@$f@�`iB7B+B�H@AMi2HHB;2M+2@BM@2/m+�iBQM.

Coleman, Catherine Nicole. 2017. “Artificial Intelligence and the Library of the Future, Revis-
ited.” Stanford Libraries (blog). November 3, 2017. ?iiTb,ffHB#`�`vXbi�M7Q`/X2
/mf#HQ;bf/B;Bi�H@HB#`�`v@#HQ;fkyRdfRRf�`iB7B+B�H@BMi2HHB;2M+2@�M
/@HB#`�`v@7mim`2@`2pBbBi2/.

“Face Recognition.” n.d. Accessed November 30, 2019. ?iiTb,ff;Bi?m#X+QKf�;2Bi;2vf
7�+2n`2+Q;MBiBQM.

Griffey, Jason, ed.. 2019. “Artificial Intelligence and Machine Learning in Libraries.” Special
issue, Library Technology Reports 55, no. 1 (January). ?iiTb,ffDQm`M�HbX�H�XQ`;fB
M/2tXT?TfHi`fBbbm2fpB2rAbbm2fdyNf9dR.

Harper, Charlie. 2018. “Machine Learning and the Library or: How I Learned to Stop Worrying
and Love My Robot Overlords.” Code4Lib, no. 41 (August). ?iiTb,ffDQm`M�HX+Q/2
9HB#XQ`;f�`iB+H2bfRjedR.

Heller, Martin. 2019. “What is Keras? The Deep Neural Network API Explained.” InfoWorld
(website). January 28, 2019. ?iiTb,ffrrrXBM7QrQ`H/X+QKf�`iB+H2fjjjeRNkfr?
�i@Bb@F2`�b@i?2@/22T@M2m`�H@M2irQ`F@�TB@2tTH�BM2/X?iKH.

Karaka, Anil. 2019. “Object Detection with RetinaNet.” Weights & Biases (website). July 18,
2019. ?iiTb,ffrrrXr�M/#X+QKf�`iB+H2bfQ#D2+i@/2i2+iBQM@rBi?@`2iBM�M
2i.

Lynch, Clifford. 2019. “Machine Learning, Archives and Special Collections: A High Level
View.” International Council on Archives Blog. October 1, 2019. ?iiTb,ff#HQ;@B+�
XQ`;fkyRNfRyfykfK�+?BM2@H2�`MBM;@�`+?Bp2b@�M/@bT2+B�H@+QHH2+iBQM
b@�@?B;?@H2p2H@pB2rf.

Mahapatra, Sambit. “Why Deep Learning over Traditional Machine Learning?” Towards Data
Science (website). March 21, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfr?v@/22
T@H2�`MBM;@Bb@M22/2/@Qp2`@i`�/BiBQM�H@K�+?BM2@H2�`MBM;@R#e�NNRdd
yej.

McCarthy, John. “What is Artificial Intelligence?” Professor John McCarthy (website). Revised
November 12, 2007. ?iiT,ffDK+Xbi�M7Q`/X2/mf�`iB+H2bfr?�iBb�Bfr?�iBb�
BXT/7.

Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI in
Libraries. Dublin, OH: OCLC Research. ?iiTb,ff/QBXQ`;fRyXk8jjjftFdx@N;Nd.

Pesenti, Jerome. 2019. “Facebook’s Head of AI Says the Field Will Soon ‘Hit the Wall.’ ” Inter-
view by Will Knight. Wired (website). December 4, 2019. ?iiTb,ffrrrXrB`2/X+QKf
biQ`vf7�+2#QQFb@�B@b�vb@7B2H/@?Bi@r�HHf.

Ren, Shaoqing, Kaiming He, Ross Girshick, Xiangyu Zhang, and Jian Sun. 2015. “Object De-
tection Networks on Convolutional Feature Maps.” IEEE Transactions on Pattern Analysis
and Machine Intelligence 39, no. 7 (April).

SAS. n.d. “Machine Learning: What It Is and Why It Matters.” Accessed December 17, 2019.

https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/
https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/
https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited
https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited
https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited
https://github.com/ageitgey/face_recognition
https://github.com/ageitgey/face_recognition
https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471
https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471
https://journal.code4lib.org/articles/13671
https://journal.code4lib.org/articles/13671
https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html
https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html
https://www.wandb.com/articles/object-detection-with-retinanet
https://www.wandb.com/articles/object-detection-with-retinanet
https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/
https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/
https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/
https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063
https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063
https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063
http://jmc.stanford.edu/articles/whatisai/whatisai.pdf
http://jmc.stanford.edu/articles/whatisai/whatisai.pdf
https://doi.org/10.25333/xk7z-9g97
https://www.wired.com/story/facebooks-ai-says-field-hit-wall/
https://www.wired.com/story/facebooks-ai-says-field-hit-wall/


Prud’homme 135

?iiTb,ffrrrXb�bX+QKf2MnmbfBMbB;?ibf�M�HviB+bfK�+?BM2@H2�`MBM;X?i
KH.

Schweikert, Annie. 2019. “Audiovisual Algorithms, New Techniques for Digital Processing.”
Master’s Thesis, New York University. ?iiTb,ffrrrXMvmX2/mfiBb+?fT`2b2`p�iB
QMfT`Q;`�Kfbim/2MinrQ`FfkyRNbT`BM;fRNbni?2bBbna+?r2BF2`iXT/7.

“Tesseract OCR.” n.d. Accessed December 11, 2019. ?iiTb,ff;Bi?m#X+QKfi2bb2`�+i@Q
+`fi2bb2`�+i.

Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2017 “Object Detection with
Deep Learning: A Review.” IEEE Transactions on Neural Networks and Learning Sys-
tems 30, no. 11 (2019): 3212-3232.

https://www.sas.com/en_us/insights/analytics/machine-learning.html
https://www.sas.com/en_us/insights/analytics/machine-learning.html
https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf
https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf
https://github.com/tesseract-ocr/tesseract
https://github.com/tesseract-ocr/tesseract