Chapter 10

Bringing Algorithms and
Machine Learning Into Library

Collections and Services

Eric Lease Morgan
University of Notre Dame

Seemingly revolutionary changes

At the time of their implementation, some changes in the practice of librarianship were deemed
revolutionary, but now-a-days some of these same changes are deemed matter of fact. Take, for
example, the catalog. During much of the Middle Ages, a catalog was more akin to a simple
acquisitions list. By 1548 the first author, title, subject catalog was created (LOC 2017, 18).
These catalogs morphed into books, books which could be mass produced and distributed. But
the books were difficult to keep up to date, and they were expensive to print. As a consequence, in
the early 1860s, the card catalog was invented by Ezra Abbot, and the catalog eventually became
a massive set of drawers (82). Unfortunately, because the way catalog cards are produced, it is not
feasible to assign more than three or four subject headings to any given book. If one does, then
the number of catalog cards quickly gets out of hand.

In the 1870s, the idea of sharing catalog cards between libraries became common, and the
Library of Congress facilitated much of the distribution (LOC 2017, 87). In 1965 and with the
advent of computers, the idea of sharing cataloging data as MARC (machine readable cataloging)
became prevalent (Crawford 1989, 204). The data structure of a MARC record is indicative
of the time. Intended to be distributed on reel-to-reel tape, the MARC record is a sequential
data structure designed to be read from beginning to end, complete with checks and balances
ensuring the record’s integrity. Despite the apparent flexibility of a digital data structure, the
tradition of three or four subject headings per book still holds true. Now-a-days, the data from
MARC records is used to fill databases, the databases’ content is indexed, and items from the

113


114 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
library collection are located by searching the index. The evolution of the venerable library catalog
has spanned centuries, each evolutionary change solving some problems but creating new ones.

With the advent of the Internet, a host of other changes are (still) happening in libraries.
Some of them are seen as revolutionary, and only time will tell whether or not these changes will
persevere. Examples include but are not limited to:

• the advocacy of alt-metrics and open access publications

• the continuing dichotomy of the virtual library and library as place

• the creation and maintenance of institutional repositories

• the existence of digital scholarship centers

• the increasing tendency to license instead of own content

Many of the traditional roles of libraries are not as important as they used to be. That does not
mean the roles are unimportant, just not as important. Like many other professions, librarianship
is exploring new ways to remain relevant when many of their core functions are needed by fewer
people.

Working smarter, not harder

Beyond automation, librarianship has not exploited computer technology. Despite the fact that
libraries have the world of knowledge at their fingertips, libraries do not operate very intelligently,
where “intelligently” is an allusion to artificial intelligence.

Let’s enumerate the core functionalities of computers. First of all, computers…compute.
They are given some sort of input, assign the input to a variable, apply any number of functions
to the variable, and output the result. This process — computing — is akin to solving simple
algebraic equations such as the area of a circle or a distance traveled. There are two factors of
particular interest here. First, the input can be as simple as a number or a string (read: “a word”)
or the input can be arbitrarily large combinations of both. Examples include:

• 42

• 1776

• xyzzy

• George Washington

• a MARC record

• the circulation history and academic characteristics of an individual

• the full text and bibliographic descriptions of all early American authors


Morgan 115

What is really important is the possible scale of a computer’s input. Libraries have not taken
advantage of that scale. Imagine how librarianship would change if the profession actively used
the full text of its collections to enhance bibliographic description and resulting public service.
Imagine how collection policies and patron needs could be better articulated if: 1) students, re-
searchers, or scholars first opted-in to have their records analyzed, and 2) the totality of circulation
histories and journal usage histories were thoroughly investigated in combination with patron
characteristics and data from other libraries.

A second core functionality of computers is their ability to save, organize, and retrieve vast
amounts of data. More specifically, computers save “data” — mere numbers and strings. But
when the data is given context, such as a number denoted as date or a string denoted as a name,
then the data is transformed into information. An example might include the birth year 1972
and the name of my pet, Blake. Given additional information, which may be compared and
contrasted with other information, knowledge can be created — information put to use and un-
derstood. For example, Mary, my sister, was born in 1951 and is therefore 21 years older than
Blake. Computers excel at saving, organizing, and retrieving data which leads to information
and knowledge. The possibilities of computers dispensing wisdom — knowledge of a timeless
nature — is left for another essay.

Like the scale of computer input, the library profession has not really exploited computers’
ability to save, organize, and retrieve data; on the whole, the library profession does not under-
stand the concept of a “data structure.” For example, tab-delimited files, CSV (comma-separated
value) files, relational database schema, XML files, JSON files, and the content of email messages
or HTTP server responses are all examples of different types of data structures. Each has its own
set of inherent strengths and weaknesses; there is no such thing as “One size fits all.” Through
the use of data structures, computers store and retrieve information. Librarianship is about these
same kinds of things, yet few librarians would be able to outline the differences between different
data structures.

Again, data becomes information when it is given context. In the world of MARC, when a
string (one or more “words”) is inserted into the 245 field of a MARC bibliographic record, then
the string is denoted as a title. In this case, MARC is a “data structure” because different fields
denote different contexts. There are fields for authors, subjects, notes, added entries, etc. This is
all very well and good, especially considering that MARC was designed more than fifty years ago.
But since then, many more scalable, flexible, and efficient data structures have been designed.

Relational databases are a good example. Relational databases build on a classic data structure
known as the “table” — a matrix of rows and columns where each row is a record and each column
is a field. Think “spreadsheet.” For example, each row may represent a book, with columns for
authors, titles, dates, publishers, etc. The problem comes when a column needs to be repeatable.
For example, a book may have multiple authors or more commonly, multiple subjects. In this case
the idea of a table breaks down because it doesn’t make sense to have a column named subject-01,
subject-02, and subject-03. As soon as you do that, you will want subject-04. Relational databases
solve this problem. The solution is to first add a “key” — a unique value — to each row. Next, for
fields with multiple values, create a new table where one of the columns is the key from the first
table and the other column is a value, in this case, a subject heading. There are now two tables
and they can be “joined” through the use of the key. Given such a data structure it is possible to
add as many subjects as desired to any bibliographic item.

But you say, “MARC can handle multiple subjects.” True, MARC can handle multiple sub-
jects, but underneath, MARC is a data structure designed for when information was dissemi-


116 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
nated on tape. As such, it is a sequential data structure intended to be read from beginning to
end. It is not a random access structure. What’s more, the MARC data structure is really di-
vided into three substructures: 1) the leader, which is always twenty-four characters long, 2) the
directory, which denotes where each bibliographic field exists, and 3) the bibliographic section
where the bibliographic information is actually stored. It gets more complicated. The first five
characters of the leader are expected to be a left-hand, zero-padded integer denoting the length
of the record measured in bytes. A typical value may be 01999. Thus, the record is 1999 bytes
long. Now, ask yourself, “What is the maximum size of a MARC record?” Despite the fact that
librarianship embraces the idea of MARC, very few librarians really understand the structure
of MARC data. MARC is a format for transmitting data from one place to another, not for
organization.

Moreover, libraries offer more than bibliographic information. There is information about
people and organizations. Information about resource usage. Information about licensing. In-
formation about resources that are not bibliographic, such as images or data sets. Etc. When these
types of information present themselves, libraries fall back to the use of simple tables, which are
usually not amenable to turning data into information. There are many different data structures.
XML became popular about twenty years ago. Since then JSON has become prevalent. More
than twenty years ago the idea of Linked Data was presented. All of these data structures have
various strengths and weaknesses. None of them is perfect, and each addresses different needs,
but they are all better than MARC when it comes to organizing data. Libraries understand the
concept of manifesting data as information, but as a whole, libraries do not manifest the concept
using computer technology.

Finally, another core functionality of computers is networking and communication. The
advent of the Internet is a relatively recent phenomenon, and the ubiquitous nature of comput-
ers combined with other “smart” devices has facilitated literally billions of connections between
computers (and people). Consequently the data computed upon and stored in one place can be
transmitted almost instantly to another place, and the transmission is an exact copy. Again, like
the process of computing and the process of storage, efficient computer communication builds
upon itself with unforeseen consequences. For example, who predicted the demise of many cen-
tralized information authorities? With the advent of the Internet there is less of a need/desire for
travel agents, movie reviewers, or dare I say it, libraries.

Yet again, libraries use the Internet, but do they actually exploit it? How many librarians
are able to create a file, put it on the Web, and share the resulting URL? Granted, centralized
computing departments and networking administrators put up road blocks to doing such things,
but the sharing of data and information is at the core of librarianship. Putting a file on the ’Net,
even temporarily, is something every librarian ought to be able to know how (and be authorized)
to do.

Despite the functionality of computers and their place in libraries over the past fifty to sixty
years, computers have mostly been used to automate library tasks. MARC automated the process
of printing catalog cards and eventually the creation of “discovery systems.” Libraries have used
computers to automate the process of lending materials between themselves as well as to local
learners, teachers, and scholars. Libraries use computers to store, organize, preserve, and dissem-
inate the gray literature of our time, and we call these systems “institutional repositories.” In all
of these cases, the automation has been a good thing because efficiencies were gained, but the use
of computers has not gone far enough nor really evolved. Lending and usage statistics are not
routinely harvested nor organized for the purposes of monitoring and predicting library patron


Morgan 117

needs/desires. The content of institutional repositories is usually born digital, but libraries have
not exploited their full text nature nor created services going beyond rudimentary catalogs.

Computers can do so much more for libraries than mere automation. While I will never say
computers are “smart,” their fundamental characteristics do appear intelligent, especially when
used at scale. The scale of computing has significantly changed in the past ten years, and with
this change the concept of “machine learning” has become more feasible. The following sections
outline how libraries can go beyond automation, embrace machine learning, and truly evolve
their ideas of collections and services.

Machine learning: what it is, possibilities, and use cases

Machine learning is a computing process used to make decisions and predictions. In the past,
computer-aided decision-making and predictions were accomplished by articulating large sets of
if-then statements and navigating down decision trees. The applications were extremely domain
specific, and they weren’t very scalable. Machine learning turns this process on its head. Instead
of navigating down a tree, machine learning takes sets of previously made observations (think
“decisions”), identifies patterns and anomalies in the observations, and saves the result as a math-
ematical model, which is really an n-dimensional array of vectors. Outside observations are then
compared to the model and depending on the resulting similarities or differences, decisions or
predictions are drawn.

Using such a process, there are really only four different types of machine learning: classifi-
cation, clustering, regression, and dimension reduction. Classification is a supervised machine
learning process used to subdivide a set of observations into smaller sets which have been previ-
ously articulated. For example, suppose you had a few categories of restaurants such as American,
French, Italian, or Chinese. Given a set of previously classified menus, one could create a model
defining each category and then classify new, unseen menus. The classic classification example is
the filtering of email. “Is this message ‘spam’ or ‘ham’?” This chapter’s appendix walks a person
through the creation of a simplified classification system. It classifies texts based on authorship.

Clustering is almost always an unsupervised machine learning process which also creates
smaller sets from a larger one, but clustering is not given a set of previously articulated categories.
That is what makes it “unsupervised.” Instead, the categories are created as an end result. Topic
modeling is a popular example of clustering.

Regression predicts a numeric value based on sets of dependent variables. For example, given
dependent variables like annual income, education level, size of family, age, gender, religion, and
employment status, one might predict how much money a person may spend on an independent
variable such as charity.

Sometimes the number of characteristics of each observation is very large. Many times some
of these characteristics do not play a significant role in decision-making or prediction. Dimension
reduction is another machine learning process, and it is used to eliminate these less-than-useful
characteristics from the observations. This process simplifies classification, clustering, or regres-
sion.

Some possible use cases

There are many possible ways to enhance library collections and services through the use of ma-
chine learning. I’m not necessarily advocating the implementation of any of the following ideas,


118 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
but they are possibilities. Each is grouped into the broadest of library functional departments:

• reference and public services

– given a set of grant proposals, suggest library resources be used in support of the
grants

– given a set of licensed library resources and their usage, suggest other resources for
use

– given a set of previously checked out materials, suggest other materials to be checked
out

– given a set of reference interviews, create a chatbot to supplement reference services
– given the full text of a set of desirable journal articles, create a search strategy to be

applied against any number of bibliographic indexes; answer the proverbial question,
“Can you help me find more like this one?”

– given the full text of articles as well as their bibliographic descriptions, predict and
describe the sorts of things a specific journal title accepts or whether a given draft is
good enough for publication

– given the full text of reading materials assigned in a class, suggest library resources to
support them

• technical services

– given a set of multimedia, enumerate characteristics of the media (number of faces,
direction of angles, number and types of colors, etc.), and use the results to supple-
ment bibliographic description

– given a set of previously cataloged items, determine whether or not the cataloging
can be improved

– given full-text content harvested from just about anywhere, analyze the content in
terms of natural language processing, and supplement bibliographic description

• collections

– given circulation histories, articulate more refined circulation patterns, and use the
results to refine collection development policies

– given the full text of sets of theses and dissertations, predict where scholarship at your
institution is growing, and use the results to more intelligently build your just-in-case
collection; do the same thing with faculty publications

Implementing any of these possible use cases would necessarily be a collaborative effort. Im-
plementation requires an array of expertise. Enumerated in no priority order, this expertise in-
cludes: subject/domain expertise (such as cataloging trends, circulation services, collection strate-
gies, etc.), computer programming and data management skills (such as Python, R, relational
databases, JSON, etc.), and statistical modeling (an understanding of the strengths and weak-
nesses of different machine learning algorithms). The team would then need to:

1. articulate and share a common goal for the work


Morgan 119

2. amass the data to model

3. employ a feature extraction process (lower case words, extract a value from a database, etc.)

4. vectorize the features

5. create and evaluate the resulting model

6. go to Step #2 until satisfied

7. put the model into practice

8. go to Step #1; this work is never done

For example, to bibliographically connect grant proposals to library resources, try this:

1. use classification to sub-divide each of your bibliographic index descriptions

2. apply the resulting model to the full text of the grants

3. return a percentage score denoting the strength of each resulting classification

4. recommend the use of zero or more bibliographic indexes

To predict scholarship, try this:

1. amass the full text and bibliographic descriptions of all theses and dissertations

2. topic model the full text

3. evaluate the resulting topics

4. go to Step #2 until satisfied

5. augment the model’s matrix of vectors with bibliographic description

6. pivot the matrix on any of the given bibliographics

7. plot the results to see possible trends over time, trends within disciplines, etc.

8. use the results to make decisions

The content of the GitHub repository reproduced in this chapter’s appendix describes how
to do something very similar in method to the previous example.1

1See ?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb.

https://github.com/ericleasemorgan/bringing-algorithms


120 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
Some real-world use cases

Here at the University of Notre Dame’s Navari Center for Digital Scholarship, we use machine
learning in a number of ways. We cut our teeth on a system called Convocate.2 In this case we ob-
tained a set of literature on the theme of human rights. Half of the set was written by researchers
in non-governmental organizations. The other half was written by theologians. While both sets
were on the same theme, the language of each was different. An excellent example is the use of the
word “child.” In the former set, children were included in documents about fathers and mothers.
In the later set, children often referred to the “Children of God.” Consequently, queries referring
to children were often misleading. To rectify this problem, a set of broad themes were articulated,
such as Actors, Harms and Violations, Rights and Freedoms, and Principles and Values. We then
used topic modeling to subdivide all of the paragraphs of all of the documents into smaller and
smaller sets of paragraphs. We compared the resulting topics to the broad themes, and when we
found correlations between the two, we classified the paragraphs accordingly. Because the process
required a great deal of human intervention, and thus impeded subsequent updates, this process
was not ideal, but we were learning and the resulting index is useful.

On a regular basis we find ourselves using a program called Topic Modeling Tool, which is a
GUI/desktop application heavily based on the venerable MALLET suite of software.3 Given a set
of plain text files and an integer, Topic Modeling Tool will create a weighted list of latent themes
found in a corpus. Each theme is really a list of words which tend to cluster around each other,
and these clusters are generated through the use of an algorithm called LDA (Latent Dirichlet
Allocation). When it comes to topic modeling, there is no such thing as the correct number of
topics. Just as in the traditional process of denoting what a corpus is about, there can be many
distinct topics or there can be a few. Moreover, some of the topics may be large and others may
be small. When using a topic modeler, it is important to iteratively configure and re-configure
the input until the results seem to make sense.

Just like every other machine learning application, Topic Modeling Tool bases its “reason-
ing” on a matrix of vectors. Each row represents a document, and each column is a topic. At the
intersection of a document row and a topic column is a score denoting how much the given doc-
ument is “about” the calculated topic. It is then possible to sum each topic column and output
a pie chart illustrating not only what the topics are, but how much of the corpus is about each
topic. Such can be very insightful.

By adding metadata to the matrix of vectors, even more insights can be garnered. Suppose
you have a set of plain text files. Suppose also you know the names of the authors of each file. You
can then do topic modeling against your corpus, and when the modeling is complete you can add
a new column to the matrix and call it authors. Next, you update the values in the authors column
with author names. Finally, you “pivot” the matrix on the authors column to calculate the degree
each authors’ works are “about” the calculated topics. This too can be quite insightful. Suppose
you have works by authors A, B, C, and D. Suppose you have calculated topics I, II, III, and IV. By
updating the matrix and pivoting the results, you might discover that author A discusses topic I
almost exclusively, whereas author B discusses topics I, II, III, and IV in equal parts. This process
works for just about any type of metadata: gender, genre, extent, dates, language, etc. What’s
more, Topic Modeling Tool makes this process almost trivial. To learn how, see the GitHub

2See ?iiTb,ff+QMpQ+�i2XM/X2/m.
3See ?iiTb,ff;Bi?m#X+QKfb2M/2`H2fiQTB+@KQ/2HBM;@iQQH for the Topic Modeling Tool. See ?iiT,

ffK�HH2iX+bXmK�bbX2/m for MALLET.

https://convocate.nd.edu
https://github.com/senderle/topic-modeling-tool
http://mallet.cs.umass.edu
http://mallet.cs.umass.edu


Morgan 121

repository accompanying this chapter.4

We have used classification techniques in at least a couple of ways. One project required the
classification of press releases. Some press releases are deemed mandatory — declared necessary
to publish. Other press releases are considered discretionary — published at the will of a com-
pany. The domain expert needed a set of 100,000 press releases classified into either mandatory
or discretionary piles. We used a process very similar to the process outlined in this chapter’s Ap-
pendix. In the end, the domain expert believes the classification process was 86% correct, and
this was good enough for them. In another project, we tried to identify articles about a particu-
lar yeast (Cryptococcus neoformans), despite the fact that the articles never mentioned the given
yeast. This project failed because we were unable to generate an accuracy score greater than 70%.
This was deemed not good enough.

We are developing a high performance computing system called the Distant Reader, which
uses machine learning to do natural language processing against an arbitrarily large volume of
text. Given one or more documents of just about any number or type, the Distant Reader will:

1. amass the documents

2. convert the documents into plain text

3. do rudimentary counts and tabulations against the plain text

4. calculate statistically significant keywords against the plain text

5. extract narrative summaries against the plain text

6. use Spacy (a natural language processing library) to classify each and every feature of each
and every sentence into parts-of-speech and/or named entities5

7. save the results of Steps #1 through #6 as plain text and tab-delimited files

8. distill the tab-delimited files into an SQLite database

9. create both narrative as well as tabular reports against the database

10. create an archive (.zip file) of everything

11. return the archive to the student, researcher, or scholar

The student, researcher, or scholar can then analyze the contents of the .zip file to get a bet-
ter understanding of its contents. This analysis (“reading”) ranges from perusing the narrative
reports, to using desktop tools to visualize the data, to exploiting command-line tools to inves-
tigate the data, to writing software which uses the data as input. The Distant Reader scales to
everything between a single scholarly report, hundreds of book-length documents, and thou-
sands of journal articles. Its purpose is to supplement the traditional reading process, and it uses
machine learning techniques at its core.

4?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb.
5See ?iiTb,ffbT�+vXBQ.

https://github.com/ericleasemorgan/bringing-algorithms
https://spacy.io


122 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
Summary and Conclusion

Computers and libraries are a natural fit. They both excel at the collection, organization, and
dissemination of data, information, and knowledge. Compared to most professions, the practice
of librarianship has used computers for a very long time. But, for the most part, the functionality
of computers in libraries has not been fully exploited. Advances in machine learning coupled
with the data/information found in libraries present an opportunity for both librarianship and
the people whom libraries serve. Machine learning can be used to enhance library collections and
services, and with a modest investment of time as well as resources, the profession can make it a
reality.

Appendix: Train and Classify

This appendix lists two Python programs. The first (train.py) creates a model for the classification
of plain text files. The second (classify.py) uses the output of the first to classify other plain text
files. For your convenience, the scripts and some sample data ought to be available in a GitHub
repository.6

The purpose of including these two scripts is to help demystify the process of machine learn-
ing.

Train

The following Python script is a simple classification training application.
Given a file name and a list of directories containing .txt files, this script first reads all of the

files’ contents and the names of their directories into sets of data and labels (think “categories”). It
then divides the data and labels into training and testing sets. Such is a best practice for these types
of programs so the models can be evaluated for accuracy. Next, the script counts and tabulates
(“vectorizes”) the training data and creates a model using a variation of the Naive Bayes algorithm.
The script then vectorizes the test data, uses the model to classify the test data, and compares
the resulting classifications to the originally supplied labels. The result is an accuracy score, and
generally speaking, a score greater than 75% is on the road to success. A score of 50% is no better
than flipping a coin. Finally, the model is saved to a file for later use.

1 O i`�BM X Tv @ ;Bp2M � 7BH2 M�K2 �M/ � HBbi Q7 /B`2+iQ`B2b
O +QMi�BMBM; X iti 7BH2b - +`2�i2 � KQ/2H 7Q` +H�bbB7vBM;
O bBKBH�` Bi2Kb

O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F
6 7`QK bFH2�`MX72�im`2n2ti`�+iBQMXi2ti BKTQ`i *QmMio2+iQ`Bx2`

7`QK bFH2�`MXKQ/2Hnb2H2+iBQM BKTQ`i i`�BMni2binbTHBi
7`QK bFH2�`MXM�Bp2n#�v2b BKTQ`i JmHiBMQKB�HL"
BKTQ`i ;HQ#- Qb- TB+FH2- bvb

11 O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi
B7 H2MU bvbX�`;p V I 9 ,

bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y

6?iiTb,ff;Bi?m#X+QKf2`B+H2�b2KQ`;�Mf#`BM;BM;@�H;Q`Bi?Kb.

https://github.com/ericleasemorgan/bringing-algorithms


Morgan 123

] IKQ/2H= I/B`2+iQ`v = I�MQi?2` /B`2+iQ`v =$M] V
[mBiUV

16

O ;2i i?2 M�K2 Q7 i?2 7BH2 r?2`2 i?2 KQ/2H rBHH #2 b�p2/
KQ/2H 4 bvbX�`;p( R )

O ;2i i?2 `2bi Q7 i?2 BMTmi - i?2 M�K2b Q7 /B`2+iQ`B2b iQ T`Q+2bb
21 /B`2+iQ`B2b 4 ()

7Q` B BM `�M;2U k- H2MU bvbX�`;p V V ,
/B`2+iQ`B2bX�TT2M/U bvbX�`;p( B ) V

O BMBiB�HBx2 i?2 /�i� iQ �M�Hvx2 �M/ Bib �bbQ+B�i2/ H�#2Hb
26 /�i� 4 ()

H�#2Hb 4 ()

O HQQT i?`Qm;? 2�+? ;Bp2M /B`2+iQ`v
7Q` /B`2+iQ`v BM /B`2+iQ`B2b ,

31

O 7BM/ �HH i?2 i2ti 7BH2b �M/ ;2i i?2 /B`2+iQ`v ^b M�K2
7BH2b 4 ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V
H�#2H 4 QbXT�i?X#�b2M�K2U /B`2+iQ`v V

36 O T`Q+2bb 2�+? 7BH2
7Q` 7BH2 BM 7BH2b ,

O QT2M i?2 7BH2
rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 ,

41

O �// i?2 +QMi2Mib Q7 i?2 7BH2 iQ i?2 /�i�
/�i�X�TT2M/U ?�M/H2X`2�/UV V

O mT/�i2 i?2 HBbi Q7 H�#2Hb
46 H�#2HbX�TT2M/U H�#2H V

O /BpB/2 i?2 /�i� f H�#2Hb BMiQ i`�BMBM; b2ib �M/ i2biBM; b2ib c
O � #2bi T`�+iB+2
/�i�ni`�BM - /�i�ni2bi - H�#2Hbni`�BM - H�#2Hbni2bi 4

51 i`�BMni2binbTHBiU /�i�- H�#2Hb V

O BMBiB�HBx2 � p2+iQ`Bx2` - �M/ i?2M +QmMi f i�#mH�i2 i?2
O i`�BMBM; /�i�
p2+iQ`Bx2` 4 *QmMio2+iQ`Bx2`U biQTnrQ`/b4^2M;HBb?^ V

56 /�i�ni`�BM 4 p2+iQ`Bx2`X7Bini`�Mb7Q`KU /�i�ni`�BM V

O BMBiB�HBx2 � +H�bbB7B+�iBQM KQ/2H - �M/ i?2M mb2 L�Bp2 "�v2b
O iQ +`2�i2 � KQ/2H
+H�bbB7B2` 4 JmHiBMQKB�HL"UV

61 +H�bbB7B2`X7BiU /�i�ni`�BM - H�#2Hbni`�BM V

O +QmMi f i�#mH�i2 i?2 i2bi /�i� - �M/ mb2 i?2 KQ/2H iQ +H�bbB7v Bi


124 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 10
/�i�ni2bi 4 p2+iQ`Bx2`Xi`�Mb7Q`KU /�i�ni2bi V
+H�bbB7B+�iBQMb 4 +H�bbB7B2`XT`2/B+iU /�i�ni2bi V

66

O #2;BM iQ i2bi 7Q` �++m`�+v
+QmMi 4 y

O HQQT i?`Qm;? 2�+? i2bi +H�bbB7B+�iBQM
71 7Q` B BM `�M;2U H2MU +H�bbB7B+�iBQMb V V ,

O BM+`2K2Mi - +QM/BiBQM�HHv
B7 +H�bbB7B+�iBQMb( B ) 44 H�#2Hbni2bi( B ) , +QmMi Y4 R

76 O +�H+mH�i2 �M/ QmiTmi i?2 �++m`�+v b+Q`2 c
O �#Qp2 d8$W #2;BMb iQ �+?B2p2 bm++2bb
T`BMi U ]�++m`�+v, WbWW $M] W U BMiU U +QmMi  RXy V

f H2MU +H�bbB7B+�iBQMb V  Ryy V V V

81 O b�p2 i?2 p2+iQ`Bx2` �M/ i?2 +H�bbB7B2` U i?2 KQ/2H V
O 7Q` 7mim`2 mb2 - �M/ /QM2

rBi? QT2MU KQ/2H- ^r#^ V �b ?�M/H2 ,
TB+FH2X/mKTU U p2+iQ`Bx2` - +H�bbB7B2` V- ?�M/H2 V

86 2tBiUV

Classify

The following Python script is a simple classification program.
Given the model created by the previous script (train.py) and a directory containing a set of

.txt files, this script will output a suggested label (“classification”) and a file name for each file in
the given directory. This script automatically classifies a set of plain text files.

O +H�bbB7v X Tv @ ;Bp2M � T`2pBQmbHv b�p2/ +H�bbB7B+�iBQM KQ/2H �M/
O � /B`2+iQ`v Q7 X iti 7BH2b - +H�bbB7v � b2i Q7 /Q+mK2Mib

4 O `2[mB`2 i?2 HB#`�`B2b f KQ/mH2b i?�i rBHH /Q i?2 rQ`F
BKTQ`i ;HQ#- Qb- TB+FH2- bvb

O b�MBiv +?2+F c K�F2 bm`2 i?2 T`Q;`�K ?�b #22M ;Bp2M BMTmi
B7 H2MU bvbX�`;p V 54 j ,

9 bvbXbi/2``Xr`Bi2U ^lb�;2, ^ Y bvbX�`;p( y ) Y
] IKQ/2H= I/B`2+iQ`v =$M] V

[mBiUV

O ;2i BMTmi c ;2i i?2 KQ/2H iQ `2�/ �M/ i?2 /B`2+iQ`v +QMi�BMBM;
14 O i?2 X iti 7BH2b

KQ/2H 4 bvbX�`;p( R )
/B`2+iQ`v 4 bvbX�`;p( k )

O `2�/ i?2 KQ/2H
19 rBi? QT2MU KQ/2H- ^`#^ V �b ?�M/H2 ,


Morgan 125

U p2+iQ`Bx2` - +H�bbB7B2` V 4 TB+FH2XHQ�/U ?�M/H2 V

O T`Q+2bb 2�+? X iti 7BH2
7Q` 7BH2 BM ;HQ#X;HQ#U /B`2+iQ`v Y ]f Xiti] V ,

24

O QT2M - `2�/ - �M/ +H�bbB7v i?2 7BH2
rBi? QT2MU 7BH2 - ^`^ V �b ?�M/H2 ,

+H�bbB7B+�iBQM 4 +H�bbB7B2`XT`2/B+iU
p2+iQ`Bx2`Xi`�Mb7Q`KU ( ?�M/H2X`2�/UV ) V V

29

O QmiTmi i?2 +H�bbB7B+�iBQM �M/ i?2 7BH2 ^b M�K2
T`BMiU ]$i2ti#�+FbH�b? i]XDQBMU U

+H�bbB7B+�iBQM( y )-
QbXT�i?X#�b2M�K2U 7BH2 V V V V

34

O /QM2
2tBiUV

References

Crawford, Walt. 1989. MARC for Library Use: Understanding Integrated USMARC. 2nd ed.
Boston: G.K. Hall.

LOC (Library of Congress). 2017. The Card Catalog: Books, Cards, and Literary Treasures. San
Francisco: Chronicle Books.