Chapter 13

Towards a Chicago place name
dataset: From back-of-the-book

index to
a labeled dataset

Ana Lucic
University of Illinois

John Shanahan
DePaul University

Introduction

Reading Chicago Reading1 is a grant-supported digital humanities project that takes as its ob-
ject the “One Book One Chicago” (OBOC) program2 of the Chicago Public Library. Since fall
2001, One Book One Chicago has fostered community through reading and discussion. On its
“Big Read” website, the Library of Congress includes information about One Book programs
around the United States,3 and the American Library Association (ALA) also provides materials
with which a library can build its own One Book program and, in this way, bring members of
their communities together in a conversation.4 While community reading programs are not a

1Reading Chicago Reading project (?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf) gratefully acknowl-
edges the support of the National Endowment for the Humanities Office of Digital Humanities, HathiTrust, and Lyrasis.

2See ?iiTb,ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf.
3See ?iiT,ff`2�/X;Qpf`2bQm`+2bf.
4See ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM2#QQF.

151

https://dh.depaul.press/reading-chicago/
https://www.chipublib.org/one-book-one-chicago/
http://read.gov/resources/
http://www.ala.org/tools/programming/onebook


152 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13
new phenomenon and exist in various formats and sizes, the One Book One Chicago program is
notable because of its size (the Chicago Public Library has 81 local branches) as well as its history
(the program has been in continual existence for nearly 20 years). Although relatively common,
book clubs and community-based reading programs are not regularly assessed as other library
programming components are, or are subjects of long-term quantitative study.

The following research questions have been guiding the Reading Chicago Reading project
so far: can we predict the future circulation of a book using a predictive model based on prior cir-
culation, community demographics, and text characteristics? How did different neighborhoods
in a diverse but also segregated city respond to particular book choices? Have certain books been
more popular than others around the city as measured by branch-level circulation, and can these
changes in checkout totals be correlated with CPL outreach work? A related question is the fo-
cus of this paper: by associating place names with sentiment scores in Chicago-themed OBOC
books, what trends emerge from spatial analysis? Results are still in progress and will be forth-
coming in future papers. In the meantime, exploration of these questions, and our attempt to
find solutions for some of them, enables us to reflect on some innovative services that libraries
can offer. We will discuss this possibility in the last section of this paper.

Chicago as a place name

Thus far, the Reading Chicago Reading project has focused the bulk of its analysis on seven recent
OBOC book selections and their respective “seasons” of public outreach programming:

• Fall of 2011: Saul Bellow’s The Adventures of Augie March

• Spring of 2012: Yiyun Li’s Gold Boy, Emerald Girl

• Fall of 2012: Markus Zusak’s The Book Thief

• 2013–2014: Isabel Wilkerson’s The Warmth of Other Suns

• 2014 – 2015: Michael Chabon’s The Amazing Adventures of Kavalier and Clay

• 2015 – 2016: Thomas Dyja’s The Third Coast

• 2016 – 2017: Barbara Kingsolver’s Animal Vegetable Miracle: A Year of Food Life

All of the listed works above, spanning categories of fiction and non-fiction, are still in copy-
right. Of the seven works, three were categorized as Chicago-themed because they take place in
the Chicago area in whole or in substantial part: Saul Bellow’s The Adventures of Augie March,
Isabel Wilkerson’s The Warmth of Other Suns, and Thomas Dyja’s The Third Coast.

As part of ongoing work of the Reading Chicago Reading project, we used the secure data
portal of the HathiTrust Research Consortium to access and pre-process the in-copyright nov-
els in our set. The HathiTrust research portal permits the extraction of non-consumptive fea-
tures of the works included in the digital library, even those that are still under copyright. Non-
consumptive features do not violate copyright restrictions as they do not allow the regular reading
(“consumption”) or digital reconstruction of the full work in question. An example of a non-
consumptive feature is the part of speech information extracted in aggregate with or without
connection to its source words. Location words (i.e. place names) in the text are another example


Lucic and Shanahan 153

of a non-consumptive feature as long as we do not aim to extract locations with the surround-
ing context: that is, while the extraction of a location word alone from a work under copyright
will not violate copyright law, the extraction of the location word with its surrounding context
(a fixed size “window” of words that surrounds the location word) might do so. Similarly, the
sentiment of a sentence also falls under the category of a “non-consumptive” feature as long as
we do not extract both the entire sentence and its sentiment score. Using these methods, it was
possible to utilize the HathiTrust research portal to access and also extract the location words as
well as sentiment of individual sentences from copyrighted works. As later paragraphs will reveal
however, we also needed to verify the accuracy of these extractions, which was done manually by
checking the extracted references against the actual text of the work.

This paper arises from the finding that the three OBOC books that are set largely in or are
about Chicago circulated differently than the OBOC books that are not, (i.e., Marcus Zusak’s
TheBookThief, Yiyun Li’sGoldBoy, Barbara Kingsolver’sAnimal,Vegetable,Miracle, and Michael
Chabon’s TheAmazingAdventuresofKavalierandClay. Since one of the findings was that some
CPL branches had higher circulation for “Chicago” OBOC books than others in the program,
we wanted to determine (1) which place names were featured in the three books and (2) quan-
tify and examine the sentiment associated with these places. Although recognizing a well-defined
place name in a text by automated means is no longer a difficult task thanks to the development of
named entity recognizers such as the Stanford Named Entity Recognizer,5 OpenNLP,6 spaCy,7
and NLTK,8 recognizing whether a place name is a reference to a Chicago location is a harder
task. If Chicago is the setting or one of the main topics of the book then we can assume that
a number of locations mentioned will also be Chicago place names. However, if information
about the topicality or locality of the book is not known in advance or if the plot in the book
moves from location to location, then the task of verifying through automated methods whether
a place name is a Chicago location is much harder.

With the help of LinkedGeoData9 we were able to obtain all of the Chicago place names
identified by volunteers through the OpenStreetMap project10 and then download a listing that
included Chicago buildings, theaters, restaurants, streets, and other prominent places. While this
is very useful, we also realized that we were missing historical Chicago place names with this ap-
proach. At the same time, the way that place names are represented in a text will likely not always
correspond to the way a place name is formally represented in a dictionary, database, or knowledge
graph. For example, a sentence might simply use an anaphoric reference such as “that building” or
“her home” instead of directly naming the entity known from other sentences. Moreover, there
were many examples of generic place names: how many cities in the United States have a State
Street, a Madison Street, or a 1st Avenue, and the like? A further hindrance was determining the
type of place names we wanted to identify and collect from the text’s total set of location word
tokens: it soon became obvious that for the purposes of visualizing a place name on the map, gen-
eral references to Chicago went beyond the scope of the maps we wanted to create. We became
more interested in tracking references to specific Chicago place names that included buildings
(historical and present), named areas of the city, monuments, streets, theatres, restaurants, and
the like. Given that our total dataset for this task comprised just three books, we were able to man-

5See ?iiTb,ffMHTXbi�M7Q`/X2/mfbQ7ir�`2f*_6@L1_X?iKH.
6See ?iiTb,ffQT2MMHTX�T�+?2XQ`;f.
7See ?iiTb,ffbT�+vXBQf.
8See ?iiTb,ffrrrXMHiFXQ`;f#QQFf+?ydX?iKH.
9See ?iiT,ffHBMF2/;2Q/�i�XQ`;f�#Qmi.

10See ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f.

https://nlp.stanford.edu/software/CRF-NER.html
https://opennlp.apache.org/
https://spacy.io/
https://www.nltk.org/book/ch07.html
http://linkedgeodata.org/About
https://www.openstreetmap.org/


154 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13

Figure 13.1: Mapping place names associated with positive (top row) and very negative (bottom
row) sentiment extracted from three OBOC books.

ually sift through the automatically identified place names and verify whether they were indeed a
Chicago place name or not. We also established the sentiment of each location-bearing sentence
in the three books using the Stanford Sentiment Analyzer.11 Our guiding principle was that spe-
cific place(s) mentioned in the sentence “inherit” the sentiment score of the entire sentence. This
principle may not always be true, but our manual inspection of the sentiment assigned to sen-
tences, and therefore to locations mentioned in the sentences, established that this was a fairly
accurate estimate: the sentiment score of the entire sentence is at the very least connected to or
“resonates” with the individual components of the sentence including place names. While we did
examine some samples, we did not conduct a qualitative analysis of the accuracy of the sentiment
scores assigned to the corpus.

Figure 13.1 documents an example of the results of our effort to integrate place names with
the sentiment of the sentence.

Particularly notable in Figure 13.1 is The Third Coast (right column) which shows a concen-
tration of positively-associated Chicago place names in the northern parts of the city along the
shore of Lake Michigan. Negative sentiment, by contrast appears to be more concentrated in the
central part of Chicago and also in the southern parts of the city.

The place names extracted from our three Chicago-setting OBOC books allowed us to focus
11See ?iiTb,ffMHTXbi�M7Q`/X2/mfb2MiBK2Mif.

https://nlp.stanford.edu/sentiment/


Lucic and Shanahan 155

Figure 13.2: Mapping of sentences that feature “Hyde Park,” and their sentiment, from three
OBOC program books

on particular areas of the city such as Hyde Park on the South Side, which is mentioned in each
of them. Larger circles correspond to a greater number of sentences that mention Hyde Park
and are associated with a negative sentiment in both The Adventures of Augie March and The
Warmth of Other Suns. As the maps in figure 13.2 indicate, on the other hand, The Third Coast
features sentences in which Hyde Park is mentioned in both positive and negative contexts.

These results prompt us to continue with this line of research and to procure a larger “con-
trol” set of texts with Chicago place names and sentiment scores. This would allow us to focus on
specific places such as “Wrigley Field” or the once-famous but no longer existing “Mecca” apart-
ment building (which stood at the intersection of 34th and State Street on the South Side and
was immortalized in a 1968 poetry collection by Gwendolyn Brooks). With a robust place name
data set, we could analyze the context in which these place names were mentioned in other liter-
ature, in contemporary or historical newspapers (Chicago Tribune, Chicago Sun-Times, Chicago
Defender), or in library and archival materials. Promising contextual elements would include the
sentiment associated with the place name.

Our interest in creating a dataset of Chicago place names extracted from literature led us to
The Chicago of Fiction, a vast annotated bibliography by James A. Kaser. Published in 2011, this


156 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 13
work contains entries on more than 1,200 works published between 1852 and 1980 that feature
Chicago. Kaser’s book contains several indexes that can serve as sources of labeled data or in-
stances in which Chicago locations are mentioned. Although we are still determining how many
of the titles included in the annotated bibliography already exist in digital format or are accessible
through the HathiTrust digital library, it is likely that a subset of the total can be accessed elec-
tronically. Even if the books do not exist in electronic format presently, it is still possible to use
the index as a source of already-labeled data for Chicago place names. We anticipate that such
a dataset would be of interest to researchers in Urban Studies, Literature, History, and Geogra-
phy. A sufficiently large number of sentences featuring Chicago place names would enable us to
proceed in the direction of a Chicago place name recognizer that can “learn” Chicago context or
examine how much context is sufficient to establish whether, for instance, a “Madison Street”
place name in a text is located in Chicago or elsewhere.

How do libraries innovate? From print index to labeled
data

Over the last decade, libraries have pioneered services related to the development and preservation
of digital scholarship projects. Librarians frequently assist faculty and students with the devel-
opment of digital humanities and digital scholarship projects. They point patrons to resources
and portals where they can find data and help with licensing. Librarians also procure datasets,
and some perform data cleaning and pre-processing tasks. And yet it is still not that common
for librarians to participate in the creation of a dataset. A relatively recent initiative, however,
Collections as Data,12 directly tackles the issue of treating research, library, and cultural heritage
collections as data and providing access to them. This ongoing initiative aims to create 12 projects
that can serve as a model to other libraries for making collections accessible as data.

The data that undergird the mechanisms of library workings—circulation records for phys-
ical and digital objects, metadata records, and the like—are not commonly available as datasets
open to machine learning tasks. If they were, not only could libraries refer others to the already
created and annotated physical and digital objects, but they could also participate in creating ob-
jects that are local to their settings. Creation and curation of such datasets could in turn help
establish new relationships between area libraries and local communities. One can imagine a
“data challenge,” for instance, in which libraries assemble a community by building a dataset rel-
evant to that community. Such an effort would need to be preceded by assessment of the data
needs and interests of that particular community. In the case of a Chicago place name dataset
challenge, efforts could revolve around local communities adding sentences to the dataset from
literary sources. A second step might involve organizing a crowdsourced data challenge to build a
place name recognizer model (e.g. Chicago place name recognizer model) based on the sentences
gathered.

One can also imagine turning metadata records into curated datasets that are shared with
local communities and with teachers and university lecturers for use in the classroom. Once a
dataset is built, scenarios can be invented for using it. This kind of work invites conversations
with faculty members about their needs and about potential datasets that would be of particular
interest. Creation of datasets based on unique materials at their disposal will enrich the palette
of services already offered by libraries.

12See ?iiTb,ff+QHH2+iBQMb�b/�i�X;Bi?m#XBQfT�`ikr?QH2f.

https://collectionsasdata.github.io/part2whole/


Lucic and Shanahan 157

One of the main goals of the Reading Chicago Reading project was the creation of a model
that can predict the circulation of a One Book One Chicago program book selection given param-
eters such as prior circulation for the book, its text characteristics, and the geographical locality
of the work. We are not aware of other predictive models that integrate circulation records with
text features extracted from the books in this way. Given that circulation records are not com-
monly integrated with other data sources when they are analyzed, linking different data sources
with circulation records is another challenging opportunity that this paper envisions.

Ultimately, libraries can play a dynamic role in both managing and creating data and datasets
that can be shared with the members of local communities. Using back-of-the-book indexes as
a source of labeled place name data is a tool that we have begun to prototype but still requires
further exploration and troubleshooting. While organizing a data challenge takes a lot of effort,
a data challenge can be an effective way of reaching out to one’s local community and identifying
their data needs. To this end, we aim to make freely available our curated list of sentences and
associated sentiment scores for Chicago place names in the three OBOC selections centered on
Chicago. We will invite scholars and the general public to add more Chicago location sentences
extracted from other literature. Our end goal is a labeled training dataset for the creation of a
Chicago place name recognizer, which, we hope, will enable new avenues of research.

References

American Library Association. n.d. “One Book One Community.” Programming & Exhibitions
(website). Accessed May 31, 2020. ?iiT,ffrrrX�H�XQ`;fiQQHbfT`Q;`�KKBM;fQM
2#QQF.

Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python.
Sebastopol, CA: O’Reilly Media Inc.

Chicago Public Library. n.d. “One Book One Chicago.” Accessed May 31, 2020. ?iiTb,
ffrrrX+?BTm#HB#XQ`;fQM2@#QQF@QM2@+?B+�;Qf.

“Collections as Data: Part to Whole.” n.d. Accessed May 31, 2020. ?iiTb,ff+QHH2+iBQMb�
b/�i�X;Bi?m#XBQfT�`ikr?QH2f.

Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005. “Incorporating Non-
local Information into Information Extraction Systems by Gibbs Sampling.” In Proceedings
of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005),
363-370. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSy8@Ry98f.

HathiTrust Digital Library. n.d. Accessed May 31, 2020. ?iiTb,ffrrrX?�i?Bi`mbiXQ`;f.
Kaser, A. James. 2011. The Chicago of Fiction: A Resource Guide. Lanham: Scarecrow Press.
Library of Congress. “Local/Community Resources.’ n.d. Read.gov. Accessed May 31, 2020.

?iiT,ff`2�/X;Qpf`2bQm`+2bf.
LinkedGeoData. “About / News.” n.d. Accessed May 31, 2020. ?iiT,ffHBMF2/;2Q/�i�X

Q`;f�#Qmi.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and

David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.”
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, 55-60. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfSR9@8yRyf.

OpenStreetMap. n.d. Accessed May 31, 2020. ?iiTb,ffrrrXQT2Mbi`22iK�TXQ`;f.
Reading Chicago Reading. “About Reading Chicago Reading.” n.d. Accessed May 31, 2020.

?iiTb,ff/?X/2T�mHXT`2bbf`2�/BM;@+?B+�;Qf�#Qmif.

http://www.ala.org/tools/programming/onebook
http://www.ala.org/tools/programming/onebook
https://www.chipublib.org/one-book-one-chicago/
https://www.chipublib.org/one-book-one-chicago/
https://collectionsasdata.github.io/part2whole/
https://collectionsasdata.github.io/part2whole/
https://www.aclweb.org/anthology/P05-1045/
https://www.hathitrust.org/
http://read.gov/resources/
http://linkedgeodata.org/About
http://linkedgeodata.org/About
https://www.aclweb.org/anthology/P14-5010/
https://www.openstreetmap.org/
https://dh.depaul.press/reading-chicago/about/