Applying Topic Modeling for Automated Creation of Descriptive Metadata for Digital Collections


ARTICLE 

Applying Topic Modeling for Automated Creation of 
Descriptive Metadata for Digital Collections 
Monika Glowacka-Musial 

 
INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2022  
https://doi.org/10.6017/ital.v41i2.13799 

Monika Glowacka-Musial (monikagm@nmsu.edu) is Assistant Professor/Metadata Librarian, 
New Mexico State University Library. © 2022. 

ABSTRACT 

Creation of descriptive metadata for digital objects tends to be a laborious process. Specifically, 
subject analysis that seeks to classify the intellectual content of digitized documents typically 
requires considerable time and effort to determine subject headings that best represent the 
substance of these documents. This project examines the use of topic modeling to streamline the 
workflow for assigning subject headings to the digital collection of New Mexico State University news 
releases issued between 1958 and 2020. The optimization of the workflow enables timely scholarly 
access to unique primary source documentation. 

INTRODUCTION 

Digital scholarship relies on digital collections and data. In the influential book Digital_Humanities, 
Anna Burdick and her associates affirm that humanistic knowledge production depends on 
collection building and curation.1 Access to historical documents and data resources is essential 
for the development of new research questions and methodologies.2 This project utilizes topic 
modeling to support building a digital collection of institutional news releases. It is one of the 
initiatives to apply digital technologies in the library workflows.   

NEW MEXICO STATE UNIVERSITY NEWS RELEASES 

In response to a growing scholarly and public interest in original university press announcements, 
the digitization of past NMSU print news releases was approved in September 2013. Sixty years of 
news releases starting from the late 1950s to the present were to be included. One of the 
arguments presented in justification of the project was that these institutional news briefs have a 
truly unique historical value. Researchers view university press announcements as anchors in the 
history of NMSU and the region, particularly for dating events and initiatives. They also find 
official communications essential for studying the way the news was framed by participants and 
the university administration. 

Historically, the relationships between the university and the local media had always been a major 
concern of college administrators: how to respect the freedom of the press, while ensuring 
responsible and factual journalism, and how to build an effective partnership that would benefit 
both sides?3 To address these questions, the administration early on established the college’s 
Information Services that have issued news releases about campus events, programs, and 
developments in the college’s research, teaching, and service. These formal news repor ts 
representing the perspective of the university have been regularly distributed to local and 
worldwide media for many decades. This collection has become one of the most popular primary 
sources documenting a history of the Southwestern educational institution. 

mailto:monikagm@nmsu.edu


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 2 

Since the beginning of the digitization project, thousands of press releases had been scanned, 
described, and added to the digital collection. Currently, the collection features press releases 
issued by the university between 1958 and 1974. There is still a lot to be done. The most time-
consuming element in the process is adding metadata, including Library of Congress Subject 
Headings, to individual news releases. With decreasing personnel, dwindling library resources, 
and competing work priorities, the progress on the project has slowed substantially. Its 
revitalization requires a fresh, problem-solving approach that would allow for a significant 
reduction of time that catalogers spend on metadata creation. In search for a viable solution, topic 
modeling, a computational tool for classifying large collections of texts, was put to the test and 
generated promising results. The following sections describe the tools, data, and process created 
for this experiment in some detail.  

TOPIC MODELING AND ITS APPLICATIONS 

Topic modeling (TM) is one of the methodologies used in natural language processing (NLP). It 
was specifically designed for text mining and discovering hidden patterns in huge collections of 
documents, images, and networks.4 According to practitioners, topic modeling is best viewed as a 
statistical tool for text exploration and open-ended discovery.5 It has been used extensively in 
computer science, genetics, marketing, political science, journalism, and digital humanities f or the 
last two decades. A growing literature on topic modeling applications provides clear evidence of 
its viability.6 Examples of TM applications in digital social sciences and humanities include finding 
geographic themes from GPS-associated documents on social media platforms such as Flickr and 
Twitter,7 selecting news articles on opposition to Euro currency from Financial Times data,8 
identifying paragraphs on epistemological concerns in English and German novels ,9 tracking 
research trends in different disciplines,10 and revealing dominant themes in newspapers,11 
governance literature,12 and Wikipedia entries.13 

Topic modeling was applied in addition to text mining to enhance access to large digital collections 
by providing minimal description and enriching metadata, including subject headings .14 Also, a 
possibility of using topic modeling to determine the subject headings for books on Project 
Gutenberg was explored.15 

Topic modeling in a nutshell 
Topic models help to identify the contents of document collections. Topic modeling is a process of 
discovering clusters of words that best represent a set of topics. Figure 1 shows the basic idea 
behind topic modeling. 

A large collection of text documents (the scrolls on top) consists of thousands of words (shown 
symbolically at the bottom). The algorithm seeks for the most frequent words that tend to occur in 
proximity and clusters them together. Each cluster, referred to as a topic, has a set of words with 
their probabilities of belonging to a given topic. Each document in the collection displays a set of 
combined topics to different degrees. Here, documents are seen as mixtures of topics, and topics 
are seen as mixtures of words.16 Topics also provide context to words. Documents that have 
similar combinations of topics tend to be related. As a result, a large collection of text documents 
can be represented by a limited set of topics (as presented by icons in the middle of the figure). 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 3 

 
Figure 1. Basic idea behind topic modeling. 

Topics and subject headings combined 
The original purpose of topic modeling, as formulated by David Blei and his associates in 2003, 
was to make large collections of texts more approachable for scholars by organizing texts 
automatically based on latent topics.17 These hidden topics can be discovered, measured, and 
consequently used by scholars to navigate the collection.  

The purpose of assigning subject headings is to identify “aboutness,” or simply subject concepts, 
covered by the intellectual content of a given work, and then again collocate related works. 18 Since 
both topic models and subject headings have a similar purpose, although very different 
methodology and scale, we decided to combine them and make topic models a prerequisite for 
assigning subject headings. In such a scenario, the computer deals with the scale of text collections 
that are beyond human reading capacity and catalogers then fine-tune the results generated by 
the algorithm. The following Methods section shows subsequent stages involved in the process of 
semiautomated assignment of subject headings to documents. 

METHODS 

Overview 

For topic modeling, we used the algorithm of Latent Dirichlet Allocation (LDA).19 LDA takes a 
document-term matrix, with rows corresponding to documents, and columns corresponding to 
terms (words) and, based on semirandom exploration, finds optimal probabilities of topics in 
documents (called gammas), and probabilities of terms in topics (called betas).  

After LDA generates a set of topics that best represent the collection of news releases, each topic is 
associated with several subject headings that were previously assigned to news releases by 
catalogers. For a new news release, LDA finds a set of most representative topics. Subject headings 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 4 

associated with the dominant topics are combined into a list of subject candidates presented to a 
cataloger. 

The last step involves a cataloger using a short list of subject candidates for selecting subject 
headings for news releases. 

Training data 
Training data used in this project consists of over 6,000 news releases (from 1958 to 1967) 
annotated with metadata. Only two metadata properties—titles and subject headings—were 
considered. Created by catalogers, both properties reflect the content of news releases accurately, 
although mistakes may sometimes happen. The values from the titles field were converted into a 
document-term matrix that, in turn, became an input for the algorithm. Texts produced by OCR on 
original news releases were not included in the analysis due to their poor quality. 

Detailed steps of the proposed method: 

1. Topic modeling on training data: 
a. Run standard preprocessing of training text data, including tokenization, stop words 

removal, and stemming. 

b. Run topic modeling (LDA) where each document from the training data set is assigned a 
set of topics (subsets of words), each one with a measurable contribution to the 

document. 

2. Assignment of subject headings to topics.20 For each topic: 
a. Select a number of documents with the highest probability (gamma) for the topic. We 

used 400. 

b. Gather a set of subject headings assigned to documents produced in 2.a. and arrange them 
with decreasing frequency (freq) of occurrence in the set. 

3. Assignment of subject headings to a new document. 
a. Assign to the new document gammas (probabilities) of topics using the LDA model 

trained in 1.b. 
b. In subsequent topics, for each subject heading calculate its weight in the document as a 

product of its frequency in the topic (freq) and probability of the topic (gamma) in the 

document; for subject headings duplicated across topics, sum up their weights across 

topics. 
c. Create a list of candidate subject headings processed in 3.b. in descending order with 

respect to their weights in the document. 

IMPLEMENTATION 

There is a growing number of tools used for topic modeling.21 For this project, we used the R 
programming language, which has many packages for data preprocessing and topic modeling 
(TM).22 

Below are listed R packages used for this project: 

• topicmodels with functions: LDA() producing topic models, posterior() for assigning topics 
to test documents by pretrained models and perplexity() for perplexity calculation 23  

• tidytext with tidying functions that allow for re-arrangements and exploring data as well as 
for interpreting the models 

• textstem for preprocessing data, including stemming and lemmatization 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 5 

• tidyr, dplyr, and stringr for data and strings manipulation and arrangements 
• ggplot2 for data visualizations 

The code related to topic modeling was mostly reused from the DataCamp class on topic 
modeling.24 Occasionally, data.table data structure was applied instead of data.frame.  

In addition to standard stop-words, custom stop-words including initials, names of weekdays, and 
dates were removed from the corpus using function anti_join(). 

For finding topics in test documents by a pretrained model, function posterior() from the R 
package topicmodels was used.25 The extra step needed before using function posterior() was to 
align the new document with the document-term matrix used for training the LDA model.26  

RESULTS 

For assessing the method’s performance, we adopted the idea of recall. In this specific context, 
recall is defined as the fraction of original subject headings (i.e., those assigned to a document 
manually by a cataloger) that are present on the list of candidate subject headings produced by the 
method. 

The average recall is estimated using a leave-one-out setting.27 Once a single test document is set 
aside, the LDA model is trained on the remaining documents and recall is calculated for the tested 
document using the list of candidate subject headings produced by the method. Then, recall is 
averaged over a set of testing documents. This approach produces an estimate of the method’s 
performance if tested on a new document. 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 6 

 
Figure 2. Average recall as a function of size of list with subject headings candidates. 

Figure 2 shows the dependence of average recall on length of list of candidate subjects produced 
by the method. Recall is averaged over 1,500 randomly selected test documents. The dashed line 
represents the chance level performance, i.e., when the method would produce a random subset of 
all subject headings available in the data. On a list of 100 suggested subject headings, the recall is 
on average above 0.6 and for a list of 500 candidate subject headings, above 0.8. Even though the 
average recall stays noticeably below 1 (recall value 1 would mean perfect performance), at the 
same time it is still considerably above the chance level. The results presented in figure 2 were 
produced by the LDA model trained with 16 topics. 

One of the parameters affecting the method performance is the number of topics used by the LDA 
model. For finding the number of topics corresponding to the highest recall, an overall measure of 
recall across different lengths of the subject candidate list was defined as the cumulative recall for 
first 100 subject candidates. We assumed that 100 is a likely size of candidate lists that catalogers 
would be willing to go through. Figure 3 shows the cumulative recall for different numbers of 
topics, based on which 16 were chosen as the optimum. Interestingly, this corresponds wel l with 
the perplexity dependence on number of topics (fig. 4). The perplexity, a measure of model’s 
surprise at the data, shows how the model fits the data—a smaller number means a better fit, i.e., a 
better topic model.28  


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 7 

 
Figure 3. Cumulative recall as a function of number of topics in the LDA model. 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 8 

 
Figure 4. Perplexity of the LDA model as a function of number of topics in the LDA model. 

To give a better idea about the method performance, figure 5 shows the distribution of recall for 
individual test documents, for a list of 100 subject headings. Since most documents in the training 
data have just a few subject headings, there is only a small set of discrete values possible for recall 
for individual documents. The distribution is wide, with a fraction of documents with no subject 
heading present on the proposed list (recall = 0) but also with a bigger fraction of documents fully 
covered by the list (recall = 1).  


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 9 

 
Figure 5. Distribution of recall across 1,500 test documents, for 100 subject candidates (for 16 topics). 

The following examples show the sets of subject headings selected by the algorithm that include 
subject headings (in bold blue) chosen originally by catalogers. 

Example 1  

Title of news release: “‘Romeo and Juliet’ play - part of campus celebration for 400th anniversary 
of Shakespeare's birth” 

Subjects Weights 
New Mexico State University. Playmakers 0.280 

Theater 0.143 

Students 0.080 
Academic achievement 0.080 
Theater--Production and direction 0.075 

High school students 0.052 
Competitions 0.048 

New Mexico State University. College of Engineering 0.042 

Plays 0.041 

Debates and debating 0.038 
New Mexico State University. Aggie Forensic Festival 0.036 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 10 

Zohn, Hershel 0.034 
Shakespeare, William, 1564-1616. A Midsummer Night's Dream 0.034 

Forensics (Public speaking) 0.034 

Frisch, Max, 1911-1991. Firebugs 0.027 
Tickets 0.027 
Theater rehearsals 0.027 

New Mexico State University. College of Agriculture and Home Economics 0.022 
Shakespeare, William, 1564-1616. Romeo and Juliet 0.020 

Frisch, Max, 1911-1991 0.020 
Performances 0.020 

Garcia Lorca, Federico, 1898-1936. Casa de Bernarda Alba. English 0.020 
Molière, 1622-1673. Bourgeois gentilhomme. English 0.020 
Anniversaries 0.014 

New Mexico State University. College of Teacher Education 0.012 
 

Example 2 
Title of caption to photo: “Locals Barbara Gerhard, Donna Herron, Lillian Jean Taylor rehearse for 
upcoming concert” 

Subjects Weights 

Concerts 0.123 
New Mexico State University. University-Civic Symphony Orchestra 0.085 
INSTITUTION. Playmakers 0.077 

United States. Air Force ROTC 0.073 

United States. Army. Reserve Officers' Training Corps 0.062 
Military cadets 0.058 
Award presentations 0.054 

Theater 0.039 

Award winners 0.038 
Scholarships 0.035 
Music 0.035 

Musicians 0.031 

Awards 0.027 

New Mexico State University. Department of Military Science 0.023 
Theater--Production and direction 0.021 

Kennecott Copper Corporation 0.019 
Students 0.019 
Glowacki, John 0.019 

New Mexico State University Symphonic Band 0.015 
New Mexico State University. University-Community Chorus 0.015 

Lynch, Daniel 0.015 
Drath, Jan 0.015 

Performances 0.015 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 11 

Military art and science 0.012 
United States. Army--Inspection 0.012 
 

DISCUSSION 

The major advantage of the method described above is reducing a long list of Library of Congress 
Subject Headings that catalogers need to consult before assigning subject headings to news 
releases. It is important to note that this method produces subject headings that are already 
present in the training data. The list of available subject headings can be expanded by periodic 
updates of the training data to include all entries in the catalog, assuming catalogers will add, 
where needed, subjects not present so far in the data set.  

In this project we utilized metadata from just two fields: titles and subject head ings. Although 
documents’ titles are supposed to compactly represent the content of documents, we expect that 
the presented approach would give better results if the full text (OCR) was analyzed. In this 
project, the limiting factors were both quality of print copies and robustness of available OCR 
tools. 

In some cases, subject annotations are imperfect, depending on skills and experience of catalogers. 
That also affects the performance of our method that relies on quality of subject assignments. On 
the other hand, there are cases when the method suggests subjects that are fitting the content of 
news releases but were not selected by catalogers. This indicates that the method can also be used 
to refine the existing annotations. 

CONCLUSION 

We propose a way to streamline the workflow of metadata creation for university news releases 
by applying topic modeling. First, we use this digital technology to identify topics in a large 
collection of text documents. Then, we associate the discovered topics with sets of subject 
headings. Finally, to a new document, we assign those subject headings that are associated with 
the document’s most dominant topics. The proposed method facilitates the process of document 
annotation. It produces short lists of candidate subject headings that account for a significant part 
of original labeling performed by catalogers. This approach can be applied to support annotation 
of any large digital collection of text documents.  

One of the advantages of applying topic modeling is that it produces numeric representations of 
text documents. These numeric representations can be used by advanced analytical 
methodologies, including machine learning, for numerous practical purposes in library workflows 
like text categorization, collocation of similar materials, enhancing metadata for digital collections, 
finding trends in government literature, etc. 

In addition, mastering digital methodologies by librarians may open new ways of collaboration 
among them and digital scholars across university campuses. As Johnson and Dehmlow argue, “... 
digital humanities represent a clear opportunity for libraries to offer significant value to the 
academy, not only in the areas of tool and consultations, but also in collaborative expertise that 
supports workflows for librarians and scholars alike.”29 Digital technologies are best learned in 
hands-on practice. If librarians are to contribute to the development of digital scholarship, then 


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 12 

they need to learn how to apply new technologies to their own work. And since both librarians 
and humanists work with texts, they might have much to offer each other. 

Correction 
On November 21, 2022, the URLs to references 24 and 26 were updated at the author’s request to 
avoid user login. 

ENDNOTES 
 

1 Anne Burdick et al., Digital_Humanities (Cambridge, Massachusetts: The MIT Press, 2012), 32–33. 

2 Thomas G. Padilla, “Collections as Data Implications for Enclosure,” ACRL News 79, no. 6 (2018), 
https://crln.acrl.org/index.php/crlnews/article/view/17003/18751; Rachel Wittmann, Anna 
Neatrour, Rebekah Cummings, and Jeremy Myntti, “From Digital Library to Open Datasets: 
Embracing a ‘Collections as Data’ Framework,” Information Technology and Libraries 38, no. 4 
(December 2019), https://doi.org/10.6017/ital.v38i4.11101. 

3 Gerald W. Thomas, Academic Ecosystem: Issues Emerging in a University Environment (Gerald W. 
Thomas, 1998), 159–64. 

4 David M. Blei, Andrew Ng, and Michael Jordan, “Latent Dirichlet Allocation,” Journal of Machine 
Learning Research 3, no. 1 (2003); David M. Blei, “Topic Modeling and Digital Humanities,” 
Journal of Digital Humanities 2, no. 1 (Winter 2012), http://journalofdigitalhumanities.org/2-
1/topic-modeling-and-digital-humanities-by-david-m-blei/. 

5 Megan R. Brett, “Topic Modeling: A Basic Introduction,” Journal of Digital Humanities 2, no. 1 
(Winter 2012), http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-
introduction-by-megan-r-brett/; Jordan Boyed-Graber, Yuening Hu, and David Mimno, 
“Applications of Topic Models,” Foundations and Trends® in Information Retrieval 11, no. 2–3 
(2017): 143–296. 

6 Boyed-Graber, Hu, and Mimno, “Applications of Topic Models,” Foundations and Trends® in 
Information Retrieval 11, no. 2–3 (2017): 143–296; Rania Albalawi, Tet Hin Yeap, and Morad 
Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” 
Frontiers in Artificial Intelligence 3 (2020): 42, https://doi.org/10.3389/frai.2020.00042; 
Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, “Latent Dirichlet Allocation (LDA) and Topic 
Modeling: Models, Applications, a Survey,” (2017), 
https://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/LDA
_survey_1711.04305.pdf. 

7 Zhijun Yin et al., “Geographical Topic Discovery and Comparison,” in WWW: Proceedings of the 
20th International Conference on the World Wide Web (2011), 
https://doi.org/10.1145/1963405.1963443. 

8 David Andrzejewski and David Buttler, “Latent Topic Feedback for Information Retrieval,” in 
KDD '11: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery 
and Data Mining (2011), https://dl.acm.org/doi/10.1145/2020408.2020503. 

 
https://crln.acrl.org/index.php/crlnews/article/view/17003/18751
https://doi.org/10.6017/ital.v38i4.11101
http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/
http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/
http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/
http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/
https://doi.org/10.3389/frai.2020.00042
https://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/LDA_survey_1711.04305.pdf
https://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/LDA_survey_1711.04305.pdf
https://doi.org/10.1145/1963405.1963443
https://dl.acm.org/doi/10.1145/2020408.2020503


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 13 

 
9 Matt Erlin, “Topic Modeling, Epistemology, and the English and German Novel,” Cultural Analytics 
1, no. 1 (May 1, 2017), https://doi.org/10.22148/16.014. 

10 Cassidy R. Sugimoto et al., “The Shifting Sands of Disciplinary Development: Analyzing North 
American Library and Information Science Dissertations Using Latent Dirichlet Allocation,” 
Journal of the American Society for Information Science and Technology 62, no. 1 (January 
2011), https://doi.org/10.1002/asi.21435; David Mimno, “Computational Historiography: 
Data Mining in a Century of Classics Journals,” Journal on Computing and Cultural Heritage 5, 
no. 1 (April 2012): 3:1–3:19; Andrew J. Torget and Jon Christensen, “Mapping Texts: 
Visualizing American Historical Newspapers,” Journal of Digital Humanities 1, no. 3 (Summer 
2012), http://journalofdigitalhumanities.org/1-3/mapping-texts-project-by-andrew-torget-
and-jon-christensen/; Andrew Goldstone and Ted Underwood, “The Quiet Transformations of 
Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” New Literary History 45, 
(2014): 359–84; Carlos G. Figuerola, Francisco Javier Garcia Marco, and Maria Pinto, “Mapping 
the Evolution of Library and Information Science (1978–2014) Using Topic Modeling on LISA,” 
Scientometrics 112, (2017): 1507–35, https://doi.org/10.1007/s11192-017-2432-9; Jung Sun 
Oh and Ok Nam Park, “Topics and Trends in Metadata Research,” Journal of Information Science 
Theory and Practice 6, no. 4 (2018): 39–53; Manika Lamba and Margam Madhusudhan, 
“Metadata Tagging of Library and Information Science Theses: Shodhganga (2013–2017),” 
paper presented at ETD 2018: Beyond the Boundaries of Rims and Oceans Globalizing 
Knowledge with ETDs, National Central Library, Taipei, Taiwan, 
https://doi.org/10.5281/zenodo.1475795; Manika Lamba and Margam Madhusudhan, 
“Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008–
2017), India,” Library Philosophy and Practice (2019): 2593, 
https://digitalcommons.unl.edu/libphilprac/2593. 

11 David J. Newman and Sharon Block, “Probabilistic Topic Decomposition of an Eighteenth-
Century American Newspaper,” Journal of the American Society for Information Science and 
Technology 57, no. 6 (April 1, 2006): 753–67; Robert K. Nelson, “Mining the Dispatch,” last 
modified November 2020, https://dsl.richmond.edu/dispatch/about; Tze-I Yang, Andrew 
Torget, and Rada Mihalcea, “Topic Modeling on Historical Newspapers,” in LaTeCH '11: 
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social 
Sciences, and Humanities (2011), https://dl.acm.org/doi/10.5555/2107636.2107649; Carina 
Jacobi, Wouter van Atteveldt, and Kasper Welbers, “Quantitative Analysis of Large Amounts of 
Journalistic Texts Using Topic Modelling,” Digital Journalism 4, no. 1 (2015), 
https://doi.org/10.1080/21670811.2015.1093271. 

12 Jonathan O. Cain, “Using Topic Modeling to Enhance Access to Library Digital Collections,” 
Journal of Web Librarianship 10, no. 3 (2016): 210–25, 
https://doi.org/10.1080/19322909.2016.1193455; Alexandra Lesnikowski et al., “Frontiers in 
Data Analytics for Adaptation Research: Topic Modeling,” WIREs Climate Change 10, no. 3 
(2019): e576, https://doi.org/10.1002/wcc.576. 

13 Tiziano Piccardi and Robert West, “Crosslingual Topic Modeling with WikiPDA,” in Proceedings 
of The Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia (ACM, New 
York), https://doi.org/10.1145/3442381.3449805. 

 
https://doi.org/10.22148/16.014
https://doi.org/10.1002/asi.21435
http://journalofdigitalhumanities.org/1-3/mapping-texts-project-by-andrew-torget-and-jon-christensen/
http://journalofdigitalhumanities.org/1-3/mapping-texts-project-by-andrew-torget-and-jon-christensen/
https://doi.org/10.1007/s11192-017-2432-9
https://doi.org/10.5281/zenodo.1475795
https://digitalcommons.unl.edu/libphilprac/2593
https://dsl.richmond.edu/dispatch/about
https://dl.acm.org/doi/10.5555/2107636.2107649
https://doi.org/10.1080/21670811.2015.1093271
https://doi.org/10.1080/19322909.2016.1193455
https://doi.org/10.1002/wcc.576
https://doi.org/10.1145/3442381.3449805


INFORMATION TECHNOLOGY AND LIBRARIES  JUNE 2022 

APPLYING TOPIC MODELING FOR AUTOMATED CREATION OF DESCRIPTIVE METADATA | GLOWACKA-MUSIAL 14 

 
14 Cain, “Using Topic Modeling to Enhance Access to Library Digital Collections,” 210 –25; A. 
Krowne and M. Halbert, “An Initial Evaluation of Automated Organization for Digital Library 
Browsing,” in JCDL '05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital 
Libraries, (June 7–11, 2005): 246–255; David Newman, Kat Hagedorn, and Chaitanya 
Chemudugunta, “Subject Metadata Enrichment Using Statistical Topic Models,” paper 
presented at ACM IEEE Joint Conference on Digital Libraries JCDL’07, Vancouver, BC, June 17–
22, 2007. 

15 Craig Boman, “An Exploration of Machine Learning in Libraries,” ALA Library Technology Report 
55, no. 1 (January 2019): 21–25. 

16 Julia Silge and David Robinson, Text Mining with R: A Tidy Approach (Sebastopol, California: 
O’Reilly Media, Inc., 2017), 90. 

17 Blei, Ng, and Jordan, “Latent Dirichlet Allocation.” 

18 Arlene G. Taylor, Introduction to Cataloging and Classification, 10th ed. (Westport, Connecticut: 
Libraries Unlimited, 2006), 19–20, 301–14; Arlene G. Taylor and Daniel N. Joudrey, The 
Organization of Information, 3rd ed. (Westport, Connecticut: Libraries Unlimited, 2009), 303–
28. 

19 Blei, Ng, and Jordan, “Latent Dirichlet Allocation.” 

20 Silge and Robinson, Text Mining with R, 149. 

21 Albalawi, Yeap, and Benyoucef, “Using Topic Modeling Methods for Short-Text Data,” 42. 

22 The R Project for Statistical Computing, https://www.r-project.org/. 

23 Bettina Grün and Kurt Hornik, “topicmodels: An R Package for Fitting Topic Models,” Journal of 
Statistical Software 40, no. 13 (2011): 1–30, https://doi.org/10.18637/jss.v040.i13. 

24 Topic Modeling in R (DataCamp), https://www.datacamp.com/courses/topic-modeling-in-r. 

25 Grün and Hornik, “topicmodels.” 

26 Topic Modeling in R (DataCamp), chap. 3, https://www.datacamp.com/courses/topic-
modeling-in-r. 

27 Christopher M. Bishop, Pattern Recognition and Machine Learning (New York, NY: Springer 
Science + Business Media, 2006), 32–33. 

28 Blei, Ng, and Jordan, “Latent Dirichlet Allocation.” 

29 Daniel Johnson and Mark Dehmlow, “Digital Exhibits to Digital Humanities: Expanding the 
Digital Libraries Portfolio,” in New Top Technologies Every Librarian Needs to Know, ed. 
Kenneth J. Varnum,  (Chicago: ALA Neal-Schuman, 2019), 124. 

https://www.r-project.org/
https://doi.org/10.18637/jss.v040.i13
https://www.datacamp.com/courses/topic-modeling-in-r
https://www.datacamp.com/courses/topic-modeling-in-r
https://www.datacamp.com/courses/topic-modeling-in-r

	ABSTRACT
	Introduction
	New Mexico State University news releases
	Topic modeling and its applications
	Topic modeling in a nutshell
	Topics and subject headings combined

	Methods
	Overview
	Training data
	Detailed steps of the proposed method:

	Implementation
	Results
	Example 1
	Example 2

	Discussion
	Conclusion
	Endnotes