Using Machine Learning and Natural Language Processing to Analyze Library Chat Reference Transcripts ARTICLE Using Machine Learning and Natural Language Processing to Analyze Library Chat Reference Transcripts Yongming Wang INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2022 https://doi.org/10.6017/ital.v41i3.14967 Yongming Wang (wangyo@tcnj.edu) is Systems Librarian, The College of New Jersey. © 2022. ABSTRACT The use of artificial intelligence and machine learning has rapidly become a standard technology across all industries and businesses for gaining insight and predicting the future. In recent years, the library community has begun looking at ways to improve library services by applying AI and machine learning techniques to library data. Chat reference in libraries generates a large amount of data in the form of transcripts. This study uses machine learning and natural language processing methods to analyze one academic library’s chat transcripts over a period of eight years. The built machine learning model tries to classify chat questions into a category of reference or nonreference questions. The purpose is to predict the category of future questions by the model with the hope that incoming questions can be channeled to appropriate library departments or staff. INTRODUCTION Since the beginning of this century, artificial intelligence (AI) and machine learning (ML) have been used in almost all industries and businesses to gain knowledge and insights and predict the future. The large amount of data available has helped to accelerate the application of AI and ML in stunning speed. To follow this technology trend, the library community has begun looking at ways to improve library services by applying AI and ML techniques to library data. Stanford University Library is one of the pioneers in the research and application of ML and AI in the library. The mission of its Library AI Initiatives states: “The Library AI initiative is a program to identify, design, and enact applications of artificial intelligence that will help us make our rich collections more easily discoverable, accessible, and analyzable.” 1 In 2019, Stanford University Library hosted the second International Conference on AI for Libraries, Archives, and Museums, titled Fantastic Future.2 Many academic libraries have implemented chat reference services as a way to support student learning and academic research on campus. Chat reference serves as an important channel to connect the library’s resources and services to the campus community.3 Over the years, libraries have accumulated a large amount of data in the form of chat transcripts. Analyzing the content of transcripts can help the library understand users’ information needs, deploy library human resources more efficiently, and improve the quality of the chat reference service. The College of New Jersey is a midsize academic library that serves a campus with 7,000 college students, most of them undergraduates. The library began to use Springshare’s LibChat in 2014. The chat service is freely accessible online from the library’s website, and anyone can initiate a chat by asking an initial question through the chat box. Approximately 8,000 chat transactions have been accumulated over the past eight years. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 2 This study aims to use machine learning and natural language processing (NLP) techniques to build a classification model to categorize all available questions into two categories: reference and nonreference. By doing so, we hope that the model can automatically classify future chat questions received into either the reference question category or the nonreference question category, and channel the question to the appropriate library department or staff. LITERATURE REVIEW Traditionally, the analysis of chat transcripts has used qualitative or simple quantitative methods (e.g., chat frequency, duration). To better understand chat service quality and patrons’ information needs, librarians must manually review and read through chat transcripts, which requires a lot of time and effort.4 In recent years, however, the library field has started to witness the application of AI and ML techniques to analyze library data, including chat transcripts, in order to quickly and efficiently gain more insight into user information needs and information seeking patterns. Megan Ozeran and Piper Martin used topic modeling, a ML method, to analyze library chat reference conversations. The purpose of their project was to identify the most popular topics asked by library patrons in order to improve the chat reference service and to train the library staff.5 The Brigham Young University library implemented a machine learning–based tool to perform various text analysis on transcripts of chat reference to gauge patron satisfaction levels and to classify patrons’ questions into several categories.6 Jeremy Walker and Jason Coleman used ML and NLP techniques to build models that predict the relative difficulty of incoming chat reference questions. They tested their large sample size of chat transcripts on hundreds of models. Their aim was to help library professionals and management improve chat reference services in the library.7Another ML topic modeling project of was carried out by HyunSeung Koh and Mark Fienup. Their study applied pLSA (Probabilistic Latent Semantic Analysis) to library chat data over a period of four years, resulting in more accurate and interpretable topics and subjects compared with results by human qualitative evaluation.8 Another interesting ML project on chat reference data was conducted by Ellie Kohler. This project used a machine learning model to analyze chat transcripts for sentiment and topic extraction. 9 In addition to library chat data, ML has been also used to analyze other library data, including library digital collections and library tweet data. Jeremiah Flannery applied NLP summarization techniques on a special library digital collection of Catholic pamphlets. This project tried to automatically generate a summary for each digitized pamphlet by using NLP’s BERT Extractive technique and Gensim python package.10 Sultan M. Al-Daihani and Alan Abrahams conducted a text mining analysis of academic libraries’ tweets. They used a tool called PamTAT developed by the Pamplin College of Business at Virginia Polytechnic Institute and State University. Pamplin is a Microsoft Excel–based interface to the NLP NLTK package written in Python. The purpose of their analysis was to try to identify the most common topics or subject keywords of the tweets by 10 large academic libraries. In addition, they also ran Harvard General Inquirer for semantic and sentiment analysis of the tweets.11 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 3 Other applications of ML techniques in the academic library include analyzing library operations such as acquisition. In 2019, Kevin W. Walker and Zhehan Jiang from the Univers ity of Alabama used a machine learning method called adaptive boosting (AdaBoost) to predict demand -driven acquisition (DDA).12 Carlos G. Figuerola, Franciso Javier Carcia Marco, and Maria Pinto used the topic modeling technique, specifically the Latent Dirichlet Allocation, to identify the main topics and categories of the 92,705 publications in the domain of library and information science from 1978 to 2014.13 PAIR (Projects in Artificial Intelligence Registry) is a repository and online global directory of AI projects in higher education. It is maintained by the University of Oklahoma Libraries. The aim of PAIR is to foster cross-institutional collaboration and to support grant activity in the field of artificial intelligence and machine learning in higher education.14 Public libraries have started to seriously look at the application and impact of AI in the library. Frisco Public Library in Texas has developed a series of applications and programs to help train library staff in AI. They also developed artificial intelligence maker kits, including Google AIY Voice Kit, for circulation. They even provide introductory Python lessons to the public.15 BACKGROUND OF NLP AND ML Natural language processing is a multidiscipline field that involves linguistics, computer science, and machine learning. By using computer algorithms, NLP tries to build a machine learning model that is applied to large amounts of data in order to make predictions or decisions. The data in NLP is natural language data, that is, data in plain and unstructured textual form in any language. There are many types of applications of NLP and ML in business or people’s daily lives. Especially with the popularity of Internet, there is a tremendous increase in and accumulation of textual data, such as social media networks and customer online chat services. Major applications of NLP include sentiment analysis on social media data, topic modeling in digital humanities, text classification, speech recognition, search box auto correct, and auto completion, etc. The use cases are countless. In general, there are two types of ML: supervised learning vs. unsupervised learning. In supervised learning, the dataset fed to the model is labelled in advance to classify data or predict outcomes accurately whereas unsupervised learning is a type of algorithm that learns patterns from unlabeled or untagged data. No matter which type, all ML and NLP techniques involve a series of general steps in any project, also called the ML/NLP pipeline. 1. Data collection, which involves obtaining the raw textual data and usually means downloading data from some remote server or service. 2. Data preprocessing, which is necessary for any project, large or small, because the raw textual data is unstructured data and is not ready to be fed to the model for computing processes. Data preprocessing usually includes removing punctuations, changing all letters to lowercase, tokenization, removing stop words, and stemming or lemmatization. 3. Feature engineering, which is optional but often very useful. 4. Text vectorization, which is the final step before feeding the data to the model. The purpose is to transform the text into some kind of value in numbers. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 4 5. Model building, evaluation, and optimization, which involves multiple cycles until the optimal or desired results are achieved. 6. Implementation, which is the final step in implementing the model to the real world. METHODOLOGY For this ML/NLP project, the raw data came from the chat transcripts repository downloaded from Springshare’s server. From 2014 to 2021, a total of 8,000 chat reference transactions were logged. These transactions formed the raw dataset for model building and testing with this project. Because of the nature of the data, i.e., textual data, Python was chosen for this project. And the two major Python packages used in the project are NLTK and scikit-learn. NLTK (Natural Language Toolkit) is a suite of libraries and programs for natural language processing for English language. NLTK supports classification, tokenization, and stemming, tagging, parsing, and semantic reasoning functionalities. Scikit-learn is a Python module built on NumPy, SciPy, and Matplotlib. Featuring various classification, regression, and clustering algorithms, including support-vector machines, random forest, gradient boosting, k-means and DBSCAN, scikit-learn is a simple and efficient tool for predictive data analysis. Scikit-learn is one of the most popular Python modules for any ML project. Data Collection Data collection includes both data gathering and data preparation. Data gathering is the process of downloading the 8,000 initial questions into an Excel file. Data preparation deals with the initial data clean up, such as removing the blank rows. The most important task of data preparation is data labelling. Because this is a supervised-learning ML project, all questions must be labeled as either reference question (label=Yes) or nonreference question (label=No) by hand. Then, all labeled questions (dataset) are fed to the ML model for either training purposes or testing purposes. See table 1 for an example of data after the preparation step. Table 1. Sample questions with Yes or No labels Question sequential number Label Question 3979 Yes Working on an Alumni Reunion presentation. I need to know … 3980 Yes would a book with this call number: DS559.8.D7 G68 1991 … 3981 No Would a Rutgers student be able to take out a textbook from … 3982 Yes Would I be able to find mathematics textbooks by Pearson on … 3983 No Would I be able to log in to find an article if I am an alumni of … 3984 Yes Would it be possible to help me find a online essay? 3985 No Would like to renew: Huguenots [videorecording] / music by … 3986 Yes would like to request for a course description catalog from Fall … 3987 No Would someone be able to ask room 414 to quiet down please? 3988 No would someone be able to come up to floor 3 and tell people to … INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 5 Data Preprocessing Data preprocessing is the first programming step in the pipeline of this ML/NLP project. Data preprocessing transforms the raw data into a more digestible form so that the ML model can perform better and achieve the desired results. One of the purposes of data preprocessing is to remove insignificant and nonmeaningful words such as “a,” “the,” “and,” etc., as well as punctuation, from the textual data. Removing nonmeaningful and stop words from the corpus allows for a better result in the ML model, allowing it to deal only with significant and meaningful words. It is also necessary to apply lowercase formatting to all letters. While we as humans know that lowercase and uppercase words have the same meaning, the computer will treat them as having different meanings. For example, “cat” and “Cat” are two different words to the computer. Tokenization involves splitting the sentences into a list of individual words by removing spaces between words. The last step of data preprocessing is stemming or lemmatizing, which is to find the semantic meaning of a group of related words. In other words, this process explicitly correlates words with similar meanings. For instance, run, running, runner will become “run” ; library and libraries will become “librari”; goose and geese will become “goose.” Feature engineering involves creating a new feature, or transforming the current feature. The purpose of feature engineering is to help the model make better predictions. This step is optional but often very helpful if done right. In this project, a new feature called “question length” was created. “Question length” was based on the assumption that the average length of reference questions is longer than the average length of nonreference questions. If such is the case, the ML model will benefit by using this new feature to make better decisions. Figure 1 is a histogram of question length distribution. The distribution of reference questions is represented in blue; nonreference questions are represented in yellow. Figure 2 shows a sample result list following completion of data preprocessing and feature engineering. From left to right, it lists the result after each step. The question length feature (Question_len column) must follow the original question because this step is based on the original question before any other steps. The Question_lemma column is the result after all preprocessing steps. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 6 Figure 1. Histogram of question length distribution. Figure 2. Results from data preprocessing and feature engineering. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 7 Text Vectorization The purpose of text vectorization is to transform the text data into numeric data so that the ML algorithms and Python can understand and use that data to build a model. The basic idea is to build an n-dimensional vector of numerical features that present some object. The three most popular text vectorizations are count vectorization, n-grams vectorization, and TF-IDF vectorization. TF-IDF stands for term frequency–inverse document frequency. Because it is weighted, it is more accurate. Figure 3 shows the result of TF-IDF vectorization. Figure 3. Result of TF-IDF vectorization. Model Building, Testing, and Evaluation The first step of model building is to divide the dataset into two sets, one for a training model and one for model testing. Normally we use 80% of the data for training and 20% for testing. After feeding the training data to the model, we feed the testing data as new data to the model to predict the Yes or No label based on the pattern that the model builds through the training data. Testing data were initially labeled by humans and are 100 percent accurate. By comparing the labels predicted by the model with the labels in the training data, we would know how the model performs, and make changes, if necessary, to the model parameters. Scikit-learn contains several ML models. This project used two popular models: random forest and gradient boosting. The random forest model builds many decision trees and computes them at the same time. Then the final decision is made by majority vote. Because it computes at the same time, it is more efficient and fast. The gradient boosting model builds one tree at a time. Each new tree helps correct errors made by previously trained trees, and then the model is boosted (optimization) by reward or penalty. In theory, gradient boosting should yield better results than random forest. Nevertheless, it is slower and consumes more resources. The confusion matrix was used to evaluate the performance of the two models. There are three parameters in the confusion matrix. They are accuracy, precision, and recall. Accuracy equals true positive plus true negative and then divided by the total. Precision equals true positive divided by INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 8 true positive plus false positive. Recall equals true positive divided by true positive plus false negative. Usually there is a tradeoff between precision and recall. Recall shows the number of false negatives or the percentage of false negatives of the total, and precision shows the number of false positives. False negative means that the model predicts the reference question as a nonreference question. False positive means that the model predicts the nonreference question as a reference question. Which is more important for the model to catch, false positive or false negative? The answer depends on the actual situation. In our case, false negative is more serious than false positive because we did not want the real reference questions to be predicted as nonreference questions. However, it was acceptable if the nonreference questions were predicted as reference questions. Therefore, we wanted the least amount of false negatives, which meant the largest recall value possible. RESULTS AND ANALYSIS Table 2 lists the result from both models. Table 2. Results of random forest model and gradient boosting model Precision Recall Accuracy Fit time Predict time Random forest model 0.914 0.964 0.912 2.489 s 0.15 s Gradient boosting model 0.904 0.948 0.894 97.786 s 0.064 s In general, any parameter values above 0.9 (90%) are very good. Looking at and comparing those results, we can see that both models performed well. Nevertheless, the random forest model had better results than that of the gradient boosting model in all three parameters. In addition, the fit time of the random forest model was much shorter than that of the gradient boosting model. Even though the predict time of the random forest model is slightly longer than that of the gradient boosting model, it is relatively insignificant. Therefore, the random forest model was chosen for the final model for this project. CONCLUSION AND FUTURE WORK In this pilot study, we used the classification modeling of NLP and ML techniques to divide patrons’ chat questions into two categories: reference questions and nonreference questions. The purpose of the model is to predict the category of future questions received through chat so that library staff and professionals can provide faster, more efficient reference services. Two machine learning models were tested: random forest and gradient boost. After comparing results from each model, it was concluded that the random forest model showed better results. What is the next step after the model is built? A potential use of this model is to implement it as a plugin or feature enhancement for the online chat application. The model can function as the filter to direct incoming questions to either reference librarians if the question is predicted as a reference question by the model, or to library staff or graduate student assistants if the question is predicted by the model as a nonreference question. This will be especially useful for libraries that have busy online chat service transactions. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 9 Further work can be done to make the model be multicategory. For example, a multicategory model can go beyond two categories and include categories for information seeking, citation help, printing help, noise complaints, interlibrary loan questions, spams, etc. Thus, the model can send the question to the relevant department or library personnel accordingly. ENDNOTES 1 “Stanford University Library AI Initiative,” Stanford University Library, https://library.stanford.edu/projects/artificial-intelligence. 2 “Fantastic Futures: 2nd International Conference on AI for Libraries, Archives, and Museums,” (2019), Stanford University Library, https://library.stanford.edu/projects/fantastic-futures. 3 Christina M. Desai and Stephanie J. Graves, “Cyberspace or Face-to-Face: The Teachable Moment and Changing Reference Mediums,” Reference & User Services Quarterly 47, no. 3 (Spring 2008): 242–55, https://www.jstor.org/stable/20864890. 4 Sharon Q. Yang and Heather A. Dalal, “Delivering Virtual Reference Services on the Web: An Investigation into the Current Practice by Academic Libraries,” Journal of Academic Librarianship 41, no. 1 (November 2015): 68–86, https://doi.org/10.1016/j.acalib.2014.10.003. 5 Megan Ozeran and Piper Martin, “Good Night, Good Day, Good Luck: Applying Topic Modeling to Chat Reference Transcripts,” Information Technology and Libraries 38, no. 2 (June 2019): 49– 57, https://doi.org/10.6017/ital.v38i2.10921. 6 Christopher Brousseau, Justin Johnson, and Curtis Thacker, “Machine Learning Based Chat Analysis,” Code4Lib Journal, no. 50 (2021), https://journal.code4lib.org/articles/15660. 7 Jeremy Walker and Jason Coleman, “Using Machine Learning to Predict Chat Difficulty,” College & Research Libraries 82, no. 5 (2021), https://doi.org/10.5860/crl.82.5.683. 8 HyunSeung Koh and Mark Fienup, “Topic Modeling as a Tool for Analyzing Library Chat Transcripts,” Information Technology and Libraries 40, no. 3 (2021), https://doi.org/10.6017/ital.v40i3.13333. 9 Ellie Kohler, “What Do Your Library Chats Say? How to Analyze Webchat Transcripts for Sentiment and Topic Extraction” (17th Annual Brick & Click Libraries Conference, Maryville, Missouri: Northwest Missouri State University, 2017). 10 Jeremiah Flannery, “Using NLP to Generate MARC Summary Fields for Notre Dame’s Catholic Pamphlets,” International Journal of Librarianship 5, no.1 (2020): 20–35, https://doi.org/10.23974/ijol.2020.vol5.1.158. 11 Sultan M. Al-Daihani and Alan Abrahams, “A Text Mining Analysis of Academic Libraries’ Tweets,” The Journal of Academic Librarianship 42, no. 2 (2016): 135–43, https://doi.org/10.1016/j.acalib.2015.12.014. https://library.stanford.edu/projects/artificial-intelligence https://library.stanford.edu/projects/fantastic-futures https://www.jstor.org/stable/20864890 https://doi.org/10.1016/j.acalib.2014.10.003 https://doi.org/10.6017/ital.v38i2.10921 https://journal.code4lib.org/articles/15660 https://doi.org/10.5860/crl.82.5.683 https://doi.org/10.6017/ital.v40i3.13333 https://doi.org/10.23974/ijol.2020.vol5.1.158 https://doi.org/10.1016/j.acalib.2015.12.014 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022 USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS WANG 10 12 Kevin W. Walker and Zhehan Jiang, “Application of Adaptive Boosting (AdaBoost) in Demand - Driven Acquisition (DDA) Prediction: A Machine-Learning Approach,” The Journal of Academic Librarianship 45, no. 3 (2019): 203–12, https://doi.org/10.1016/j.acalib.2019.02.013. 13 Carlos G. Figuerola, Francisco Javier Garcia Marco, and Maria Pinto, “Mapping the Evolution of Library and Information Science (1978–2014) Using Topic Modeling on LISA,” Scientometrics 112 (2017): 1507–35, https://doi.org/10.1007/s11192-017-2432-9. 14 “Projects in Artificial Intelligence Registry (PAIR): A Registry for AI Projects in Higher Ed,” University of Oklahoma Libraries, https://pair.libraries.ou.edu/. 15 Thomas Finley, “The Democratization of Artificial Intelligence: One Library’s Approach,” Information Technology and Libraries 38, no. 1 (2019): 8–13, https://doi.org/10.6017/ital.v38i1.10974. https://doi.org/10.1016/j.acalib.2019.02.013 https://doi.org/10.1007/s11192-017-2432-9 https://pair.libraries.ou.edu/ https://doi.org/10.6017/ital.v38i1.10974 Abstract Introduction Literature Review Background of NLP and ML Methodology Data Collection Data Preprocessing Text Vectorization Model Building, Testing, and Evaluation Results and Analysis Conclusion and Future Work Endnotes