RAG (retrieval-augmented generation)
RAG (retrieval-augmented generation) -- as one way to implement generative AI -- is something easy for us libraries to get our heads around because the process is very much like the impementation of our discovery systems:
- create content
- index content
- query content
- return response
While such is the basic RAG recipe, here I will outline a way I have implemented it. You can cut to the case by perusing a temporary README file as well an advanced chat session about cataloging.
First create a set of content to be indexed. In this case I simply created a directory filled with plain text files -- chapters from Jane Austen's Emma.
Next, index the content using OpenAI's API. (See index.py) The script reads each file in the configured directory and sends it off to OpenAI. OpenAI indexes ("vectorizes") the content, returns the index, and it is cached for future use.
Third, you can query the index -- chat.py One can enter just about anything desired, and the results ought to rooted in the index content -- nothing else. Here is a tiny chat:
$ ./chat.py Ask anything: Who is Emma? Emma is a young woman who is described as handsome, clever, and rich. She has lived a comfortable life with very little distress or vexation. Emma is the youngest daughter in her family, with an affectionate and indulgent father. She has been the mistress of her father's house since her sister's marriage. Emma had a governess who was like a mother to her, and she had a close relationship with her. Emma is portrayed as having the power to have her own way and a tendency to think highly of herself, though these traits are not seen as significant issues by her at the moment. Ask anything: In a sentence, what does Emma like to do? Emma likes to spend time with Frank Churchill, assist him with tasks, and engage in conversations with him. Ask anything: How old is Frank Chirchill? Frank Churchill is around three or four-and-twenty years old. Ask anything: Where does the story take place? The story takes place in a room where Mrs. Martin and two girls are present, engaging in conversation and reminiscing about past events. Ask anything: What is a synopisis of the story? The story follows Emma as she navigates social interactions and relationships in her community. She becomes involved in a situation where her friend Harriet is misled by Frank Churchill's behavior. Emma reflects on her role in encouraging Harriet's feelings and realizes she should have intervened to prevent any misunderstandings. The narrative also delves into Mr. Knightley's suspicions regarding Frank Churchill's intentions towards Emma and Jane Fairfax. Themes of friendship, social dynamics, and self-awareness are explored among the characters. Ask anything: ^D Okay, bye bye, and thank you.
The results are not perfect, but what indexing/searching process is?
Getting results, such as the ones above are nice, but they beg the question, "From where did the answers originate? Show me citations." This problem is easily address through the use of metadata. More specifically, when indexing ("vectorizing") one can assoicate things like authors, titles, dates, genres, types, etc. to indexed items. Then, when results are returned, it is possible to learn what documents were used to formulate responses.
For example, I collected about 136 journal articles on the topic of cataloging. I then indexed them using OpenAI, and create a query interface. Below are snippets from an example chat, and notice how each is assoicated with one or more author/title combinations -- pointers to the articles whence the answer were generated:
What is the role of cataloging in libraries? Cataloging in libraries involves creating bibliographic records for various resources, including books and online materials, to facilitate the discovery and access of these resources by library users. Catalogers are responsible for organizing and describing library materials, ensuring that they are accurately represented in the library's catalog. The role of cataloging has evolved with the introduction of automation and technological advancements, leading to changes in the tools and resources used by catalogers. Some libraries also outsource certain cataloging functions, such as copy cataloging and cataloging of foreign-language materials. Overall, cataloging plays a crucial role in ensuring that library users can easily find and access the resources they need. References: 1. Liaising the Catalog: Collaborating Across Library Departments to Promote Successful Discoverability through Enhanced Cataloging / Busch, Tammie (2023.0) (5) 2. Catalogers in Academic Libraries: Their Evolving and Expanding Roles / Buttlar, Lois (1998.0) (3) How did OCLC effect the practice of library cataloing? OCLC had a significant impact on the practice of library cataloging. It provided a valuable resource for cataloging data, allowing libraries to find cataloging copy for various types of materials. Many libraries relied on OCLC as their primary source of cataloging copy, which helped improve the efficiency and effectiveness of their cataloging processes. OCLC's data base also played a role in interlibrary loan, preacquisitions verification, and cataloging data. Overall, OCLC's services had a positive effect on library cataloging practices. References: 1. Liaising the Catalog: Collaborating Across Library Departments to Promote Successful Discoverability through Enhanced Cataloging / Busch, Tammie (2023.0) (3) 2. A Survey on the Outsourcing of Cataloging in Academic Libraries / Libby, Katherine A. (1997.0) (2) 3. The Availability of Cataloging Copy in the OCLC Data Base / Metz, Paul (1980.0) (1) 4. An Overview of the Current State of Linked and Open Data in Cataloging / Ullah, Irfan (2018.0) (1) 5. Bade, David. The Creation and Persistence of Misinformation in Shared Library Catalogs: Language and Subject Knowledge in a Technological Era. Champaign-Urbana, Ill.: Graduate School of Library and Information Science, Univ. of Illinois (Occasional Papers, no. 211), 2002. 33p. $8 (ISBN 087845120X). / Bland, Robert (2002.0) (1)
Finally, processes such as the ones outlined above could be applied to many differnt types of library content: MARC records, PNX files, the output of OAI-PMH harvests, LibGuides, special collections exhibits, etc. I'm not saying the results are better, but I am saying the ways to query the content are MUCH easier, and the results are MUCH more readable.
Finally, finally, you can temporarily download the whole of these sketches as a single zip file.
Creator: Eric Lease Morgan <emorgan@nd.edu>
Source: I believe I posted this to the Code4Lib mailing list.
Date created: 2024-04-30
Date updated: 2024-04-30
Subject(s): Retreival-augmented generation;
URL: https://distantreader.org/blog/rag/