

LLamaIndex Sketches

This directory contains a set of files used to: 1) index content, and 2) query the index. The files are sketches against a Python framework called LLamaIndex, which is designed to exploit large-language models. [1] In the end, this directory is intended to demonstrate yet another way to model text for the purposes of use & understanding.


Setup

First, to set up the files in this directory, you MUST have a OpenAI API key. I'm not sure, but I think you can get one for free. First, from the OpenAI home page create an account, and then generate a key. [2, 3] The resulting key will look something like this:

  sk-Y8bnK8DasdfygmJdEDMLTtWABwzBCUX

Next, you MUST create an environment variable called OPENAI_API_KEY containing the value of your... key. On Linux/Macintosh computers you can enter the following command from the terminal:

  export OPENAI_API_KEY=sk-Y8bnK8DasdfygmJdEDMLTtWABwzBCUX
  
On Windows you can do the same thing like this, I think:

  set OPENAI_API_KEY=sk-Y8bnK8DasdfygmJdEDMLTtWABwzBCUX


Index

To interact the files you will need to place content in the folder/directory called "corpus". This distribution comes pre-installed with the chapters from Jane Austen's Emma. Alternatively, you can put your own plain text or PDF files in to the corpus directory.

Next, you need to actually index the files; run ./bin/index.py. Initially the program will probably croak because you have not installed the necessary Python modules. Do so now, and repeat this step. If all goes well zero output will be returned, and a bunch o' files will appear in the folder/directory called "index". These files include snippets of the original content as well as vectors ("embeddings") denoting where those snippets exist in an n-dimensional space.


Chat

Once content has been indexed, you can interact with the index; run ./bin/chat.py. The index files will be loaded into RAM, and you will be prompted to enter a query. You can enter anything, but try to enter detailed and grammatically correct sentences/questions. If all goes well, the system will return responses. Repeat this step using words and/or phrases found in the responses. Repeat this whole process for about five minutes. Such will exercise your brain, and I'm sure you will go away with a better understanding of the indexed corpus.

Compare this process to the process of traditional reading; if you were to spend five minutes using the traditional reading process against the whole of Emma, do you believe you will acquire the same level of understanding? Don't get me wrong; traditional reading and distant reading complement each other. 


Example

Here is an example interaction with the system:

  $ ./bin/index.py
  $ ./bin/chat.py

  Ask anything: What are the over-arching themes in the story?

  The over-arching themes in the story include chance encounters and
  unexpected connections, the power of first impressions, and the
  complexities of human emotions.


  Ask anything: Tell me about the human emotions.

  The characters in the given context go through a range of human
  emotions. These emotions include surprise, disbelief, happiness,
  amusement, indifference, attachment, eagerness, regret, affection,
  tiredness, concern, reflection, doubt, envy, misery, and longing. The
  characters' emotional states are shaped by their situations and
  interactions with each other.


  Ask anything: What are the names of some of the people who are surprised,
  eager, indifferent, or concerned?

  In the given context, there are several people who display different
  emotions. These individuals include Mr. Weston, who is eager and
  concerned, Emma, who is indifferent, Mr. Woodhouse, who is concerned,
  Harriet, who is frightened, Frank Churchill, who is concerned and eager,
  Miss Bates, who is concerned, Mrs. Goddard, who is concerned, Mr.
  Knightley, who is concerned, and Miss Fairfax, who is indifferent.


  Ask anything: Tell me about Frank Churchill.

  Frank Churchill is mentioned in the given context. He is described as
  someone who has the ability to leave the Churchills and is often seen at
  various social events and locations. There are suggestions that he
  enjoys pleasure and is not always consistent in his actions. However,
  his behavior is also seen as a result of his circumstances and the
  expectations placed upon him. Overall, Frank Churchill is a character
  who is of interest to the other characters in the context.


  Ask anything: What are the relationships between Emma and Frank? 

  Emma and Frank Churchill have a friendly relationship. They enjoy each
  other's company and have conversations about various topics. They are
  seen walking together and spending time together at social events.
  However, there is no indication of a romantic relationship between them
  in the given context.


  Ask anything: Is there a moral to the story, and if so, then what is it?

  The story in the given context does not explicitly convey a moral.


  Ask anything: ^C            
  Okay, bye bye, and thank you.


Extra credit

For extra credit, delete all the files in the corpus directory and replace them with your own. For example, do some sort of literature search, download about a dozen PDF files from the results, save them in the corpus directory, and lastly, index them. Next, use the chat program to interact the newly created index. Alternatively, download a few files from Project Gutenberg -- say, all of the things written by Mark Twain -- and "read" them too. [4]


Librarianship

These sketches exploit some of the latest developments in natural language processing and machine learning, but more importantly, these sketches can be seen as an opportunity for librarianship.

Here in libraries we collect, organize, archive, and disseminate data, information, and knowledge. But with the advent of computers and the Internet, such processes can be done sans libraries and librarians. Yet, libraries and librarians still have vast quantitates of data, information, and knowledge at their disposal. Moreover, the associated cataloging ("metadata") is all but indispensable when it comes to thorough analysis. Merely enabling people to search & get library content is not as useful as it once was. On the other hand, if libraries and librarians where to not only curate collection but provide services akin to the services in this sketch, then libraries and librarians would both evolve and enable people to comprehend the things they need to read to a greater degree. 


Links

[1] LlamaIndex - https://www.llamaindex.ai
[2] OpenAI home page - https://openai.com
[3] generate an OpenAI API key - https://platform.openai.com/account/api-keys
[4] Project Gutenberg - https://gutenberg.org

--
Eric Lease Morgan <emorgan@nd.edu>
Navari Family Center for Digital Scholarship
University of Notre Dame

January 2, 2024

