Finished Creating a Collection of Carrels
I have finished creating and curating a collection of data sets I call "study carrels". †
As data sets, study carrels are intended to be computed against, and they are akin to "collections as data". On average, each study carrel includes about 100 textual items on a given topic, of a particular genre, or by an given author. They can be analyzed ("read") in a myriad of ways including but not limited to:
- bibliographics
- concordancing
- feature analysis
- full-text indexing
- large-language models
- linked data
- network graph analysis
- semantic indexing
- topic modeling
Through these reading techniques all sort of research questions can be addressed, and they range from the mundane to the sublime:
- how big is this collection?
- how difficult is it to read the items in this collection?
- what are the most frequent ngrams and named-entities?
- what is discussed, what do those things do, and how are they described?
- what words can be used to denote the aboutness of the collection, and what sentences contain those words?
- what are the latent themes in the collection, and how have those themes ebbed and flowed over time, or compared to authors?s
- how is Penelope in Homer's epics similar and different from the main characters of Jane Austen's works?
- what is the defintion of climate change, and how has it been manifested?
My next steps are four-fold: 1) describe the collection in greater detail, 2) describe how the collection can be accessed programmatically or through a Web browser, 3) demonstrate how to model carrels, and 4) address big philosophic questions like what is truth, beauty, honor, and justice.
Here are a few fun facts:
- the collection includes 3,000 carrels comprised of 315,000 items for a total of 3.5 billion words
- the largest carrel is on the topic of English literature and is comprised of 72,000,000 words which is equal to 90 Bibles or 280 copies of Moby Dick
- the content of the carrels comes from repositories such as Project Gutenberg, EarlyPrint, a dataset called CORD-19, and journal articles harvested via OAI-PMH
Finally, I assert the process of reading something online is inherently different from reading something in an analog form. "Duh!" For example, in an analog form we inherently observe the size of a document and state whether it is long or short. Such is not nearly as easy to do in a digital environment. What are you to do? Measure sizes in bytes? Similarly, analog books include all sorts of tools to assist in the reading process: tables of contents, running chapter headings, page numbers, indexes, back-of-the-book indexes, maybe annotations written in the margins, etc. Things like this are poorly manifested in the digital environment. On the other hand, te digital environment does include rudimentary find (control-f). If the process of reading is different in different environments, then we need different tools to do our reading. Heck, I use my glasses to help me read. I read with my pencil in hand. Why not use a computer to help me read? Moreover, traditional -- close -- reading does not scale, but distant reading does. For example, how long would it take you to use traditional reading techniques to outline the characteristics of 280 novel-length things? Study carrels are an attempt to address all of these issues.
Peruse the collection at http://carrels.distantreader.org/
Thank you for listening.
† - All that said, collections are never really finished.
Creator: Eric Lease Morgan <emorgan@nd.edu>
Source: This message was first posted to the Code4Lib mailing list on October 21, 2024
Date created: 2024-10-26
Date updated: 2024-10-26
Subject(s): Distant Reader; study carrels;
URL: https://distantreader.org/blog/collection-of-carrels/