Distant Reader Indexes

I have begun to integrate indexes into the Distant Reader for the purposes of creating and downloading previously created data sets ("study carrels"). This posting introduces the work done to date.

The Reader creates and provides services against data sets afffectionately called "study carrels". To create a study carrel the student, researcher, or scholar assembles a set of files for analysis into a single directory. The directory is used as input to the Reader's build command, and the result is a data set that can be modeled for the purposes of addressing research questions. That said, you would be surprised how difficult it is for people to create a directory filed with files; now-a-days we seem distribute links to splash pages instead of the content itself. Sigh! The Reader's set of indexes is intended to make it easier to create a set of files for analysis and thus demonstrate the distant reading.

To date, a small set of indexes have been created. They include:

Arxiv - More than 2 million pre-print journal articles, mostly in areas of physics, astronomy, and computer science (interesting query: "computer science is")
Project Gutenberg electronic texts (ebooks) - About 60,000 books, mostly from the Western canon (interesting query: subject:love AND subject:war)
ITAL - The ful run (approximately 800) of a journal called Information Technology & Libraries (interesting query: "libraries are" )
Distant Reader study carrels - Approximately 3,000 previously and automatically created study carrels (interesting query: love AND sources:freebo )

The idea behind the indexes is this:

query an index
use the HTML output to sort, filter, and refine the query
export the results as CSV or JSON
import the CSV or JSON files into your favorite anaysis program (like a database, spreadsheet, or OpenRefine)
curate the results to make them even more exact
programatically loop through the results and cache content locally
transform the results into a simple metadata (CSV) file
create a study carrel with the cached ontent and the metadata file

Yes, at first glance, the process seems complicated, but a whole lot of it can be automated.

"Again, why are you doing this?" Because we continue to be drinking from the proverbial firehose; because we are suffering from information overload; because typical search results return so many relevant items, and we few means to really read and analyze all of them. The Distant Reader is intended to address this problem and more.

Finally, software is never done. If it were, then it would be called "hardware". That being the case, the links above will probably break sooner than I desire, but the process will remain the same: 1) search index, 2) refine result, 3) create data set, and ultimately, 4) do analysis -- read.

Creator: Eric Lease Morgan <emorgan@nd.edu>
Source: This is the first publication of this posting.
Date created: 2022-11-25
Date updated: 2022-11-25
Subject(s): Distant Reader; indexes;
URL: https://distantreader.org/blog/indexes/