Planet Eric Lease Morgan

January 05, 2021

Musings

Reading texts through the use of network graphs

You shall know a word by the company it keeps. --John Rupert Firth

I am finally getting my brain around the process of reading texts through the use of network graphs.

Words in-and-of themselves do not carry very much meaning; the connotation of words is lost without context; the meaning and connotation of words only really present themselves when words are used in conjunction with other words. That is why, in world of natural language processing, things like ngrams, noun phrases, and grammars are so important. Heck, things like topic modelers (such as MALLET) and semantic indexers (such as Word2Vec) assume the co-occurrence of words is indicative of meaning. With this in mind, network graphs can be used to literally illustrate the relationship of words.

As you may or may not know, network graphs are mathematical models composed of "nodes" and "edges". Nodes denote things, and in my world, nodes are usually words or documents. Edges denote the relationships -- measurements -- between nodes. In the work I do, these measurements are usually the distances between words or the percentage a given document is about a given topic. Once the nodes and edges are manifested in a data structure -- usually some sort of matrix -- they can be computed against and ultimately visualized. This is what I have learned how to do.

Below is a little Python script called "txt2graphml.py". Given a plain text file, one of two normalization functions, and an integer, the script will output a common network graph data structure called "GraphML". The script does its good work through the use of two Python modules, Textacy and NetworkX. The first takes a stream of plain text, parses it into words, normalizes them by finding their lemmas or lower casing them, and then calculates the number of times the given word is in proximity to other words. The normalized words are the nodes, and the proximities are the edges. The second module simply takes the output of the former and serializes it into a GraphML file. The script is relatively tiny; about 33% of the code includes comments:

  #!/usr/bin/env python

  # txt2graphml.py - given the path to a text file, a normalizer,
  # and the size of window, output a graphml file

  # Eric Lease Morgan <emorgan@nd.edu>
  # January 4, 2021 - first cut; because of /dev/stdout, will probably break under Windows

  # configure
  MODEL = 'en_core_web_sm'

  # require
  import networkx as nx
  import os
  import spacy
  import sys
  import textacy

  # get input
  if len( sys.argv ) != 4 : sys.exit( "Usage: " + sys.argv[ 0 ] + " <file> <lemma|lower> <window>" )
  file      = sys.argv[ 1 ]
  normalize = sys.argv[ 2 ]
  window    = int( sys.argv[ 3 ] )

  # get the text to process
  text = open( file ).read()

  # create model and then then use it against the text
  size = ( os.stat( file ).st_size ) + 1
  nlp  = spacy.load( MODEL, max_length=size, disable=( 'tagger', 'parser', 'ner', 'textcat' ) )
  doc  = nlp( text )

  # create a graph; the magic happens here
  G = textacy.spacier.doc_extensions.to_semantic_network(
    doc,
    normalize=normalize, 
    nodes='words', 
    edge_weighting='cooc_freq', 
    window_width=window )

  # output the graph and done
  nx.write_graphml( G, '/dev/stdout' )
  exit()

One can take GraphML files and open them in Gephi, a program intended to render network graphs and provide a means to interact with them. Using Gephi is not easy; the use of Gephi requires practice, and I have been practicing off and on for the past few years. (Geesh!) In any event, I used both txt2graphml.py and Gephi to "read" a few of my recent blog postings, and I believe the results are somewhat illuminating. I believe the results illustrate the salient word combinations of each posting. Files. Functions. Tools. Content. Etc. Each "reading" is presented below:

The tools I use to do my application development

The combined use of two tools to create content

The process I'm employing to read the works of Horace

There are many caveats to this whole process. First, the denoting of nodes & edges is not trivial, but txt2graphml.py helps. Second, like many visualization processes, the difficulty of visualization is directly proportional to the amount of given data; it is not possible to illustrate the relationship of every word to every other word unless a person has a really, really, really big piece of paper. Third, like I already said, Gephi is not easy to use; Gephi has so many bells, whistles, and options that it is easy to get overwhelmed. That said, the linked zip file includes sample data, txt2graphml.py, a few GraphML files, and a Gephi project so you can get give it a whirl, if you so desire.

Forever and a day we seem to suffering from information overload. Through the ages different tools have been employed to overcome this problem. The venerable library catalog is an excellent example. My personal goal is to learn how certain big ideas (love, honor, truth, justice, beauty, etc.) have been manifested over time, but the corpus of content describing these things is... overwhelming. The Distant Reader is a system designed to address this problem, and I am now on my way to adding network graphs to its toolbox.

Maybe you can employ similar techniques in the work you do?

January 05, 2021 12:32 AM

January 01, 2021

Musings

The Works of Horace, Bound

The other day I bound the (almost) complete works of Horace.

For whatever reason, I decided to learn about bit about Horace, a Roman poet who lived between 65 and 8 BC. To commence upon this goal I downloaded a transcribed version of Horace's works from Project Gutenberg. I marked up the document in TEI and transformed the resulting XML into a FO (Formatting Objects) file, and then used a FO processor (Apache FOP) to create a PDF file. The PDF file is simple with only a title page, table-of-contents, chapters always starting on the right-hand page, and page numbers. What's really important is the pages' margins. They are wide and thus amenable to lots of annotation. I then duplex printed all 400 pages.

Four hundred pages (two hundred pages duplex printed) is too large to effectively bind. Consequently I divided the works into two parts and bound them. The binding is simple. I started with two boards just less than the size of the paper. I then wrapped the boards with a single large piece of paper, and I covered up the insides with another piece of paper. I then wrapped a book block within the resulting case. Finally, I used a Japanese stab stitch to hold the whole thing together. Repeat for part #2. The results are very strong, very portable, and very functional, depicted below:

covers

binding

For better or for worse, I seem to practice and enjoy a wide spectrum of library-esque activities. Moreover, sometimes my vocation is also may avocation. Geesh!

P.S. Why are the works (almost) complete? Because the Gutenberg version does not include something called "Carmen Saeculare". I guess you get what you pay for.

January 01, 2021 05:00 AM

December 30, 2020

Musings

How to write in a book

There are two files attached to this blog posting, and together they outline and demonstrate how to write in a book.

The first file -- a thumbnail of which is displayed below -- is a one-page handout literally illustrating the technique I employ to annotate printed documents, such as books or journal articles.

Handout

From the handout:

For the most part, books are containers for data & information, and as such they are not sacred items to be worshiped, but instead things to be actively used. By actively reading (and writing in) books, a person can not only get more out of their reading, but a person can add value to the material as well as enable themselves to review the material quickly... Here is a list of possible techniques to use in an active reading process. Each assumes you have a pencil or pen, and you "draw" symbols to annote the text:... The symbols listed above are only guidelines. Create your own symbols, but use them sparingly. The goal is to bring out the most salient points, identify declarative sentences, add value, and make judgements, but not diagram each & every item.

The second file is a journal article, "Sexism in the Academy" by Troy Vettese in N+1, Issue 34 (https://nplusonemag.com/issue-34/essays/sexism-in-the-academy/). The file has been "marked-up" with my personal annotations. Give yourself 120 seconds, which is much less time than it would take for you to even print the file. Look over the document, and then ask yourself three questions:

What problem might the author be addressing?
What are some possible solutions to the problem?
What does the reader (me, Eric) think the most important point is?

I'll bet you'll be able to answer the questions in less than two minutes.

"Reading is FUNdemental."

December 30, 2020 05:00 AM

December 29, 2020

Musings

TEI Toolbox, or "How a geeky librarian reads Horace"

tldnr; By marking up documents in XML/TEI, you create sets of well-structured narrative data, and consequently, this enables you to "read" the documents in new & different ways.

Horace, not

Who was Horace and what did he write about? To answer this question, I suppose I could do some sort of Google search and hope for the best. Through an application of my information literacy skills, I suppose I could read an entry about Horace in an encyclopedia, of which I have many. One of those encyclopedias could be Wikipedia, of which I am a fan. Unfortunately, these approaches rely on the judgements of other people, and while other people have more experience & expertise than myself, it is still important for me to make up my own mind. To answer questions -- to educate myself -- I combine the advice of others with personal experience. Thus, the sole use of Google and/or encyclopedias fail me.

To put in another way, in order to answer my question, I ought to read Horace's works. For this librarian, obtaining the complete works of Horace is a trivial task. Search something like Project Gutenberg, the Internet Archive, Google Books, or the HathiTrust. Download item. Read it in its electronic form, or print it and read it in a more traditional manner. Gasp! I could even borrow a copy from a library or purchase a copy. In the former case, I am not allowed to write in the item, and in the later case the format may not be amenable to personal annotation. (Dont' tell anybody, but I advocate writing in books. I even facilitate workshops on how to systematically do such a thing.)

Obtaining a copy of Horace's works and reading it in a traditional manner is all well and good, but the process is expensive in terms of time, and the process does not easily lend itself to computer assistance. After all, a computer can remember much better than I can. It can process things much faster than I can. And a computer can communicate with other computers much more throughly than I can. Thus, this geeky librarian wants to read Horace with the help of a computer.

This is where the TEI Toolbox comes in. The TEI Toolbox is a fledging system of Bash, Perl, and Python scripts used to create and transform Text Encoding Initiative (TEI) files into other files, and these other files lend themselves to alternative forms of reading. More specifically, given a TEI file, the Toolbox can:

validated it
parse it into smaller blocks such as chapters and paragraphs, and save the results for later use
mark-up each word in each sentence in terms of parts-of-speech; "morphadorn" it
transform it into plain text, for other computing purposes
transform it into HTML, for online reading
transform it into PDF, specifically designed for printing
distill its content into a relational (SQLite) database complete with bibliographics, parts-of-speech, and named-entities
create a word-embedding (word2vec) database
create a (Solr) full-text index complete with parts-of-speech, named-entities, etc.
search the totality of the above in any number of different ways
compare & contrast documents in any number of different ways

Thus, given a valid TEI file, I can not only print a version of it amenable to traditional reading (and writing in), but I can also explore & navigate a text for the purposes of scholarly investigation. Such is exactly what I am doing with the complete works of Horace.

My first step was to identify a plain text version of Horace's works, and the version at Project Gutenberg was just fine. Next, I marked up the plain text into valid TEI using a set of Barebones BBEdit macros of my own design. This process was tedious and took me about an hour. I then used my Toolbox's ./bin/carrel-initialize.sh script to create a tiny file system. I then used the ./bin/carrel-build.sh script to perform most of the actions outlined above. This resulted in a set of platform-independent files saved in a directory named "horace". For example, it includes:

TEI/XML file; it all starts here
PDF file suitable for printing
HTML file complete with metadata and hundreds of navigation links
plain text files such as the complete works as a single file, chapters, and paragraphs
the relational database file
the word embedding file

To date, I have printed the PDF file, and I plan to bind it before the week is out. I will then commence upon reading (and writing in) it in the traditional manner. In the meantime, I have used the Toolbox to index the whole with Solr, and I have queried the resulting index for some of my favorite themes. Consequently, I have gotten a jump start on my reading. What I think is really cool (or "kewl"), is how the search results return pointers to the exact locations of the hits in the HTML file. This means I can view the search results within the context of the whole work, like a concordance on steroids. For example, below are sample queries for "love AND war". Notice how the results are hyperlinked within the complete work:

Here are some results for "god AND law":

And finally, "(man OR men) AND justice)":

All of the above only scratches the surface of what is possible with the Toolbox, but the essence of the Toolbox is this: by marking up a document in TEI you transform a narrative text into a set of structured data amenable to computer analysis. From where I sit, the process of marking up a document is a form of close reading. Printing a version of the text and reading (and writing in) it lends itself to additional methods of use & understanding. Finally, by exploiting derivative versions of the text with a computer, even more methods of understanding present themselves. Hopefully, I will share some of those other techniques in future postings. Now, I'm off to my workshop to bind the book, all 400 pages of it...

"Reading is FUNdemental."

December 29, 2020 05:00 AM

December 27, 2020

Musings

Cool hack with wget and xmllint

I'm rather proud of a cool hack I created through the combined use of the venerable utilities wget and xmllint.

Eye Candy by Eric

A few weeks ago I quit WordPress because it was too expensive, and this necessitated the resurrection of my personal TEI publishing system. Much to my satisfaction, the system still works quite well, and it is very satisfying when I can easily bring back to life an application which is more than a decade old. The system works like this: 1) write content, 2) mark-up content in rudimentary TEI, 3) pour content into database, 4) generate valid TEI, 5) transform TEI into HTML, 6) go to Step #1 until satisfied, and finally, 7) create RSS feed.

But since software is never done, the system was lacking. More specifically, when I wrote my publishing system RSS feeds did not include content, just metadata. Since then an extended element was added to the RSS namespace, specifically one called "content". [2] This namespace allows a publisher to include HTML in their syndication but with two caveats: 1) only the true content of an HTML file is included in the syndication, meaning nothing from the HTML head element, and 2) no relative URLs are allowed because if they were, then all the URLs would be broken. ("Duh!") Consequently, if I wanted my content to be truly syndicated, then would need to enhance my RSS feed generator.

This is where wget and xmllint make the scene. Given a URL, wget will... get the content at the other end of the URL, and as an added bonus and through the combined use of the -k and -O switches, wget will also tranform all relative URLs of a cached HTML file into absolute URLs. [3] Very nice. Thus, Issue #2, above, can be resolved. To resolve Issue #1, I know that my returned HTML is well-formed, and consequently I can extract the desired content through the use of an XPath statement. Given this XPath statement, xmllint can return the desired content. [4] For a good time, I can also use xmllint to reformat the output into a nicely formatted hierarchical structure. Finally, because both of these utilities support I/O through standard input and standard output, they can be glued together with a few tiny (Bash) commands:

# configure
URL="http://infomotions.com/musings/my-ide/"
TMP="/tmp/file.html"
XPATH='/html/body/div/div/div'

# do the work
CONTENT=$( wget -qkO "$TMP" "$URL"; cat "$TMP" | xmllint --xpath "$XPATH" - | xmllint --format - | tail -n +2 )

Very elegant. The final step is/was to tranlate the Bash commands into Perl code and thus incorporate the hack into my RSS generator. "Voila!"

Again, software is never done, and if it were, then it would be called "hardware"; software requires maintenance, and after a while the maintenance can become more expensive than the development. It is very satisfying when maintenance is so inexpensive compared to development. Jettisoning WordPress was the right thing for me to do, especially considering the costs -- tiny.

December 27, 2020 05:00 AM

December 20, 2020

Musings

My integrated development environment (IDE)

My integrated development environment (IDE) consists of three items: 1) a terminal application (Mac OS X Terminal), 2) a text editor (Barebones's BBEdit), and 3) a file transfer application (Panic's Transmit). I guess it goes without saying, I do all my work on top of Mac OS X.

Mac OS X Terminal

Barebones BBEdit

Panic Transmit

At the very least, I need a terminal application, and Mac OS X's terminal works just fine. Open a connection to my local host, or SSH to a remote host. Use the resulting shell to navigate the file system and execute (that sounds so violent) commands. Increasingly I write Bash scripts to do my work. Given a relatively sane Linux environment, one would be surprised how much functionality can be harnessed with simple shell scripts.

BBEdit is my most frequently used application. Very rarely do I use some sort of word processor to do any of writing. "Religious wars" are fought over text editors, so I won't belabor my points. BBEdit will open just about any file, and it will easily open files measured in megabytes in size. Its find/replace functions are full-featured. I frequently use its sort function, duplicate line function, remove line breaks function, markup function, and reformat XML and JSON functions. It also supports the creation macros, knows about my local shell, and can do AppleScript. BBEdit can even be opened from the command line, meaning it can take STDOUT is input. Fun!

While BBedit suports SFTP, my go to file transfer application is Transmit. Transmit knows many file transport protocols, not just SFTP. For example, instead of using a Web browser to navigate a Google Drive (dumb), I can mount the drive with Transmit, and the result is much more responsive. Very similar to my terminal, I use it to connect to a remote host, navigate the file system, and then I create, move, rename, and delete files. Simple. One of the coolest bits of functionality is the ability to download a text file, have it opened in my editor, and when I save the text file, then it is saved on the remote host. Thus, there is little need to know a terminal-based editor like vi, emac, or nano, but I do use vi or nano every once in a while.

I have never felt the need for a "real" IDE. Too much overhead. No need to set any debugging points nor trace the value of a variable. I don't feel the need for a bazillion windows, panes, nor panels. An IDE feels too much a shell for my shell. Yet another thing to learn and an obfuscation of what is really going on. This is just my style. There are many different ways to cook an omlet, paint a painting, sing a song, etc. The same holds true maintaining computers, running software, and writing programs. To each his^h^h^h their own.

December 20, 2020 05:00 AM

December 19, 2020

Mini-musings

Final blog posting

This is probably my final blog posting using the WordPress software, and I hope to pick up posting on Infomotions’ Musings.

WordPress is a significant piece of software, and while its functionality is undeniable, maintaining the software in a constant process. It has become too expensive for me.

Moreover over, blog software, such as WordPress, was suppose to enable two additional types of functionality that have not really come to fruition. The first is/was syndication. Blog software was expected to support things like RSS feeds. While blog software does support RSS, people to not seem to create/maintain lists of blogs and RSS feeds for reading. The idea of RSS has not come to fruition in the expected way. Similarly, blog were expected to support commenting in the form of academic dialog, but that has not really come to fruition either; blog comments are usually terse and do not really foster discussion.

For these reasons, I am foregoing WordPress, and I hope to return to use the of my personal TEI publishing process. I feel as if my personal process will be more long-lasting.

In order to make this transition, I have used a WordPress plug-in called Simply Static. Install the software, play with the settings, create a static site, review results, and repeat if necessary. The software seems to work pretty well. Also, paying the roll of librarian, I have made an effort classify my blog postings while diminishing the number of items in the “miscellaneous” category.

By converting my blog to a static site and removing WordPress from my site, I feel as if I am making the Infomotions web presence simpler and easier to maintain. Sure, I am loosing some functionality, but that loss is smaller than the amount of time, effort, and worry I incur by running software I know too little about.

by Eric Lease Morgan at December 19, 2020 04:02 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/