Lexicon Enhancers

This posting describes a number of Python scripts used to enhance a lexicon, where a lexicon is defined as a list of desirable, meaningful words. Compare a lexicon to a stop word list where stop word lists contain words of little use or interest, and a lexicon is a set of words of great significance.

The first script -- keywords2lexicon.py - takes the name of a Distant Reader study carrel and an integer (N) as input. It then computes the N most frequent keywords and outputs them to the carrel's etc/lexicon.txt file. This is a decent way to jumpstart a lexicon. Alternatively, create a list of words by hand.

Once you have a lexicon, you may want to enhance it, and there are three supported methods:

  1. lexicon2variants.py - given a study carrel, this script will find the lemmas of each word in the lexicon, identify the associated tokens ("words") with those lemmas, and send the result to standard output. This is a good way to identify variations in spellings but only the variations that exist in the carrel's corpus.
  2. lexicon2related.py - given a study carrel and a number (N), this script will loop through the lexicon to identify semantically related words. The number of related words can be equal to the top N similarities or the similarities whose value is greater than N, where N is a floating point number. The script will output the related words. All of this can only be done by first semantically indexing the carrel, and it only really becomes useful if the carrel's size can be measured in millions of words. This technique is a root of the current generative-AI trend.
  3. lexicon2synonyms.py - given a study carrel, loop through the carrel's lexicon and use WordNet to identify synonyms. This technique ought to be seen as complementary to lexicon2related.py as it will introduce words outside the carrel's corpus.

Used in different orders, with different parameters, and compounded between themselves, this tiny system of scripts will generate lists of words that may be of interest to the student, researcher, or scholar.

Given a refined lexicon, it is possible to create sophisticated full-text database queries, map the lexicon's words in a networked space, feed the words to a concordance for quick reading, etc. One might even calculate weights of individual documents based on the occurances of lexicon words. Hmmm...

As an example, here is a tiny lexicon generated from the list of computed keywords in Homer's Iliad and Odyssey:

father; great; jove; man

Here is a list of these words and their variants found in the corpus:

father; fathers; great; greater; greatest; jove; man; manned; men

Here is a list of these words and their semantically related words:

aside; father; great; jove; loud; man; mighty; murderous; redoubtable; sarpedon; shook; thereon

Here is a list of these words and their synonyms:

Church Father; Father; Father-God; Father of the Church; Fatherhood; Isle of Man; Jove; Jupiter; Man; Padre; adult male; bang-up; beget; begetter; beginner; big; bring forth; bully; capital; corking; cracking; dandy; don; enceinte; engender; expectant; father; forefather; founder; founding father; generate; gentleman; gentleman's gentleman; get; gravid; great; groovy; heavy; homo; human; human being; human beings; human race; humanity; humankind; humans; jove; keen; large; majuscule; male parent; man; mankind; military man; military personnel; mother; neat; nifty; not bad; outstanding; peachy; piece; serviceman; sire; slap-up; smashing; swell; valet; valet de chambre; with child; world

Finally, and very importantly, the output of these scripts are not intended to be taken whole cloth. Instead one is expected to get the output, peruse it, and then season it to your own taste. Computers are stupid. You are not.


Creator: Eric Lease Morgan <emorgan@nd.edu>
Source: This is the original publication of this posting.
Date created: 2024-07-12
Date updated: 2024-07-12
Subject(s): hacks; lexicons;
URL: https://distantreader.org/blog/lexicon-enhancers/