I've begun reading the whole of two library-related journals: 1) College & Research Libraries (CRL), and 2) Information Technology And Libraries (ITAL), and this blurb outlines what I've learned so far.

I began by exploiting OAI-PMH to download the whole of the journals. The corpus includes about 6,100 articles and is 26 million words long. [1] CRL dates from 1938, and ITAL from 1968. [2]

I then did some topic modeling against the corpus, and for grins, I limited the number of topics to eight. This resulted in the following themes:

     themes  weights                                           features
      books  0.43275  books use materials work subject collections s...
   american  0.27160  american history books index bibliography refe...
   academic  0.26037  academic work true false management change kno...
    faculty  0.21954  faculty state staff academic committee status ...
    catalog  0.15720  catalog data records use systems used search s...
   students  0.14023  students academic study student instruction fa...
   journals  0.11281  journals study use science articles data acade...
       data  0.07789  data digital web users technology search conte...

'Looks about right, if you ask me. I then visualized the results using the ubiquitous pie chart, and again, it looks pretty much what I expected.


I then augmented the underlying model with date values, pivoted the table, and visualized the result as a stacked area chart. From the results you can see that the theme of "books" was very prevalent until 1967. Then, starting around 2006, the theme of "data" became much more prominent.


Since my corpus is relatively large, and since I iterated the modeling process relatively few times, these results ought to be considered preliminary. Still, it looks about right to me.

Fun with comprehensive, full text collections.

[1] To put this into context, Moby Dick is about .25 million words long.
[2] For additional descriptive statistics-like detail about the collection, see: .

Creator: Eric Lease Morgan <>
Source: This posting was originally shared on the Code4Lib Slack channel (November 2, 2022)
Date created: 2022-11-14
Date updated: 2022-11-14
Subject(s): readings;