Analyzing the University of Notre Dame's Theses & Dissertations

For a number of reasons, I spent some time analyzing a subset of the University of Notre Dame's theses & dissertations, and this missive outlines what I learned.

Theses & dissertations metadata

Since 2003 or so, the University's Graduate School has been encouraging and/or mandating Masters and Doctoral students to upload their completed theses & dissertations to the library's institutional repository. Along with digital (PDF) versions of the theses and/or dissertations, each upload is assoicated with bibliographic as well as topical metdata. This metadata includes things like: authors, titles, dates, abstracts, colleges, degrees, disciplines, subjects/keywords, and contributors. If one is on the University's network, then it is possible to use an application programmer interface (API) to query the institutdional repository and acquire the metadata. This is exactly what I did. After normalizing ("cleaning") the metadata, the result is manifested as comma-separated values (CSV) file.

It is now possible to do some analysis against the metadata. For example, since 2003, about 5,200 theses & dissertations have been granted. This means, since 2003, the University has averaged granting about 260 graduate degrees per year. These degrees have been self-classified with dates, types, colleges, and disciplines. The way these things have been applied can be visualized as simple pie charts, as below:

I find the last two visualizations compelling; clearly, more than half, if not two-thirds of the graduate degrees have been in non-humanities fields of study. When I think aboutd the University of Notre Dame, or when people outside of the University think of the University, then I suspect they associate the University with humanities-related fields of study.

Analysis through text minging and natural langauge processing

The metadata, outlined above, combined with the abstracts of each thesis or dissertation makes for a good dataset where text mining and natural language processing can be applied. Such is exactly what was done. I converted the abstracts to files, fed the files and the metadata to a locally developed tool called the Distant Reader, and learned more about the theses & dissertations.

For example, what do graduate students write about? This question can be addressed by counting, tabulating, and visualizing the frequency of unigrams, bigrams, and statistically significant keywords, as per the following word clouds. An interesting observation, at least I think so, are the names of persons extracted from corpus. The names are most definitely heavy on religion and philosophy. Do scientists not mention people's names?

Additional descriptive-like statistics are demonstrated in the automatically generated index page -- index.htm.

Techniques called "feature reduction" and "topic modeling" can be used to compute aboutness as well. When I applied PCA (principle component analysis) against the corpus, the result grossly falls into two categories. We don't know what the categories are, but the ratio is about 2:1, and I believe this echoes ETDs by College pie chart, above:

Topic modeling is an unsupervised machine learning process used to enumerate latent themes in a corpus. Given an integer (T), the algorithm clusters the corpus into T topics, and each topic can connote a theme. Remember, there is no corect value for T. After all what is the "correct" number of topics describing the whole of Shakespeare's works?

That said, I topic modeled with a value of T equal to 16, simply because it seemed to tell a good story and the results were easier to visualize. Thus, I predict the theses and dissertations are grossly about the following topics and fall into the following ratios:

       topics  weights  features
         data  0.25502  data models methods analysis approach systems ...
    political  0.07543  political social religious american cultural s...
        human  0.05952  human theory moral account theology christian ...
     literary  0.05586  literary texts history century works medieval ...
       memory  0.04758  memory research relationship effects social st...
      surface  0.04402  surface properties energy materials molecular ...
      species  0.03578  species effects water environmental change cli...
         flow  0.03256  flow pressure layer boundary surface plasma fi...
        cells  0.03136  cells expression cancer role gene signaling di...
      systems  0.02846  systems performance control energy networks co...
    equations  0.02795  equations theory problem space problems order ...
      binding  0.02742  binding proteins activity biological imaging m...
         bone  0.02731  bone nuclear reaction energy mechanical proper...
        ionic  0.02358  ionic uranyl metal reaction synthesis compound...
      devices  0.02184  devices optical performance power current magn...
         glia  0.00818  glia standard mass retinal surge coastal higgs...

The underlying model can be supplemented with year and college metadata values, pivoted, and visualized. Thus we can address questions such as, "How did the topics ebb & flow over time?", or "What topics dominate what colleges?" In these cases, the topics seem pretty constant over time with some sort of anomaly in 2016. The science-related theses & dissertations where the most diverse, and they emphasized "data", which is really "data models methods analysis approach systems".

Graph analysis

Network graph analysis excels at analyzing the relationships between things. For example, theses & dissertations have authors. Authors are in schools. And schools are a part of colleges. These things can be combined as sets of nodes and edges, measured, and visualized. In this case, we can first see how degrees, schools, and colleges are related. The College of Arts & Letters includes many disciplines, for example. Second, the nodes and edges can be "clustered", in a process similar to topic modeling, and we can see how they fall into four major neighborhoods: doctoral degrees, masters degrees, humanities schools, and sciences schools:

Summary and conclusion

I took it upon myself to analyze the University of Notre Dame theses & dissertations from 30,000 feet. I was surprised at the ratio of science to humanities degrees earned, and I ask myself, "To what degree do the Libraries's collections and services align with the scholarship being done?" I don't really know the answer to this question, and I suppose it depends on the definition of "collections and services". I believe discussing such definitions would be interesting.

Finally, the entire data set -- raw data, tabulations, models, and visualizations -- used to do this analysis can be downloaded from https://distantreader.org/stacks/carrels/etds/index.zip. The resulting files are amenable to a wide number of desktop applications, database programs, and software tools. Download the file and do your own analysis?


Eric Lease Morgan <emorgan@nd.edu>
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

June 28, 2023