Provided by the author(s) and University College Dublin Library in accordance with publisher 

policies. Please cite the published version when available.

Title Systems in Language: Text Analysis of Government Reports of the Irish Industrial School 

System with Word Embedding

Authors(s) Keane, Mark T.; Pine, Emilie; Leavy, Susan

Publication date 2019-06-03

Publication information Digital Scholarship in the Humanities, 34 (1): i110-i122

Publisher Oxford University Press

Item record/more information http://hdl.handle.net/10197/10884

Publisher's statement This article has been accepted for publication in Digital Scholarship in the Humanities © 

2019 the Authors Published by Oxford University Press. All rights reserved.

Publisher's version (DOI) 10.1093/llc/fqz012

Downloaded 2021-04-06T00:40:07Z

The UCD community has made this article openly available. Please share how this access 

benefits you. Your story matters! (@ucd_oa)

© Some rights reserved. For more information, please see the item record link above.

https://twitter.com/intent/tweet?via=ucd_oa&text=DOI%3A10.1093%2Fllc%2Ffqz012&url=http%3A%2F%2Fhdl.handle.net%2F10197%2F10884


1 
 

Systems in Language: Text Analysis of Government Reports 
of the Irish Industrial School System with Word Embedding 

 
Susan Leavy 
University College Dublin 

Ireland 
susan.leavy@ucd.ie 

Mark T. Keane 
University College Dublin 

Ireland 
mark.keane@ucd.ie 

Emilie Pine 
University College Dublin 

Ireland 
emilie.pine@ucd.ie 

 
Abstract 

 
Industrial Memories is a digital humanities initiative to supplement close 

readings of a government report with new distant readings, using text analytics 

techniques. The Ryan Report (2009), the official report of the Commission to Inquire 

into Child Abuse (CICA), details the systematic abuse of thousands of children 15 from 

1936 to 1999 in residential institutions run by religious orders and funded and overseen 

by the Irish State. Arguably, the sheer size of the Ryan Report—over 1 million words—

warrants a new approach that blends close readings to witness its findings, with distant 

readings that help surface system-wide findings embedded in the Report. Although 

CICA has been lauded internationally for 20 its work, many have critiqued the narrative 

form of the Ryan Report, for obfuscating key findings and providing poor systemic, 

statistical summaries that are crucial to evaluating the political and cultural context in 

which the abuse took place (Keenan, 2013, Child Sexual Abuse and the Catholic 

Church: Gender, Power, and Organizational Culture. Oxford University Press). In this 

article, we concentrate on describing the distant reading methodology we adopted, using 

machine learning and text-analytic methods and report on what they surfaced from the 


2 
 

Report. The contribution of this work is threefold: (i) it shows how text analytics can be 

used to surface new patterns, summaries and results that were not apparent via close 

reading, (ii) it demonstrates how machine learning can be used to annotate text by using 

word embedding to compile domain-specific semantic lexicons for feature extraction 

and (iii) it demonstrates how digital humanities methods can be applied to an official 

state inquiry with social justice impact. 

 
Keywords: text analysis, text classification, machine learning, industrial schools, child abuse 

 
1. Introduction 

The Ryan Report (Ryan, 2009) details the findings of the Irish Government’s 

Commission to investigate child abuse in Irish Industrial Schools, run by Catholic 

Religious Congregations from the 1936-1999. The Report provides an extensive 

catalogue of abuse carried out in these schools and had a major societal impact in 

Ireland with respect to public attitudes to the moral authority of the Roman Catholic 

Church (Donnelly and Inglis, 2010; Pilgrim, 2012). However, aspects of its narrative 

structure have been criticised for obscuring as much as it revealed. The anonymisation 

of names of the clergy for instance has been criticised for protecting the religious orders 

(Powell et al, 2012) and the structure of the document obscures the systematic nature of 

the abuse (Pine et al. 2017).  

This paper reports on the use of text analytics to surface heretofore-invisible 

underlying patterns and enable a system-wide analysis of the contents of the report and 


3 
 

facilitate new kinds of reading through an interactive web-based platform1. It presents a 

distant reading methodology whereby word embedding is used to compile domain-

specific semantic lexicons for feature extraction to enable machine learning classifiers 

to annotate excerpts of the Ryan Report according to its meaning. In the remainder of 

this introduction, we identify key shortcomings of the Report (see section 1.1), specify 

our motivation for doing the current work (section 1.2), outline the key themes in the 

report (section 1.3) and outline the structure of the remainder of the paper. 

 
1.1 Shortcomings of the Ryan Report 

The structure and narrative form of the Ryan Report organises information in a way that 

impedes a system-wide analysis of abuse in the Irish industrial school system. 

Preliminary chapters describe the historical background of the school system, the terms 

of the Commission of Investigation and how various selected sources were used2. The 

main body of the report is then comprised of a collection of chapters organized by 

school3. Each chapter begins with a historical overview of the school and its 

management. The narrative then moves to a consideration of the events involving 

clerics or lay staff in the school, about whom accusations of abuse were made. 

Due to this segregation of information by school, the descriptions of serial 

abusers and their movements from school to school are distributed across many 

chapters.  This makes it very difficult, if not impossible, for the reader to build a 

coherent history of a given individual who may have worked at several schools.  Indeed, 

in the context of 432 members of religious orders spread across 66 chapters, even the 

                                                             
1 https://industrialmemories.ucd.ie/ 
2 See the Commission to inquire into child abuse (CICA) Report, Vol. 1.1 to Vol. 1.5 (available 
at: http://www.childabusecommission.ie/rpt/).  
3 See the CICA Report, Vol. 1.6 to Vol. 2.16. 


4 
 

most assiduous reader cannot easily connect a given individual’s sequence of offenses 

in any coherent way.  This narrative structure obscures the movement of staff between 

institutions, which was a common response of governmental and congregational bodies 

to allegations of abuse. This structure also makes institutional comparisons difficult, 

thus obfuscating the system-wide conditions that allowed abuse to emerge and become 

endemic. 

Within the chapters on each school, information is further divided in sections 

according to individual perpetrators detailing evidence of abuse and the response of the 

religious orders. While this approach is consistent with the concept of individual 

responsibility that is fundamental to a retributive justice system (Hagan et al., 1981), 

such individualised narratives deflect from the complex social phenomena that 

contribute to the occurrence of abuse (Keenan, 2012).  

 
1.2 Motivation for a Distant Reading of the Ryan Report 

The motivation for the current work arose from the difficulties in undertaking a cross-

institutional, systemic analysis.  Hence, we advance a suite of techniques, using word 

embedding and text analytics (i.e. text classifiers) to perform distant readings of the 

document and annotate extracts of text based on their content. We outline a 

methodology for generating a set of domain-specific keywords (doing query expansion 

from minimal seed keywords) to compile lexicon-based features for classifiers that can 

be used in conjunction with other features to identify paragraphs based on their 

semantic content. This methodology could potentially be used to analyse the content of 

similarly voluminous reports resulting from other investigations (e.g. Royal 


5 
 

Commission Report, Australia, 2017 (covering 8,000 witness testimonies); Truth and 

Reconciliation Commission, Canada, 2015 (over 7,000 testimonies)).  

A central motivation also concerned issues with the application of machine learning 

techniques in the area of digital humanities. In many digital humanities projects, 

although corpora are too large to conduct comprehensive close readings, they are often 

not large enough to employ ‘big-data’ methodologies such as machine learning mainly 

due to the cost of compiling sufficient training data (Schöch, 2013). We addressed this 

issue by outlining a scalable methodology that enables machine learning to be used for 

annotation with relatively small training-datasets. 

 
1.3  Knowledge Categories  

Annotating excerpts of the Ryan Report based on their semantic content enabled the 

existing narrative of the report to be deconstructed and its findings to be extracted and 

read in new ways. The following outlines the thematic foci of this analysis and their 

relevance to gaining a system-wide understanding of the dynamics of abuse at Irish 

industrial schools: 

Witness testimonies: Extracting the accounts of individual witnesses recorded in the text, 

to allow us to collate and examine in detail all of the testimony embedded in the Report. 

Experts of testimony of witnesses in the Ryan Report are most commonly presented in the 

form of block quotes and preceded by a colon along with introductory text contextualising 

the source of quotations. Shorter in-text quotations are identified by quotation marks. The 

same punctuation is used also to signify extracts from historical documentary sources such 

as reports and letters necessitating semantic analysis of the text introducing the quitations. 


6 
 

(Vol. 2.9) 

 
Transfer events: These paragraphs deal with the responses to allegations of abuse 

in the industrial schools where, typically, the cleric involved was transferred from 

one institution to another. In some cases the cleric involved was moved out of the 

schools system (to a parish or a Congregational House), dismissed or granted a 

dispensation from their vows. Those paragraphs recounting the movement of 

accused abusers, to enable us to view the transfer trajectories of specific individuals 

and to surface patterns of movement between institutions obscured by the linear 

narrative structure of the Report.  

 
(CICA Vol. 1.7) 

 
Abuse events: These paragraphs detail abusive events (i.e., physical, emotional and sexual 

abuse) and are a crucial to understanding the scale and nature of abuse across the industrial 

school system. The language used to describe abusive events is complex reflecting the 

varied experiences of the 1,090 witnesses who gave evidence of abuse experienced at the 

industrial schools. Extracting such paragraphs allows us to identify, collate and examine in 

detail the representations of abuse in the Ryan Report. 

 
(CICA Vol. 1.7) 

 
7 
 

1.4  Outline of Paper 

In the remainder of this paper, we present the techniques we used for a distant reading 

of the Ryan Report. We review the main collections of research relevant to our concerns 

in Section 2. We then describe the techniques and present the results of our research.  In 

Section 3, we outline how we used word-embedding methods, specifically word2vec 

(Mikolov, 2013), to carry out feature extraction in order to classify the semantic content 

of excerpts of the report. Section 4 describes how these domain-specific semantic 

lexicons along with other features can be used in a suite of classifiers designed to 

automatically identify particular text items in the Report.  This section also reports the 

results evaluating the effectiveness of these classifiers in detecting the semantic content. 

2. Background 

The approach we adopted in this project encompasses findings from previous studies in 

relation to the requirements for a digital platform to enable distant reading. It also builds 

upon previous approaches to using machine learning to automatically classify text. 

 
2.1 Digital Platforms for Humanities Research 

Widlöcher et al. (2015) outlined guidelines for platforms for humanities research 

demonstrating how enriching data through annotation, segmentation of documents, 

statistical analysis and comprehensive search functionality enables distant reading. They 

also emphasise the importance of retaining structural elements of documents to facilitate 


8 
 

close readings. This incorporation of both close and distant reading functionality within 

an exploratory digital interface was demonstrated in work by Hinrichs et al. (2015) and 

Kopaczyk (2013).  

Distant reading though the extraction and exploration of relationship between 

entities in text is a central function of many platforms (Muralidharan and Hearst, 2013; 

Vuillemot et al., 2009). Jokers and Mimno (2013) emphasises distant reading using 

methods such as topic modelling and visualisation. In developing an approach to 

digitally analysing the Ryan Report we build on requirements outlined in these related 

digital humanities projects.  

 
2.2  Annotation in Humanities Research 

Analysis of the contents of the Ryan Report involved the automatic classification of 

paragraphs based on their content. In exploring approaches to annotation in humanities 

research it is important to appreciate the important role that manual annotation plays in 

the critical analysis of text (Jackson, 2001). Researchers gain in-depth knowledge of the 

corpus through the process of evaluating its meaning and annotating the text. The 

development of distant reading methods must therefore aim not to simply replace this 

interpretative stage but to enhance it. Incorporating input from domain experts into the 

process is key to achieving that and also ensuring the interpretability of automation so 

the classification process itself may be critically analysed. This is demonstrated in work 

by Sweetnam and Fennel (2012) who included input from experts in each stage of their 

annotation process.  


9 
 

2.3 Automated Annotation 

There are two main approaches to automated annotation, rule-based and statistical 

machine learning. Chiticariu et al. (2013) outlines how the data-analytics industry 

primarily employs rule-based approaches to annotation and information extraction 

despite major developments in academia in using machine learning. This they found, is 

largely due to the fact that rule-based methods are interpretable, can incorporate domain 

knowledge easily and do not require extensive training data.  

A comparable situation persists in digital humanities where despite an abundance of 

research developing automated methods for annotation many projects rely on manual 

annotation of text (Mahlow et al., 2012). This is due in large part to the domain-

specificity of the language of many digital humanities corpora and the high levels of 

accuracy required to produce reliable analysis (Frank et al.; 2012, Hampson et al. 2013). 

Compiling sufficient training data to yield accurate results in this context is often costly 

and error prone. To address this, we explored an approach to automated classification 

that ensures high levels of accuracy with limited training data, while also incorporating 

domain knowledge and emulating the transparency and interpretability of a rule-based 

approach.  

2.4 Annotation Using Word Embedding and Semantic Text Classifiers 

The most relevant research on automated annotation pertains to identifying witness 

testimony. In the Ryan Report this information is represented as excerpts of reported 

speech. Our methodology therefore builds upon previous approaches to automatically 

identifying reported speech in text. This commonly relies on pattern-based extraction 

rules to detect linguistic markers such as quotation marks (Krestel et al., 2008; 

Pouliquen et al., 2007; Iosif et al., 2014).  However, in the Ryan Report, punctuation 


10 
 

does signal speech, but that punctuation also signals other kinds of text so semantic 

information from the paragraphs has to be taken into consideration making research 

extracting indirect speech more relevant (Krestel et al., 2008).  

Using machine learning, Schöch et al. (2016) developed an approach that 

involves semantic analysis using a lexicon of 81 linguistic features associated with 

direct speech derived from a corpus of French 18th century literature. These were used 

as features to train a classifier, yielding an accuracy of 84.4 percent. Weiser and Wartin 

(2012) developed a dictionary of verbs that introduce speech in text (reporting verbs) 

and used this in conjunction with pattern-based extraction rules to annotate indirect 

speech. 

Machine learning approaches to text classification commonly use a bag-of-

words approach to feature selection. However, this approach is problematic when 

instances to be classified are short giving rise to over-fitting (Brooks, 2013). A lexicon-

based approach to feature selection can prevent address this but encounters new issues 

concerning the domain specificity of some corpora. Existing lexical databases such as 

WordNet (Miller, 1995) have been used to generate lists of synonyms from seed words 

to compile semantic lexicons (Argamon et al., 2007). However, they often do not 

recognise terms specific to particular domains such as the domain of ecclesiastical 

discourse used in the Ryan Report. Our project therefore required a methodology that 

used machine learning with lexicon-based features that take account of specific terms 

used in the Ryan Report. 

In compiling domain-specific lexicons for feature extraction we called upon 

work by Mikolov (2013) who developed word2vec, a word-embedding algorithm. 

Word2vec is a set of neural network models that produces distributed representations of 


11 
 

words from text that reflect many aspects of their meanings.  It implements the 

distributed semantics notion that the “meaning of a word can be determined by the 

company it keeps”4. This technique analyses word co-occurrence over large corpora 

representing a given word by a large vector of all the other words it is found beside it. 

Using these vectors one can then establish that two words are “similar” or synonymous 

by virtue of whether their vectors are the same or close in a multi-dimensional space.  

Mikolov’s (2013) work provides a method for uncovering word-similarities that are 

tailored to the language of the Ryan Report.  

This word-embedding technique was used by Chanen (2016) to identify 

synonyms and compile lexicons for feature extraction in order to account for the 

multiplicity of terms used to refer to the same semantic concept in a corpus of flight 

incident reports. Their method of identifying semantically related terms involved 

generating 20 word-2-vec ensembles, extracting terms that re-occurred over each 

ensemble and manually filtering antonyms and semantically dissimilar words. Given the 

domain-specificity of the language in the Ryan Report this approach suggests a useful 

way to compile lexicons that are specific to the language of the religious, industrial-

school and legal worlds of the Ryan Report.  

3. Distant Reading the Ryan Report: Methodology 

A central aim of this project was to provide a methodology for identifying the semantic 

content of text in the Ryan Report and extracting given categories of information. The 

semantic categories identified included testimony of witnesses included in the Report 

(witness testimony), details of the transfer of clergy from school to school (transfer 
                                                             
4 See also Latent Semantic Analysis, as a related technique, Dumais 2004; Landauer 2006; and 
similar methods in Turney & Pantel, 2010.   


12 
 

events) and descriptions of abusive (abuse events). Machine learning classification was 

used to annotate the text based on domain-specific semantic lexicons along with other 

features. In order to generate these domain-specific lexicons, word embedding was used 

to find terms in the Ryan Report that were semantically similar to a given set of seed 

words; this task can be cast as a type of query expansion or feature extraction.   

These text-analytic methods for paragraph identification were extensions to our 

construction of a digital platform involving an exploratory web interface and database 

into which the significant parts of the Ryan Report were processed5. The core basic 

record in the database of this web-based system stored each paragraph from the relevant 

chapters in the Ryan Report.   These paragraph records were then linked to other tables 

detailing actors in the Report (witnesses, clerics, officials), the Schools, the 

Congregations and time periods. Named actors were extracted using NLTK (Bird and 

Looper, 2002) and other information was identified using a rule-based approach. The 

web-interface also had a string-search facility for the paragraphs along with filters for 

other categories of entity (e.g., one could search on a single school or a diocese).  

 In the remainder of this section, we report on the other aspects of the 

methodology we developed to permit automated paragraph identification.  

 
3.1 Method 

The Corpus & Paragraph Categories 

In the Ryan Report, 22 of the chapters detail events at each school. This dataset, 

comprising 6,839 paragraphs and 597,651 words was the corpus annotated according to 

its semantic content.   Each paragraph is a definite unit-of-analysis in the Report, as they 

                                                             
5 Digital platform was developed using the Django framework (https://djangoproject.com) 


13 
 

are systematically numbered and tend to focus on particular events and issues.  The 

following are characteristic features of each semantic category: 

 
3.2 Feature Extraction Techniques: Using Word Embedding 

Using machine learning with lexicon-based features can address the issue of over-fitting 

of classification models when instances of text to be classified are short as is the case 

with paragraphs in the Ryan Report (see section 2.4). However, the language of the 

Ryan Report is domain-specific and general thesauri would not identify concepts such 

as “dispensation” as being synonymous with “dismissal”. Hence, we used the word2vec 

algorithm (Mikolov, 2013) supplemented by synonyms generated from WordNet to find 

semantically related words from a set of seed-keywords building on the methodology 

outlined by Chanen (2016).  

To compile the semantic lexicon five word-2-vec ensembles were generated 

from seed words. The top 30 words were extracted from each ensemble. A set of words 

common to each ensemble were identified and the results were then reviewed by a 

domain expert to validate their validity as synonyms within the context of the Ryan 

Report. Using this method many non-obvious synonyms were found. General synonyms 

were collected using the WordNet lexical database. This involved entering each seed 

word and compiling a list of synonyms from the results of a search in WordNet. The 

resulting lists were verified manually to ensure they were appropriate synonyms for the 

context of the Ryan Report.       

Seed words were manually selected based on initial readings of the texts. In the 

case of paragraphs detailing transfers and direct speech, initial seed words were straight 


14 
 

forward to compile as terms such as ‘transfer’ and ‘said’ were commonly used in the 

report. However, in the case of descriptions of abuse, the language varied widely. A 

support vector machine-learning algorithm was used in this case to generate a 

classification model using 100 example paragraphs based on a bag-of-words feature set. 

Analysis of the support vectors highlighted words that best distinguished paragraphs 

describing abuse and the highest-ranking of these were used as seed words. The 

domain-specific semantic lexicons that resulted from this word embedding procedure 

are detailed in the next section.  

 
3.3 Feature Extraction Techniques: Lexicon-Based Features 

Domain-specific semantic lexicons were supplemented with other less domain-specific 

features. Verbs introducing reported speech, colons or quotation marks signal witness 

testimony in the Ryan Report. Punctuation such as commas, question marks and word 

contractions seemed to be used more frequently and testimony was also expressed in the 

first person. This information was therefore included as features to classify excerpts of 

direct speech in the Ryan Report (Table 1). A lexicon was also manually generated in 

order to filter out excerpts from written reports and letters. This lexicon included the 

terms: visitation, visitor, report, letter, wrote, written. ‘Visitations’ for instance is the 

term used for inspections of industrial schools carried out by the church. Various 

combinations of all features were examined to identify the optimal feature set. 

Generally, the person being transferred is named in a paragraph detailing such 

an event. Similarly, in describing abuse, the perpetrator is commonly named. Names 

were therefore included as features for both of these semantic categories. In describing 

transfer events, the names of the institutions were often mentioned and the events 


15 
 

seemed often to be described in sections, which concerned abuse or named the alleged 

perpetrator. This information was therefore included as features for classification.  

 
Semantic Category Feature 
Witness Testimony Reporting Verbs: domain specific semantic lexicon 
 Pronouns 
 Punctuation 
Transfer Events Transfer Terms: domain specific semantic lexicon 
 Section heading references to types of abuse 
 Mentions of Religious actors 
 Mentions of Institutions 
Descriptions of Abuse Abuse Terms: domain specific semantic lexicon 
 Mentions of Religious Actors 

Table 1: Feature Sets Extracted from the Ryan Report  

 
3.4  Classifiers Used for Paragraph Identification 

Separate classifiers were built for each of the paragraph categories.   Using training data 

for each paragraph type, features were extracted based on the semantic lexicons 

generated using word embedding and WordNet to build feature vectors for each 

paragraph based on frequency counts. A random-forest classifier (Breiman, 2001) was 

then trained to find the relative weightings of features that predicted the content-class 

for given paragraphs using the Weka toolkit (Holmes et al., 1994). The random-forest 

algorithm was chosen because, as an ensemble learner that creates a ‘forest’ of decision 

trees by randomly sampling from the training set, it is well suited to learning from 

smaller datasets (Poliker, 2006).  

3.5  Training Data 

Sample paragraphs belonging to each paragraph category were manually selected from 

the Ryan Report as training data for classifiers. In order to address the issue of the cost 


16 
 

of compiling training data in digital humanities projects (Fran et al., 2012; Hampson et 

al., 2013), minimising the number of examples required was a guiding principle.  

The training data consisted of 25 paragraphs detailing transfer events, 150 

paragraphs containing direct speech and 100 paragraphs describing abuse.  The vacience 

in numbers of training examples reflected the volume of instances in the report itself 

and the cost of compiling training data. Positive examples of each case were selected 

from across the report to capture the variety within the category. Negative examples 

were compiled through a random selection and manual verification of paragraphs.  

3.6  Validation & Evaluation 

Preliminary testing of the classifiers was done using 10-fold cross validation. These 

metrics indicated the most effective combination of features and were subsequently 

evaluated on a sample taken from a larger set of unseen data. The sample of unseen data 

was made up of 600 randomly selected paragraphs from the report. For transfer 

paragraphs, given the low number of training examples (25 positive and 50 negative), 

an interim evaluation stage was conducted by applying the classification model to a 

balanced set of 200 examples of unseen data from the report to further verify the 

optimal combination of features for classification.   

4 Results & Discussion 

The results showed that using word embedding to generate semantic lexicons for feature 

extraction is effective in yielding high accuracy where the language of a corpus is 

domain-specific and the volume of training data is limited. This allowed the integration 

of the semantic annotations in an online search tool for the report (Fig. 1). In the 


17 
 

following sections we outline the results of using word embedding to generate semantic 

lexicons and then report on effectiveness of the classifiers. 

 
Figure 1: Search Interface for Ryan Report 

 
4.1 Domain-specific Semantic Lexicons: Using Word Embedding 

Semantic lexicons for each category of text were generated from an initial set of seed 

words derived from readings of the report. In the case of witness testimony, the seed 

words were the reporting verbs “said”, “told” and “explained”. A reading of the Report 

suggested some obvious key terms to describe the transfer of staff from school to 

school: “transfer”, “dismiss” and “sack”. 

Seed words for abusive events were uncovered through analysis of the support 

vectors in a model generated by a support vector machine learning algorithm based on 

the words in a sample of 100 positive and negative paragraphs (details on this approach 

in section 3.2). This showed that terms distinguishing paragraphs describing abuse from 

the remainder of the report formed five semantic categories: perpetrator, abusive 

actions, body parts, emotions engendered in the victims and implements used in the 

abuse. The highest-ranking support vectors from these word types were selected as 

seed-words to form the semantic lexicon:  abuse, beaten, raped, arms, humiliation, 

implement. 


18 
 

The word lists generated from running the word2vec algorithm on the full text of 

the Ryan Report are detailed in Table 2. This details the common terms among the top-

30 words across 5 word-embedding ensembles generated for each seed word. After the 

manual verifications step, they were supplemented by general synonyms of each seed 

word generated from a search of the WordNet lexical database.  

 
Text Category Source Feature 
Witness Testimony Seed terms Said, told, explained 

Word 
embedding 

Accepted, acknowledged, added, admitted, advised, agreed, 
alleged, angry, answered, asked, asking, asserted, assured, 
believed, called, claimed, commented, complained, conceded, 
concluded, confessed, confirmed, convinced, denied, describes, 
explained, explained, felt, guarantee, heard, informed, insisted, 
knew, described, learned, mentioned, presumed, protested, 
questioned, realised, recalled, recollection, recounted, relieved, 
remarked, remember, remembered, replied, requested, said, 
saw, saying, says, screams, stated, stating, suggested, surmised, 
tells, thinks, thought, told, warned, witnessed, reported 

WordNet  Apologise, apology, articulate, articulated, assure, assured, 
condone, condoned, enounced, enounce, explicate, explicated, 
express, expressed, narrate, narrated, pardon, pardoned, posit, 
posited, recite, recited, recount, recounted, said, state, stated, 
submit, submitted, tell, told, verbalise, verbalised 

Transfer Events Seed terms Transfer, dismiss, sack 
Word 
embedding 

Application, applied, apply, appointed, appointment, arrival, 
arrived, arriving, assigned, attended, committed, continued, 
converted, decision, departure, discharge, discharged, dismiss, 
dismissal, dismissed, dispensation, dispensed, entered, 
expelled, leaving, move, moved, position, posted, posting, 
proposal, referring, release, relieved, removal, remove, 
removed, replaced, request, resignation, resigned, returned, 
returning, sacked, sanction, seek, sending, sent, served, stayed, 
suspended, transferred, withdraw, withdrawal 

WordNet  Transferred, transfer, moved, remove, dismiss, dismissed, 
sacked 

Descriptions of 
Abuse 

Seed terms Abuse, beaten, raped, arms, humiliation, implement 
Word 
embedding 

Lexicons pertaining to parts of the body, abusive actions, 
emotion engendered in the victims and implements of abuse 

  
Table 2: Domain-specific Semantic Lexicons 


19 
 

4.4 Classifier Results  

The results showed that the semantic lexicons generated using word embedding played 

a key role in producing accurate classifiers using limited training data. In classifying 

abuse paragraphs the words in the semantic lexicons were the sole features used. For 

transfer paragraphs, the semantic lexicon denoting transfer events featured in each of 

the combinations yielding the highest classification results. Results for classifying 

witness testimony were also highest when the semantic lexicon of reporting verbs were 

used as features. However, as was expected, features based on punctuation such as 

colons were also important in identifying this category of paragraph. 

Classification: Witness Testimonies 

In classifying paragraphs containing witness testimony, the model using a combination 

of all feature sets gained the highest accuracy in 10-fold cross-validation (Table 3). 

Most combinations of features were well-balanced between precision and recall.  

 
Feature Sets Precision Recall F-Measure Accuracy (%) 
Reporting Verbs, Punctuation, Personal Pronouns .928 .927 .927 93 
Punctuation, Pronouns .920 .920 .920 92 
Punctuation, Reporting Verbs .918 .917 .917 92 
Pronouns  .914 .913 .913 91 
Pronouns, Reporting verbs .910 .910 .910 91 
Punctuation .882 .867 .865 86 
Reporting Verbs  .823 .813 .812 81 

 Table 3: Results of 10-fold cross-validation for Witness Testimony Classification 

 
The best performing model as indicated by the 10-fold cross validation was then 

was run on the remainder of the report.  Based on a random sample of 600 paragraphs, 

an accuracy of 87 percent was achieved (Table 4). 

Feature Sets Precision Recall F-Measure Accuracy (%) 
Reporting verbs, Writing, Punctuation, Personal 
Pronouns 

.685 
 

.766 
 

.723 
 

87 
 

20 
 

  Table 4: Accuracy on sample of 600 Paragraphs for Witness Testimony Classification 

 
Error analysis showed that false negative results were primarily due to in-text 

quotations of short-phrases. There were no instances of larger blocks of quotations 

being missed by the classifiers. The rate of false positives was relatively high primarily 

due to the misclassification of letters, extracts from inspection reports and diary entries. 

However, in many cases the source of such content can be challenging to decipher even 

on reading the report.  

 
Classification: Transfer Events 

When all features were included in the classifier to automatically detect paragraphs 

detailing the transfers of religious throughout the industrial school system, 88 percent 

accuracy was gained based on 10-fold cross validation (Table 5). However, when 

named entities were excluded as feature-sets, accuracy increased to 94%. This counter-

intuitive result was verified further by applying the 3 best performing models to 200 

unseen paragraphs consisting of a 50-50 balance between positive and negative 

examples. This showed that on a balanced set of unseen data, using all features yielded 

the best results (Table 6).  

 
Feature Sets Precision Recall F-Measure Accuracy (%) 
Transfer terms, section heading info, mentions of school .941 .940 .940 94 
Transfer terms, section heading info .941 .940 .940 94 
Transfer terms, mentions of religious actors, section heading info, 
mentions of school 

.882 .880 .880 88 

Transfer terms, mentions of religious actors, section heading Info .880 .880 .880 88 
Transfer terms, mentions of religious actors .865 .860 .859 86 
Section heading info, mentions of school .865 .860 .859 86 
Transfer terms, mentions of religious actors, mentions of school .842 .840 .840 84 
Transfer terms, mentions of school .825 .820 .819 82 
Mentions of religious actors, section heading info .821 .820 .820 82 
Mentions of religious actors .818 .800 .797 80 
Transfer terms .818 .800 .797 80 
Mentions of religious actors, section heading info, mentions of 
School 

.750 .740 .737 74 

Mentions of religious actors, mentions of school .744 .740 .739 74 


21 
 

Section heading info .720 .720 .720 72 
Mentions of school .601 .600 .599 60 

Table 5: Witness Testimony 10-Fold Cross Validation for Transfer Events 

The text set of 200 sample paragraphs was comprised paragraphs that were 

distinctly positive and negative examples of transfer paragraphs.  For this reason, a 

higher level of accuracy would be expected than on the rest of the report where 

language can often be more vague.  

 
Feature Sets Precision Recall F-Measure Accuracy (%) 
Transfer Terms, Section Heading Info, Mentions of 
School .937 .804 .865 89 

Transfer Terms, Section Heading Info .913 .816 .862 86 
Transfer Terms, Mentions of Religious Actors, 
Section Heading Info, Mentions of School .966 .832 .894 90 

Table 6: Witness Testimony Accuracy on Balanced Sample of 200 Paragraphs 

The final phase of evaluation for paragraphs pertaining to transfer events 

involved application of the best performing model, from the results of the balanced set 

of 200 paragraphs, to the remainder of the report and manually examining the 

classification of 600 randomly sampled paragraphs (Table 7). These results showed 

high levels of recall. However, there were quite a few false positive results leading to 

relatively low levels of precision.  

Feature Sets Precision Recall F-Measure Accuracy (%) 
Transfer Terms, Section Heading Info, 

Mentions of School .514 .900 .655 93 

Table 7: Witness Testimony Accuracy on Random Sample of 600 Paragraphs 

Error analysis showed that paragraphs that were falsely categorised as being 

about the transfer of clergy actually pertained to the transfer of children. However, some 

false positive results raised potentially new questions regarding the transfer of children 

throughout the industrial school system as a response of the congregations to allegations 

of abuse:  


22 
 

(CICA Vol. 1.9) 

Classification: Abuse Events 

The best performing model for identifying paragraphs describing abuse in 10-fold cross 

validation used two of the semantic categories along with the names of the alleged 

perpetrator (Table 8). The domain-specific semantic lexicons that were most useful 

included references to the emotions engendered in the victims and references to abusive 

actions.  

 
Feature Sets Precision Recall F-Measure Accuracy (%) 
Action, emotion, mentions of religious actors .958 .958 .958 95.7 
Emotion, implement, action, mentions of religious 
actors .953 .953 .953 95.3 

Action, implement, mentions of religious actors .953 .953 .953 95.2 
Action, implement, emotion .948 .948 .948 94.8 
Implement, emotion, mentions of religious actors .948 .948 .948 94.8 
Emotion, implement, body, action, mentions of 
religious actors .943 .943 .943 94.3 

Body, action, emotion, mentions of religious actors .939 .939 .939 93.8 
Body, action, implement, mentions of religious 
actors .934 .934 .934 93.4 

Body, Implement, mentions of religious actors .906 .906 .906 90.5 
Body, action, implement .904 .901 .901 90.1 
Actor, body, emotion .901 .901 .901 90.0 
Emotion, implement, body, action .884 .882 .882 88.2 
Body, implement, emotion .816 .816 .816 81.6 
Body, action, mentions of religious actors .939 .939 .939 93.8 
Body, action, emotion .881 .877 .877 87.0 
Table 1: Results of 10-fold cross-validation for Abuse Events 

 
The classification model was then run on the reminder of the report and a 

random sample of 600 paragraphs was manually verified yielding overall accuracies of 

82 percent (Table 9).  

Feature Sets Precision Recall F-Measure Accuracy (%) 
Action, emotion, mentions of religious actors .395 .816 .532 81.8 

Table 9: Descriptions of Abuse Tested on Random Sample of 600 Paragraphs 


23 
 

While precision was reduced reflecting the complexity of the language, recall was high. 

Error analysis showed that false positives uncovered a similarity in the language used to 

describe the emotional experience of victims of abuse and some memories of young 

clergy when they first took up positions in the schools. 

5 Conclusions 

This research demonstrates how distant reading methodologies can deconstruct an 

official state report narrative to enable new kinds of analysis of institutional child abuse. 

Automatic annotation of excerpts of the report based on the meaning of the text enabled 

a more focussed close reading of these identified paragraphs, surfacing significant new 

patterns of events and language in the institutional system (Pine et al., 2017). These 

insights were previously obscured by the legal constraints on and narrative form of the 

Ryan Report, which emphasised an in-depth case-by-case study, in lieu of system-wide 

analysis.  

The feasibility of using machine learning to annotate text for digital humanities 

projects can be enhanced by using word embedding for feature extraction. The cost of 

compiling training data and the domain specificity of the text of many projects can often 

be a barrier to using machine learning approaches to annotation. This research 

demonstrates how word embedding can be used to compile context-specific semantic 

lexicons as a method for extracting features for text classifiers to perform automated 

annotations of text. This is an innovative methodology building on an approach outlined 

by Chanen (2016). High accuracy was achieved using a minimal set of training 

examples with features based on semantic lexicons generated from the entire dataset.  


24 
 

There have been numerous international state investigations into the abuse of 

children. Wright et al. (2017) documented 40 historical child abuse enquiries to date 

each of which resulted in lengthy reports detailing their findings. In using automated 

methods to enable distant reading of the Ryan Report, this project presents an approach 

whereby key information may be extracted and restructured to facilitate a system-wide 

analysis of the findings of such investigations. 

 
Acknowledgements 

This research is part of the Industrial Memories project funded by the Irish Research 

Council under New Horizons 2015. 

 
25 
 

References 

Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N. and Levitan, S., 2007. 

Stylistic text classification using functional lexical features. Journal of the Association 

for Information Science and Technology, 58(6), pp.802-822. 

 
Bird, S., Klein, E., & Loper, E., 2009. Natural language processing with Python. 

Beijing, O'Reilly Media. Sebastopol, CA. 

 
Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32. 

Vancouver  

 
Brooks, M., Kuksenok, K., Torkildson, M.K., Perry, D., Robinson, J.J., Scott, T.J., 

Anicello, O., Zukowski, A., Harris, P. and Aragon, C.R., 2013, February. Statistical 

affect detection in collaborative chat. In Proceedings of the 2013 conference on 

Computer supported cooperative work (pp. 317-328). ACM. 

 
Chanen, A., 2016, April. Deep learning for extracting word-level meaning from safety 

report narratives. In Integrated Communications Navigation and Surveillance (ICNS), 

2016 (pp. 5D2-1). IEEE. 

 
Chiticariu, L., Li, Y. and Reiss, F.R., 2013, October. Rule-based information extraction 

is dead! long live rule-based information extraction systems!. In EMNLP (No. October, 

pp. 827-832). 


26 
 

Django Software Foundation, 2016. Django (V ersion 1.9.6) [Computer Software]. Re- 

trieved from https://djangoproject.com. 

 
Donnelly, S. and Inglis, T., 2010. The media and the Catholic Church in Ireland: 

Reporting clerical child sex abuse. Journal of Contemporary Religion, 25(1), pp.1-19. 

 
Dumais, S.T., 2004. Latent semantic analysis. Annual review of information science 

and technology, 38(1), pp.188-230. 

 
Frank, A., Bögel, T., Hellwig, O. and Reiter, N., 2012. Semantic annotation for the 

digital humanities. Linguistic Issues in Language Technology, 7(1), pp.1-21. 

 
Hampson, C., Munnelly, G., Bailey, E., Lawless, S. and Conlan, O., 2013, September. 

Improving User Control and Transparency in the Digital Humanities. In Culture and 

Computing (Culture Computing), 2013 International Conference on (pp. 196-197). 

IEEE. 

 
Hogan, R. and Emler, N.P., 1981. Retributive justice. The justice motive in social 

behavior: Adapting to times of scarcity and change, pp.125-143. 

 
Holmes, G., Donkin, A. and Witten, I. H. (1994), Weka: A machine learning 

workbench, in ‘Intelligent Information Systems, 1994. Proceedings of the 1994 Second 

Australian and New Zealand Conference on’, IEEE, pp. 357–361. 


27 
 

Iosif, E. and Mishra, T., 2014, April. From Speaker Identification to Affective Analysis: 

A Multi-Step System for Analyzing Children's Stories. In CLfL@ EACL (pp. 40-49). 

 
Jackson, H.J., 2002. Marginalia: Readers writing in books. Yale University Press. 

 
Jockers, M.L. and Mimno, D., 2013. Significant themes in 19th-century literature. 

Poetics, 41(6), pp.750-769. 

 
Keenan, M., 2013. Child sexual abuse and the Catholic Church: Gender, power, and 

organizational culture. Oxford University Press. 

 
Krestel, R., Bergler, S. and Witte, R., 2008. Minding the source: Automatic tagging of 

reported speech in newspaper articles. Reporter, 1(5), p.4. 

 
Landauer, T.K., 2006. Latent semantic analysis. John Wiley & Sons, Ltd. 

 
Mahlow, C., Grün, C., Holupirek, A. and Scholl, M.H., 2012, September. A framework 

for retrieval and annotation in digital humanities using XQuery full text and update in 

BaseX. In Proceedings of the 2012 ACM symposium on Document engineering (pp. 

195-204). ACM. 

 
28 
 

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Distributed 

representations of words and phrases and their compositionality. In Advances in neural 

information processing systems (pp. 3111-3119). 

 
Miller, G.A., 1995. WordNet: a lexical database for English. Communications of the 

ACM, 38(11), pp.39-41. 

 
Muralidharan, A.S. and Hearst, M.A., 2014. Improving the Recognizability of Syntactic 

Relations Using Contextualized Examples. In ACL (2) (pp. 272-277). 

 
Pilgrim, D., 2012. Child abuse in irish catholic settings: A non‐reductionist account. 

Child Abuse Review, 21(6), pp.405-413. 

 
Pine, E., Leavy, S. and Keane, M.T., 2017. Re-reading the Ryan Report: Witnessing via 

and Close and Distant Reading. Éire-Ireland, 52(1), pp.198-215. 

 
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits and 

systems magazine, 6(3), pp.21-45. 

 
Pouliquen, B., Steinberger, R. and Best, C., 2007, September. Automatic detection of 

quotations in multilingual news. In Proceedings of Recent Advances in Natural 

Language Processing (pp. 487-492). 

 
29 
 

Powell, F., Geoghegan, M., Scanlon, M. and Swirak, K., 2012. The Irish charity myth, 

child abuse and human rights: Contextualising the Ryan report into care institutions. 

British Journal of Social Work, 43(1), pp.7-23. 

 
Ryan, S. (2009). Commission to inquire into child abuse report (Volumes I - V). Dublin: 
Stationery Office, Dublin. Available at: http://www.childabusecommission.ie/rpt/. 
 

Schöch, C., 2013. Big? smart? clean? messy? Data in the humanities. Journal of Digital 

Humanities, 2(3), pp.2-13. 

 
Schöch, C., Schlör, D., Popp, S., Brunner, A., Henny, U. and Tello, J.C., 2016. Straight 

Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels. In 

Digital Humanities 2016: Conference Abstracts (pp. 346-353). 

 
Sweetnam, M.S. and Fennell, B.A., 2011. Natural language processing and early-

modern dirty data: applying IBM Languageware to the 1641 depositions. Literary and 

linguistic computing, 27(1), pp.39-54. 

Vancouver  

 
Turney, P.D. and Pantel, P., 2010. From frequency to meaning: Vector space models of 

semantics. Journal of artificial intelligence research, 37, pp.141-188. 

 
Vuillemot, R., Clement, T., Plaisant, C. and Kumar, A., 2009, October. What's being 

said near “Martha”? Exploring name entities in literary text collections. In Visual 


30 
 

Analytics Science and Technology, 2009. VAST 2009. IEEE Symposium on (pp. 107-

114). IEEE. 

 
Weiser, S. and Watrin, P., 2012. Extraction of unmarked quotations in newspapers. In 

Proceedings of the Eighth International Conference on Language Resources and 

Evaluation (LREC-2012). 

 
Widlocher, A., Bechet, N., Lecarpentier, J.M., Mathet, Y. and Roger, J., 2015, 

September. Combining Advanced Information Retrieval and Text-Mining for Digital 

Humanities. In Proceedings of the 2015 ACM Symposium on Document Engineering 

(pp. 157-166). ACM. 

 
Wright, K., Swain, S., and Sköld, J. (2017). 'The Age of Inquiry: A global mapping of 

institutional abuse inquiries'. Melbourne: La Trobe University. DOI: 

http://doi.org/10.4225/22/591e1e3a36139