OP-LLCJ160067 21..28


Presenting the Bangor Autoglosser
and the Bangor Automated
Clause-Splitter
............................................................................................................................................................

D. M. Carter

The University of British Columbia, Okanagan campus, Kelowna,

British Columbia, Canada; Centre for Research on Bilingualism,

Bangor University, Gwynedd, Wales

M. Broersma

Centre for Language Studies, Radboud University, Nijmegen, The

Netherlands; Max Planck Institute for Psycholinguistics, Nijmegen,

The Netherlands

K. Donnelly

Centre for Research on Bilingualism, Bangor University, Gwynedd,

Wales

A. Konopka

University of Aberdeen, Aberdeen, Scotland
.......................................................................................................................................

Abstract
Until recently, corpus studies of natural bilingual speech and, more specifically,
codeswitching in bilingual speech have used a manual method of glossing, part-
of-speech tagging, and clause-splitting to prepare the data for analysis. In our
article, we present innovative tools developed for the first large-scale corpus study
of codeswitching triggered by cognates. A study of this size was only possible due
to the automation of several steps, such as morpheme-by-morpheme glossing,
splitting complex clauses into simple clauses, and the analysis of internal and
external codeswitching through the use of database tables, algorithms, and a
scripting language.

.................................................................................................................................................................................

1 Introduction

One of the main challenges faced by researchers who
study natural bilingual speech is the amount of time
needed to collect, transcribe, and prepare the corpus
data before any type of linguistic or sociolinguistic
analysis can take place. For instance, previous ana-
lyses of codeswitching patterns found specifically in

the Welsh–English Siarad corpus1 utilized in our
study relied on manual morpheme-by-morpheme
glossing, clause-splitting (i.e. splitting complex
clauses into simple clauses), and data preparation
(Carter et al., 2011; Davies and Deuchar, 2010;
Herring et al., 2010). The manual data preparation
involved processes such as determining a main lan-
guage and an embedded language for each bilingual

Correspondence:

D. M. Carter, Faculty of

Creative and Critical Studies,

Department of Critical

Studies, CCS 349, University

of British Columbia,

Okanagan campus, 1148

Research Road, Kelowna, BC

V1V 1V7, Canada.

E-mail:

diana.carter@ubc.ca

Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018. � The Author 2017. Published by Oxford University
Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com

21

doi:10.1093/llc/fqw065 Advance Access published on 27 February 2017

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


simple clause (see Section 4 for details on the Matrix
Language Frame model; Myers-Scotton, 1993,
2002). The result was a slow process that limited
the number of clauses included in each analysis,
ranging from a few hundred to a few thousand.
One of the goals of our current study was to
devise more efficient automated tools and tech-
niques that would allow us to analyze all of the
65,000 clauses in the Siarad corpus in a much
shorter amount of time.

In our article, we present the methodology and
innovative tools that were essential to our study of
codeswitching in the Welsh–English Siarad corpus of
spontaneous bilingual speech. We believe that these
tools will facilitate several steps in the analysis of
monolingual and bilingual corpora. For instance, the
Bangor Autoglosser can be utilized to automatically
gloss corpora that include languages with small
speaker populations, given that tagging systems are
often unavailable for languages with fewer than five
million speakers. The Bangor Automated Clause-
Splitter can be a helpful tool for any researcher who
needs to divide complex clauses into smaller clauses
for analysis and may be used for other languages in
addition to Welsh, such as Spanish, for example.

Previous work has successfully used automated
tools to predict codeswitching in corpora
(Papalexakis et al., 2014; Solorio and Liu, 2008).
In the present study, to the contrary, we analyze
actual occurrences of codeswitching. Specifically,
our study employed automated methods with the
aim of analyzing both internal codeswitches (two
languages used within the same clause) and external
codeswitches (switches extending over the clause
boundary) triggered by cognates (Clyne 1967,
2003). Clyne (2003) defines cognates, or trigger
words, as proper nouns, bilingual homophones,
and lexical transfers (items from one language that
have become part of the lexicon of the speaker’s
second language), and typically the default assump-
tion is that cognates are nouns. However, in our
study, we extended the definition to include all
word types that overlap in form and meaning in
the bilingual’s two languages. Essentially, Clyne’s
triggering hypothesis proposes that cognates facili-
tate codeswitching, an effect that is the result of the
selection of the cognate from the mental lexicon

(Broersma and De Bot, 2006; Broersma, 2009). It
is argued that cognates may be strongly connected
in the mental lexicon and that their conceptual rep-
resentations are more closely connected than those
of non-cognates. Therefore, the activation of a word
that is shared by two languages may lead to a change
in activation of both languages at the lexical level.
This in turn may ‘boost’ the least active language to
the extent that the next time a lemma is selected it
may be one from the boosted language instead of
the previously spoken language.

Similarly to the Welsh–English studies men-
tioned above, previous work on the triggering hy-
pothesis was also performed manually and required
over 250 h to tag and analyze small corpora of 2–
3,000 words (Broersma and De Bot, 2006; Broersma,
2009). However, through the implementation of the
Bangor Autoglosser, the Bangor Automated Clause-
Splitter, as well as database tables, algorithms, and a
scripting language, we were able to successfully ana-
lyze almost 450,000 words in 65,000 clauses. In the
following sections, we first describe the collection
and transcription of the Siarad corpus and then,
crucially, the autoglossing and clause-splitting pro-
cesses, and final data preparation.

2 Data collection

Here we describe the method followed to collect the
large Welsh–English corpus used in our analysis of
triggered codeswitching. The Welsh–English Siarad
corpus consists of 447,353 words from 151 speakers
across sixty-nine conversations. The corpus was col-
lected over a 2 year period in Wales by bilingual
Welsh–English researchers who were local members
of the community (Deuchar et al., 2014). The par-
ticipants were recruited through a variety of means,
such as newspaper announcements, and the ‘friend
of a friend’ approach (Milroy, 1987). The speakers
were told that the aim of the study was to record
people having bilingual conversations with another
bilingual friend or family member.

Conversations lasted between 19 and 64 min,
with a mean length of 35 min, and were recorded
using a Marantz hard disk recorder (Carter et al.,
2016; Deuchar et al., 2014). Researchers were not

D. M. Carter et al.

22 Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


present at the time of the recording, and the par-
ticipants could discuss any topic of their choice. The
speakers were left alone to minimize the observer’s
paradox, which occurs as the result of having an
interviewer or researcher present during the record-
ings (Labov, 1972). It was important that the par-
ticipants felt comfortable and unhindered given that
informal situations, rather than formal interviews,
are more likely to elicit natural bilingual speech.

After the recordings were finished, the partici-
pants were asked to complete a self-assessed back-
ground questionnaire consisting of twenty
questions. The questionnaire elicited a wide range
of information, such as the participants’ age, gender,
occupation, language of education, age of exposure
to each language, language of social network, lan-
guage proficiency, and attitudes toward codeswitch-
ing. The anonymity of the participants was
protected, which means that other researchers inter-
ested in studying social variables could also access
the questionnaire data without harming anonymity.

3 Transcription

All of the recordings were transcribed by Welsh–
English bilinguals using the CHAT2 transcription
system in the Computerized Language Analysis
(CLAN) program (MacWhinney, 2000). Within
the CHAT system, transcribers used language tags
to differentiate between Welsh, English, and cognate
words. Words in the most frequent language in each
conversation were left untagged, while all other
words were tagged according to their corresponding
language using three-letter abbreviations of ISO-
639-3 (i.e. @s:eng for English). Transcribers em-
ployed a ‘dictionary method’ to allocate language
tags and ensure consistency in the transcripts.
Words that occurred in the dictionaries of both
English and Welsh were considered cognates and
tagged as @s:cym&eng, with the language tags in
alphabetical order of the abbreviation. These
words included proper nouns, nouns, and verbs as
well as other word classes. In the Siarad corpus,
Welsh is the most frequent language in all

conversations, ranging from 51 to 93% of the
words, with an average of 84%.

One of the key advantages of language tagging is
that it greatly facilitates the identification of cog-
nates and codeswitching in bilingual corpora.
Example (1) below illustrates a transcription tier
with language tags and a translation tier.

(1) ond dw i ddim actually@s:eng isio mynd i
wrando ar y stuff@s:cym&eng.

‘but I don’t actually want to go and listen to the
stuff’.

In addition to the tiers of transcribed speech and
the translation tier, another tier was included in the
corpus that was essential to our study: a morpheme-
by-morpheme gloss. Originally, all of these tiers
were entered manually by a team of bilingual
researchers. The newly developed Bangor
Autoglosser provides researchers with a more effi-
cient automatic method of glossing.

4 The Bangor Autoglosser

The transcription tiers were glossed with an innova-
tive automated tool called the Bangor Autoglosser
that followed the Leipzig glossing conventions
(Carter et al., 2016; Donnelly and Deuchar, 2011).
Given that the existing tagging system used in
CLAN only handles larger languages of over five
million speakers (MacWhinney, 2009), it was neces-
sary to create a tool from scratch that could auto-
matically gloss large multilingual Welsh–English
texts. The implementation of the Bangor
Autoglosser involved a combination of digital dic-
tionaries and the application of Constraint
Grammar (Karlsson, 1990; Karlsson, et al., 1995).
Constraint grammar assigns grammatical tags to
text based on context-dependent rules written by a
linguist. Each rule selects, removes, adds, or replaces
the tag on any given word by taking into account
surrounding words and their tags. This was the first
application of Constraint Grammar to mixed-lan-
guage texts. Essentially the procedure involves the
separation of text into words, the lookup of each
word in a dictionary that gives possible lemmas
and part-of-speech (POS) for that word, and the

Presenting the Bangor Autoglosser

Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018 23

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


selection of the correct lemma and POS for the
word in its current context. This is illustrated in
Table 1.

The autoglossing process is as follows. First, the
Bangor Autoglosser imports each utterance from a
transcript into an utterance table, as seen in Fig. 1.
The table facilitates the process of editing or adding
items either directly to the table or to an exported
spreadsheet version of the same table.

Second, the words are imported from the data-
base into a ‘words table’ and tokenized (Fig. 2). Any
mutations in Welsh are removed (e.g. ‘gath at-
cath’),3 and any elisions or regular verb endings in
English are also removed (e.g. ‘gonna, I’ll’).

The language tags are used to decide which dic-
tionary is consulted for the gloss. The correct diction-
ary accumulates all matching entries for each word
and writes them in another file that is in the format
required by the Constraint Grammar parser. Next,
the parser applies the Constraint Grammar rules to
the file.4 For example, in the case of the English word
‘dance’, you would have one reading: dance, sv, infin,
meaning that ‘dance’ can be a singular noun, or a
verb (with the combined tag ‘sv’), and if it is a verb, it
is usually an infinitive. The Constraint Grammar
rules then use context to convert the ‘sv’ tag into
‘n.sg’, or ‘v.pres’ (e.g. they dance). The Constraint
Grammar rules for Welsh are applied by the parser

in the same way it would apply rules for any lan-
guage. In other words, there is no need for a special
algorithm to be written specifically for Welsh. This is
one of the features that allows Constraint Grammar
to be used to tag multilingual text. One main differ-
ence between English and Welsh, however, is the
higher number of homonyms present in English.
As a result, in Welsh, each individual meaning
tends to have a separate reading.

The results of the application of the grammar
rules are stored in a words table as a combination
of a gloss and POS-tag (Fig. 3).

Finally, the entire CHAT file is written out of the
database with a new autogloss tier that is generated
from the glossed words. This output is illustrated in
Example (2).

‘but I don’t actually want to go and listen to the
stuff’
Using this innovative method, glossed text was

produced at a rate of 1,000 words per minute and
the 40 h Siarad corpus was glossed in approximately
8.5 h. We performed manual checks of the complete
outputs from five random transcription files which
showed that the precision of the glossing was be-
tween 97 and 99%, depending on the language.

In addition to its efficiency, another advantage of
the automated glossing is that it is now possible to
easily access any word or attribute of texts that are
available in the database. Through the use of a
scripting language such as Hypertext Preprocessor
(PHP) (Lerdorf, 2007) or Python (Bird et al., 2009),

Table 1 Welsh dictionary layout

Surface Lemma Enlemma POS Gender Number Tense

bara bara bread n M sg

cathod cath cat n F pl

mynd mynd go v infin

aeth mynd go v 3s past

hapus hapus happy adj

rhywsut rhywsut somehow adv

heb heb without prep

Fig. 1 Example (1) in the utterance table

D. M. Carter et al.

24 Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


it is possible for researchers to manipulate the data-
base at this point and begin to analyze the corpus
data. We used a scripting language in most stages of
our study, including the development of the auto-
mated clause-splitter we describe next.

5 The Bangor Automated
Clause-Splitter

Given that the Siarad corpus was not originally tran-
scribed in simple clauses and no Welsh parser
existed, we needed to devise a way of automatically
splitting complex clauses into simple clauses for our
codeswitching analysis. This was an essential step so
that we could apply the Matrix Language Frame
model (Myers-Scotton, 1993, 2002) and determine
a base or matrix language for each clause. According
to the model, each codeswitched clause contains a
matrix language that provides the morphosyntactic
frame for the clause, and an embedded language
that contains inserted material, mostly consisting
of content morphemes. The matrix language can
usually be determined by the language of the finite
verb in each clause, which was found to be true for
the entire Siarad corpus. The large majority of the
clauses in the Siarad corpus has Welsh as the main
verb with English providing the inserted material.

As mentioned in the introduction, previous stu-
dies that involved manual clause-splitting took sev-
eral weeks and many researchers to divide only a
few thousand clauses (Carter et al., 2011; Davies and
Deuchar, 2010; Herring et al., 2010). In the present
study, we were able to analyze 65,000 clauses as a
result of the creation of the Automated Clause-
Splitter.

During the initial development phase, the first
version of the clause-splitter was tested on the first
300 utterances of a single file and was checked in
detail, revealing an accuracy rate of 93%. In total,
twenty-one (7%) of the utterances were split incor-
rectly. Out of the twenty-one, eight (3%) were due
to an incorrectly applied rule in the Constraint
Grammar, and another three (1%) because of an
error in the dictionary. The final ten (3%) were
due to the splitter itself. To increase the accuracy
rate of the cause-splitter, we made corrections to the
Constraint Grammar application as well as the dic-
tionaries. Additionally, we revised some of the as-
sumptions that the splitter uses. For example, one
assumption is that inflected verbs have the clause
marker moved to the preceding word when the pre-
ceding word is a conjunction, a subordinator, or an
adverb. The initial list of these words was increased

Fig. 3 Disambiguated words from Example (1) stored in
the words table after the application of the Constraint
Grammar parser

Fig. 2 Example (1) in the words list table

Presenting the Bangor Autoglosser

Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018 25

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


because it was not exhaustive, thus causing inaccur-
ate clause-splits.

The clause-splitting procedure as applied to the
Siarad corpus can be summarized as follows. First,
for the purpose of the present study, we removed all
conversations containing more than two speakers
leaving us with fifty-two conversations and 105
speakers; this was a preemptive step that would
later facilitate the statistical analysis. Note that the
clause-splitter could handle conversations with
more than two speakers without any problem.
Second, we omitted all interactional markers,
which are utterances such as ‘uhhuh, mmhm’ that
do not fulfill any syntactic role in everyday speech.
Next, we added role indicators in the ‘words table’
(Fig. 3) to every finite verb, which were then moved
where necessary. In the following Example (3), the
finite verbs are underlined, the clause-splits are
marked with a forward slash/, and the word onto
which the clause-split marker was moved is in bold.
The example illustrates how the marker is moved
from ‘o’n’ to ‘pan’ (when) because ‘pan’ is a con-
junction, following the assumption made by the
clause-splitter that the marker be moved to the
word preceding an inflected verb if that word is a
conjunction.

(3) dw i yn cofio/o’n i yn gweithio ar y nos/pan
o’n i yn gweithio yn Beaumaris

‘I remember/I was working nights/when I was
working in Beaumaris’

As mentioned previously, spot checks of a
random sample of the splits revealed that this
method was over 97% accurate, which was
deemed an acceptable rate given the speed of the
process and the large number of clauses that were
produced for analysis.

Next, we determined the matrix language of each
clause by detecting the language of the finite verb
within that clause. This step was done automatically
based on the language tagging in the transcripts.
Once the matrix language was assigned, we assessed
whether there were any internal or external codes-
witches. If two languages co-occurred within the
same basic clause, it was considered an internal
switch, but if the subsequent clause had a different
matrix language from the previous clause, then it
was an external switch. Finally, we generated

additional data that characterized the clauses and
the conversations. For example, we wanted to
know the length of each clause in words, whether
the clause contained cognates, and if yes, how many,
the type of each clause, the length in letters of each
cognate, and the language of the clause (Welsh,
English, or bilingual). Other key information
included the total number of words, clauses, cog-
nates, and codeswitches in each conversation and
per speaker.

Once the enriched data had been generated, they
were exported to a comma-separated value file and
could be analyzed using statistical software such as R
(R Development Core Team, 2009).

7 Conclusions

In contrast to previous smaller-scale studies of codes-
witching patterns in bilingual corpora, and specifically
in the Welsh–English Siarad corpus, our research team
was able to analyze the entire corpus of 65,000 clauses
due to the development of innovative tools, namely,
the Bangor Autoglosser, which applied Constraint
Grammar to bilingual text for the first time, and the
Bangor Automated Clause-Splitter that divided thou-
sands of complex clauses into basic clauses at a rapid
rate. All of the data were contained in database tables
and were manipulated and analyzed through the use
of a general-purpose scripting language, rather than a
specific dedicated interface, such as the query applica-
tion found in the CLAN (MacWhinney, 2009) pro-
gram. The scripts were written and utilized
successfully to prepare a large quantity of clauses for
the analysis of several variables pertaining to our
study’s focus on triggered codeswitching. Although a
discussion of the statistical analysis and results are
outside of the scope of this current article, it should
be noted that without the use of the automated tools
and scripts, it would not have been possible to process
the large Welsh–English Siarad corpus with such
speed, efficiency, and accuracy.

Funding

This work was supported by a Small Research Grant
from the British Academy awarded to the first and

D. M. Carter et al.

26 Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


second authors (grant number 101421). We also
gratefully acknowledge the support of the Max
Planck Institute for Psycholinguistics, the Centre
for Research on Bilingualism in Wales, and the
University of Calgary.

References
Bird, S., Klein, E., and Loper, E. (2009). Natural

Language Processing with Python. California: O’Reilly

Media, Inc.

Broersma, M. (2009). Triggered codeswitching between

cognate languages. Bilingualism: Language and

Cognition, 12: 447–62.

Broersma, M. and De Bot, K. (2006). Triggered codes-

witching: a corpus-based evaluation of the original trig-

gering hypothesis and a new alternative. Bilingualism:

Language and Cognition, 9: 1–13.

Carter, D., Broersma, M., and Donnelly, K. (2016).

Applying computing innovations to bilingual corpus

analysis. In Valenzuela, E. and de la Fuente, A. A.

(eds), Language Acquisition Beyond Parameters: Studies

in honour of Juana M. Liceras. Amsterdam: John

Benjamins.

Carter, D., Deuchar, M., Davies, P., and Parafita Couto,

M. C. (2011). A systematic comparison of factors af-

fecting the choice of matrix language in three bilingual

communities. Journal of Language Contact, 4: 153–83.

Clyne, M. (1967). Transference and Triggering:

Observations on the Language Assimilation of Postwar

German-Speaking Migrants in Australia. The Hague:

Martinus Nijhoff.

Clyne, M. (2003). Dynamics of Language Contact: English

and Immigrant Languages. Cambridge: Cambridge

University Press.

Davies, P. and Deuchar, M. (2010). Using the matrix

language frame model to measure the extent of word

order convergence in Welsh-English bilingual speech.

In Breitbarth, A., Lucas, C., Watts, S., and Willis, D.

(eds), Continuity and Change in Grammar.

Philadelphia, PA: John Benjamins, pp.77–96.

Deuchar, M., Davies, P., and Donnelly, K. (2016).

Building and using the Siarad corpus: bilingual conver-

sations in Welsh and English. Manuscript.

Deuchar, M., Davies, P., Herring, J., Parafita Couto,

M.C., and Carter, D. (2014). Bilingual language use.

In Thomas, E. and Mennen, I. (eds), Advances in the

Study of Bilingualism. Bristol: Multilingual Matters,

pp.93–110.

Donnelly, K. and Deuchar, M. (2011). Using Constraint

Grammar in the Bangor Autoglosser to Disambiguate

Multilingual Spoken Text. In Proceedings of the

NODALIDA 2011 Workshop Constraint Grammar

Applications, Riga, Latvia: NEALT Proceedings Series,

Tartu.

Herring, J., Deuchar, M., Parafita Couto, M. C., and

Moro Quintanilla, M. (2010). ‘I saw the madre’: eval-

uating predictions about codeswitched determiner-

noun sequences using Spanish-English and Welsh-

English data. International Journal of Bilingual

Education and Bilingualism, 13: 553–73.

Karlsson, F. (1990). Constraint Grammar as a Framework

for Parsing Unrestricted Text. In Proceedings of the 13th

International Conference of Computational Linguistics,

vol. 3:168–73, Stroudsurg, PA. doi:10.3115/

991146.991176.

Karlsson, F., Voutilainen, A., Heikkilä, J., and Anttila A.

(1995). Constraint Grammar: A Language-Independent

System for Parsing Running Text. Natural Language

Processing, 4. Berlin and New York: Mouton de

Gruyter.

Labov, W. (1972). Some principles of linguistic methodol-

ogy. Language in Society, 1: 97–120.

Lerdorf, R. (2007). PHP on Hormones—history of PHP.

MySQL Conference. Santa Clara, California. http://web.

archive.org/web/20130729204354id_/http://itc.conver-

sationsnetwork.o rg/shows/detail3298.html.

MacWhinney, B. (2009). Enriching CHILDES for mor-

phosyntactic analysis. Department of Psychology. Paper

175. http://repository.cmu.edu/psychology/17.

MacWhinney, B. (2000). The CHILDES Project: Tools for

Analyzing Talk, 3rd edn. Mahwah, NJ: Lawrence

Erlbaum Associates.

Milroy, L. (1987). Language and Social Networks. Oxford:

Blackwell.

Myers-Scotton, C. (2002). Contact Linguistics: Bilingual

Encounters and Grammatical Outcomes. Oxford and

New York, NY: Oxford University Press.

Myers-Scotton, C. (1993). Common and uncommon

ground: social and structural factors in codeswitching.

Language in Society, 22: 475–503.

Papalexakis, E., Nguyen, D., and Seza Doğruöz, A.

(2014). Predicting Code-Switching in Multilingual

Communication for Immigrant Communities. In

Proceedings of the First Workshop on Computational

Presenting the Bangor Autoglosser

Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018 27

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018


Approaches to Code Switching. Doha, Qatar, October
2014, pp. 42–50.

R Development Core Team. (2009). R: a language and
environment for statistical computing. Vienna, Austria:
R. Foundation for Statistical Computing. ISBN 3-
900051-07-0. http://www.R-project.org.

Solorio, T. and Liu, Y. (2008). Learning to Predict Code-
Switching Points. In Proceedings of the 2008 Conference
on Empirical Methods in Natural Language Processing.
Honolulu, October 2008, pp. 973–81.

Notes
1 The Siarad corpus of Welsh–English data is available

under open license at http://bangortalk.org.uk.
2 At the time the corpus was being collected, the

CHAT system was one of the most suitable choices
(Deuchar et al., 2016). Currently, there are other op-
tions available for multilingual data, such as ELAN
(https://tla.mpi.nl/tools/tla-tools/elan/), although as

MacWhinney explains in the CHILDES manual

(http://childes.psy.cmu.edu/manuals/chat.pdf), the CHAT

data can be translated to XML which can then be used

in ELAN, among other programs.
3 In Welsh, as in the other Celtic languages, some word-

initial consonants change (‘mutate’) to reflect morpho-

logical and syntactic relationships between the words of

the utterance. For example: siop llyfrau da (a shop

[with] good books), but siop lyfrau dda (a good book-

shop), where the change d -> dd signifies that the ad-

jective da (good) relates to siop (shop) and not to

llyfrau (books). Llyfrau is itself mutated ll -> l to

show that it qualifies siop. Another example is seen

here where mae o’n marw means ‘he is dying’, but

mae o’n farw means ‘he is dead’. The change m -> f

signifies that marw (die, dead) is the adjective and not

the verb. These mutations have to be removed to get to

the underlying lemma.
4 The scripts for the Constraint Grammar rules for Welsh

are available at https://github.com/donnekgit/autoglosser.

D. M. Carter et al.

28 Digital Scholarship in the Humanities, Vol. 33, No. 1, 2018

Downloaded from https://academic.oup.com/dsh/article-abstract/33/1/21/3057458
by MPI Psycholinguistics user
on 17 April 2018