key: cord-0018960-j4xijey6
authors: Dediu, Dan
title: Tone and genes: New cross-linguistic data and methods support the weak negative effect of the “derived” allele of ASPM on tone, but not of Microcephalin
date: 2021-06-30
journal: PLoS One
DOI: 10.1371/journal.pone.0253546
sha: bb210e4d1585b4f8dbc4d7acf8b079beb2d8f362
doc_id: 18960
cord_uid: j4xijey6

While it is generally accepted that language and speech have genetic foundations, and that the widespread inter-individual variation observed in many of their aspects is partly driven by variation in genes, it is much less clear if differences between languages may also be partly rooted in our genes. One such proposal is that the population frequencies of the so-called “derived” alleles of two genes involved in brain growth and development, ASPM and Microcephalin, are related to the probability of speaking a tone language or not. The original study introducing this proposal used a cross-linguistic statistical approach, showing that these associations are “special” when compared with many other possible relationships between genetic variants and linguistic features. Recent experimental evidence supports strongly a negative effect of the “derived” allele of ASPM on tone perception and/or processing within individuals, but failed to find any effect for Microcephalin. Motivated by these experimental findings, I conduct here a cross-linguistic statistical test, using a larger and updated dataset of 175 samples from 129 unique (meta)populations, and a battery of methods including mixed-effects regression (Bayesian and maximum-likelihood), mediation and path analysis, decision trees and random forests, using permutations and restricted sampling to control for the confounding effects of genealogy (language families) and contact (macroareas). Overall, the results support a negative weak effect of ASPM-D against the presence of tone above and beyond the strong confounding influences of genealogy and contact, but they suggest that the original association between tone and MCPH1 might have been a false positive, explained by differences between populations and languages within and outside Africa. Thus, these cross-linguistic population-scale statistical results are fully consonant with the inter-individual-level experimental results, and suggest that the observed linguistic diversity may be, at least in some cases, partly driven by genetic diversity.

Introduction matter, this may be a canonical example of a hypothesis-generating exploratory study leading, more than 10 years later, to an experimental hypothesis-testing design, supporting the initial proposal using completely different data and methods. However, this scientific success story generated two nagging questions for me: (i) what is going on with Microcephalin?, and (ii) the data, methods and results in [9] are from 2005-2006: how would they look in 2020?

For (i), both experimental studies [25, 26] fail to find any evidence for an effect of Microcephalin and, while this could be a false negative (its effect is too weak to be detected even with more than 400 participants) or simply not captured by the tasks, it is worthwhile taking it at face value and assuming that Microcephalin could have very well been a false positive in the original [9] study. For (ii), the intervening years have seen a revolution in the methods used to ask cross-cultural questions [27] , ranging from the generalisation of mixed-effects/hierarchical regression [28, 29] , to the use of Bayesian methods [2, 30] , permutation/randomisation [31] , and of phylogenetics [32] and machine learning [33] . Likewise, the availability and quality of linguistic (and cultural) data has dramatically improved, with databases such as WALS Online; [34] , PHOIBLE; [35] , LAPSyD and D-PLACE [36] being easily accessed by humans and machines. On the genetic side, while the genomic coverage of the data has exploded (we now have not only full exomes and genomes, but epigenetic data as well), both in modern and archaic humans (including Neanderthals and Denisovans), it has remained rather circumscribed geographically and ethno-linguistically, despite efforts such as the 1000 genomes project [37] , the Simons Genome Diversity Project [38] , and the The ALlele FREquency Database (ALFRED) [39] . So, if we were to collect new linguistic and genetic data, and use the methods now available, how would the relationship between ASPM, Microcephalin and tone look like?

This paper first summarises the original data, methods and results in [9] , then describes the new data collected (as of early 2020), the methods used and their results, ending with a discussion and conclusions. a reminder, the vast majority of locations on our genome (or loci), including genes such as ASPM and MCPH1, come in pairs, with one copy inherited from the mother and one from the father. Thus, in simple cases as discussed here, each individual can have 0, 1 or 2 "derived" alleles in her genome-independently for each of the two genes. In a population (or group) of people, we can thus compute the frequency of the "derived" allele (for each gene independently), f D , by diving the number of "derived" alleles across individuals, N D , to twice the total number of individuals genotyped in the population, 2N (as each individual can have up to two such alleles): f D ¼ N D 2N . This can vary between 0% (the "derived" allele is absent from the population) and 100% (everybody has it) but, importantly, its estimation can be affected by many types of errors and biases, the most important being the number of people that are genotyped, the genealogical relationships between them, and their representativity for the considered (meta)population. (Please see [46] for a gentle introduction to population and evolutionary genetics). Also, while not directly relevant to the hypothesis in [9] , it is nevertheless important to mention that the original papers claimed that (i) the "derived" alleles of both ASPM and Microcephalin were evolving under positive natural selection (i.e., that there is a selective advantage to the individuals having them in their genomes), and (ii) that this selective advantage was probably related to "cognition". However, the method they used to test for positive selection is not robust, and the selection signal is probably an artefact [47, 48] . More importantly, given the controversies the original papers generated (including accusations of implicit racism) [49] , subsequent work did not find any influence of these "derived" alleles on cognition, brain size or any other such phenotypes [50, 51] (with some debatable exceptions; [52, 53] ). Therefore, while the crucial roles played by these two genes in the evolution and ontogeny of the human brain are clear, with very probable episodes of positive selection in our lineage [43] [44] [45] , it is currently unclear what phenotypic effects their "derived" alleles may have [26, 54] and odds are that the recent and current evolution of these particular alleles is mainly driven by neutral demographic processes and not by selective pressures [47, 48, 54] .

There were three factors that prompted us to explore the relationship between these two "derived" alleles and linguistic tone back in 2005-2006: 1. the striking visual resemblance between the maps of the two "derived" alleles ( Fig 1) and that of linguistic tone (Fig 3 and https://wals.info/feature/13A) 2. the probable involvement of the two genes in brain growth and development, and 3 . the evidence of a genetic basis for language and speech was rapidly increasing, not only in terms of heritability studies [55] , but also of candidate genes [56] , including the relatively recently discovered FOXP2 [57] ; while most of these concerned "universal" aspects of speech and language or their pathologies, a few did focus on normal inter-individual variation (see [58] for a contemporary review and discussion).

Thus, while the suggestion that these two "derived" alleles influence tone was not an a priori hypothesis that pre-existed seeing the data, but instead was prompted by the data itself, it did nevertheless have pre-existing theoretical roots, especially in the prediction that, given their genetic bases and the widespread inter-individual and inter-group genetic variation, there should be aspects of language and speech whose patterns of diversity are influenced by genetic diversity [58, 59] . Therefore, it is important to see [9] as an attempt to reject the hypothesis of a link between the geographic distributions of ASPM-D, MCPH1-D and tone, using a super-set of the data partly responsible for the generation of this hypothesis. The failure to achieve this rejection can be taken, in extremis [60] , as a confirmation that human visual perception is really good at detecting matching patterns and that [9] is a futile and misleading exercise in "dubble dipping"/"circular analysis", or, as repeatedly highlighted [9, 10, 58] , as the generation of a hypothesis from a combination of data and theoretical expectations, followed by the reduction of the probability of false positives.

In order to check this hypothesis, we started from the genetic data in [41, 40] , consisting of the frequency of the two "derived" alleles in 59 populations, relatively widespread across the globe, but with a clear bias towards Africa and Eurasia, and very little data from the Americas and the Pacific, and no data from Australia. We manually mapped these populations to languages (using the meta-data in the two papers), and we collected data concerning not only their tone systems (using a binary coding of "no tone" versus "tone") but also various other structural aspects of language (mostly from [61] , supplemented with data from other secondary and primary sources, including questionnaires sent to specialists in various languages); likewise, we collected data about the frequency of many more genetic loci spread across the genome in these populations from databases such as ALFRED [39] and HGDP [62] (see [9, 58] for details). Due to missing and ambiguous data, we included only 49 of these 59 populations in our analyses (in particular, we did not include the 5 populations from the Americas in the statistical analyses, but we used them as an informal test of the results, in the sense that they have low frequencies of ASPM-D, high of MCPH1-D, and show both tonal and nontonal languages). those of all possible associations between all the linguistic features and all the genetic markers in our database. This procedure quantifies "how special" the relationship between tone and the two "derived" alleles is relative to what would be expected when repeatedly picking random aspects of language and our genome, allowing us to implicitly control for multiple confounds, such as climate, environment, contact, language family diversification, and past demographic processes and events.

On top of this, we also explicitly controlled for two major confounds [27] : shared inheritance and contact. The first refers to the fact that the features of related languages are not independent due to inheritance from their common ancestor ("Galton's problem"; [63] ), a point also applicable to the genetic makeup of populations that descend from a common ancestor. The second captures the fact that languages in contact tend to exchange features, just as populations may exchange genes. For this explicit control, we used the Mantel test [64] , which computes the (partial) correlations between two distance matrices (possibly controlling for others), and repeatedly permutes the data in order to compensate for the non-independence of the observations. The distances used were (the first two are of interest for the hypothesis, the last two are the confounds):

• the "structural distance" between languages, defined as the Euclidean distance on the space defined by one or more structural features: small distances reflect higher structural similarities between the languages • the "genetic distance" between populations, defined as Nei's D [65] : small distances mean that the populations have very similar allele frequencies • the "geographic distance" between languages, computed as the great circle distances constrained by the geography of the continents: presumably languages closer in space would have had more chances for linguistic and genetic exchanges, and

• the "historical linguistic distance", quantifies the degree of genealogical closeness between two languages using the classification in the 15 th edition of the Ethnologue [66] .

The from unambiguous, resulting in uncertain judgements with respect to the linguistic variables (including tone).

While the comparison with the empirical distribution of many other similar associations should control for many potential confounds, the technique used for their explicit control, namely the Mantel test, is far from ideal. First, it does not quantify the association between the actual values, but between distances derived from these values, introducing extra degrees of freedom and potential noise. For example, the way geographical distance is computed assumes a linear effect on contact, and the historical linguistic distance assumes that unrelated languages are only slightly more dissimilar than languages from the same family but in different "genera" (a distance of 4 versus 3). Second, controlling for the historical and geographical distances does not fully model their effects on the relationship between tone and the two "derived" alleles. For these (and other) reasons, the Mantel test is very rarely used nowadays, the preference being to explicitly model the sources of statistical non-independence using, for example, phylogenetic approaches, mixed effects/hierarchical models, permutation/randomisation or restricted sampling.

I now turn to the current study, detailing its data, methods and results.

Please note that all the data, and the R and Rmarkdown code need to reproduce the results reported here, as well as the full analysis report (as a self-contained HTML document) are freely available on Zenodo at doi: 10 .5281/zenodo.4762169 and (except the HTML full analysis report, due to its size) also in the GitHub repository https://github.com/ddediu/tone-genes-update. All external file paths referenced here are relative to the root of this repository (e.g., ./data/ genetics/code/00_preprocesses_genetics.R).

Despite the advances of the last 14 years, the availability of frequency data for the "derived" alleles of ASPM and MCPH1 is still the limiting factor for this update. Therefore, I first collected (hopefully, all) the currently available genetic data, followed by its merging with the linguistic data. Please note that I focus here only on the relationship between tone and the two "derived" alleles, and not on its comparison with other comparable relationships as in the original paper [9] ; therefore, I only collected and analysed data on ASPM-D, MCPH1-D and tone.

Definition. The original paper [40, p. 1720] identified the "derived" allele of ASPM in relation to "haplotype 63" and two of its polymorphic nonsynonymous sites in exon 18 in an open reading frame (ORF), A44871G and C45126A, with ancestral alleles A and C, respectively, and the derived ones, G and A. More recent publications however, use SNP (single nucleotide polymorphism) rs41310927 with ancestral allele T and derived allele C. Likewise, the original paper [41, p. 1717] identified the "derived" allele of MCPH1 (or Microcephalin) in relation to G37995C in exon 8 in an ORF with ancestral allele G and derived one C. More recent publications use SNP rs930557 with ancestral allele G and derived allele C.

Data sources. For the collection of population frequency data concerning the two "derived" alleles, I used a total of 7 sources (see Table 1 ): besides the original papers [41, 40] , I extracted information from the experimental study [26] on Cantonese speakers, as well as from several large genetic databases. However, it is important to note that, while most databases contain information about the actual "derived" alleles or the corresponding SNPs (rs41310927 and rs930557), not all do, but instead contain data on other SNPs that are in very tight LD (linkage disequilibrium) with them. I used LDlink's "LDproxy Tool" to obtain the list of all SNPs in LD with the target ones across all the populations in that database-see Table 2 for the list of retained "proxy" SNPs. For ASPM-D, these 5 "proxy" SNPs represent 289 unique extra genetic samples out of 396 (73%), and for MCPH1-D, the single "proxy" SNPs represents 141 unique extra genetic samples out of 245 (57.6%). Please note that, while increasing the coverage of the data, this procedure is not perfect, in that LD may differ between populations and geographic regions, potentially introducing noise. However, running the analyses excluding the "proxy" SNPs produce very similar, but somewhat weaker results to those obtained including them (see the full HTML analysis report), suggesting that they are, indeed, good proxies for the "derived" alleles.

The samples and (meta)populations. The unit (the "group") for which allele frequency information is given varies between, and even within, these data sources, and cannot be considered a priori equivalent nor unique between and within data sources. For example, 1000 genomes (and LDLink) contain populations such as "African/African-American", "Han Chinese in Beijing, China" and "Mende in Sierra Leone", while ALFRED may contain very specific samples from arguably the same (meta)populations, such as SA004380P and SA004595X, both "Khanty". Thus, I manually matched these units within and between samples using the meta-information available in each database, resulting in 175 unique samples contained in 129 unique (meta)populations, where possible based on those available in ALFRED; each sample has a unique ID, while for each (meta)population, besides a unique ID, I also provide a "readable" name. With these, a (meta)population may contain one or more samples: e.g., "Abkhaz" [PO000844Q] contains just the ALFRED sample SA004584V, while "Adygei" Table 3 , while the full HTML analysis reports also plots their maps. The frequencies of the "derived" alleles. For each unique sample there might be frequency information available for more than one locus; for example, for ASPM-D in the "Adygei" [PO000017I] sample SA001509P, there is information for A44871G from [40] (frequency f = 0.40 from N = 30 alleles) and for rs3762271 from ALFRED (f = 0.41, N = 34). In such cases, I computed the weighted average frequency as in Eq 1:

where i goes over all data sources with relevant information, f i is the allele frequency and N i the number of genotyped alleles (normally twice the number of genotyped individuals) in the data source i (in the example above, wavg(SA001509P) � 0.405). While this procedure does not assume subjective preferences between data sources, nor does it take into account the potentially imperfect LD between some of the "proxy" loci and the "target", it does give more credence to larger samples, resulting in an aggregate frequency estimate that should be more robust than each estimate independently. The distribution of the "derived" alleles. The patterns of distribution remain similar to those in the original Science papers (Fig 1) . The frequency of ASPM-D is globally below �60%, is very low in sub-equatorial Africa and the Americas, higher in eastern Eurasia, and highest in western Eurasia. MCPH1-D is almost absent in sub-equatorial Africa and relatively high everywhere else, reaching fixation (100%) in some samples.

Matching samples to languages. In most cases, the mapping of a given genetic sample to one (or more) language(s), using the provided metadata, is far from obvious. To this end, I used all the available information about the sample in the data source giving the allele frequency, in combination with other sources such as Wikipedia, the Ethnologue, the Glottolog, WALS, and the ISO 639-3 Registration Authority website, plus general knowledge about the distribution of the world's languages and ethnic groups. (For example, the description of the population "Adygei" [PO000017I] in ALFRED, available at https://alfred.med.yale.edu/ alfred/recordinfo.asp?UNID=PO000017I, gives general data about the population, including links to the Ethnologue, as well as more specific information about each particular sample, possibly including references to the actual publications). While this resulted in clear matches in Table 3 . The original genetic samples and (meta)populations (in parentheses) used in [9] , and the new ones added in this study, by macroarea.

Gene 

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin many cases, with 139 samples (80%) being uniquely matched to a single language (e.g., the samples of the "Adygei" [PO000017I] population were uniquely matched to the Adyghe language, Glottocode adyg1241, ISO-639-3 ady), and 23 samples to 2 languages each (e.g., the "Bakola Pygmy" were mapped to either/both Gyele [gyel1242, gyi] and Kwasio [kwas1243, nmg]), there were also more ambiguous mappings: 4 samples mapped to 3 languages each, 4 to 3, 1 each to 5, 6, 7 and 9 languages, 1 to 23 languages (the ALFRED/HGDP sample SA001501H "Papuan", where the best I could do was to map it to a whole set of potential Sepik, Ndu, Ap Ma, and Lower Speik-Ramu languages), and 1 mapping to no less than 144 languages (this "monster" is the SA004382R "Micronesians" sample, which I could only map to a whole subset of the Austronesian family). As this ambiguous mapping may introduce an extra level of uncertainty, for an extra analysis, I removed all the genetic samples that do not uniquely map to a single language. Please note that this procedure is very conservative, given that several "ambiguous" mappings are in fact concordant for tone, as is the case, for example, for , which correspond to SA004371P. The full results are reported in the HTML analysis report, but they are very similar to those obtained using the whole set of genetic samples, suggesting that using all the available information is worth the cost of added noise.

Data sources. I collected data from five sources (see Table 4 ). I used the 2014 version of the WALS Online [34] , which gives tone as a 3-way classification ("Feature 13A") in a machine-readable format (CSV). I manually collected the tone data from the LAPSyD website in early 2020, noting, for each language, the 5-way classification and the given number of tones. The binary classification of tone (present/absent) used in [9] is available in Annex 6 of my PhD thesis [58] ; it was manually curated from several primary and secondary sources (see [9, 58] for details). For PHOIBLE [35] , I downloaded the 2.0.1 version of the database and I extracted only the symbols used for tone (SegmentClass = "tone"), manually removing some symbols that appear only very rarely (see Fig 2) . Unfortunately, the data of The World Phonotactics Database ( [67] ; WPHON) was not accessible as of April 2020, so I used instead the last available snapshot from the Internet Archive/the WayBackMachine from 8 June 2019, Table 4 . Data sources for linguistic tone with the number of languages (the "#" column).

WALS Online [34] https://wals.info/feature/13A 3-way classification: "No tones", "Simple tone system" & "Complex tone system" 513 Tone classifications. Given these sources, I decided to compile information about tone in three formats (codings):

1. a binary coding, opposing languages that do not use any tone system to those that do (i.e., absence vs. presence): "No"/"Yes" 2. a 3-way ordered coding of tone system complexity: "None" < "Simple" < "Complex", and 3. a count of the number of tones (or tone symbols) used to describe a language, ranging from 0 (no tone) to a maximum dependent on the data source.

The rules for obtaining these coding for each data source are given in Table 5 . Please note that the counts were rebased, in the sense that those few languages reported as having 1 tone/ symbol were recoded as having 2 (the fact that they are very few, 6 for LAPSyD, 4 for PHO-IBLE, and 3 for WPHON, coupled with a case-by-case analysis, suggested that they can be safely collapsed with the languages with 2, all being considered as having simple tone systems), followed by the subtraction of 1 for all languages with at least 2 tones/symbols (i.e., the languages with 0 tones/symbols stay at 0, but all languages with n � 2 tones/symbols are recoded as having n − 1 tones/symbols)-in this way, we have a continuum of counts from 0 up to the original maximum-1.

Reconciliation of the sources. While the agreement between these five sources is very good (see S1 Fig) , there are some languages for which they disagree (e.g., Angaataha [anga1290, agm] is coded as "Moderately complex" in LAPSyD, but as "Simple" in WALS). Moreover, few languages are coded in all sources. Therefore, it would be preferable to arrive at an agreement coding of tone that (a) reconciles any existing conflicts and (b) can retain as many languages as possible. To this end, I implemented the following algorithm.

For the categorical classifications (binary and 3-way). I implemented a set of rules based on a hierarchy of the sources and the pattern of agreements and disagreements between them; while still subjective, this is fully replicable, transparent and can be easily modified. I preferred to use manually-curated categorical classifications over the count-level sources, resulting in Table 5 . How I coded tone for each data source. The given rules are of the form original value(s) 7 ! coded value; � means "all other original value(s)"; "as is" means the value as given by the data source without change; "rebase" applies only to the counts (see text for details); "-" means that the data source did not contribute to the coding (does not contain useful information).

WALS "None" 7 ! "No"; � 7 ! "Yes" as is -

LAPSyD "None" 7 ! "No"; � 7 ! "Yes" "None" 7 ! "None"; "Marginal" & "Simple" 7 ! "Simple"; � 7 ! "Complex" rebase DL2007 as is --PHOIBLE 0 7 ! "No"; � 7 ! "Yes" 0 7 ! "None"; �27 ! "Simple"; � 7 ! "Complex" rebase WPHON 0 7 ! "No"; � 7 ! "Yes" 0 7 ! "None"; �37 ! "Simple"; � 7 ! "Complex" rebase https://doi.org/10.1371/journal.pone.0253546.t005

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin the following (rough) ordering in terms of precedence:

The actual rules are coded as patterns of information about a given language in the available sources and the actions to be taken (e.g., if a language has information in LAPSyD, then this is used no matter what other information is available); in this way, the coding conflicts are implicitly solved by picking the "highest" available source. For the counts. I used a similar approach, in that the sources are ordered as LAPSyD � WPHON � PHOIBLE, with the added twist that for the last two (WPHON and PHOIBLE) I do not pick the "raw" value as given by the database itself, but instead a "corrected" value. More exactly, given that I consider the tone counts in LAPSyD as the "gold standard", I performed first the quadratic regression of the LAPSyD counts on the WPHON counts and, separately, on the PHOIBLE counts, and used the (rounded) value predicted by these regression models from the "raw" value in the corresponding database. These regressions are: L ¼ 0:08ð�0:04Þ þ 0:92ð�0:06ÞW À 0:04ð�0:01ÞW 2 ð2Þ L ¼ 0:39ð�0:05Þ þ 0:68ð�0:09ÞP À 0:04ð�0:

where L is the predicted LAPSyD count from the "raw" WPHON (W) or PHOIBLE (P) count. This procedure attempts to "align" and "scale" the counts (also allowing a non-linear quadratic term) so that they better map between sources. (Please note that the distribution of the unrounded predicted counts is extremely similar to that of their rounded values overall and within macroareas, as is their relationship with the frequency of the two "derived" alleles; however, as most of the statistical techniques used here for the counts, Poisson regression in particular, need integers, I did not conduct any further statistical analyses on these unrounded predicted values). The agreement classification. With these, I obtained three agreement classifications (one binary, one 3-way, and one count), that agree very well with the original sources. These have the following distributions (see also Fig 3) :

• binary (321): "No" (251), and "Yes" (70).

• 3-way (314): "None" (248), "Simple" (39) , and "Complex" (27) .

• counts (314): "0" (249), "1" (26), "2" (23), "3" (6), "4" (5), "5" (3), and "6" (2).

There are different numbers of languages with binary (321) versus 3-way and count (314) data because of the different patterns of missing data in the five original sources used, the type of coding they contain, and the possibility of deducing "coarser" codings from more "finegrained" ones (e.g., binary from 3-way and count, and 3-way from count) but not the other way around (e.g., 3-way from binary); 314 languages have data for all three codings.

While arguably justified, the hierarchy of the sources used to derive this agreement classification is only one of the many possible, and to check if it unduly biases the results I compared it to an alternative hierarchy of sources (please see the HTML analysis report for full details). This alternative can also be justified by the nature and coverage of the sources, and is, for binary and 3-way: WALS � WPHON � LAPSyD � DL2007 � PHOIBLE, and for counts: WPHON � LaPSyD � PHOIBLE. However, on the full dataset, these two agreement codings are very similar (only 27 out of 3798 languages disagree for the binary coding, 51 of 3785 for 3-way, and 59 of 3785 differ by more than 1 symbol for tone counts), and on the reduced datasets that also include the "derived" alleles they are virtually identical (2 of 181 differ for tone1, 4 of 180 for tone2, and 6 of 183 by more than 1 for counts).

For the purposes of the analyses reported here, I considered the following variables referring to tone:

• tone1: this is a shorter name for the binary coding of tone as "No" (251) vs "Yes" (70).

• tone2: this is a binary variable derived from the 3-way coding of tone by opposing "Complex" tone systems (coded as "Yes", 27 languages) vs "None" and "Simple" collapsed together (coded as "No", in 287 languages).

• tone counts: the unchanged tone counts from above.

While tone1 is the direct equivalent of the binary presence/absence coding of tone used in the original paper [9] , and allows the re-testing of the negative bias against the presence of tone due to high frequencies of the the "derived" alleles formulated there, tone2 and counts are new codings introduced here, potentially reflecting refinements of the original hypotheses, or even new ones. As such, it is possible that the "derived" alleles may (also) have an effect on the Each circle on the maps represents a unique language, and its colour represents is value. The insets show the overall distributions across all the languages. Top-left: binary tone (inset horizontal axis: "N" = "No", "Y" = "Yes"); top-right: 3-way classification (inset horizontal axis: "N" = "None", "S" = "Simple", "C" = "Complex"); bottom-left: tone counts. The maps were generated with the R package maps which uses public domain data from the Natural Earth project https://www.naturalearthdata.com/.

https://doi.org/10.1371/journal.pone.0253546.g003

complexity of the tone system, in the sense that they may influence the probability that a language has a "complex" tone system or not; the variable tone2 captures precisely this new hypothesis, and its analysis should be seen as complementing and partially overlapping that of tone1. Moreover, it is possible that the "derived" alleles influence not just the presence/absence of tone or of "complex" tone, but, more generally, the actual number of tone distinctions (potentially in a non-linear manner): this justifies the inclusion of the counts variable. In fact, the recent experimental evidence that ASPM-D influences tone perception in speakers of Cantonese (a complex tone language) [26] , seems to suggest that the effects of the "derived" alleles (in particular, ASPM-D's) on tone may go beyond "blocking" or "allowing" the emergence of tone in the first place, and may also affect the internal structure and complexity of tone systems (e.g., their "simplification" or "complexification").

Combining the tone and genetic data inevitably results in a loss of a few observations due to missing data, especially in what concerns the frequency of the "derived" alleles. As a reminder, we have three main units of analysis: the actual genetic sample, representing a particular group of people for which allele frequency data is available (e.g., using the example given above, SA001509P). The (meta)population, which may contain more than one sample (e.g., "Adygei" [PO000017I], containing not only SA001509P but also SA004373R and SA004585W), but one sample belongs to a single (meta)populations. And the "language", or, more precisely, the Glottocode (e.g. the Adyghe language with Glottocode adyg1241 and ISO-639-3 code ady): one sample/(meta)populations might speak more than one "language" (in reality or due to ambiguities in mapping), and one "language" can be spoken by more than one sample/(meta)population.

With these, there are in total 175 unique samples in 129 unique (meta)populations speaking 321 unique Glottocodes with data for at least one of the three tone codings and one of the two "derived" alleles; I will denote such counts as 175:129:321. Of these, 175:129:321 (i.e., all) have data for tone binary, and 170:124:314 for tone 3-way and counts (5 genetic samples from the 5 (meta)populations "Burunge", "Hazara", "Mozabite", "Oroqen" and "Xibe" speaking 7 "languages" buru1320, efee1239, gyel1242, haza1239, oroq1238, tumz1238 and xibe1242 do not have this info). 170:127:319 have info for ASPM-D (5 samples FINRISK, GenDan, GenNed5, KRGDB and Qatari, 2 (meta)populations "Dutch" and "Qatari" and 2 Glottocodes dutc1256 and gulf1241 lack it), 166:128:320 have info for MCPH1-D (9 samples, 1 (meta)populations "Bulgarian" and 1 "language" bulg1262 lack it), and 161:126:318 have info for both ASPM-D and MCPH1-D (14 samples, 3 (meta)populations "Bulgarian", "Dutch", "Qatari" and 3 Glottocodes bulg1262, dutc1256, gulf1241 lack this info).

Requiring also information for tone binary does not change this (161:126:318), while for 3-way and counts it drops to 156:121:311 (19 samples, 8 (meta)populations "Bulgarian", "Burunge", "Dutch", "Hazara", "Mozabite", "Oroqen", "Qatari" and "Xibe" and 10 Glottocodes bulg1262, buru1320, dutc1256, efee1239, gulf1241, gyel1242, haza1239, oroq1238, tumz1238, xibe1242 are lost).

However, as some of these samples map to more than one Glottocode, for each such sample I only retain the corresponding Glottocodes that have different tone values. Moreover, one (meta)population might have more than one sample, and for each such (meta)population I only retain those with different frequencies for ASPM-D and MCPH1-D. While this procedure can be criticised, it is conservative in that it gives more weight to divergent signals. With this, the final dataset on which the analyses were conducted contained a total of 181 observations counts is in large part due to real missing genetic data, but also to the sometimes ambiguous mapping between genetic samples and linguistic varieties, some with the same value for tone. An extreme case is sample SA004382R ("Micronesians") mapped to no less than 144 Glottocodes, all Austronesian languages, the vast majority of which (137) do not use tone-arguably, reducing these to a single "No" observation does not represent a loss of actual information. On the other hand, the loss of populations without genetic information for one of the "derived" alleles (5:2:2 for ASPM-D, 9:1:1 for MCPH1-D and 14:3:3 for both) is a real loss of information, but (a) it is arguably rather small, and (b) I decided against using various missing data imputation techniques as not to introduce biases in the data. (Moreover, my using of "proxy" SNPs for the "derived" alleles already is a form of very specific missing data imputation, informed by genetic theory and data).

I analysed separately the three variables related to tone, namely tone1, tone2 and tone counts. For each variable, I selected only the entries with non-missing data for it as well as for the two "derived" alleles; if, for a given sample, there is more than one possible tone or allele frequency values (i.e., corresponding to more than one language with different tone coding or genetic samples with different allele frequency data), I only kept those entries that have different tone and/or allele information. This procedure maintains the uncertainty in the data, and, while it might be seen as giving too much voice to the "exceptions", it avoids giving too much voice to entries that happen to be similar due to close genealogical relatedness. With these, there are 181 observations among 119 unique Glottolog codes in 35 families for tone1, 165 observations among 108 unique Glottolog codes in 35 families for tone2, and 184 observations among 121 unique Glottolog codes in 35 families for tone counts.

As detailed below, I used several non-Bayesian and Bayesian statistical methods for the analysis of the data.

The (meta)populations are fully included in the language families, and most have just a few samples: of the 129 unique (meta)populations, 100 have only 1 sample, 17 have 2 samples, 8 have 3, 3 ("Finns", "Jews_Ashkenazi" and "Russians") have 4, and there is only 1 ("Han") with 5 samples. Therefore, I did not explicitly model them in the non-Bayesian approaches, as they add too little information to the samples and the language families (and, in very few cases, cause convergence issues), but I did include them explicitly, as embedded within the language families, in the Bayesian mixed-effects regressions.

However, the families do need to be considered (even if most have only one or a few languages: of all the 39 unique families, 16 are represented by a single language, 6 by 2, 5 by 3, 3 by 5, 2 by 7, 3 by 10, and 1 each by 16 ("Afro-Asiatic"), 26 ("Indo-European"), 31 ("Atlantic-Congo") and 146 ("Austronesian") languages, respectively), as it is essential to properly model the genealogical non-independence between related languages (and, arguably, genetic samples): therefore, I modelled them as random effects (in the now standard approach; e.g. [27] ).

Macroareas, on the other hand, need special consideration for several reasons:

• there is a very skewed geographical sampling (e.g, there is nothing from Australia, and very few data points from the Americas), the dataset being dominated by Eurasia and Africa;

• despite the obvious fact that geography strongly influences language contact and results in geographic clustering, this is far from a simple phenomenon captured by a limited number of clear-cut "areas"; therefore, there are several proposals of such areas that are arguably independent of each other for the purposes of language contact (e.g., [69] [70] [71] ), the one used here (based on the macroareas defined in the Glottolog) being relatively widely used but still open to criticism (e.g, "Africa" and "Eurasia" are treated as unitary, while PNG and Oceania are placed together within "Papunesia");

• the distribution of the two "derived" alleles is also highly skewed geographically, being almost completely absent from Africa.

Therefore,

• as "North America" and "South America" each have very few data points, I collapsed them into a single macroarea, "America";

• in order to control for the influence of the macroareas, I modelled them as fixed effects in the non-Bayesian approaches, as random effects (crossed with the families and (meta)populations) in a set of Bayesian mixed-effects regressions, and as a 2-dimensional Gaussian Process (with the families sand (meta)populations) in a different set of Bayesian mixed-effects regressions;

• because the genetic data suggests that the major split is between "Africa" and the rest of the world, coupled with the fact that some methods cannot gracefully handle multi-valued factors, made me dichotomise, for some analyses, macroarea in into "Africa" vs "non-Africa".

Following [9, 10] , and the direct experimental tests in [25, 26] , I am testing here the following hypotheses:

• there is a weak negative influence of the population frequency of ASPM-D on tone1 above and beyond the effects of shared ancestry (language family) and contact (macroarea);

• there should be no effect of MCPH1-D on tone1 independent of language family and macroarea, or it should be much smaller.

There are no clear predictions concerning the complexity of tone systems (tone2 and tone counts), but we might expect a negative effect of ASPM-D:

• there might be a weak negative influence of ASPM-D on tone2 above and beyond language family and macroarea;

• there might be a weak negative influence of ASPM-D on tone counts above and beyond language family and macroarea;

• there is no particular influence of MCPH1-D on tone2 independent of language family and macroarea;

• there is no particular influence of MCPH1-D on tone counts independent of language family and macroarea.

However, the "above and beyond the effects of shared ancestry (language family) and contact (macroarea)" are rather tricky in this particular case, and should be treated with care. These are indeed potential confounds in any cross-linguistic/cross-cultural/cross-population statistical studies and usually result in artificially inflated (i.e., artificially statistically significant) associations [27] . This is due to the fact that the languages/cultures/populations are not statistically independent, but more similar than expected by chance due to "genealogical inertia" ("Galton's problem"; [63] ) and contact. In our case, tone seems to be both stable genealogically (i.e., the daughter languages have a strong tendency to conserve the tonal system of their proto-language) and relatively easy to borrow between neighbouring languages [13, [19] [20] [21] 60] . Likewise, given that the two "derived" alleles, ASPM-D and MCPH1-D, are very probably evolving neutrally [47] , it is to be expected that their frequencies in any given population are shaped by drift and admixture [46] . While in small isolated populations their frequency might fluctuate widely across generations (possibly ending in fixation or complete loss), in larger ones they are expected to be relatively stable. Moreover, genetic exchanges between populations (due to inter-marriage, migrations. . .) lead to more similar frequencies, while various barriers (physical or cultural) might lead to increased differentiation [46, 72] . Thus, any apparent association between tone and the frequency of the "derived" alleles in present-day populations might be non-causal but spurious, due to a fortuitous accident through which it just so happened that a few populations, some generations ago, had a high frequency of the "derived" alleles and no tone, accident later amplified by language expansions, demographic processes and contact, making the proper control for these non-independencies a must [15, 27] .

On the other hand, if the hypothesis that the "derived" alleles induce a weak individual bias that can be amplified by the repeated use and transmission of language across multiple generations [9] [10] [11] is true (as strongly suggested by the experimental evidence; [25, 26] ), then the cross-linguistic/cross-population effects of this causal link might look very similar to the very confounds discussed above. This is so because in populations with a low frequency of the "derived" alleles, tone is "free" to change (i.e., being innovated, complexified, simplified or lost) under the influence of other processes (internally-motivated sound change, language contact, or even climate; [3, 12, 13] ), but in populations with a higher frequency of these alleles, there is an extra (weak) force against tone (thus, either a "push" away from tone through increased rates of tone simplification and loss, or a "pull" towards a lack of tone through a low probability of tonogenesis). The effects of this "extra" force (the negative bias) would not be instantaneous, but would presumably require several generations.

There are four important scenarios to consider: in the first, there is a significant increase in the frequency of the "derived" alleles (due to drift or gene flow), in the second there is a stable high frequency of the alleles, in the third, there is a significant decrease in this frequency, and in the fourth, a stable low frequency of the alleles. However, please note that it is currently unclear what a "significant" change in the frequency of the "derived" alleles should be, if this change is independent or not of the frequency itself, if there are thresholds or more general non-linearities, and if the frequency is modulated by the structure of the communicative network itself [17, 18, 73] -all in need of specific computational and cross-linguistic work that go beyond the scope of this paper; for my purposes here suffices that there are changes in allele frequency that result in changes in the negative bias towards tone at the language level.

In the first scenario, when the frequency of the "derived" alleles increases sufficiently in a population so that the negative bias becomes active, then we would expect that the original language(s) of the population would start simplifying or losing tone (if they had it), or fail to develop or complexify it (even if other conditions hold). If this process is on the same timescale as language differentiation or faster (a few generations/hundreds of years), it might retrospectively look like a "regular" sound change (or, even much harder to ascertain, a failure to change despite favouring conditions), followed (second scenario) by a period of relative stability (lack of tone) across the history of the ensuing language family. Moreover, given that it is highly improbable that such significant changes in allele frequency happen just in a single population that is large enough to support a viable language, or that it would stay confined there, they would likely affect (relatively simultaneously), or spread among, bigger groups of people speaking multiple languages across larger areas; retrospectively, this might look like a period of tone simplification or loss among several languages in contact which can be interpreted as contact-induced, followed by a period of relative stable lack of tone (second scenario).

In contrast, in the third scenario, the frequency of the "derived" alleles decreases sufficiently to remove the negative bias against tone, "allowing" the other forces to resume affecting tone, effectively "opening" up tonogenesis and tone complexification as possible pathways of language change, possibly followed by the fourth scenario, where the frequency of the "derived" alleles is low and the negative bias inactive. Retrospectively, scenario three would be hard to distinguish from the "usual" evolution of tone systems, except by being preceded by a period characterised by a lack of tone in the history of the languages. Scenario four is that of language change "as usual" in the absence of a genetic bias, where tone is gained and lost, complexified and simplified as driven by other factors.

Given what we know about the two "derived" alleles, namely that they emerged relatively recently (ASPM-D � 6,000 years ago, and MCPH1-D � 37,000 years ago) presumably somewhere in Eurasia, and have since spread and increased in frequency (presumably due to demographic processes of expansion, migration and inter-marriage) slowly and unequally around the globe, it is probable that all scenarios are applicable in different circumstances (linguistic families and geographic areas), but that scenario three is rather improbable.

Thus, because of the intrinsic population dynamics of these "derived" alleles, of the weakness of the bias they produce, and of the complex multi-factorial nature of language change (combining many types of influences and a good dose of serendipity), the causal effects of this bias will largely overlap with those of genealogical inertia (because, just as tone, the allele frequencies of the daughter populations are highly correlated between them and with those of the mother population) and areal effects/language contact (because, just as tone, the "derived" alleles are spread by people moving and interacting with each other). Fig 4 is a representation of these relationships using Judea Pearl's Directed Acyclic Graph (DAG) approach to causality [30, 74, 75] .

Given the limitations of the data available (not only in terms of sample size, but also of geographic and linguistic representativeness, and sometimes ambiguous mapping), the complexity of the causal model, and the roles of inheritance and contact discussed above, I used several complementary methods to try to test these hypotheses.

I used both Bayesian and non-Bayesian methods, as each have advantages and disadvantages and comparing them should allow a better estimation of the robustness of the results. On the one hand, the Bayesian methods (as implemented here by the brms package [76] in R [77] and Stan [78] ) are more flexible and, in this case, allow the (meta)population to be included among the random effects, and the macroarea to be modelled as a random effect or as a 2D Gaussian Process, but are more computationally expensive. On the other, the "classic" implementations (as available through (g)lmer and glmmTMB in, respectively, the lme4 [79] and glmmTMB [80] packages in R) have much lower computational costs, especially relevant for the randomisation and restricted sampling approaches. Nevertheless, it is important to point out that the Bayesian and non-Bayesian methods produce largely convergent results.

For tone1 and tone2 (binary variables) I performed logistic regressions (as implemented, in the "classic" approach, by glm and glmer when there is no random structure and where there is one, respectively, and by brms in the Bayesian approach), for the tone counts I used Poisson regression (glmer and brms, respectively), and for the population frequencies of the "derived" alleles I used Beta regression (as implemented by glmmTMB and brms, respectively; please note that, due to the restrictions of Beta regression on the dependent variable, I systematically replaced all 0.0 frequencies by 10 −7 and all 1.0 frequencies by 1.0 − 10 −7 ).

While such an approach would usually require some type of multiple testing correction, I decided here against it because (a) at least for tone1 this is based on a pre-existing hypothesis, and tone2 and tone counts are arguably related to this hypothesis, (b) the various methods and approaches are interpreted together rather than independently, (c) as shown by the power analysis and the results themselves, the dataset is small and the effects weak enough for multiple testing correction to effectively make the whole enterprise futile purely for statistical power reasons, and (d) it is technically difficult to correct across such different methods, and especially across frequentist and Bayesian ones (for the latter, it is not entirely clear even if this is a meaningful issue; e.g. [81] ). However, when interpreting the results the fact that there is no such correction should be kept in mind.

(Mixed-effects) regressions. This implements the largely current "standard" approach in cross-linguistic studies, where the dependent variable (DV) is regressed on the independent variable(s) (IVs), while the potential confounds are modelled as fixed or random effects [27, 28] .

In the Bayesian approach, the DV is either tone1, tone2 or counts, and the IVs of interest are the population frequencies of ASPM-D and MCPH1-D (and their interaction), while the confounds are macroarea, language family and (nested within family), the (meta)population. The last two of these confounds (family and (meta)population) were modelled as random effects, but I modelled macroarea either as yet another (crossed) random effect (in one set of regressions) or as a bi-dimensional (2D) Gaussian Processes [82] . Here, I use a Gaussian Process to model the effects of geographical space as implemented by the gp() function in package brms [76] , using the longitude and the latitude of each sample grouped separately for each macroarea; this allows the continuous modelling of the effects of geographic distance within each macroarea (see, for example, [30] for such an application). However, I consider this approach as "experimental" here as it seems to suffer from too few data for too many parameters and results in very wide posterior distributions. The importance of the IVs of interest is captured by, first, their posterior distribution (plotted and summarised by its mean and 89% Highest Density Interval (HDI); [30] ), second, by formal hypothesis testing of the one-sided expected direction of the effect (here, < 0) and versus the point (i.e., two-sided) value 0 (these tests produce both probabilities and evidence ratios; see function hypothesis in package brms for details), third, through comparisons with the ROPE (the region of practical equivalence, a small interval around 0.0, usually [-0.1, 0.1] but that can vary depending on the particular model) which produces the percent of the HDI within this interval and probabilities that can be interpreted similar to frequentist p-values (see functions rope and p_rope in package bayestestR and [81] for details), and, fourth, by the comparison of the model with the IV and the nested model without it (using multiple criteria: the Bayes Factor, leave-one-out crossvalidation (LOO), the Widely Applicable Information Criterion (WAIC), and K-Fold Cross-Validation; please see, for example, [30] for details).

In the "classic" approach, the DVs and the IVs of interest are the same as above, while the confounds are macroarea (modelled as a fixed effect) and its interactions with ASPM-D and MCPH1-D, and language family (as a random effect); the (meta)populations are not explicitly modelled. I individually tested the significance of each fixed effect separately by performing a likelihood ratio test and comparing the Akaike Information Criteria (AIC) between the model with the fixed effect of interest and the model without it.

In all regression models that include the frequency of either or both "derived" alleles as IVs, these frequencies are z-scored (i.e., transformed into a variable with mean 0.0 and standard deviation 1.0 through f À meanðf Þ sdðf Þ ). Only in the "classic" approach, for each DV:

All data. I first fitted the regression model on all the available data;

Randomization. followed by a permutation approach where I repeatedly (n = 1000) shuffle the data and re-fit the regression model, resulting in a null distribution of model fits that can be compared to the original fit to the unshuffled data. In fact, there are 18 different scenarios resulting from combinations of three parameters that test slightly different null hypotheses (see Tables 6 and 7) ;

Restricted sampling. finally, a popular approach in typology is represented by sampling only one language from each genealogical and/or areal unit [69, 83, 84] , which I implemented here by repeatedly picking only one language from each family, fitting a regression model (without any random effects structure, as each family now has exactly one member) while controlling or not for macroarea, and analysing the distribution of the model fits (this time, the comparison is done against the null hypothesis of no effect).

Please note that because permutations and restricted sampling involve randomness, the results might differ slightly between runs.

Mediation and path analysis. Because macroarea, and especially its dichotomisation as Africa vs the rest of the world, predicts both the distribution of tone and the frequency of the two "derived" alleles, it confounds any potential causal effect that the alleles may have on tone. Mediation analysis [85] allows the partitioning of the total effect (TE) of a variable of interest alleles-together = permute the two alleles together ! destroys the patterning of the two "derived" alleles but not the correlation between them alleles-independent = permute the two alleles independently ! destroys the patterning of the two "derived" alleles and the correlation between them within constraints on permutations unrestricted = all observations are freely permuted ! destroys all structure in the data families = observations are permuted within their family ! destroys the within-family structure, but conserves the between-families variation, i.e., conserves the genealogical signal macroareas = observations are permuted within their macroarea ! destroys the within-area structure, but conserves the between-areas variation, i.e., conserves the areal (contact) signal Table 6 , Name (1 st column) is a 3-letter shorthand: 1 st letter: T = tone, L = alleles-together ("linked") and I = alleles-independent ("independent"); 2 nd letter: U = unrestricted, M = macroareas, F = families; 3 rd letter: N = none, F = fixef. -"-means as above.

TUN tone unrestricted none tone is permuted across the whole sample ! destroys the family and macroarea structure of tone, but keeps it for the alleles; macroarea is not included LUN alleles-together unrestricted none the frequencies of the two "derived" alleles are permuted across the whole sample together ! destroys the family and macroarea structure of the two alleles (while preserving correlations between them), but keeps it for tone; macroarea is not included IUN allelesindependent unrestricted none the two alleles are permuted across the whole sample independently ! destroys the family and macroarea structure of the two alleles and the correlations between them, but keeps it for tone; macroarea is not included (here, dichotomised macroarea: Africa vs the rest of the world) on the outcome (here, tone) into its direct effect (ADE) and its mediated (or indirect) effect (ACME), the latter being "channelled", usually, through one mediator variable (here, ASPM-D or MCPH1-D)-see Fig 5. Here, I performed mediation analysis both in a "classic" approach (using the R package mediation [86] , with logistic regressions for tone1 and tone2, linear regressions for the zscored frequencies of ASPM-D and MCPH1-D, and Poisson regression for tone counts, controlling for language family indirectly, through restricted sampling), and in a Bayesian framework (using the package brms and a custom-written extension of the mediation function from package sjstats [87] , with logistic regressions for tone1 and tone2, Beta regressions for the actual, i.e., non-z-scored, frequencies of ASPM-D and MCPH1-D, and Poisson regression for tone counts, with language family and (meta)population as random effects). This extension of the mediation function allows the simultaneous modelling of two or more mediators, so that I could implement a model similar to the one discussed below for path analysis (Fig 6) while explicitly controlling for family and (meta)population as random effects, but this should be considered as experimental.

Path analysis [88] is even more flexible, here allowing the simultaneous modelling of the mediation effects of both "derived" alleles simultaneously besides the direct effect of macroarea on tone, as shown in Fig 6 ( but, as described above, I experimentally implemented this simultaneous mediation in a Bayesian framework). Again, the macroarea was dichotomised as Africa vs the rest of the world, but due to the limitations of the lavaan package [89] , the binary IV (the dichotomised macroarea) and the binary DV (tone1 and tone2) were coded either as numeric (0 vs 1; "rest of the world" = 0, "Africa" = 1, and "No" = 0, "Yes" = 1) or as ordered categorical ("rest of the world" < "Africa", and "No" < "Yes"), while tone counts were considered numeric; here, the "derived" allele frequencies are z-scored.

Both the mediation (the "classic" approach) and path analyses have trouble modelling the language family as a random effect, so, for each DV and type of model: all data:. I first fitted the model on all the available data, where there is no control for family; restricted sampling. I repeatedly fitted the model to a reduced dataset obtained by picking only one language from each family, the resulting distribution of model fits controlling thus for family.

(Please note that I do not need to control for macroarea, as it is explicitly modelled as the IV).

Machine learning. For the two binary outcomes, tone1 and tone2, I also implemented two classification techniques widely used in machine learning, decision trees and random forests. inside vs outside Africa), DV is the outcome of interest (here, the various codings of tone, tone1, tone2 or tone counts), and MV is the mediator (here, the population frequency of one of the "derived" alleles, ASPM-D or MCPH1-D). The average direct effect ADE = c 0 , the average indirect effect ACME = a × b, and the total effect is their sum, TE = ADE + ACME = c 0 + a × b. The coefficients a, b and c 0 are obtained from fitting two regression models simultaneously (in R formula notation): DV * IV + MV and MV * IV. These path plots are generated using Graphviz (https://graphviz. org/) as implemented by DiagrammeR::grViz in R.

https://doi.org/10.1371/journal.pone.0253546.g005

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin Both are used to find the subset of predictors and rules that best predict the binary outcome, and result not only in measures of how well the model fits/predict the data, but also in a set of explicit rules (decision trees) and ranking of how important the predictors are (random forests).

The measures of fit that I am using are accuracy, sensitivity (or recall), specificity and precision; given the observed (true) binary outcome and the model's predictions (i.e., what the model "thinks" or "labels as"), the relationship between the two is described by four measures: coefficients (a 1 , a 2 , b 1 , b 2 and c 0 ) . Please note that a similar model was also experimentally implemented in a Bayesian framework as simultaneous multiple mediation which explicitly controls for family and (meta)population as random effects.

https://doi.org/10.1371/journal.pone.0253546.g006

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin

Ideally, these measures of fit should be close to 100%, but deviations point to different types of failures and biases.

One important issue for such models is overfitting, where the model "overlearns" the data, fitting it very well, but doing very poorly at predicting new, unseen observations generated by the same processes. One popular technique to overcome this is to repeatedly split the original dataset into two complementary subsets: a training and a testing one. While the training subset is usually larger and is used to fit the model, the testing subset contains observations not yet "seen" by the model, and is used to estimate the fit measures that capture the capacity of the model to generalise to new situations. Here I used an 80%:20% split of the observations between training and testing, stratified by macroarea (i.e., making sure that the distribution of the observations in each subset reflects the distribution of the observations by macroareas in the full dataset), repeated 100 times. Please note that because the random forests have an internal bootstrap mechanism, this repeated training/testing procedure was not applied to them.

The decision trees were implemented by ctree in R's package partykit [90] , while for random forests I used randomForest in package randomForest [91] as well as the conditional random forests implemented by cforest in package partykit. randomForest provides two measures of relative variable importance: one is based on the mean decrease in accuracy if the variable is permuted, while the second is based on the mean decrease in node impurity (measured by the Gini index) when splitting on the variable; while the first captures how much a variable helps in making accurate predictions, the second focuses on producing more homogeneous splits. For cforest, I used the unconditional variable importance, which is similar to the mean decrease in accuracy.

Finally, given the importance of the macroarea as a confound, I fitted these models with and without macroarea as a predictor. (Please note that these models do not control for language family at all).

Phylogenetic approaches. While these hypotheses seem naturally amenable to phylogenetic approaches, including regression while controlling for phylogeny (as implemented, for example, in brms [76] ) and correlated evolution (as implemented, for example, in phytools [92] and RevBayes [93] ), these data are not appropriate for such analyses. First, there are too few language families with enough languages and phylogenetic structure; essentially, just 6 families have at least 5 languages: Turkic (10 languages), Atlantic-Congo (15) , Indo-European (20) , Afro-Asiatic (9), Uralic (8) and Sino-Tibetan (7) . Second, even these families show very little internal variation in tone and in the frequencies of the two "derived" alleles. Third, most phylogenetic methods also require branch lengths, which are notoriously controversial and hard to obtain for language families [5, 94] . However, even collecting more data might still not make such methods applicable, especially if the observation that families are internally very homogeneous for tone and the two "derived" alleles is confirmed, in which case we will either need to use highly controversial language trees above the level of the family [95] [96] [97] or exploit tiny variations within very large families.

The results are presented by outcome and method. While such structure makes this section rather dense and relatively hard to follow, it has the advantage that it does not emphasize any particular narrative thread, outcome or method. Instead, the reader is presented with all the results, leaving the highlighting of the various narratives for the Discussion and conclusions at the end. Each method has its own strength and limitations in relation to these data and questions, and they should be seen as complementary rather than in competition, and their results should be interpreted together.

While being, in certain ways, the fundamental type of approach used here, the individual regressions do not correctly model the postulated flow of causality between tone, the "derived" alleles, and other factors (including historical accidents) influencing the demographic and linguistic processes. Among the multiple types of regression used here, the "classic" mixed-effects models using glmer are probably the least reliable, with the Bayesian approach (brms) modelling macroarea as a random effect the most informative, while using a 2D Gaussian process should be seen as more "experimental" in nature. The randomization approach is very flexible and allows the fine control of various confounds, the most informative for our purposes here being: TUF (tone: unrestricted: macroareas; please see Table 7 for details) and IUF (alleles-independent: unrestricted: macroareas), which check if the overall clustering of tone or the alleles drives the observed association after controlling for macroareas; TMN (tone: macroareas: none) and IMN, which check if the association is due to their clustering specifically within macroareas; and TFF (tone: family: macroareas) and IFF, which check if the assocation is due to their clustering specifically within families after controlling for macroareas. The restricted sampling implements a different way to control for family and macroareas.

Mediation and path analyses do a much better job at modelling the flow of causality, and should therefore have more weight than the individual regressions, but mediation analysis deals with each allele separately, while path analysis and the "experimental" simultaneous mediation analysis treat them simultaneously. The Bayesian mediation analysis is the most appropriate and allows the control for family and macroarea, followed by restricted sampling (which also controls for these confounds), with the "classic" mediation using the full dataset being seen as suggestive. The path analysis on the whole dataset should be seen as suggestive (as it does not control for family and macroarea), but the restricted sampling does allow the control for these confounds.

Finally, the machine learning techniques should be seen as "experimental" and suggestive, complementing the others, as they do not control for the effect of family at all, but allow the explicit modelling of the macroarea and the quantification of the predictive importance of the "derived" alleles and of the macroarea.

Thus, the interpretation of the results is easy when various approaches "say the same thing", but when there are disagreements, it should be driven by the mediation and path analyses (especially the Bayesian and restricted sampling), followed by the regressions (Bayesian, restricted sampling, and the randomisation scenarios highlighted above), but still in the context of all the results taken together.

Before we analyse the relationships between the population frequencies of the "derived" alleles and tone, it is important to understand how they are structured by macroarea (a proxy for contact) and language family (a proxy for genealogy). To this end, I regressed ASPM-D and MCPH1-D separately on macroarea (as fixed effect) using mixed-effects Beta regression with family as random effect (in R's notation: a * 1 + M + (1jF) and m * 1 + M + (1jF), where a is the population frequency of ASPM-D and m of MCPH1-D, M is macroarea, F is family; * is the regression operator linking the DV on the left to the fixed and random effects on the right; 1 represents the intercept, + adds new predictors; (1 j F) denotes the random effects structure, here varying intercepts by family), and I found that their distribution is very strongly clustered within families (the intra-class correlation coefficients are: ICC(ASPM-D) = 71.5% and ICC (MCPH1-D) = 100.0%), and that macroarea predicts their distribution very well (for ASPM-D: p = 3.4 � 10 −16 , R 2 = 49.5%; for MCPH1-D: p = 3.1 � 10 −12 , R 2 = 73.2%). As a reminder, the intra-class correlation coefficient, ICC, represents the proportion of the variance explained by the grouping due to the random effects, and varies between 0% (the grouping contains no information) to 100% (basically all individual observations in a given group are identical). The adjusted ICC only considers the random effects, while the conditional ICC also considers the fixed effects as well, and they are equal when there are no fixed effects (i.e., for the null models DV * 1 + (1 j F)). Here, I report only the adjusted ICC computed on the null models, because we are interested in the clustering of the variance due to the random effects.

Separating Africa from the rest of the world seems to drive most of this effect, due to the overall lower frequencies of these alleles in Africa: for ASPM-D: p = 2.3 � 10 −14 , R 2 = 34.5%; for MCPH1-D: p = 3.2 � 10 −9 , R 2 = 33.7%.

Thus, the population frequencies of the "derived" alleles are strongly confounded by macroarea, and, in fact, seem mostly driven by the difference between Africa and the rest of the world.

When selecting all unique observations with non-missing data for tone1, ASPM-D and The relationship between tone1 and the population frequency of the two "derived" alleles is shown in Fig 8. It can seen that, globally, there seems to be a difference between languages with and without tone: while tone languages (tone1 == "Yes") tend to be found when ASPM-D has a low frequency, the others (tone1 == "No") tend to be found at high frequencies of both "derived" alleles. Zooming in on each macroarea shows different patterns: while there seems to be a difference for ASPM-D in Eurasia (higher for "No") and Papunesia (lower for "No"),

Please note that the vertical axes have different scales to avoid the macroareas with small sample sizes to be visually "squashed" by the whole database.

https://doi.org/10.1371/journal.pone.0253546.g007 there seems to be no differences in Africa and America. However, such plots can be very misleading because these points represent related languages and/or alternative genetic and linguistics values for the same sample-actual statistical analyses are needed.

Regressions. I fitted a "classic" mixed-effects logistic regression model of tone1 on macroarea, the z-scored population frequencies of the two "derived" alleles and their interaction as fixed effects, and family as random effect to the whole data (building on the notations introduced previously: t1 * 1 + a + m + M + a: m + (1 j F), where t1 is tone1, and: represents interaction). I performed hierarchical model comparisons to test each relevant predictor by comparing the models with and without the predictor (including with the null model with no fixed effects) producing p-values using an appropriate test as implemented by R's anova() function (here, for logistic regressions, a χ 2 test); I also considered the quadratic effects of the "derived" allele frequencies but while I will report here only on their linear effects these are in the full HTML report document. First, as expected, tone1 is strongly clustered within families (ICC(tone1) = 70.4%). Second, the interaction does not contribute (p(ASPM-D:MCPH-D) = 0.86) and was removed from the model. Third, as expected, macroarea predicts tone1 by itself (p = 0.00082, R 2 = 23.3%; for mixed-effects models, the proportion of variance explained by the model is represented by Nakagawa's R 2 , where the marginal estimate considers only the fixed effects, while the conditional also considers the random effects as well. Here, I only show Permuting the observations (Fig 9 and Table 8 ), shows that ASPM-D has a negative effect on tone1, detectable even when controlling for macroarea as a fixed effect, except when restricting the permutations within families (which is due, on the one hand, to the many families with few languages for which permuting the data does not change much, and, on the other, to the strong within-family clustering of both ASPM-D and tone1). In contrast, for MCPH1-D, there seems to be a negative effect only for unrestricted permutations when not controlling for macroarea.

Repeated restricted sampling, where only one sample is randomly chosen per family, also shows a clear negative effect of ASPM-D on tone1 even when controlling for macroarea and MCPH1-D (�82% of the β's < 0, with a mean � b � À 0:45), and a weaker but still discernible one for MCPH1-D even when controlling for macroarea and ASPM-D (�68% of the β's < 0, with a mean � b � À 0:37); see Fig 10 . , and the distribution of the permuted β for the three variables to be permuted (the coloured areas): tone by itself (red), the two "derived" alleles together as a unit (green), and the two "derived" alleles shuffled individually (blue). Please note that the y-axes vary between plots, but the x-axes are fixed for comparability. If a distribution is centred on 0.0 (e.g., "ASPM"/"unrestricted"/"none" in the top left) then the permuted data show no effect of the "derived" allele (here, ASPM-D) on tone (i.e., the regression slope β is 0.0). The less overlap is between the observed β (dashed black line) and such a distribution, the more "special" the relationship between the "derived" allele and tone is relative to the structure destroyed by that specific permutation; conversely, if the observed β is largely included in a distribution, the more such a value is expected by chance given the constraints of the permutation.

https://doi.org/10.1371/journal.pone.0253546.g009 Table 8 . The results of the regressions on the randomisation scenarios. The 3-letter scenario name as in 7. For each of tone1, tone2 and tone counts, I show the percent of models fitted on permuted data that, compared to the model fitted on the original data, have a better AIC, a smaller (i.e., more negative) regression coefficient β for ASPM-D, and a smaller β for MCPH1-D, respectively. The lower these percentages, the more "special" is the relationship on the original dataset relative to the structure that is destroyed in each scenario. In bold are the values < 5%. . Please note that the posterior probability for the directional hypothesis p(β < 0) should be as large as possible (this is the actual probability that the effect is negative) and the associated evidence ratio as large as possible, the % HDI inside ROPE should be as small as possible (a value of 0% meaning the HDI entirely falls outside the ROPE), and that p ROPE should also be as small as possible (a value of 0.0 meaning that the whole posterior distribution falls entirely outside the ROPE) and can be interpreted somewhat like a frequentist p-value by comparing it to a threshold of say, 0.05. Modelling macroarea as a 2D Gaussian Processes in the Bayesian mixed-effects logistic regression model found that the interaction is not needed, but that each "derived" allele seems to have a negative effect on tone. Mediation analysis. I ran two main separate mediation analyses (see Fig 5) with the treatment (IV) = dichotomised macroarea (Africa vs the rest of the world), the outcome (DV) = tone1, and the mediator (MV) = either ASPM-D or MCPH1-D, respectively, as well as the "experimental" Bayesian simultaneous bi-mediated model (see Fig 6) with both ASPM-D or MCPH1-D as mediators.

Using the full dataset and the "classic" approach, both main models find a . Taken together, this shows that there's a positive effect of being in Africa on tone, but that about half of it is mediated by the negative effect of ASPM-D, but not by MCPH1-D; moreover, the frequency of MCPH1-D is shaped much more by being in or outside Africa than that of ASPM-D.

Restricted sampling (repeated 1,000 times) finds the same pattern (see Fig 11 and Table 9 ), with only ASPM-D mediating part of the influence of macroarea on tone, but not MCPH1-D;

also, macroarea has a much stronger effect on the latter. This procedure does control for language family, but, on the other hand, drastically reduces the sample size to only N = 35 unique families, resulting in a low power of the individual mediation analyses, with relatively few effect sizes being large enough for statistical significance, but even so, there are many more . Please note that the y-axes vary between plots, but the x-axes of the four left panels (but not of the two right panels) are fixed for comparability. Mediation estimates (leftmost panels): ACME (blue) should be away from 0 for a sizeable mediation through the "derived" allele. Mediation p-values (mid panels): ACME (blue) should be below 0.05 (or 0.10) for a statistically significant mediation through the "derived" allele. (Partial) regression slopes β (rightmost panels): should be different from 0. See Table 9 for the numeric summaries.

https://doi.org/10.1371/journal.pone.0253546.g011 Table 9 . Summaries of the mediation analysis for 1,000 restricted samples for tone1. For each "derived" allele the table shows the three types of effects (total, direct and mediated) and the two (partial) regression coefficients. For each, it gives the mean, median and the results of the relevant comparison with 0.0 (smaller or bigger) in terms of the percent of the estimates smaller (or bigger) than 0.0 and the one-sided t-test (df = 999); the direction of the comparison is based on a priori expectancies: positive effects but negative βs. Path analysis. I ran path analyses (see Fig 6) with the treatment (IV) = dichotomised macroarea (Africa vs the rest of the world), the outcome (DV) = tone1, and the mediators MV 1 = ASPM-D and MV 2 = MCPH1-D, separately for the binary and the ordered codings of the IV and the DV (but given that the results are extremely similar, I only report here the numeric coding).

Using the full dataset, the model fits the data very well (w 2 1 ¼ 0:22, p = 0.64; CFI = 1.00, TLI = 1.01, NNFI = 1.01 and RFI = 1.00), and, as shown in Fig 13, being in Africa has a significant positive direct effect on tone1, a significant negative effect on ASPM-D which has a negative significant effect on tone1, but while it has a stronger significant negative effect on MCPH1-D this has no effect on tone1. Please note that, for path analyses/SEM models, we want the χ 2 goodness-of-fit test to be not significant, meaning that there is no reason to reject the hypothesis that the model fits the data; this is a binary decision. On the other hand, there are several fit indices (I only show a few: CFI, TLI, NNFI and RFI), which are continuous, the closer to 1.00, the better the model fits to the data.

As for the mediation analysis above, restricted sampling finds a similar pattern (see Fig 14  and Table 10 ).

Decision trees. If, besides ASPM-D and MCPH1-D, macroarea is also included, the decision tree fits the full dataset well (accuracy = 77.3%, sensitivity = 71.7%, specificity = 79.3%, precision = 54.1%, and recall = 71.7%), and generalises across 100 training/testing sets (mean ± sd: accuracy = 77.1% ±6.6%, sensitivity = 71.6% ±15.2%, specificity = 79.3% ±6.8%, precision = 52.9% ±12.4%, recall = 71.6% ±15.2%), and suggests that within Eurasia and Papunesia, the frequency of ASPM-D is a good predictor of tone1 (with p = 0.003). When macroarea is not included, then the fit for the whole dataset drops a bit (accuracy = 75.1%, sensitivity = 75.0%, specificity = 75.2%, precision = 39.3%, and recall = 75.0%) and generalises slightly less well (accuracy = 70.1% ±7.6%, sensitivity = 61.3% ±19.1%, specificity = 76.2% ±8.7%, precision = 44.0% ±23.0%, recall = 61.3% ±19.1%); now, ASPM-D is predictive overall (p < 0.001), and MCPH1-D is relevant (p = 0.004) only for low frequencies of ASPM-D (see Fig 15) .

Random forests. If macroarea is included as a predictor, the random forests fit the data well (accuracy = 77.7% ±0.8%, sensitivity = 68.9% ±1.9%, specificity = 81.6% ±0.4%, Leaving macroarea out slightly reduces the fit (random forests: accuracy = 70.7% ±1.0%, sensitivity = 56.3% ±1.4%, specificity = 78.6% ±0.8%, precision = 59.0% ±1.8%, recall = 56.3% ±1.4%; conditional random forests: accuracy = 82.1% ±0.5%, sensitivity = 81.8% ±0.7%, specificity = 82.3% ±0.6%, precision = 60.5% ±1.6%, recall = 81.8% ±0.7%), but now all three criteria agree that ASPM-D > MCPH1-D.

Summary. Thus, tone1, capturing the presence of any type of tone system, is negatively influenced by each of the two "derived" alleles by themselves (i.e., when not controlling for shared inheritance and contact), while controlling for these confounds still finds these negative influences but they are now much weaker. (Interestingly, when contact is modelled in a continuous manner, using a 2D Gaussian Process, then these effects are stronger than when modelling it as a standard random effect). The strengths of the effects of the two "derived" alleles are comparable and are affected in similar ways by controlling for the confounds. However, when properly modelling the difference between Africa and the rest of the world in allele frequencies and tone I found that only ASPM-D has a negative effect on tone above and beyond that due to macroarea and language family, suggesting that the frequency of ASPM-D partially mediates the influence of macroarea on tone. Thus, there seems to be a weak negative effect of ASPM-D on the presence/absence of tone above and beyond the confounding effects of contact and shared inheritance, but less convincingly for MCPH1-D.

While the original hypothesis in [9] concerns strictly the presence of tone, here I capitalise on the availability of data allowing the comparison of languages with a complex tone system versus those without tone or with a simple system-this is embodied in the tone2 variable, where "Yes" represents complex tone systems.

When selecting all unique observations with non-missing data for tone2, ASPM-D and MCPH1-D, there are 180 observations, distributed among 118 unique Glottolog codes (languages) in 35 families (the number of languages per family ranges from 1 to 47, with mean 5.1 

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin 

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin and median 2). There are 29 (16.1%) languages with complex tone ("Yes") in the dataset (w 2 1 ¼ 82:7, p < 2.2 � 10 −16 ), and the distribution by macroarea is 37 (9 = 24.3% "Yes") in Africa, 123 (18 = 14.6%) in Eurasia, 10 (1 = 10%) in America, and 10 (1 = 10%) in Papunesia (between macroareas w 2 3 ¼ 2:6, p = 0.46); see Fig 16 . The relationship between tone2 and the population frequency of the two "derived" alleles is shown in Fig 17. It can seen that, globally, it seems that languages with complex tone tend to be found when ASPM-D has a low frequency, which seems to be the case also in Eurasia.

Regressions. On the whole data and using the "classic" approach, tone2 is almost completely clustered within families (ICC(tone2) = 95.6%). The interaction between ASPM-D and MCPH-D does not contribute (p = 0.76), macroarea does not predict tone2 (p = 0.50, R 2 = 2.0%), and neither do the two "derived" alleles individually (ASPM-D: β = −0.87 ± 0.69, p = 0.19, R 2 = 1.3%; MCPH1-D β = −1.01 ± 0.73, p = 0.16, R 2 = 1.5%).

Permuting the observations (Fig 18 and Table 8 ), shows that ASPM-D has a negative effect on tone2 in all conditions, less clear when restricting the permutations within families, but MCPH1-D seems to have a negative effect only for unrestricted permutations when not controlling for macroarea.

Repeated restricted sampling also shows a clear negative effect of ASPM-D on tone2 even when controlling for macroarea and MCPH1-D (99.9% of the β's <0, with a mean � b ¼ À 0:96), and arguably no effect for MCPH1-D (35.1% of the β's <0, with a mean � b ¼ 0:67); see Fig 19 .

Fitting a Bayesian mixed-effects logistic regression model of tone2 on ASPM-D, MCPH1-D and their interaction as fixed effects, and macroarea, family and (meta)population (nested within family) as random effects, found that the interaction is not needed, but that each "derived" allele seems to have a negative effect on tone. Restricted sampling (Fig 20 and Table 11 ), finds that while for both "derived" alleles there are positive and comparably strong total (TE) and direct (ADE) effects, only ASPM-D shows a positive indirect effect (ACME), but pretty much none of these effects are individually significant.

The Table 11 for the numeric summaries.

https://doi.org/10.1371/journal.pone.0253546.g020 Table 11 . Summaries of the mediation analysis for 1,000 restricted samples for tone2. Same conventions as in Table 9 . Fig 22, being in Africa has no direct effect on complex tone, but has a significant negative effect on ASPM-D which has a negative significant effect on complex tone; however, while it has a stronger significant negative effect on MCPH1-D, this has no effect on complex tone. The same pattern is also found by restricted sampling, except that here there is also a hint of a negative effect of MCPH1-D on complex tone (see Fig 23 and Table 12 ).

Decision trees. The decision trees including or excluding macroarea are the same and trivial in that they uniformly predict just the majority value "No" and have only one split depending on the frequency of ASPM-D (� or > � 0.27, p < 0.001).

Random forests. If macroarea is included as a predictor, the random forests fit the data well (accuracy = 84.2% ±0.5%, sensitivity = 58.4% ±11.2%, specificity = 84.9% ±0.4%, precision = 8.8% ±2.6%, recall = 58.4% ±11.2%), and the conditional random forests are even better (accuracy = 87.2% ±0.7%, sensitivity = 97.5% ±4.8%, specificity = 86.9% ±0.7%, precision = 21.2% ±5.2%, recall = 97.5% ±4.8%), and two measures of variable importance (mean decrease in accuracy and mean decrease in the Gini index) find that ASPM-D > MCPH1-D > macroarea, while the third (unconditional importance) finds that ASPM-D > macroarea > MCPH1-D. 

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin Leaving macroarea out has very little effect on the fit (random forests: accuracy = 82.2% ±1.0%, sensitivity = 42.8% ±4.4%, specificity = 87.3% ±0.6%, precision = 30.3% ±3.5%, recall = 42.8% ±4.4%; conditional random forests: accuracy = 87.3% ±0.2%, sensitivity = 81.2% ±3.1%, specificity = 87.7% ±0.0%, precision = 27.6% ±0.0%, recall = 81.2% ±3.1%), and all three criteria agree that ASPM-D > MCPH1-D.

Summary. Complex tone is much rarer than no tone and simple tone, so that the distribution of tone2 is very skewed, and the results much less clear-cut than for tone1, but it seems that there is also a negative effect of ASPM-D on complex tone (while for MCPH1-D the evidence is much more ambiguous).

There are 184 unique observations with non-missing data for tone counts, ASPM-D and MCPH1-D, distributed among 121 unique Glottolog codes (languages) in 35 families (the number of languages per family ranges from 1 to 47, with mean 5.3 and median 2); see Fig 24. The relationship between the tone counts and the population frequency of the two "derived" alleles is shown in Fig 25. It can seen that, globally, languages with higher tone counts tend to be found when ASPM-D has a low frequency, but this does not seem to be a linear relationship. 

On the whole data and using the "classic" approach, the tone counts are strongly clustered within families (ICC(tone2) = 65.3%). The macroarea predicts the tone counts (p = 0.013, R 2 = 23.8%), and while ASPM-D has no significant effect by itself (−0.37 ± 0.19, p = 0.061, R 2 = 7.6%), MCPH1-D seemingly does (−0.46 ± 0.19, p = 0.016, R 2 = 9.9%), but this seems to fully overlap with that of macroarea. Path analysis. On the full dataset, the model fits the data very well (w 2 1 ¼ 0:29, p = 0.59; CFI = 1.00, TLI = 1.01, NNFI = 1.01 and RFI = 1.00), and, as shown in Fig 30, being in Africa has no direct effect on tone counts, but has a significant negative effect on ASPM-D which has a negative significant effect on tone counts, but while it has a stronger significant negative effect on MCPH1-D this has no effect on tone counts. The same pattern is also found by restricted sampling, except that here there is also a hint of a negative effect of MCPH1-D on complex tone (see Fig 31 and Table 14) .

Summary. The actual number of tones/tone symbols has a skewed distribution, and the results are less clear, but it seems that there is a negative effect of ASPM-D on tone counts Table 13 . Summaries of the mediation analysis for 1,000 restricted samples for tone counts. Same conventions as in Table 9 . Table 11 for the numeric summaries.

https://doi.org/10.1371/journal.pone.0253546.g028 https://doi.org/10.1371/journal.pone.0253546.g031

Tone and genes: New cross-linguistic data and methods support the weak effect of ASPM but not of Microcephalin above and beyond the confounding effects of macroarea and family, but not so for MCPH1-D (except when using a Gaussian Process to model contact).

Given the importance of the Bayesian models in this analysis, I checked if the Bayesian regressions of tone1, tone2 and tone counts on the z-scored population frequency of each "derived" allele separately while controlling for family, (meta)population and macroarea as random effects (i.e., the models used above) are sensitive to the particular choice of prior distributions used. This is a standard technique in Bayesian data analysis (e.g., [30] ) that, first, makes sure that the chosen prior distribution does not unduly affects the results (in effect, creating artefacts) and, second, is a different way of getting a "feeling" of how much information there is in the data. In the analyses presented above I systematically used a student's t(df = 3, μ = 0, σ = 3) prior for the intercept α and the slopes β, where μ is the location parameter (the median) and σ is the scale parameter, resulting in a relatively wide spread around 0 for z-scored predictors (see also the prior recommendations for Stan at https://github.com/stan-dev/stan/wiki/ Prior-Choice-Recommendations). To this, I compared a set of 13 other priors explicitly chosen for their properties and relevance (see Table 15 ) by fitting the exact same regression model to the exact same data (varying only the intercept and slope prior) separately for the z-scored population frequency of ASPM-D and of MCPH1-D, respectively. For each such model I focused on the slope β of the predictor, and I collected the point estimate with its 89%HDI (see Table 16 ; for full results please see Appendix II in the S1 File). First, it can be seen that the estimates are negative for all priors, including for the extreme positive one (except for MCPH1-D ! tone2), suggesting that indeed, the "derived" alleles probably have a negative effect on tone after controlling for family and macroarea. Second, the "default", the "flat" and the "default normal" priors produce similar estimates for tone1 and tone counts (for both alleles) and slightly less so for tone2, suggesting that the "default" prior does not bias the results too much.

Thus, there is enough information in the data to override even the extreme priors (and clearly the "default" prior used), and this information consistently points to a negative effect of the "derived" alleles on the three codings of tone.

I used a simulation approach (as implemented by the simr package [98] ) to conduct a posthoc power analysis of the effect of ASPM-D on tone1. More precisely, I fitted a "classic" When counterfactually changing the number of languages per family (while keeping everything else constant) suggests that the 80% power threshold would be achieved only if there were at least �700 languages per family, while changing the number of families (while keep everything else constant) suggests that we reach 80% power only at �350 families! Independently changing the number of families and languages suggests that the region with > 80% power needs either many languages per family or many families, with the best trade-off reached for �100 to �200 families with �250 to �100 languages each.

For comparison, currently there are �150 families in the Ethnologue and �420 in Glottolog, the latter listing a maximum of �1400 languages per family (Atlantic-Congo), with a mean of �20, and a median of 2 or of 5 (when excluding isolates). Of course, this analysis must be taken as only an indication of the rough order of magnitude but, given the distribution of linguistic diversity, it suggests that using such a model would require an almost exhaustive sampling and even then its success would not be guaranteed. Interestingly, for the rather different question of absolute universals (i.e., the non-existence of logically possible feature values), the bootstrapping and Bayesian analyses in [99] similarly suggest that at least hundreds of statistically independent languages might be required.

The fact that the population frequencies of the two "derived" alleles is so different between Africa and the rest of the world, on the one hand, while being relatively uniform within each of these two geographic regions (particularly strong for MCPH1-D, raises the question of performing the analysis (a) excluding Africa, and (b) within each of the four macroareas Table 15 . The priors used for the sensitivity analyses of the Bayesian regressions. The student's t(df, μ, σ) distribution has parameters df = degrees of freedom, μ = location (the median) and σ = scale, while the normal(μ, σ) has parameters μ = mean and σ = standard deviation. Remember that these distributions refer to the intercept and slope of a z-scored predictor (i.e., one with mean 0 and standard deviation 1) that is expected to have a weak and possibly negative effect.

Distribution Description default t(3, 0, 3) the distribution used in the analyses reported above; relatively wide spread around 0 following recommendations flat normal(0, 10) very wide around 0 (i.e., "uninformative") default normal normal(0, 5) very similar to the "default" but using the normal distribution instead of student's t narrow zero t(3, 0, 1) relatively narrow around 0 (i.e., assuming the parameter is 0) very narrow zero t(3, 0, 0.1) very narrow around 0 (i.e., strongly assuming the parameter is 0) default negative t(3, −1, 3) wide around -1 (i.e., weakly assuming the parameter is negative) narrow negative t(3, −1, 1) relatively narrow around -1 (i.e., assuming the parameter is negative) separately. The full results can be found in the Appendix IV: Macroareas as units of analysis of the S1 File analysis report, but, briefly, as America and Papunesia have very few unique observations (each 10 in 6 families, respectively), they cannot be analysed independently, resulting in three analyses: (a) Eurasia + America + Papunesia (i.e., excluding Africa), (b1) just Africa, and (b2) just Eurasia. For these analyses, I did only the regression analyses (no Gaussian Process Bayesian regressions). In all these analyses, there is no detectable effect of the "derived" alleles on tone (except for restricted sampling and Bayesian repressions, which seems to find a negative effect of ASPM-D in certain cases). While these essentially null results at the level of the macroareas and when excluding Africa can be taken to show that the effect of ASPM-D on tone obtained using the full dataset are an artefact of the differences between the macroareas, and between Africa and non-Africa in particular (akin to Simpson's paradox [30, 74] ), I do not think that this interpretation is warranted. Instead, not finding a relationship within the individual macroareas and when excluding Africa is rather unsurprising given the small expected effect size, coupled with the very uniform population frequency of the "derived" alleles, and the small number of observations and their high clustering within families. Therefore, we should instead use all the data while modelling the family and macroarea structure, in particular that between Africa and non-Africa, as I have done in the regression, mediation and path analysis approaches.

The three types of mixed-effects regression models using the full dataset, (a) the "classic" (maximum likelihood, considering family as a random effect but macroarea as a fixed effect), (b) the Bayesian (which models family, (meta)population and macroarea as random effects), and (c) the "experimental" Bayesian with Gaussian Process (which models family and (meta)population as random effects but contact as a continuous Gaussian Process split by macroarea), find effects of relatively similar sizes and of the same sign for the influence of each of the two "derived" alleles on each of the three measures of tone, but with different probabilities of being "significant" (for (a) they are uniformly n.s. when controlling for macroarea, for (b) the posterior probability p(β < 0) of a strictly negative effect is > 80%, the % HDI inside ROPE <21% and p ROPE < 0.2, while for (c) p(β < 0) > 78%, the % HDI inside ROPE <24% and p ROPE < 0.2 across both alleles and the three measures of tone). Thus, while all three suggest weak negative effects on tone of both "derived" alleles, the Bayesian approaches seem more powerful at detecting these effects above and beyond the confounds (especially macroarea); moreover, overall there seems to be a stronger effect of ASPM-D relative to MCPH1-D. Moreover, the Bayesian regression models not only are not very sensitive to the prior (including extremely biased ones), suggesting that there is enough signal in the data, but their results consistently point to a negative effect of the "derived" alleles on tone. Performing maximum likelihood mixed-effects regressions with family as a random effect on repeatedly permuted data (under various constraints) suggests that there is a weak negative effect of ASPM-D on tone1 and tone2 that is not entirely due to macroarea, but that there is strong similarity within language families (both confirming the high intra-class correlations for family and suggesting that a stratified sampling of one language per family is sufficiently appropriate). However, MCPH1-D has no "special" effect on neither of the three tone variables, nor does ASPM-D on tone counts, except when all structure is destroyed through unrestricted permutations (i.e., when each sample is considered as statistically independent of all the others). Likewise, repeatedly fitting maximum likelihood regressions on restricted samples (where only one language is randomly picked per family) while controlling for macroarea and the other "derived" allele as fixed effects, shows that there is a weak negative effect of ASPM-D on all three measures of tone, and possibly of MCPH1-D on tone1, but arguably not on tone2 or tone counts. Mediation analysis looks at the total influence of being an African language or not (split justified by the marked differences in the frequencies of the two "derived" alleles for populations within and outside Africa) on the three measures of tone mediated separately by the frequency of the two "derived" alleles. On the full dataset, I used both a "classic" maximum likelihood approach (which does not control for language families directly and are more restricted in terms of regression models available) and a Bayesian one (where family is modelled as a random effect and allows the use of more appropriate regression models, as well as the "experimental" mediation through both "derived" alleles simultaneously), and the two agree except in one case, where the Bayesian approach is more powerful at detecting a mediation. For tone1, both find a positive mediation through ASPM-D (composed of a negative effect of being in Africa on ASPM-D, and a negative effect of ASPM-D on tone1), and both fail to find mediation through MCPH1-D. For tone2, both fail to find mediation through MCPH1-D, but they disagree for ASPM-D: while the maximum-likelihood approach fails to find a significant mediation (even if both its components are each significant), the Bayesian one does find a positive one (composed of a two negative effects, as above). For tone counts, both find a positive mediation through ASPM-D (composed of a two negative effects, as above), and both fail to find mediation through MCPH1-D. Repeatedly conducting "classic" maximum likelihood mediation analyses when extracting only one language per family at random (restricted sampling, implementing a control for family), paints a very similar picture: there is a clear positive mediation effect of ASPM-D on all three measures of tone (composed of negative partial effects of being an African language on ASPM-D, and a negative effect of ASPM-D on tone), but much less so for MCPH1-D.

Path analysis has the added advantage that it models the mediation of the effect of within versus outside Africa on tone through both "derived" alleles simultaneously, but at the cost of less flexibility in the types of models that can be used, and of not directly controlling for language families (the "experimental" simultaneous mediation technique can satisfy both desiderata but it is not, at the moment, sufficiently well tested to be considered otherwise). On the full dataset and on repeated restricted samples of a single language per family, the results are very similar, in that there is a clear mediation through ASPM-D for all three measures of tone, composed of a negative effect of being in Africa on ASPM-D, and a negative effect of ASPM-D on tone, but not for MCPH1-D.

The decision trees and random forests suggest that ASPM-D and, to a lesser extent, MCPH1-D, contain information with regard to tone1 and tone2 above that provided by macroarea (but there is no control for language family at all).

The post-hoc power analysis of the results obtained from the maximum-likelihood mixedeffects regression of tone1 on ASPM-D with family as a random effect and macroarea as a fixed effect suggest that this method would require too much data to be feasible.

Finally, the separate analyses of each macroarea independently and when excluding Africa, suggest that our interpretation should be based on the whole dataset at hand but modeling appropriately the areal and family structure of the data, as done in the mixed-effects regression, mediation and path analyses.

Using updated data and methods relative to the original paper [9] , I found, first, that the population frequencies of the two "derived" alleles (denoted ASPM-D and MCPH1-D), as well as the distribution of tone, coded as the presence/absence of any type of tone system (denoted tone1), of complex tone systems (tone2), or as the actual number of tones/tone symbols (tone counts), are strongly clustered within language families and macroareas. This is not surprising given that they are all shaped by similar processes, with a strong vertical transmission component (inheritance) coupled with various degrees of horizontal transmission (contact). However, this is a very important result, as it confirms the need to properly disentangle the confounding effects of inheritance (here, depending on the method, using the language families and the (meta)populations as proxies) and contact (also depending on the method, using the macroareas and the geographic distances within them as proxies), from the proposed causal effects of the two "derived" alleles on tone. On the other hand, we need to exercise care because we may risk "throwing the baby with the bathwater" in the sense that this type of causal effect is expected to act on comparable timescales to the processes affecting languages and the genetic structure of populations and, thus, to be similar to the confounding effects of inheritance and contact.

Therefore, I used multiple methods of controlling for these confounding effects, including the "standard" mixed-effects regression approach of modelling family as a random effect, and macroarea as a fixed effect, as a random effect, or through a bi-dimensional Gaussian Process of the geographic coordinates of the samples, but also by permuting the language and genetic data according to multiple types of constraints (none, within families, and within macroareas), and by repeatedly sampling only one data point from each family. Moreover, besides the regression approach which quantifies the "extra" effect of the alleles on tone left after removing the effects of macroarea and family, I also used mediation and path analysis, which allow the explicit modelling of the interplay between macroarea, tone and the alleles, as well as decision trees and random forests, which quantify the capacity and relative importance of macroarea and the two alleles for predicting tone.

With these, it is interesting to note that the results are largely consistent between the methods, and that, despite their differences, when considered together, they do paint a rather coherent picture. This overall picture is one of a weak negative effect of the population frequency of ASPM-D on tone above and beyond the effects of contact and inheritance, but these effects are largely overlapping in the sense that, on the one hand, samples from the same language family tend to be very similar to each other genetically and linguistically, while, on the other, being a sample from Africa or from outside Africa has a major effect on ASPM-D and (much less so) on tone. This "extra" weak negative effect of ASPM-D on tone means that languages spoken by populations with a high frequency of ASPM-D have, overall, a slight tendency to not use tone, and, if they do, to have a simpler tone system. It is interesting to note that the distribution of tone languages is clearly skewed relative to the high frequencies of ASPM-D, and that there seems to be a qualitative change around 25% from Abau (glottocode abau1245, Sepik; simple system; 57.5% ASPM-D), Swedish (swed1254, Indo-European; simple "pitch accent"; two samples with 49.5% and 43.8%), Western Balochi (west2368, Indo-European; simple; 32.0%), Eastern Panjabi (panj1256, Indo-European; simple system; 30.7%), Burushaski (buru1296, Burushaski; probably not?; 28.0%), Awngi (awng1244, Afro-Asiatic; simple; 27.4%), and a set of 7 Austronesian languages (cemu1238, fwai1237, kara1486, labu1248, kuma1276, xara1244 and yabe1254) ambiguously matching the "Micronesians" (ALFRED SA004382R) genetic sample with 27.3% ASPM-D with simple tone systems, to Lü and Tai Nüa (luuu1242 and tain1252, Tai-Kadai; complex tone systems; 25.0%), Sichuan Yi (sich1238, Sino-Tibetan; simple; 25.0%), She (shee1238, Hmong-Mien; complex; 23.4%) and Mandarin Chinese and Yue Chinese (mand1415 and yuec1235, Sino-Tibetan; complex; 23.1%). At the other extreme, a low frequency of ASPM-D clearly does not require the presence of tone. Concerning MCPH1-D, while it apparently also has a negative effect on tone on its own, this effect seems to be in large part driven by its very skewed distribution within versus outside Africa, and does not survive most of the correction techniques used here.

However, there are differences between methods that might give hints about the nature of the effects and confounding factors. One of the most important concerns the mixed-effects regressions on the full dataset, as they implement one of the currently de facto standard methods of controlling for the confounding effects of inheritance and contact [27, 28] . While the "classic" maximum-likelihood approach, with family as a random effect and macroarea as a fixed effect, fails to find any significant effect of ASPM-D on tone at the standard α-level of 0.05, the Bayesian ones, with family and (meta)population as random effects, and macroarea as random effect or as modelled by a bi-dimensional Gaussian Process, do find some evidence for a weak negative effect of both ASPM-D and MCPH1-D on all three measures of tone, but this evidence is far from overwhelming on its own. As the randomisation, restricted sampling, and the mediation and path analyses show, this failure very probably emerges, in large part, from, on the one hand, the very high similarity of the languages and of the genetic samples from the same family and macroarea (especially, for the genetic samples, Africa versus the rest of the world), and, on the other, from the inadequate modelling of the relationship between macroarea and the frequency of the two "derived" alleles. To end, the fact that modelling contact through a set of continuous 2D Gaussian Processes (one per macroarea) over the geographic coordinates of the samples, finds a stronger negative effect of MCPH1-D on tone1 and tone counts, and comparable on tone2, than ASPM-D, suggests that the story might be even more complex, with either an actual effect of MCPH-D on tone in certain circumstances or pointing to issues with using a Gaussian Processes to control for contact in such a way.

Thus, the more (and, in some cases, more refined) data that became available since 2007 concerning both the population frequency of the two "derived" alleles (either directly, or inferred through proxies in high linkage disequilibrium that seem to induce minimal noise) and tone, and the newer methods that became popular since, support a weak negative effect of ASPM-D but probably not of MCPH1-D after removing the effects of contact and inheritance, modelled as macroareas and language families, respectively. Nevertheless, the work presented here shows the main limiting factor remains the availability of good quality genetic data with good geographic and linguistic coverage (hopefully, such data will become available from more populations across the globe, and maybe even from past groups using ancient DNA techniques), coupled with appropriate methods of analysis. For example, there are only 10 data points from the Americas and 10 from Papunesia, and none from Australia, the latter being a particularly interesting case [54] . If, indeed, ASPM-D has a weak negative effect on tone, and if ASPM-D has a low frequency among the Aboriginal Australian populations (as could be reasonably expected given the age of the allele and the apparently long genetic isolation of Australia [40, 100] ), then we would expect to find at least some tone languages among its Aboriginal languages, but this is clearly not the case [34, 35] . However, there are at least two solutions to this (potential) paradox [54] : first, the small negative bias of ASPM-D on tone is inherently probabilistic and, moreover, does not place constraints on language change and evolution at (very) low frequencies of this allele, so that it is not entirely inconceivable that accidents of history and linguistic expansions and extinctions have resulted in a linguistic Australian landscape that does not include tone at this moment. The second is based on the suggested longterm effects of a high incidence of Chronic Otits Media (COM) among the Aborigine populations on the phonetics and phonology of Australian languages [101], in particular the loss of sensitivity in the low frequency range which, naturally, should explain the absence of tone distinctions (just as it explains a lack of voicing contrasts).

Interestingly, while the original data and analyses in [9] did not find anything that would suggest a qualitative difference between ASPM-D and MCPH1-D in how they affect tone, the results reported here do, with the effect of MCPH1-D apparently confounded, in large part, by its skewed distribution inside versus outside Africa. This is consonant with the experimental findings of [25] and [26] , which supported a (negative) effect of ASPM-D on tone perception and/or processing, but failed to find any effect of MCPH1-D.

Supporting information S1 Fig. The agreement between the 5 sources for tone. Each panel shows a pair of sources (e.g., the top-left panel shows LAPSyD on the vertical axis and DL2007 on the horizontal axis); please note that the pairs are symmetric, so that the DL2007 vs LAPSyD panel is not shown; likewise, the identity panels (e.g., LAPSyD vs LAPSyD) on the diagonal are also not shown. Each panel shows the number of languages with each possible combination of values from the two sources (e.g., for "None" vs "No", there are 12 languages, but there's no language for "None" vs "Yes"). The shade of blue varies between white (the lowest count) to light blue (highest count). A high agreement between two sources results in little discrepancy between corresponding values (e.g., all "None" in LAPSyD map to "No" in DL2007 and vice-versa, while "Marginal", "Simple", "Moderately complex" and "Complex" map to "Yes"). 

Language is not isolated from its wider environment: vocal tract influences on the evolution of speech and language. Language and Communication

Human sound systems are shaped by post-Neolithic changes in bite configuration

Language evolution and climate: the case of desiccation and tone

Evidence for Direct Geographic Influences on Linguistic Sounds: The Case of Ejectives

The evolution of language families is shaped by the environment beyond neutral drift

Language as shaped by the brain

Language change: Contributions to the study of its causes. Berlin: Mouton de Gruyter

Origins of Sound Change: Approaches to Phonologization

Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin

Languages and Genes: Reflections on Biolinguistics and the Nature-Nurture Question

Are languages really independent from genes? If not, what would a genetic bias affecting language diversity look like?

Phonetic explanations for the development of tones

Cambridge Textbooks in Linguistics

Towards a Theory of Tonogenesis: An Empirical, Physiologically and Perceptually Based Account of the Development of Tonal Contrasts in Languages

Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits

Circular analysis in systems neurosciencethe dangers of double dipping

The role of genetic biases in shaping language-genes correlations

Genetic biasing through cultural transmission: do simple Bayesian models of language evolution generalize

A Bayesian phylogenetic approach to estimating the stability of linguistic features and the genetic biasing of tone

Some Structural Aspects of Language Are More Stable than Others: A Comparison of Seven Methods

Geospatial distributions reflect rates of evolution of features of language

Patterns of individual differences in the perception of missing-fundamental tones

Factors Influencing Sensitivity To Lexical Tone In An Artificial Language

Repetition Suppression in the Left Inferior Frontal Gyrus Predicts Tone Learning Performance

The Derived Allele of ASPM Is Associated with Lexical Tone Perception

ASPM-lexical tone association in speakers of a tone language: Direct evidence for the genetic-biasing hypothesis of language evolution

Correlational Studies in Typological and Historical Linguistics

Mixed effect models for genetic and areal dependencies in linguistic typology

Data Analysis Using Regression and Multilevel/Hierarchical Models

Statistical Rethinking: A Bayesian Course with Examples in R and Stan

Zú ñiga F. Randomization tests in language typology

Mapping the Origins and Expansion of the Indo-European Language Family

A Statistical Explanation of the Distribution of Sortal Classifiers in Languages of the World via Computational Classifiers

Leipzig: Max Planck Institute for Evolutionary Anthropology

Leipzig: Max Planck Institute for Evolutionary Anthropology

D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity

The 1000 Genomes Project Consortium. A global reference for human genetic variation

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

ALFRED: the ALelle FREquency Database

Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens

Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans

What primary microcephaly can tell us about brain growth

Positive selection in ASPM is correlated with cerebral cortex evolution across primates but not with whole-brain size

Adaptive Evolution of Four Microcephaly Genes and the Evolution of Brain Size in Anthropoid Primates

Microcephaly genes evolved adaptively throughout the evolution of eutherian mammals

Human evolutionary genetics

Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens" and "Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans

Comment on papers by Evans et al. and Mekel-Bobrov et al. on Evidence for Positive Selection of MCPH1 and ASPM

Superior: The Return of Race Science

The ongoing adaptive evolution of ASPM and Microcephalin is not explained by increased intelligence

No evidence that polymorphisms of brain regulator genes Microcephalin and ASPM are associated with general mental ability, head circumference or altruism

The relationship between Microcephalin, ASPM and intelligence: A reconsideration

The spread of alphabetical writing may have favored the latest variant of the ASPM gene

Genetic influences on tone? New experimental evidence and its consequences for linguistics. PsyArXiv; 2020

The Heritability of Language: A Review and Metaanalysis of Twin, Adoption, and Linkage Studies

Tangled webs: tracing the connections between genes and cognition

A forkhead-domain gene is mutated in a severe speech and language disorder

Non-spurious correlations between genetic and linguistic diversities in the context of human evolution

Evolution of Syntax: Memes-Genes Coevolution in the Multiregional Context

Real and spurious correlations involving tonal languages

The World Atlas of Language Structures

The Human Genome Diversity Project: past, present and future

The comparative method in anthropology

The detection of disease clustering and a generalized regression approach

Genetic Distance between Populations

Ethnologue: Languages of the World. 15th ed. SIL International

World phonotactics database

Glottolog 3.2. Jena: Max Planck Institute for the Science of Human History

Large Linguistic Areas and Language Sampling

Linguistic Diversity in Space and Time

Some Principles on the Use of Macro-Areas in Typological Comparison

Who We Are and How We Got Here: Ancient DNA and the new science of the human past

Interindividual variation refuses to go away: a Bayesian computer model of language change in communicative networks

Causality: Models, Reasoning, and Inference

The Book of Why: The New Science of Cause and Effect

brms: An R Package for Bayesian Multilevel Models Using Stan

R: A Language and Environment for Statistical Computing

Stan Modeling Language Users Guide and Reference Manual

Fitting Linear Mixed-Effects Models Using lme4

glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Gaussian Processes for Machine Learning

The Oxford Handbook of Linguistic Typology

Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots

Mediation Analysis. Annual review of psychology

mediation: R Package for Causal Mediation Analysis

Statistical Functions for Regression Models (Version 0.18.0)

Principles and Practice of Structural Equation Modeling

lavaan: An R Package for Structural Equation Modeling

partykit: A Modular Toolkit for Recursive Partytioning in R

Classification and Regression by randomForest. R News

phytools: An R package for phylogenetic comparative biology (and other things)

Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language

Making Genealogical Language Classifications Available for Phylogenetic Analysis: Newick Trees, Unified Identifiers, and Branch Length

Support for linguistic macrofamilies from weighted sequence alignment

Abstract Profiles of Structural Stability Point to Universal Tendencies, Family-Specific Factors, and Ancient Connections between Languages

Ultraconserved words point to deep language ancestry across Eurasia

simr: an R package for power analysis of generalised linear mixed models by simulation

Quantitative Standards for Absolute Linguistic Universals

Genomic insights into the human population history of Australia and New Guinea

Special thanks D. Robert Ladd and Patrick Wong. I also wish to thank Huma-num (https:// www.huma-num.fr) for the use of their computing infrastructure for parts of this project. I wish to thank the three anonymous reviewers whose comments helped improve the analysis and the paper. Finally, as far as possible from a "thank you", but I have to mention that this work was largely done during the three full lock-downs that France experienced in spring and autumn 2020 and spring 2021 due to Covid-19.

Conceptualization: Dan Dediu.