llc_preprint     Significance Testing of Word Frequencies in Corpora Author: Jefrey Lijffijt Affiliation: Aalto University Current affiliation: University of Bristol Mail: University of Bristol, Department of Engineering Mathematics, MVB Woodland Road, Bristol, BS8 1UB, United Kingdom. E-mail: jefrey.lijffijt@bristol.ac.uk Author: Terttu Nevalainen Affiliation: University of Helsinki Author: Tanja Säily Affiliation: University of Helsinki Author: Panagiotis Papapetrou Primary affiliation for this manuscript: Aalto University Current affiliation: Stockholm University Author: Kai Puolamäki Primary affiliation for this manuscript: Aalto University Current affiliation: Finnish Institute of Occupational Health Author: Heikki Mannila Affiliation: Aalto University     Abstract Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (2005), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.     1. Introduction Comparison of word frequencies is among the core methods in corpus linguistics and is frequently employed as a tool for different tasks, including generating hypotheses and identifying a basis for further analysis. In this study, we focus on the assessment of the statistical significance of differences in word frequencies between corpora. Our goal is to answer questions such as ‘Is word X more frequent in male conversation than in female conversation?’ or ‘Has word X become more frequent over time?’. Statistical significance testing is based on computing a p-value, which indicates the probability of observing a test statistic that is equal to or greater than the test statistic of the observed data, based on the assumption that the data follow the null hypothesis. If a p-value is small (i.e. below a given threshold α), then we reject the null hypothesis. In the case of comparing the frequencies of a given word in two corpora the test statistic is the difference between these frequencies and, put simply, the null hypothesis is that the frequencies are equal. However, to employ a test, the data have to be represented in a certain format, and by choosing a representation we make additional assumptions. For example, to employ the χ2 test, we represent the data in a 2x2 table, as illustrated in Table 1. We refer to this representation as the bag-of-words model. This representation does not include any information on the distribution of the word X in the corpora. When using this representation and the χ2 test, we implicitly assume that all words in a corpus are statistically independent samples. The reliance on this assumption when computing the statistical significance of differences in word frequencies has been challenged previously; see, for example, Evert (2005) and Kilgarriff (2005).     Table 1 The 2x2 table that is used when employing the χ2 test Corpus S Corpus T Word X A B Not word X C D Hypothesis testing as a research framework in corpus linguistics has been debated but remains, in our view, a valuable tool for linguists. A general account on how to employ hypothesis testing or keyword analysis for comparing corpora can be found in Rayson (2008). We observe that the discussion regarding the usefulness of hypothesis testing in the field of linguistics has often been conflated with discussions pertaining to the assumptions made when employing a certain representation and statistical test. Kilgarriff (2005) asserts that the ‘null hypothesis will never be true’ for word frequencies. As a response, Gries (2005) argues that the problems posed by Kilgarriff can be alleviated by looking at (measures of) effect sizes and confidence intervals, and by using methods from exploratory data analysis. Our main point is different from that of Gries (2005). While we endorse Kilgarriff’s conclusion that the assumption that all words are statistically independent is inappropriate, the lack of validity of one assumption does not imply that there are no comparable representations and tests based on credible assumptions. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such as the t-test, or the Wilcoxon rank-sum test, such that we assume independence at the level of texts rather than individual words. An alternative approach to the 2x2 table presented above is to count the number of occurrences of a word per text, and then compare a list of     (normalized) counts from one corpus against a list of counts from another corpus. An illustration of this representation is given in Table 2. This approach has the advantage that we can account for the distribution of the word within the corpus. Table 2 The frequency lists that are used when employing the t-test. The lists do not have to be of equal length, as the corpora may contain an unequal number of texts. Corpus S Text S1 Text S2 … Text SN Normalized frequency of word X S1 S2 … S|S| Corpus T Text T1 Text T2 … Text TM Normalized frequency of word X T1 T2 … T|T| We emphasize that the utility of hypothesis testing critically depends on the credibility of the assumptions that underlie the statistics. We share Kilgarriff’s (2005) concern that application of the χ2 test leads to finding spurious results, and we agree with Kilgarriff (2001) and Paquot and Bestgen (2009) that there are more appropriate alternatives, which, however, have not been implemented in current corpus linguistic tools. We re-examine the alternatives and provide new insights by analysing the differences between six statistical tests in a controlled resampling setting, as well as in a practical setting. The question which method is most appropriate for assessing the significance of word frequencies or other statistics is not new. Dunning (1993) and Rayson and Garside (2000) suggest that a log-likelihood ratio test is preferable to a χ2 test because the latter test is inaccurate when the expected values are small (< 5). Rayson et al. (2004) propose using the χ2 test with a modified version of Cochrane’s rule. Kilgarriff (2001) concludes     that the Wilcoxon rank-sum test1 is more appropriate than the χ2 test for identifying differences between two corpora, but his study is limited to a qualitative analysis of the top 25 words identified by the two methods. Kilgarriff (2005) criticizes the hypothesis testing approach because the χ2 test finds numerous significant results, even in random data. Hinneburg et al. (2007) study methods based on bootstrapping and Bayesian statistics for comparing small samples. Paquot and Bestgen (2009) present a study of the similarities and differences between the t-test, the log-likelihood ratio test, and the Wilcoxon rank-sum test; however, their study is also limited to qualitative analysis of the differences. They recommend using multiple tests, or the t-test, if only one method is to be applied. Lijffijt et al. (2011) illustrate that the bootstrap and inter-arrival time tests provide more conservative p-values than those that are provided by bag-of-words- based models (i.e. tests based on the assumption that all words are statistically independent), which includes the χ2 and log-likelihood ratio tests. Lijffijt et al. (2012) conduct a detailed study of lexical stability over time in the Corpus of Early English Correspondence, using both the log-likelihood ratio and bootstrap tests, and conclude that the log-likelihood ratio test marks spurious differences as significant.2 Relevant, but not discussed further here, is the need for balanced corpora when comparing word frequencies (Oakes and Farrow, 2007). We find that some statistical tests that are commonly used in corpus linguistics, such as the χ2 and log-likelihood ratio tests (Dunning, 1993; Rayson and Garside, 2000), are anti-conservative, that is, their p-values are excessively low, when we assume that a corpus is a collection of statistically independent texts. We perform experiments based on a subcorpus of the British National Corpus (BNC, 2007) that     contains all texts from the prose fiction genre. We quantify the potential bias of the tests based on the uniformity of p-values when we randomly divide the set of texts into two groups. This method is further explained in Section 3. Moreover, we show that the errors in the estimates differ according to each word and the dispersion of the words in the corpus. To define the dispersion of a word, we consider a measure of dispersion, DPnorm, which was introduced in Gries (2008) and refined in Lijffijt and Gries (2012). Because the bias that we observe does not solely depend on word frequency, we cannot simply use higher cut-off values in the χ2 or log-likelihood ratio tests to correct the bias. Notably, the rank of words, in terms of their significance, changes. Finally, we perform a keyword analysis of the differences between male and female authors, as annotated by Lee (2001), using two methods. We find that the differences between the methods are substantial and thus necessitate the use of a representation and statistical test such that the distribution of the frequency over texts is properly taken into account (the t-test, Wilcoxon rank-sum test, or the bootstrap test). 2. Why the Bag-of-Words Model is Inappropriate The χ2 and log-likelihood ratio tests are based on the bag-of-words model (illustrated in Table 1), in which all words in a corpus are assumed to be statistically independent. From the perspective of any word, the corpus is modelled as a Bernoulli process, i.e. a sequence of biased coin flips, which results in word frequencies that follow a binomial distribution (Dunning, 1993). The bag-of-words model implicitly assumes both a mean frequency and a certain variance of the frequency over texts and thus an expected dispersion. Figure 1 shows the observed frequency distribution of the word I in the British National Corpus and the expected frequency distribution in the bag-of-words     model. The observed distribution and the distribution that is predicted by the bag-of- words model clearly differ. Fig. 1 The frequency distribution of I in the British National Corpus. The grey bars show a histogram of the observed distribution, and the black dotted line shows the expected distribution in the bag-of-words model, on which the χ2 and log-likelihood ratio tests are based. Compared with the prediction, the observed distribution has much greater variance and thus demonstrates that the bag-of-words model is not an appropriate choice when comparing corpora, even for highly frequent words. Another example is presented in Table 3, which depicts p-values for the hypothesis that the name Matilda is used at an equal frequency by male and female authors in the prose fiction subcorpus of the British National Corpus. This subcorpus is presented in Section 4. The frequency for male authors is 56.7 per million words (absolute frequency 408), and the frequency for female authors is 20.2 per million words (absolute frequency 169). With more than 500 occurrences in the fiction subcorpus, we may easily trust the results of the χ2 and log-likelihood ratio tests, which show that male authors use this name more often than female authors. However, the other tests (the t-test, Wilcoxon rank-sum test, inter-arrival time test, and bootstrap test) indicate that the observed frequency difference is not unlikely to occur at random. The     reason that the methods disagree is that the word is used in only 5 of 409 total texts (1 text written by a male author and 4 texts written by female authors), with an uneven frequency distribution: one text contains 408 instances, followed by, in the other texts, 155 instances, 11 instances, 2 instances, and 1 instance, respectively. This uneven distribution should lead to an uncertain estimate of the mean frequency. In other words, the variance of the frequency of Matilda is very high. The χ2 and log-likelihood ratio tests do not account for the uneven distribution, as these tests use only the total number of words in a corpus, and as a result they underestimate the uncertainty. Table 3 p-values for the hypothesis that male and female authors use the name Matilda at an equal frequency, based on the prose fiction subcorpus of the British National Corpus χ2 test3 Log- likelihood ratio test Welch’s t-test Wilcoxon rank-sum test Inter-arrival time test Bootstrap test < 0.0001 < 0.0001 0.4393 0.1866 0.5826 0.7768 The remainder of this paper is structured as follows. In Section 3, we present the significance testing methods, the uniformity test, and the dispersion measure. In Section 4, we describe the data that are used. In Section 5, we compare the methods in a series of experiments based on random divisions of the corpus, and in Section 6 we describe the differences between male and female authors that were identified using various methods. Section 7 briefly concludes the paper.     3. Methods In this section, we briefly discuss the mathematical models and assumptions that underlie each of the six methods discussed in the introduction. A summary of the essential differences is given in Section 3.8. The statistical test employed in the controlled random sampling experiment (Section 5) is presented in Section 3.9, and the measure of dispersion that we use is presented in Section 3.10. Readers less interested in the specifics of the statistical tests may proceed directly to 3.8 and then to Section 4. 3.1 Notation We use q to denote the word that we intend to compare in two corpora, and S and T to denote the two corpora. Corpus S contains |S| texts and size(S) words. We use subscripts to indicate individual texts: S1, S2, …, S|S|. We express the relative frequency of word q in corpus S as freq(q,S). Each of the following six methods computes a p-value for the hypothesis of a word having an equal frequency in the two corpora, freq(q,S) = freq(q,T), against the alternative hypothesis that the frequencies are not equal: freq(q,S) > freq(q,T) or freq(q,S) < freq(q,T). Thus, conforming to the tradition in corpus linguistics, all methods provide two-tailed p-values. 3.2 Method 1: Pearson’s χ2 Test Pearson’s χ2 test, which is also known as the χ2 test for independence or simply as the χ2 test, is based on the assumption that a text or corpus can be modelled as a sequence of independent Bernoulli trials. Each Bernoulli trial is a random event with a binary outcome; thus, the entire sequence is similar to a sequence of biased coin flips. Under the assumption of independent Bernoulli trials, the probability distribution for the word frequency is given by the probability mass function of the binomial distribution. Let n     be the size of the corpus and p the relative frequency of a word. The probability of observing this word exactly k times is given by Pr(K = k)= p 2(1− p)n−k n k " # $ % & ' . (1) This distribution is approximately normal with mean np and variance np(1-p) when np(1-p) > 5 (Dunning, 1993). The fact that this distribution is well approximated by a normal distribution is used in the χ2 test. The test is conducted as follows. Let O1 = freq(q,S) ⋅ size(S) and O2 = freq(q,T) ⋅ size(T), which are the observed frequencies of q in S and T, respectively. Let p be the relative frequency over the combined corpora, i.e. p = (O1+O2)/(size(S)+size(T)). We define the expected frequency in S and T as E1 = p ⋅ size(S) and E2 = p ⋅ size(T), respectively. The test statistic X2 using Yates’ correction is given by X2 = (O1 −E1 −0.5) 2 E1 + (O2 −E2 −0.5) 2 E2 . (2) The test statistic asymptotically follows a χ2 distribution with one degree of freedom. The p-value can be obtained by comparing the test statistic to a table of χ2 distributions. The χ2 test is available in most statistical software programs and implemented in tools such as WordSmith Tools (Scott, 2012) and BNCweb (Hoffmann et al., 2008). 3.3 Method 2: Log-Likelihood Ratio Test The χ2 test is based on two approximations: the normal distribution approximates the binomial distribution, and the test statistic asymptotically follows a χ2 distribution. Because of this double approximation, the χ2 test is inapplicable when the word frequency is small (< 5). For this reason, Dunning (1993) introduces a test which is not     based on the normality approximation but on the likelihood ratio. This test is called the log-likelihood ratio test and is also known as the G2 test. The likelihood function H(p;n,k) is the same as Pr(K = k) in Equation (1); the only difference is that we explicitly mention the parameter p. The likelihood ratio is the ratio of the probability when we have two parameters, p1 and p2 (one for each corpus), divided by the probability when we have only one parameter, p (for both corpora). The precise mathematical formulation is given by p1 = freq(q,S), n1 = size(S), k1 = freq(q,S) ⋅ size(S), p2 = freq(q,T), n2 = size(T), k2 = freq(q,T) ⋅ size(T), and p = (k1+k2)/(n1+n2). The likelihood ratio is defined as λ = H(p;n1,k1)⋅H(p;n2,k2) H(p1;n1,k1)⋅H(p2;n2,k2) . (3) We set the parameters p1, p2, and p to the values that maximize the likelihood function. The full derivation can be found in Dunning (1993). The log-likelihood ratio test is based on the fact that the quantity -2 log λ asymptotically follows a χ2 distribution with degrees of freedom that are equal to the difference in the number of parameters between the ratios (i.e. one in this instance). The quantity -2 log λ is used as the test statistic. Dunning (1993) claims that this test statistic approaches its asymptotical distribution much faster than the test statistic in the χ2 test and is thus preferable, especially when the expected frequency is low. Again, the final p-value is computed by comparing the test statistic to a table of χ2 distributions. The log-likelihood ratio test is available in many statistical software programs and implemented in tools such as WMatrix (Rayson, 2008), WordSmith Tools (Scott, 2012), and BNCweb (Hoffmann et al., 2008).     Similar to the χ2 test, this method is based on the bag-of-words model, the representation illustrated in Table 1, and thus on the assumption that each word can be modelled as an independent Bernoulli trial. As a result, the test ignores all structure in the corpus and even in texts and sentences. We refer to any method that is based on this assumption as a bag-of-words test. There exist other bag-of-words tests that are not based on approximations of the probability mass function given in (1) but are directly based on the summation of values in Equation (1). Such tests provide more accurate probabilities, especially for small frequencies, under the bag-of-words assumption. Examples include Fisher’s exact test and the binomial test. We expect these methods to perform similarly to the χ2 and log- likelihood ratio tests for low word frequencies, and as the frequency increases, the results will converge because all of these tests are based on the bag-of-words assumption and Equation (1). For brevity, we do not consider other bag-of-words tests in this paper. 3.4 Method 3: Welch’s t-Test A t-test is a significance test in which the test statistic follows a Student’s t-distribution. We intend to compare two groups of samples and make a minimum number of assumptions. We use Welch’s t-test, which is based on the assumption that the mean frequency follows a Gaussian distribution. Welch’s t-test is more general than Student’s t-test because the former test does not assume equal variance in the two populations. Welch’s t-test provides a p-value for the hypothesis that the means of the two distributions are equal. The test statistic is the normalized difference between the means of the word frequencies. Let x1 be the mean of the frequency of q over texts in S, and let s1 be the     standard deviation. Likewise, let x2 be the mean of the frequency of q over texts in T, and let s2 be the standard deviation. The test statistic t is given by t = x1 − x2 s1 2 S + s2 2 T . (4) The test statistic follows a Student’s t-distribution with degrees of freedom that depend on the variance of the populations. An exact solution to this problem is unknown, but Welch’s t-test is based on the Welch-Satterthwaite equation, which provides an approximate solution (Welch, 1947). Implementations of this test are available in statistical software programs, including R and Microsoft Excel. NB. It is often claimed that Student’s and Welch’s t-test are only applicable if the data follow a normal distribution. This is not true; the assumption is that the test statistic follows a normal distribution. In this case, the test statistic is the difference between the two means. This statistic does not in general follow a normal distribution. However, the Central Limit Theorem (CLT) states that, under very general conditions, the mean of a set of independent random variables approaches normality very fast when the number of samples increases. Since the frequency of a word per text is bounded, the conditions for the CLT are met, and the means x1 and x1, as well as their difference are approximately normal when the number of texts is sufficiently large. For small corpora, it is a priori not clear if the test is an appropriate choice. 3.5 Method 4: Wilcoxon Rank-Sum Test The Wilcoxon rank-sum test, which is also known as the Mann-Whitney U-test, is a statistical test that does not make any assumption regarding the shape of the distribution for the quantity of interest. It is based on the fact that if the distributions of q for two     corpora are equal, then it is possible to induce a probability distribution over the rank orders (Wilcoxon, 1945; Mann and Whitney, 1947). The test is performed as follows. We order all samples based on the frequency of word q, regardless of the corpus in which these samples are located. This approach gives us a ranked series, an example of which is shown in Table 4. Table 4 Example of a ranked series Rank 1 2 3 4 5 6 7 8 9 10 Corpus S T T T S S S T T S The test statistic U is then defined as the sum of the ranks of texts of the smaller corpus. In this situation, because both corpora have a size of 5, we can select either S or T. We find that US = 1+5+6+7+10 = 29 and UT = ((n2+n)/2) - 28 = 26. We obtain a p-value for small n by comparing the test statistic with a statistical table, and if n > 20, then the distribution of the test statistic is well approximated by a Gaussian distribution using known parameters. Implementations of this test are available in statistical software programs, such as R. Multiple texts may have equal frequencies for a word. Particularly for infrequent words, numerous texts in a corpus may have a frequency of zero. The Wilcoxon rank- sum test accounts for texts with equal frequencies by assigning to each text the average rank over all equal-frequency texts. For example, if there are five texts with a frequency of zero, then each text is assigned a rank of 3.     3.6 Method 5: Inter-Arrival Time Test A novel significance test that is specifically designed for frequency counts in sequences is the inter-arrival time test, which was introduced by Lijffijt et al. (2011). This test is based on the spatial distribution of a word in a corpus, as modelled by the distribution of inter-arrival times between words. The assumption is that the inter-arrival time distribution of a word captures the behavioural pattern of the word in a corpus. Savický and Hlaváčová (2002) use the inter-arrival time distribution to define a corrected frequency that captures whether words that are frequent in a corpus are ``common’’ or not, and Altmann et al. (2009) reports that the inter-arrival time distribution of a word, as summarized in a burstiness parameter, is a good predictor of word class. The significance test is performed as follows. The inter-arrival times are obtained by counting the number of words between each consecutive occurrence of word q, plus one. The texts in the corpus are ordered randomly and the corpus is treated as though it were placed on a ring: the end of the corpus is attached to the beginning. We begin counting at the first occurrence and continue until we again reach the first occurrence. For example, assume that we have a corpus with ten words and two occurrences of word q (Table 5). Table 5 Example of a small corpus Index 1 2 3 4 5 6 7 8 9 10 Word x x q x x x q x x x The inter-arrival times for this corpus are 3+1 = 4 and 5+1 = 6; thus, the empirical inter-arrival time distribution is {4, 6}. By definition, the number of inter-     arrival times is equal to the number of occurrences in the corpus, and the sum of the inter-arrival times equals the size of the corpus. The significance test is based on the production of random corpora by repeatedly sampling inter-arrival times from the empirical inter-arrival time distribution. The first occurrence must be sampled from a different distribution (Lijffijt et al., 2011). After we obtain the index of the first occurrence, we sample uniformly at random an inter-arrival time from the empirical inter-arrival time distribution and insert a new occurrence of q at the position given by this inter-arrival time. This process is repeated until we exceed the length of the corpus. In Lijffijt et al. (2011), the significance test is based on a foreground corpus S and a background corpus T. The test is performed by comparing the observed frequency of q in S to the frequency in randomized corpora with sizes equal to S but based on the inter-arrival time distribution of T. The test is one-tailed, and the alternative hypothesis is freq(q,S) > freq(q,T). The test is also asymmetrical in that the p-value for freq(q,S) > freq(q,T) is not necessarily the same as freq(q,S*) < freq(q,T*) if we set S* = T and T* = S because only one corpus is randomized. We adopt a slightly different approach that does not have these issues. We create random corpora S1 to SN, based on the inter-arrival time distribution of S, and random corpora T1 to TN, based on the inter-arrival time distribution of T, with all sizes equal to the smaller corpus. The one-tailed p-value is given by the mid-P test (Berry and Armitage, 1995): p = H freq(q,Ti)− freq(q,Si)( )i=1 N ∑ N , (5)     where H(x) = 1 0.5 0 if x > 0 if x = 0 if x < 0 ! " # $ # . We can convert this to a two-tailed p-value (Dudoit et al., 2003) using the following equation: ptwo = 2 ⋅min(p,1− p). (6) Because the p-value is an empirical estimate and the real p-value that we are approximating may be small, the use of smoothing is appropriate (North et al., 2002). Thus, the final p-value is computed as follows: p*= ptwo ⋅ N +1 N +1 . (7) The value p* is used as the p-value for this test in our experiments. Obtaining the p-values takes longer compared to the other methods, as it requires sampling many pseudorandom numbers. Specifically, it takes N times the number of tokens in a corpus steps to compute the p-values for all types. For example, for the experiment presented in Section 6, this process takes several minutes. 3.7 Method 6: Bootstrap Test Bootstrapping (Efron and Tibshirani, 1994) is a statistical method for estimating the uncertainty of some quantity in a data sample by resampling the data several times. We can employ bootstrapping to create a significance test as follows. Similar to the procedure used in the inter-arrival time test, we create a series of corpora S1 to SN, but we produce a random corpus by sampling |S| texts from S. Likewise, we create a series     T1 to TN by repeatedly sampling |S| texts from T. The p-value is again obtained using Equations (5) through (7). This method makes no assumptions regarding the shape of the frequency distribution for words and is thus generally applicable. This method is almost identical to the bootstrap test used by Lijffijt et al. (2011), but our method differs in that we use a two-tailed p-value and resample both S and T concurrently. Implementations in R and Matlab can be found in Lijffijt (2012). 3.8 Summary of Methods Table 6 summarizes the assumptions underlying the six methods that are described above. The χ2 and log-likelihood ratio tests represent the data in a 2x2 table, while Welch’s t-test, the Wilcoxon rank-sum test, and the bootstrap test take as input a list of frequencies per text for each word. The inter-arrival time test is based on the spatial distribution of a word in the corpora. The Wilcoxon rank-sum and bootstrap tests make the fewest assumptions regarding the frequency distribution and are thus the most generally applicable. Table 6 Summary of the six methods that are presented in this paper and the assumptions regarding the frequency distribution for each test Test Assumption regarding frequency distribution Pearson’s χ2 test All words are statistically independent (bag-of-words model) Log-likelihood ratio test All words are statistically independent (bag-of-words model) Welch’s t-test All texts are statistically independent, and the mean frequency follows a normal distribution Wilcoxon rank-sum test All texts are statistically independent     Inter-arrival time test Spaces between occurrences of the same word are statistically independent Bootstrap test All texts are statistically independent 3.9 Test for Uniformity of p-Values All of the previously discussed methods yield p-values for the hypothesis that the frequencies of a word q in S and T are equal. Several studies, including Kilgarriff (2001), Rayson et al. (2004), and Paquot and Bestgen (2009), have previously compared some of these methods. These studies have shown that p-values in the same setting are not equal: there are differences in the significance of a given frequency difference between one method and another. This finding is alarming because we do not know which test yields the best results. We study the utility of these tests based on the criterion that if the data follow the distribution that is assumed in the null hypothesis and the test is unbiased, then the p-values given by the method should be uniformly distributed in the [0, 1] range. This criterion is applicable according to the definition of p-values: the probability of encountering a p-value of x or less is x itself. For example, there is 10% chance of observing a p-value of 0.1 or less, and a 1% chance of observing a p-value of 0.01 or less. If this criterion is not fulfilled, then the test is either anti-conservative (the probability of encountering a p-value of x or smaller is more than x) or conservative (the probability of encountering a p-value of x or smaller is less than x). See, for example, Blocker et al. (2006). When assessing a statistical testing procedure, testing for uniformity of p-values, either visually or by a statistical test, is a common practice in many disciplines such as     particle physics; see e.g. Figures 2–6, 8-9, and 11–12 in Beaujean et al. (2011). A similar kind of experiment has been published in Lijffijt (2013), while for example Schweder and Spjøtvoll (1982) study the uniformity of p-values for multiple-hypotheses adjustment procedures, and L’Ecuyer and Simard (2007) use the Kolmogorov-Smirnov test (also used here) to measure the uniformity of random number generators. Numerous statistical tests can be utilized to determine whether a distribution is uniform. We employ the Kolmogorov-Smirnov test (Massey, 1951), which can be used to compare two distributions. The reference distribution f(x) that we use is the uniform distribution on [0, 1]. The test is based on a simple statistic: the maximum distance between the empirical cumulative distribution Fn(x), which is based on the observed data, and the theoretical uniform cumulative distribution function F(x): Dn =sup x Fn(x)−F(x) . (8) The quantity nDn follows a Kolmogorov distribution. The associated p-value can be found by comparing nDn to a table containing critical values for the Kolmogorov distribution. Implementations of this test are available in statistical software programs, including R. 3.10 Measure of Dispersion: DPnorm Gries (2008) presents an overview of several dispersion measures and the disadvantages of each measure, and proposes a simple alternative that is reliable and easy to interpret: deviations of proportions (DP). The measure is based on the difference between observed and expected relative frequencies. The expected relative frequency is equal to the relative size of a text. Let v1,…,vn be the relative frequencies that are observed in texts S1,…,Sn, and let s1,…,sn be the relative sizes of the texts. DP is defined as     DP = si −vii=1 n ∑ 2 , (9) and the normalized measure DPnorm is given by DPnorm = DP 1−min i (si) . (10) The normalized measure, as presented by Lijffijt and Gries (2012), has a minimum value of 0 and a maximum value of 1, regardless of the corpus structure, whereas DP also has a minimum of 0, but its maximum depends on the corpus structure. Because the dispersion is quantified as the difference between the expected and observed frequencies, a dispersion of 0 indicates that a word is dispersed as expected, whereas a dispersion of 1 indicates that the word is minimally dispersed. A word is minimally dispersed when it occurs only in the shortest text. 4. Data For the purposes of our study, we require a relatively large and homogeneous data set containing information on the gender of the authors of the texts. To fulfil this requirement, we have selected a subcorpus of the British National Corpus (BNC, 2007), namely the prose fiction genre. Categorized by Lee (2001), the genre excludes drama but includes both novels and short stories. Lee (2001, p. 57) notes that ‘where further sub-genres can be generated on-the-fly through the use of other classificatory fields, they are not given their own separate genre labels, to avoid clutter’—thus, e.g. children’s prose fiction is not separated from adult prose fiction because these two types of fiction can be distinguished through the ‘audience age’ field. As the sub-genres of     prose fiction may differ from one another considerably, our material can be regarded as homogeneous only in relation to other super-genres, such as academic prose. The prose fiction subcorpus contains 431 texts or c. 16 million words of present- day British English. According to Burnard (2007, Section 1.4.2.2), most of the texts are continuous extracts with a target sample size of 40,000 words, but several texts are included in their entirety. The gender of the authors is known for 409 texts or c. 15.6 million words, which are divided fairly evenly between male and female authors: 203 texts were written by men, and 206 texts were written by women (c. 7.2 and 8.4 million words, respectively). These 409 texts form our data set. For the uniformity experiments in the following section, we use the first 2,000 words of each text, while for the gender study, we analyse the full texts. We preprocess the data set by lowercasing all text; furthermore, punctuation, lemmatization, parts of speech, and multi-word tags are ignored, and only the word forms (i.e. running words) are considered. 5. Uniformity of p-Values 5.1 Randomly Assigning the Texts to Two Sets The first experiment that we have conducted involves testing the uniformity of the p- values for each method. We have employed the following procedure. We randomly assign 200 texts to corpus S and 200 texts to corpus T, such that the corpora do not overlap. We then apply each method to all words with a frequency of 50 or greater in the fiction corpus (there are 3,302 such words). The entire process is repeated 500 times. Due to the fact that the corpus is split into two parts at random, the null hypothesis, that there is no difference between these parts, is by definition true. Notice     that two random samples from a population are almost always different, as long as there is variation in the population the samples are drawn from. That means we expect that there will be differences between the two samples. However, since the assignment is random, any observed structure is fully explained by the artefacts of random sampling, and there is no true discriminative structure present in the data. This procedure is very similar to permutation testing, see for example Good (2005). For example, assume that we have drawn two samples, and we observe that the word would is more frequent in S than in T. If we also find it has a low p-value, we may think that there is a real difference between the two populations. However, since S and T are drawn from the same population, we know that there is no true difference between the two populations with respect to the frequency of would. Doing many comparisons aggravates this problem, because then we are liable to find many large differences, while there are in fact none. A statistical test quantifies how likely an observation is under the null hypothesis. Perhaps counter-intuitively, this does not mean that a p-value is always 1 when there is no true difference between the populations; it means that the distribution of a p-value should be approximately uniform on the range [0, 1]. That is, there is a 50% probability that a p-value is 0.5 or lower, 10% probability that it is 0.1 or lower, 1% probability that it is 0.01 or lower, and so on. In that case, the test is neither conservative, nor anti-conservative. When we do multiple tests, we can use Bonferroni correction, or a more powerful alternative, to ensure that the smallest p-value of a set of tests has a uniform distribution. The probability distribution of the minimum corresponds to the family-wise error rate. Other post-hoc corrections may also have different aims.     Due to the random sampling, the p-values will not be exactly uniform, but—as discussed in Section 3.9—we can employ the Kolmogorov-Smirnov test to quantify the uniformity of the 500 p-values for one word for one test in a single p-value. We repeat this experiment for each word, and obtain for each of the 3,302 words six p-values that express the uniformity of the p-values for each of the six tests. This results in a total of 3,302 ⋅ 6 = 19,812 p-values. We use a minimum frequency of 50 because the frequency influences the uniformity of the p-values and the influence differs per method. We do not claim that the significance tests are inapplicable to lower frequencies (in fact, we would argue the opposite), but this experiment is not meaningful using lower frequency words. We have not optimized the frequency threshold, and, as shown below, a frequency of 50 is often too low. Further details regarding why the experiments are not meaningful with less frequent words can be found in the discussion of the experimental results below. A low p-value for the Kolmogorov-Smirnov test indicates that the p-value distribution over the random corpus assignments is not uniform. However, due to testing 19,812 hypotheses, we do not expect all p-values of the Kolmogorov-Smirnov test to be high. To correct for multiple hypotheses, we apply the Bonferroni correction by multiplying each p-value by the number of hypotheses. If a p-value is greater than one after multiplication, then we set the value to one. The Bonferroni correction ensures that there is only α probability of rejecting any sample. The correction is conservative, but we also prefer to be cautious and not reject any samples as being non-uniform unless we are certain of their lack of uniformity. For a review of multiple hypothesis correction methods see Shaffer (1995) or Dudoit et al. (2003).     Figure 2 shows an overview of the performance of each method. In the following discussion, we write, for brevity, that samples or words are rejected in the uniformity test, where we actually mean that the null hypothesis that the p-values follow a uniform distribution is rejected. Fig. 2 The results of the uniformity test for all six methods based on random text assignments. Each dot corresponds to a word, which has a frequency (x-axis) and dispersion (y-axis). Light grey dots correspond to rejected samples. A sample is rejected if the corrected p-value of the Kolmogorov-Smirnov test for uniformity is < 0.01. The Wilcoxon rank-sum and bootstrap tests demonstrate the best performance with 3.6% rejected samples. We observe that 57.6% of the samples are rejected for the χ2 test, even for the highest frequency, well-dispersed words. The log-likelihood ratio test performs even worse: 65% of the words are rejected, and these also include the most frequent and best dispersed words. The difference is probably caused by Yates’ correction for the χ2 test.     The t-test, Wilcoxon rank-sum test, and bootstrap test perform much better: although 3.6% to 4.8% of the samples are rejected, we observe that these rejected samples consist of infrequent, poorly dispersed words. Thus, testing words with sufficient frequency and/or dispersion yields appropriate results. Because of Zipf’s law, we know that the number of infrequent words greatly exceeds the number of frequent words, and thus, if we had selected a lower frequency threshold, then the percentage of rejected samples would have been much higher. The inter-arrival time test has more rejected samples (16.3%), but these samples again include frequent and well-dispersed words. This result indicates that the test does not capture all of the structure that is present in the texts. This result may have occurred because inter-arrival times have correlations within texts and these are not captured by the test. The Wilcoxon rank-sum and bootstrap tests demonstrate the best performance. Frequent and well-dispersed words always yield a uniform distribution. When comparing the bootstrap and t-tests, we observe that the samples for which the t-test does not provide a uniform distribution are all instances in which the bootstrap test does not provide a uniform distribution plus a few more. Especially for infrequent but relatively well-dispersed words, the bootstrap test appears to outperform the t-test. In contrast, the Wilcoxon rank-sum test appears to provide a tighter boundary for the rejected samples. Finally, we have also tested the performance of all tests on words with frequencies between 20 and 50. Figure 3 displays the results. We observe that the χ2 and log-likelihood ratio tests fail to yield uniform p-values in almost all cases. The t-test and Wilcoxon rank-sum test fail in nearly half of the instances; almost all words that have     frequencies below 30 or that are poorly dispersed are rejected. The inter-arrival time and bootstrap tests are more successful in yielding uniform p-values for low frequency words, with the bootstrap test being the most successful. Fig. 3 The results of the uniformity test for all six methods, based on random text assignments, for low frequency words. Each dot corresponds to a word, which has a frequency (x-axis) and dispersion (y-axis). Light grey dots correspond to samples for which the null hypothesis that the p-values follow a uniform distribution has been rejected. The null hypothesis is rejected if the corrected p-value of the Kolmogorov- Smirnov test for uniformity is < 0.01. 5.2 Randomly Assigning the Words to Two Sets The second experiment that we conducted is based on the random assignment of individual words to two sets rather than the assignment of entire texts. This approach should lead to a smoother distribution of frequencies, and we expect all methods to yield unbiased (i.e. uniform) p-values in this setting. We have used the following procedure to test this hypothesis: we randomly assign half of the 810,000 words to corpus S and assign the other half of the words to corpus T. We then apply each method     to all words with a frequency of 50 or greater in the fiction corpus (i.e. the same 3,302 words that were used in the previous experiment). The entire process is repeated 500 times. Again, we expect the p-value distribution for each word to be approximately uniformly distributed over the 500 repetitions. We use the Kolmogorov-Smirnov test as discussed above to obtain 3,302 ⋅ 6 = 19,812 p-values. We use the Bonferroni correction for multiple hypotheses to compute the final p-values. Figure 4 shows an overview of the performance of each method. Fig. 4 The results of the uniformity test for all six methods based on random word assignments (rather than texts, as in Fig. 2). Each dot corresponds to a word, which has a frequency (x-axis) and dispersion (y-axis). Light grey dots correspond to samples for which the null hypothesis has been rejected. The null hypothesis is rejected if the corrected p-value of the Kolmogorov-Smirnov test for uniformity is < 0.01. Surprisingly, we observe that the χ2 test fails to yield uniform p-values for nearly 70% of the words. This result may have occurred because the test statistic only asymptotically follows a χ2 distribution, and another contributing factor could be Yates’ correction, which makes the p-values more conservative (perhaps excessively     conservative). The latter reason is easy to verify because the Kolmogorov-Smirnov test can also be employed as a one-tailed test. We computed the p-values again by testing only whether the p-values for the frequency test are excessively low. Table 7 presents the results. We now observe that 0% of the samples are rejected; this result confirms that Yates’ correction leads to conservative p-values, which is not necessarily a disadvantage. Table 7 For each method, the percentage of samples for which the null hypothesis under the one-tailed Kolmogorov-Smirnov test is rejected, based on random word assignments as in Fig. 4. The alternative hypothesis is that p-values are anti-conservative. Test χ2 test Log- likelihood ratio test Welch’s t-test Wilcoxon rank-sum test Inter- arrival time test Bootstrap test Percentage of rejected samples 0.0% 3.9% 3.9% 3.9% 0.0% 0.0%     Fig. 5 Cumulative distribution of p-values for each method for the word trip. The diagonal line indicates the uniform distribution, which we expect to be close to the actual distribution. The p-values of the uniformity tests are presented in parentheses. The first four tests show a jagged pattern because of the deterministic nature of these tests, i.e. the limited number of different inputs leads to a limited number of different output values. This behaviour causes the uniformity test to yield low p-values. The inter-arrival time and bootstrap tests are less affected by this limitation. Table 7 also shows that 3.9% of the samples are rejected for the log-likelihood ratio test, t-test, and Wilcoxon rank-sum test despite our use of the conservative Bonferroni correction. Perhaps surprisingly, the inter-arrival time and bootstrap tests have no rejected samples; thus, we can conclude that these tests consistently yield reasonably uniformly distributed p-values. Figure 4 shows that all of the rejected samples are infrequent words. Because this difference is unexpected, let us examine an example of the p-values that are given by each method for an infrequent word. Figure 5 illustrates the p-values for the word trip. We notice a problem here: the first four tests do not yield the expected uniform distributions. The cause is visible in     the figure: the number of unique p-values that these tests yield is limited, and the tests give a similar p-value for many of the randomized inputs, because the number of distinct inputs is also very low. This behaviour is not necessarily unfavourable; if we assume that only a certain number of p-values are possible, then the observed distribution may be ‘as uniform as possible’ under the constraints. The reference distribution in our test—which is the uniform distribution on [0, 1]—does not assume a finite set of possible values. This distribution could have caused the uniformity test to be slightly inappropriate and to reject many samples, especially those corresponding to infrequent or very poorly dispersed words. Thus, we should not necessarily interpret the smoother curves given by the inter-arrival time and bootstrap tests as superior. However, we are not aware of any significance tests that would be more appropriate in this situation, and we leave this issue for further research. Figure 6 illustrates a comparison of the p-values for the frequent word would. We continue to observe the jagged pattern, but the pattern is now less severe. The high p-values for each of the tests demonstrate that the uniformity test now functions properly. This result corroborates the evidence in Fig. 4 that in this randomization setting (assigning each word in the subcorpus randomly to S or to T) none of the frequent words is rejected. We conclude that all of the methods yield uniform p-values in this setting, in which we randomly sample words rather than texts. Thus, the differences between the methods in the first experiment are fully explained by the additional structure of the texts. This finding is important because, when creating a corpus, one usually samples texts from various sources rather than individual words. As a note of caution, the jagged patterns provide the first four tests with a disadvantage in the uniformity test; thus, we     cannot conclude that these four methods are all inferior. Nonetheless, the evidence does not suggest that any test is superior to the bootstrap test either. Based on the experiments that have been discussed thus far, we can conclude that under the assumption of randomly sampled texts the χ2 and log-likelihood ratio tests may lead to spurious conclusions, and we therefore recommend the use of a representation of the data and a statistical test that takes into account the distribution of the word within the corpus. Fig. 6 Cumulative distribution of p-values for each method for the word would. The diagonal line indicates the uniform distribution, which we expect to be close to the actual distribution. The p-values of the uniformity tests are presented in parentheses. The first four tests show a jagged pattern because of the deterministic nature of these tests, i.e. the limited number of different inputs leads to a limited number of different output values. Nonetheless, at this frequency, the uniformity test works properly.     6. Differences between Male and Female Writing 6.1 The Bootstrap Test Past research on the BNC reports statistically significant gender differences in word- frequency distributions in conversation (e.g. Rayson et al., 1997) and in both the fiction and non-fiction genres (e.g. Argamon et al., 2003). We next consider the extent to which word-frequency distributions display statistically significant gender differences in the BNC prose fiction texts using the bootstrap test. After we control for a false discovery rate (FDR; Benjamini and Hochberg, 1995) at α = 0.05, which controls the expected relative number of false positives over all positives, the bootstrap test returns 74 words (occurring 5,000 times or more in both corpora) whose frequency differs significantly between the male- and female-authored subcorpora. The minimum frequency of 5,000 was chosen for ease of illustration, as the list of significant words would have been considerably longer if lower frequencies had been considered (cf. Fig. 7, below). Tables 8 and 9 list the words that are most significantly overrepresented in male and female prose fiction, respectively. Table 8 High-frequency words that are significantly overrepresented in male-authored prose fiction in the BNC according to the bootstrap test Word Males M/million Females F/million DPnorm Bootstrap a 164,254 22,823.55 179,376 21,442.46 0.06 0.0001 another 5,293 735.48 5,285 631.76 0.14 0.0001 by 20,971 2,913.98 20,687 2,472.91 0.13 0.0001 first 7,211 1,001.99 7,145 854.11 0.13 0.0001     from 29,201 4,057.56 29,279 3,499.99 0.10 0.0001 in 103,423 14,370.92 113,461 13,563.04 0.06 0.0001 its 7,031 976.98 5,863 700.86 0.26 0.0001 man 11,533 1,602.54 10,626 1,270.22 0.21 0.0001 of 161,802 22,482.84 165,196 19,747.39 0.09 0.0001 on 54,122 7,520.40 58,075 6,942.24 0.07 0.0001 one 22,641 3,146.03 23,432 2,801.04 0.09 0.0001 some 11,887 1,651.73 11,839 1,415.22 0.14 0.0001 the 417,501 58,012.94 379,234 45,333.32 0.09 0.0001 their 15,044 2,090.41 13,912 1,663.03 0.20 0.0001 they 37,660 5,232.96 35,721 4,270.06 0.17 0.0001 through 9,117 1,266.83 8,300 992.18 0.16 0.0001 two 9,592 1,332.84 8,402 1,004.37 0.17 0.0001 us 6,744 937.10 5,059 604.75 0.26 0.0001 we 26,275 3,650.99 22,273 2,662.50 0.21 0.0001 were 26,899 3,737.69 27,088 3,238.08 0.12 0.0001 is 32,539 4,521.39 30,015 3,587.97 0.21 0.0003 left 5,803 806.34 5,994 716.52 0.14 0.0005 other 8,843 1,228.76 9,170 1,096.17 0.12 0.0005     there 29,585 4,110.92 30,533 3,649.89 0.13 0.0005 are 15,878 2,206.29 15,541 1,857.76 0.18 0.0007 where 9,333 1,296.85 9,596 1,147.10 0.15 0.0013 he 124,464 17,294.62 130,393 15,587.07 0.14 0.0045   Table 9 High-frequency words that are significantly overrepresented in female-authored prose fiction in the BNC according to the bootstrap test Word Males M/million Females F/million DPnorm Bootstrap ’ll 9,340 1,297.82 14,921 1,783.64 0.24 0.0001 ’m 9,263 1,287.12 14,500 1,733.32 0.24 0.0001 ’ve 8,092 1,124.41 12,258 1,465.31 0.23 0.0001 be 32,481 4,513.33 43,381 5,185.73 0.10 0.0001 come 7,742 1,075.77 10,737 1,283.49 0.15 0.0001 could 20,573 2,858.68 27,724 3,314.10 0.12 0.0001 did 19,633 2,728.06 26,923 3,218.35 0.14 0.0001 eyes 6,955 966.42 12,757 1,524.96 0.26 0.0001 face 7,206 1,001.29 10,427 1,246.44 0.21 0.0001 for 46,664 6,484.09 59,191 7,075.64 0.07 0.0001 go 9,104 1,265.03 12,736 1,522.45 0.16 0.0001 her 49,768 6,915.40 146,675 17,533.41 0.29 0.0001 how 9,714 1,349.79 13,231 1,581.62 0.13 0.0001 if 20,859 2,898.42 27,324 3,266.29 0.11 0.0001 knew 5,700 792.03 8,264 987.87 0.18 0.0001 made 7,094 985.73 9,772 1,168.14 0.13 0.0001     make 5,341 742.15 7,379 882.08 0.13 0.0001 much 6,613 918.89 9,195 1,099.16 0.15 0.0001 must 6,054 841.22 8,325 995.16 0.18 0.0001 n’t 45,068 6,262.33 66,842 7,990.24 0.20 0.0001 never 6,969 968.36 10,827 1,294.25 0.17 0.0001 not 33,130 4,603.51 45,580 5,448.60 0.16 0.0001 own 5,403 750.76 8,078 965.64 0.17 0.0001 she 57,200 7,948.10 164,039 19,609.09 0.28 0.0001 should 5,417 752.71 7,962 951.77 0.16 0.0001 so 20,460 2,842.97 29,023 3,469.39 0.12 0.0001 thought 8,753 1,216.25 13,774 1,646.53 0.19 0.0001 to 178,154 24,755.00 223,827 26,756.10 0.05 0.0001 too 8,348 1,159.98 11,448 1,368.48 0.14 0.0001 want 6,050 840.66 8,956 1,070.59 0.20 0.0001 when 17,667 2,454.88 23,864 2,852.68 0.13 0.0001 with 48,613 6,754.91 62,689 7,493.79 0.07 0.0001 would 23,077 3,206.61 32,428 3,876.42 0.14 0.0001 you 79,286 11,017.01 119,301 14,261.14 0.16 0.0001 your 12,257 1,703.14 18,688 2,233.95 0.18 0.0001 had 63,597 8,836.98 85,125 10,175.77 0.15 0.0003 look 6,476 899.86 9,045 1,081.23 0.16 0.0003 take 5,467 759.66 7,181 858.41 0.13 0.0003 very 8,570 1,190.83 12,089 1,445.11 0.22 0.0003 do 28,665 3,983.08 38,382 4,588.15 0.15 0.0005 because 5,599 778.00 8,054 962.77 0.23 0.0007 put 5,415 752.43 7,195 860.08 0.18 0.0023     that 76,457 10,623.91 95,829 11,455.32 0.10 0.0029 little 7,654 1,063.54 10,360 1,238.43 0.19 0.0047 ’re 8,584 1,192.77 11,813 1,412.12 0.24 0.0049 have 30,736 4,270.85 38,696 4,625.69 0.11 0.0053 well 9,511 1,321.58 12,540 1,499.02 0.18 0.0057 Tables 8 and 9 are consistent with earlier research that has found gender differences based on word frequencies in prose fiction. Overall, the tables suggest that male-authored fiction is dominated by more frequent use of noun-related forms than female-authored fiction, which is verb-oriented. Male authors overuse articles (a, the) and prepositions (by, from, in, of, on, through), both of which are associated with nouns. Similarly, male-authored fiction overuses other function words that are typically associated with noun phrases and nominal functions, such as another, first, one, some, two, and other. However, it is noteworthy that the list of significant items for male authors is shorter than that for female authors. The personal pronouns that are overrepresented in male-authored fiction are the first-person plural forms us and we and the third-person pronouns its, their, and they, while women’s fiction overuses the second-person forms you and your, which can have singular and plural referents. Stereotypically, men tend to write about man and he, and women about her and she. These pronoun findings are consistent with those of Argamon et al. (2003, pp. 325–327) but deviate in that women do not significantly favour the first-person pronoun I, as the previous findings suggest. When the bootstrap method is used, personal pronouns do not emerge as unequivocal female-style markers in contemporary prose fiction.     Table 9 shows that female-authored fiction is marked by frequent verb use: there are more than twenty verb forms among the items overused by women (forms of be, do, and have; modals, such as could, should, must, and would; and activity and mental verbs, including come, go, make, knew, and thought). Only three such verb forms are overused in male-authored fiction (were, is, and are). Particularly salient features in women’s fiction are contracted forms (’ll, ’m, ’ve, n’t, ’re), negative particles (n’t, never, not), and intensifiers (much, so, too, very). These are all indicators that female- authored fiction employs a more involved, colloquial style than male-authored fiction, which, by contrast, is marked by features associated with an informational, noun- oriented style (for these distinctions, see Biber, 1995, pp. 107–120; Biber and Burges, 2000). However, these style markers may not be a simple reflection of the gender of the authors; rather, these differences may be correlated with target audience differences. Both the male and female authors sampled for the BNC wrote for adults, and only a small minority wrote for children. However, c. 5 million of the total of 7.2 million words in the male-authored fiction subcorpus was intended for a mixed readership, whereas half of the female-authored subcorpus (c. 4.4 million of 8.4 million words) targeted female audiences and may hence include more female characters and female- oriented topics than the male-authored subcorpus. Previous research indicates that audience design is relevant in spoken interaction, and style shifting is typically a response to the speaker’s audience (Bell, 1984). In weblogs, for example, the diary subgenre is reported to display more ‘female’ stylistic features, and the filter subgenre contains more ‘male’ stylistic features; in both cases the findings are independent of the gender of the author (Herring and Paolillo, 2006). It is plausible that different subgenres     of fiction and their target audiences also play a role in the word-distribution differences that are observed in the BNC prose fiction genre. 6.2 Comparing the χ2 Test with the Bootstrap Test The above analysis is based on words that are ranked as significant by the bootstrap test. Most of these words are also significant according to the other tests, including those based on the bag-of-words model. However, how do we evaluate words that are ranked as significant by the bag-of-words tests, such as the χ2 test, but are considered insignificant by the more valid tests, such as the bootstrap test? Tables 10 and 11 list high-frequency words (occurring 5,000 times or more in both subcorpora) for which the difference between the χ2 and bootstrap p-values is at least tenfold. By accounting for FDR control at α = 0.05, we find that the χ2 p-values are significant, but the bootstrap p- values are not significant. All of the listed words are also significant according to our other bag-of-words test, the log-likelihood ratio. Table 10 High-frequency words that are significantly overrepresented in male-authored prose fiction in the BNC according to the χ2 test but not according to the bootstrap test Word Males M/million Females F/million DPnorm χ2 Bootstrap an 18,513 2,572.43 20,422 2,441.23 0.11 0.0000 0.1027 back 17,159 2,384.29 18,863 2,254.87 0.13 0.0000 0.0951 down 14,405 2,001.62 15,483 1,850.83 0.13 0.0000 0.0207 has 6,595 916.39 6,553 783.34 0.26 0.0000 0.0519 his 72,681 10,099.23 76,064 9,092.63 0.16 0.0000 0.0131 I 125,809 17,481.51 141,074 16,863.87 0.20 0.0000 0.5232 into 18,468 2,566.18 20,505 2,451.15 0.12 0.0000 0.1477     my 25,143 3,493.69 24,885 2,974.73 0.30 0.0000 0.0585 off 8,869 1,232.37 9,379 1,121.16 0.15 0.0000 0.0205 old 6,455 896.94 6,895 824.22 0.24 0.0000 0.1931 or 17,248 2,396.66 17,442 2,085.00 0.17 0.0000 0.0139 out 24,466 3,399.62 26,980 3,225.17 0.11 0.0000 0.0749 people 6,342 881.24 6,243 746.28 0.26 0.0000 0.0135 them 18,592 2,583.41 19,973 2,387.56 0.15 0.0000 0.0509 this 24,230 3,366.83 26,699 3,191.58 0.14 0.0000 0.1537 up 25,018 3,476.32 27,754 3,317.69 0.12 0.0000 0.1525 which 13,030 1,810.56 12,809 1,531.18 0.25 0.0000 0.0185 who 14,583 2,026.35 15,619 1,867.08 0.15 0.0000 0.0329 then 19,598 2,723.20 21,899 2,617.79 0.16 0.0001 0.3891 looked 9,904 1,376.19 10,995 1,314.33 0.21 0.0009 0.4287 something 7,457 1,036.17 8,191 979.15 0.17 0.0004 0.1911 just 13,760 1,911.99 15,393 1,840.07 0.19 0.0011 0.4473 turned 5,738 797.31 6,311 754.41 0.18 0.0025 0.2917   Table 11 High-frequency words that are significantly overrepresented in female-authored prose fiction in the BNC according to the χ2 test but not according to the bootstrap test Word Males M/million Females F/million DPnorm χ2 Bootstrap all 25,813 3,586.79 31,323 3,744.33 0.11 0.0000 0.1765 and 184,332 25,613.45 222,854 26,639.78 0.09 0.0000 0.0873 any 7,879 1,094.81 9,837 1,175.91 0.15 0.0000 0.1033 as 45,322 6,297.62 56,365 6,737.83 0.10 0.0000 0.0063 away 8,152 1,132.74 10,130 1,210.93 0.14 0.0000 0.0615 been 20,639 2,867.85 25,253 3,018.72 0.13 0.0000 0.1319     but 42,393 5,890.63 50,780 6,070.20 0.11 0.0000 0.2905 ’d 12,340 1,714.68 17,259 2,063.13 0.34 0.0000 0.0565 day 5,369 746.04 6,788 811.43 0.19 0.0000 0.0899 going 7,539 1,047.57 9,628 1,150.92 0.20 0.0000 0.0753 him 34,197 4,751.77 42,555 5,086.99 0.15 0.0000 0.0883 last 5,116 710.88 6,620 791.35 0.16 0.0000 0.0077 might 5,960 828.16 7,630 912.08 0.20 0.0000 0.0655 no 21,170 2,941.63 26,348 3,149.62 0.10 0.0000 0.0093 now 14,568 2,024.26 18,450 2,205.50 0.13 0.0000 0.0141 only 10,668 1,482.35 13,320 1,592.26 0.12 0.0000 0.0239 said 35,208 4,892.25 46,938 5,610.93 0.28 0.0000 0.0681 seemed 5,036 699.77 6,518 779.16 0.23 0.0000 0.0789 think 9,406 1,306.99 12,231 1,462.08 0.17 0.0000 0.0145 time 13,072 1,816.39 16,112 1,926.02 0.10 0.0000 0.0215 told 5,509 765.49 7,455 891.16 0.20 0.0000 0.0065 was 111,268 15,461.00 132,703 15,863.21 0.10 0.0000 0.3401 why 7,034 977.39 8,955 1,070.47 0.16 0.0000 0.0433 room 5,708 793.14 7,107 849.57 0.22 0.0001 0.2215 know 14,188 1,971.46 17,191 2,055.00 0.15 0.0003 0.2985 about 18,742 2,604.25 22,573 2,698.36 0.14 0.0003 0.3357 even 8,156 1,133.30 9,947 1,189.06 0.16 0.0013 0.2625 after 8,541 1,186.80 10,371 1,239.74 0.12 0.0029 0.1553 long 6,326 879.02 7,740 925.23 0.12 0.0026 0.1111 tell 5,557 772.16 6,792 811.91 0.16 0.0057 0.2347       Some of the words in Tables 10 and 11 appear to corroborate the above analysis: the writing style of women is more verb-oriented, whereas men overuse masculine and collective personal pronouns, such as his and them. However, the list of words for female-authored fiction also includes a male personal pronoun, him, and men appear to significantly overuse the first-person singular pronouns I and my, which is surprising in view of our general knowledge of gendered styles (Argamon et al., 2003; Newman et al., 2008). Furthermore, men appear to overuse directional adverbs, such as back, down, out, and up; this result could be misinterpreted as an interesting discovery with regard to the focus of male prose writing on spatial orientation. If words of all frequencies are considered, then the most salient category of words that are ranked as significant by the χ2 test but not by the bootstrap test is proper nouns, as in the Matilda example above. Some of these words are also easily misinterpreted as genuine differences between subcorpora. Even an experienced linguist cannot determine which bag-of-words results are genuinely significant; our comparisons show that such results can lead to conflicting interpretations. Therefore, it is advisable to avoid the noise that is inherent in bag-of-words methods and to use a more valid test, such as the bootstrap test. 6.3 Comparing the Tests According to Significance Threshold Figure 7 summarizes the number of significant words that were returned by each test at varying significance testing thresholds. The t-test yields the least number of significant words, followed by the Wilcoxon rank-sum and bootstrap tests in both figures. Only the curve for the inter-arrival time test differs substantially between Figs 7a and 7b. The test appears to have difficulty with comparing zero with non-zero frequencies and always deems such cases significant. We also observe that the χ2 and log-likelihood ratio tests     yield more words (by several orders of magnitude) as significant results than the t-test and the Wilcoxon rank-sum and bootstrap tests. Both axes have a logarithmic scale. Fig. 7 Comparison of the number of significant words for the six methods. For each method, a curve demonstrates how the number of significant words increases as we increase the significance threshold in the male vs. female author comparison without correcting for multiple hypotheses. The x-axis shows the p-value threshold, and the y-axis shows the percentage of words that are marked as having significantly different frequencies between genders. The figure on the left (a) is based on all words in the prose fiction subcorpus, and the figure on the right (b) includes only those words with frequencies greater than zero for both genders. 7. Conclusion Many current corpus tools use the χ2 and log-likelihood ratio tests. We suggest that other tests be added to these tools for the reasons discussed in this paper. The core difference between the bag-of-words tests (the χ2 and log-likelihood ratio tests) and the other four tests (the t-test and the Wilcoxon rank-sum, inter-arrival time, and bootstrap tests) is the representation of the data, and thus, the unit of observation: for the bag-of- words tests, the data are represented in a 2x2 table (Table 1) and the number of samples equals the number of words in a corpus, whereas for the other four tests, the data are     represented either by a frequency list (Table 2), or a list of inter-arrival times. In those cases, the number of samples is much lower than the number of words in a corpus. For the t-test, the Wilcoxon rank-sum test, and the bootstrap test, the number of samples equals the number of texts in a corpus, and for the inter-arrival time test, the number of samples equals the number of occurrences of the word being tested rather than the total number of words. The number of samples generally determines our certainty with regard to the estimates, and the experimental results show that the bag-of- words tests have excessively high confidence in the estimates of mean word frequencies, in the context of the statistical comparison of two corpora. By studying the uniformity of the p-values that were given by each of the tests, we have shown that the choice of how to define independent samples and how to represent the data plays a major role in the outcome of a significance test. We have shown that bag-of-words-based tests may lead to spurious conclusions when assessing the significance of differences in frequency counts between corpora. Note, however, that we are not suggesting that there is anything wrong with the χ2 and log-likelihood ratio tests as such, but only that their application in this context is problematic. We have also shown that appropriate alternatives exist: Welch’s t-test, the Wilcoxon rank-sum test, and the bootstrap test. We have considered the choice of statistical tests for comparing moderate-sized or large corpora (at least 100 texts each). Due to space limitations, we have not include discussion on how to compare small corpora. This problem is briefly addressed in Lijffijt et al. (2012). It appears that the advice on which statistical test to use is not as straightforward as for large corpora. The objections made in this paper against the bag- of-words test hold for corpora of any size. However, in small corpora, counting all     occurrences of a word in the same text as one sample, i.e., a sample equals a text, may preclude the detection of many patterns. We would expect the inter-arrival time test to be a tempting alternative in that setting, but further investigation into the use of statistical tests for comparing small corpora or individual texts is warranted. Notes 1 Kilgarriff refers to this test as the Mann-Whitney ranks test. 2 In Lijffijt et al. (2012) we set out to explore the question of lexical variation in a historical single-genre corpus of personal correspondence over time. Comparing the log-likelihood ratio and bootstrap tests, we found that the two successive half-a-million- word subperiods of the corpus that we examined were more similar to each other with regard to their lexis than a bag-of-words method might lead one to postulate. We also observed that, besides the choice of method and the size of the corpus, the observed degree of similarity depends on several other factors, notably, the type of post-hoc correction, and the frequency cut-off and significance thresholds used. 3 Both p-values are actually 0 using double precision floating point numbers; thus, these values are much smaller than 0.0001. Acknowledgements 1 We thank the anonymous reviewers for their valuable comments and suggestions. 2 Funding 3 This work was supported by the Academy of Finland [grant numbers 129282, 129350]; 4 the Finnish Centre of Excellence for Algorithmic Data Analysis (ALGODAN); the 5 Finnish Centre of Excellence for the Study of Variation, Contacts and Change in 6     English (VARIENG); the Finnish Doctoral Programme in Computational Sciences 7 (FICS); the Academy of Finland’s Academy professorship scheme; and the Finnish 8 Graduate School in Language Studies (Langnet). 9 References 10 Altmann, E. G., Pierrehumbert, J. B., and Motter, A. E. (2009). Beyond word 11 frequency: bursts, lulls, and scaling in the temporal distributions of words, PLoS 12 One, 4(11): e7678. 13 Argamon, S., Koppel, M., Fine, J., and Shimoni, A. R. (2003). Gender, genre, and 14 writing style in formal written texts, Text, 23(3): 321–46. 15 Beaujean, F., Caldwell, A., Kollár, D., and Kröninger, K. (2011). P-values for model 16 evaluation, Physical Review D, 83(1): 012004. 17 Bell, A. (1984). Language style as audience design, Language in Society, 13: 145–204. 18 Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a 19 practical and powerful approach to multiple testing, Journal of the Royal 20 Statistical Society, 57(1): 289–300. 21 Berry, G. and Armitage, P. (1995). Mid-P confidence intervals: a brief review, The 22 Statistician, 44(4): 417–23. 23 Biber, D. (1995). Dimensions of Register Variation: A Cross-linguistic Comparison. 24 Cambridge: Cambridge University Press. 25 Biber, D. and Burges, J. (2000). Historical change in the language use of women and 26 men: gender differences in dramatic dialogue, Journal of English Linguistics, 27 28(1): 21–37. 28 Blocker, C., Conway, J., Demortier, L., Heinrich, J., Junk, T., Lyons, L., and 29 Punzig, G. (2006). Simple facts about p-values, Technical Report 30     CDF/MEMO/STATISTICS/PUBLIC/8023, Laboratory of Experimental High 31 Energy Physics, The Rockefeller University. 32 BNC = The British National Corpus, version 3 (BNC XML Edition) (2007). 33 Distributed by Oxford University Computing Services on behalf of the BNC 34 Consortium. http://www.natcorp.ox.ac.uk/ (accessed 26 November 2012). 35 Burnard, L. (2007). Reference Guide for the British National Corpus (XML Edition). 36 Published for the British National Corpus Consortium by the Research 37 Technologies Service at Oxford University Computing Services. 38 http://www.natcorp.ox.ac.uk/docs/URG/ (accessed 26 November 2012). 39 Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in 40 microarray experiments, Statistical Science, 18(1): 71–103. 41 Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence, 42 Computational Linguistics, 19: 61–74. 43 Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. New York: 44 Chapman and Hall/CRC. 45 Evert, S. (2005). The Statistics of Word Cooccurrences: Word Pairs and Collocations. 46 Dissertation, Institut für Maschinelle Sprachverarbeitung, University of 47 Stuttgart. 48 Good, P. (2005). Permutation, Parametric, and Bootstrap Tests of Hypotheses. 3rd 49 edn., New York/Heidelberg: Springer. 50 Gries, S. Th. (2005). Null-hypothesis significance testing of word frequencies: a 51 follow-up on Kilgarriff, Corpus Linguistics and Linguistic Theory, 1(2): 277–94. 52 Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora, International 53 Journal of Corpus Linguistics, 13(4): 403–37. 54     Herring, S. C. and Paolillo, J. C. (2006). Gender and genre variation in weblogs, 55 Journal of Sociolinguistics, 10(4): 439–59. 56 Hinneburg, A., Mannila, H., Kaislaniemi, S., Nevalainen, T., and Raumolin- 57 Brunberg, H. (2007). How to handle small samples: bootstrap and Bayesian 58 methods in the analysis of linguistic change, Literary and Linguistic Computing, 59 22(2): 137–50. 60 Hoffmann, S., Evert, S., Smith, N., Lee, D., and Berglund Prytz, Y. (2008). Corpus 61 Linguistics with BNCweb—a Practical Guide. Frankfurt am Main: Peter Lang. 62 Kilgarriff, A. (2001). Comparing corpora, International Journal of Corpus Linguistics, 63 6(1): 1–37. 64 Kilgarriff, A. (2005). Language is never, ever, ever, random, Corpus Linguistics and 65 Linguistic Theory, 1(2): 263–76. 66 L’Ecuyer, P. and Simard, R. (2007). TestU01: a C library for empirical testing of 67 random number generators, ACM Transactions on Mathematical Software, 68 33(4): 22. 69 Lee, D. Y. W. (2001). Genres, registers, text types, domains and styles: clarifying the 70 concepts and navigating a path through the BNC jungle, Language Learning & 71 Technology, 5(3): 37–72. 72 Lijffijt, J. (2012). Bootstrap test for R and Matlab. 73 http://users.ics.aalto.fi/lijffijt/bootstraptest/ (accessed 26 November 2012). 74 Lijffijt, J. (2013). A fast and simple method for mining subsequences with surprising 75 event counts. In Blockeel, H., Kersting, K., Nijssen, S., and Železný, F. (eds), 76 Proceedings of ECML-PKDD 2013—Part I. Berlin: Springer-Verlag, pp. 385– 77 400. 78     Lijffijt, J. and Gries, S. Th. (2012). Correction to Stefan Th. Gries’ “Dispersions and 79 adjusted frequencies in corpora”, International Journal of Corpus Linguistics, 80 17(1): 147–9. 81 Lijffijt, J., Papapetrou, P., Puolamäki, K., and Mannila, H. (2011). Analyzing word 82 frequencies in large text corpora using inter-arrival times and bootstrapping. In 83 Gunopulos, D., Hofmann, T., Malerba, D., and Vazirgiannis, M. (eds), 84 Proceedings of ECML-PKDD 2011—Part II. Berlin: Springer-Verlag, pp. 341– 85 57. 86 Lijffijt, J., Säily, T., and Nevalainen, T. (2012). CEECing the baseline: lexical 87 stability and significant change in a historical corpus. In Tyrkkö, J., Kilpiö, M., 88 Nevalainen, T., and Rissanen, M. (eds), Outposts of Historical Corpus 89 Linguistics: From the Helsinki Corpus to a Proliferation of Resources. Studies 90 in Variation, Contacts and Change in English, Vol. 10. Helsinki: VARIENG. 91 http://www.helsinki.fi/varieng/journal/volumes/10/lijffijt_saily_nevalainen/ 92 (accessed 26 November 2012). 93 Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random 94 variables is stochastically larger than the other, Annals of Mathematical 95 Statistics, 18(1): 50–60. 96 Massey, F. (1951). The Kolmogorov-Smirnov test for goodness of fit, Journal of the 97 American Statistical Association, 46(253): 68–78. 98 Newman, M. L., Groom, C. J., Handelman, L. D., and Pennebaker, J. W. (2008). 99 Gender differences in language use: an analysis of 14,000 text samples, 100 Discourse Processes, 45: 211–36. 101     North, B. V., Curtis, D., and Sham, P. C. (2002). A note on the calculation of 102 empirical p-values from Monte Carlo procedures, The American Journal of 103 Human Genetics, 71(2): 439–41. 104 Oakes, M. P. and Farrow, M. (2007). Use of the chi-squared test to examine 105 vocabulary differences in English-language corpora representing seven different 106 countries, Literary and Linguistic Computing, 22(1): 85–100. 107 Paquot, M. and Bestgen, Y. (2009). Distinctive words in academic writing: a 108 comparison of three statistical tests for keyword extraction. In Jucker, A., 109 Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. 110 Amsterdam: Rodopi, pp. 247–69. 111 Rayson, P. (2008). From key words to key semantic domains, International Journal of 112 Corpus Linguistics, 13(4): 519–49. 113 Rayson, P., Berridge, D., and Francis, B. (2004). Extending the Cochran rule for the 114 comparison of word frequencies between corpora. In Purnelle, G., Fairon, C., 115 and Dister, A. (eds), Le poids des mots: Proceedings of the 7th International 116 Conference on Statistical Analysis of Textual Data (JADT 2004). Louvain-la- 117 Neuve: Presses universitaires de Louvain, pp. 926–36. 118 Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In 119 Kilgarriff, A. and Berber Sardinha, T. (eds), Proceedings of the Workshop on 120 Comparing Corpora. Stroudsburg: Association for Computational Linguistics, 121 pp. 1–6. 122 Rayson, P., Leech, G., and Hodges, M. (1997). Social differentiation in the use of 123 English vocabulary: some analyses of the conversational component of the 124     British National Corpus, International Journal of Corpus Linguistics, 2(1): 133– 125 52. 126 Savický, P. and Hlaváčová, J. (2002). Measures of word commonness, Journal of 127 Quantitative Linguistics, 9(3): 215–31. 128 Schweder, T. and Spjøtvoll, E. (1982). Plots of p-values to evaluate many tests 129 simultaneously, Biometrika, 69(3): 493–502. 130 Scott, M. (2012). WordSmith Tools, version 6. Liverpool: Lexical Analysis Software. 131 Shaffer, J. P. (1995). Multiple hypothesis testing, Annual Review of Psychology, 46: 132 561–84. 133 Welch, B. L. (1947). The generalization of ‘Student’s’ problem when several different 134 population variances are involved, Biometrika, 34(1–2): 28–35. 135 Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics Bulletin, 136 1(6): 80–3. 137