key: cord-0978139-ldzteeix authors: Siegelman, Noam; Schroeder, Sascha; Acartürk, Cengiz; Ahn, Hee-Don; Alexeeva, Svetlana; Amenta, Simona; Bertram, Raymond; Bonandrini, Rolando; Brysbaert, Marc; Chernova, Daria; Da Fonseca, Sara Maria; Dirix, Nicolas; Duyck, Wouter; Fella, Argyro; Frost, Ram; Gattei, Carolina A.; Kalaitzi, Areti; Kwon, Nayoung; Lõo, Kaidi; Marelli, Marco; Papadopoulos, Timothy C.; Protopapas, Athanassios; Savo, Satu; Shalom, Diego E.; Slioussar, Natalia; Stein, Roni; Sui, Longjiao; Taboh, Analí; Tønnesen, Veronica; Usal, Kerem Alp; Kuperman, Victor title: Expanding horizons of cross-linguistic research on reading: The Multilingual Eye-movement Corpus (MECO) date: 2022-02-02 journal: Behav Res Methods DOI: 10.3758/s13428-021-01772-6 sha: dd02f5af70dbac0900fda147c7513bf3b706170e doc_id: 978139 cord_uid: ldzteeix Scientific studies of language behavior need to grapple with a large diversity of languages in the world and, for reading, a further variability in writing systems. Yet, the ability to form meaningful theories of reading is contingent on the availability of cross-linguistic behavioral data. This paper offers new insights into aspects of reading behavior that are shared and those that vary systematically across languages through an investigation of eye-tracking data from 13 languages recorded during text reading. We begin with reporting a bibliometric analysis of eye-tracking studies showing that the current empirical base is insufficient for cross-linguistic comparisons. We respond to this empirical lacuna by presenting the Multilingual Eye-Movement Corpus (MECO), the product of an international multi-lab collaboration. We examine which behavioral indices differentiate between reading in written languages, and which measures are stable across languages. One of the findings is that readers of different languages vary considerably in their skipping rate (i.e., the likelihood of not fixating on a word even once) and that this variability is explained by cross-linguistic differences in word length distributions. In contrast, if readers do not skip a word, they tend to spend a similar average time viewing it. We outline the implications of these findings for theories of reading. We also describe prospective uses of the publicly available MECO data, and its further development plans. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.3758/s13428-021-01772-6. Any field of research in human cognition must account for natural variability in physiological, psychological, and behavioral traits and states of individuals. A few fields, however, also need to account for the profound and inherent variability in the very object of cognitive processing. A prime example of such a field is the study of language. A generalizable account of how language is learned, produced, comprehended, or represented in the brain or mind also needs to grapple with the world's astounding diversity of languages. In the case of reading, this diversity is further compounded by the variability of orthographies, i.e., solutions developed for representing speech in print (Daniels & Bright, 1996; Daniels & Share, 2018) . Thus, one of the central goals of reading research is to find what universal and specific aspects exist across the written languages of the world, and subsequently, to study how these aspects influence reading development and processes (for recent reviews see, among others, Frost, 2012; Koda & Zehler, 2008; Share, 2014; Verhoeven & Perfetti, 2017) . This goal brings forward extensive demands on the quantity and quality of empirical evidence and, importantly, its cross-linguistic coverage, which is not always guaranteed in an Anglo-centric scientific literature on language (Share, 2014) . It is uncontroversial that the availability of high-quality, comparable behavioral data from diverse languages and writing systems is both a driving engine and a prerequisite of meaningful and generalizable theories of reading. The history of reading research shows that the field has been propelled greatly by data that came from cross-linguistic multi-lab coordinated efforts. Consider, for instance, the Ziegler and Goswami's (2005) influential psycholinguistic grain size theory-a proposal that languages with inconsistent (opaque) orthographies (e.g., English) are more difficult to learn and are preferentially learned via bigger orthographic chunks than relatively consistent transparent languages (e.g., Finnish). This proposal draws on several multilingual studies, including in particular a joint investigation of real word and non-word reading in 13 European alphabetic languages (Seymour, Aro, & Erskine, 2003) . Most research producing either cross-linguistic data or comparable single-language data so far has employed tasks revolving around single word recognition (e.g., the English Lexicon Project database of lexical decision and word naming by Balota et al., 2007 1 ). Yet proficient natural reading is the reading of continuous texts to achieve comprehension, i.e., building a mental representation of the text content in one's memory and integrating it with one's prior knowledge through inferential processing (e.g., Wooley, 2011) . This set of highly coordinated cognitive operations necessarily includes, but also goes far beyond, identification of individual words in the text in terms of complexity and breadth of demands on the visuo-oculomotor, perceptual, and information-processing systems in the reader (e.g., Liversedge et al., 2012; Rayner & Liversedge, 2011 ). For such higherlevel language processing, such cross-linguistic data is a lot less evident and barely available. In line with the goal of studying natural real-time behavior during reading for comprehension, in this study, we focus on silent reading of running texts, using eye tracking as the experimental paradigm. Eye tracking is the registration of eye movements as they unfold in real time, and its output is a demonstrably reliable and ecologically valid record of reading behavior (Kliegl et al., 2006; Rayner, 1998; Rayner et al., 2012) . A rich literature shows that eye-movement control is an integral part of information processing that takes place during reading (see review in Radach & Kennedy, 2013) , and thus, it is reflective both of the cognitive processes of comprehension and the multiple components that underlie those processes (e.g., Kennedy et al., 2000; Rayner et al., 2006; Rayner et al., 2012) . One of the important advantages of eye tracking is that it enables a finegrained real-time account of both the temporal (when) and spatial (where) aspects of text reading. The when of eye movement control determines how long to fixate on a word with the eye gaze, allowing for viewing and uptake of visual and linguistic information, and when to break the fixation and initiate a saccadic movement to another location. The where aspect relates to decisions of which word to select as a target for the next fixation and which to skip, and what amplitude of a saccadic oculomotor movement to generate to attain this target (Radach et al., 2007; Rayner, 1998) . Given vast differences in the surface characteristics of (written) languages of the world, one can expect readers of different languages to systematically vary in both the temporal and spatial dimensions of their reading behavior. An examination of such systematic patterns requires a resource of comparable eye-tracking reading data across languages. Out of thousands of experimental studies using eye tracking (see below), very few addressed this need for crosslinguistic comparison. One of these seminal exceptions is an eye-tracking study by Liversedge et al. (2016) , which examined the eye movements of native speakers reading closely matched written passages in three languages (Chinese, English, and Finnish) representing widely different language families and writing systems. Other studies provided corpora with comparable cross-linguistic eye-tracking data in two languages. Such studies include the Dundee corpus of texts read in English and French (Pynte & Kennedy, 2006) ; the GECO corpus of eye movements (Cop et al., 2017) collected from English and Dutch participants reading the same book in the original and translated version; and the Whitford and Titone's (2012) study of English-French bilinguals reading 1 The English Lexicon Project also pioneered a type of large-scale multi-lab data collection resulting in a series of mega-studies in multiple languages (see Keuleers & Balota, 2015 , for a review). An upto-date list of relevant resources is maintained at http://crr.ugent.be/ programs-data/megastudy-data-available. passages in both languages (see also English and German comparative data in Rau et al., 2015, and Chinese and English data in Sun & Feng, 1999, and Feng et al., 2009) . Several additional studies offer monolingual databases of eye-tracking data, including, among others, corpora in Chinese (Pan et al., 2021) , English (Frank et al., 2013; Luke & Christianson, 2018) , German (Kliegl et al., 2004) , Hindi (Husain et al., 2014) , and Russian (Laurinavichyute et al., 2019) . As we show below, these and similar studies are relatively limited from the viewpoint of cross-linguistic coverage. They gravitate heavily-in line with the trend in the entire field of language research-towards alphabetic languages of Europe and especially English (Share, 2008 (Share, , 2014 . Moreover, whereas all of the above studies aimed to specifically compare reading in a small number of target languages, our goal here was, for the first time, to generate a database of reading behavior across a much larger number of languages and writing systems. This database was collected using similar technology and analyzed with unified software from comparable populations of readers exposed to comparable textual stimuli. The current work thus builds upon the comparative studies cited above and extends them to investigate eye movements during reading across multiple languages. The structure of the paper is as follows. Part I is a bibliometric analysis of scholarly publications on eye movements in reading. We review the data available for various languages and the studies providing primary data on more than one language. Part II describes the Multilingual Eye-movement Corpus, or MECO, the product of an international multi-lab collaboration of research groups in 13 countries. The goal of MECO is to supply theories of reading with primary behavioral data from a large number of diverse writing and linguistic systems. The resulting data are made freely available to empirically address a range of research questions about reading across a wide variety of languages. In Part II we also address the technological, methodological, and experimental decisions that went into this corpus creation. Part III uses MECO data to directly tackle the key theoretical goal of reading research (addressed in Liversedge et al., 2016, among others) and of this paper: quantifying similarities and differences in reading behavior across a variety of written languages. These analyses offer new insights into aspects of behavior that are shared and those that vary systematically across languages. In the General Discussion, we summarize our findings and outline limitations and plans for MECO's further development. To estimate the cross-linguistic coverage of studies of reading that use eye tracking, we conducted a bibliometric analysis of 1078 papers (published from 2000 to 2018) in the Web of Science citation database 2 , which were manually coded for the investigated language(s). Note that our search should not be taken as an exhaustive list, nor does it follow the accepted protocols for meta-analyses (Moher et al., 2015) . Rather, it is meant to provide an estimate of the current state of the field based on a large number of papers published over the last two decades. The full bibliometric database is available at the project's OSF page (see Data availability section in Part II, below). Figure 1 presents the distribution of studied languages across the 1078 papers. Note that some studies included more than one language (see below), and therefore the sum of this distribution is larger than the number of studies. Perhaps unsurprisingly, Fig. 1 points to English as the most studied language, accounting for the majority of the eyetracking research on reading (studied in 620/1078 papers, 57.5%). Other languages with a prevalence of more than 1% of the total (i.e., 11/1078 studies or more) are (in descending order): Chinese (11%), German (9.7%), French (5.2%), Spanish (4.1%), Finnish (3.9%), Dutch (3.3%), Italian (2.1%), Japanese (1.5%), and Korean (1%). All other languages combined appear in only 5.6% of total publications. These comprise a total of 18 languages: Hebrew, Swedish, Thai (7 studies each), Arabic, Portuguese, Russian (4), Polish (3), publications on eye movements in reading 2 First, the following search parameters were used: TOPIC: ("reading" AND ("eye tracking" OR "eye movements")), Refined by: DOC-UMENT TYPES: (ARTICLE OR REVIEW OR PROCEEDINGS PAPER); Timespan: 2000-2018; Indexes: SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, ES. This returned 1956 results. Then, we manually removed papers on topics unrelated to reading of written materials (e.g., reading of emotions), papers without eye-tracking data (i.e., conducted using other paradigms), or those not reporting primary empirical data (e.g., reviews, meta-analyses). Afrikaans, Serbo-Croatian (2), Catalan, Croatian, Greek, Isi-Zulu, Norwegian, Persian, Romanian, Sesotho, Uighur, and Urdu (1). Together, these results show that in the last two decades, most available data on eye movements in reading has come from English, in line with Share's (2008) criticism. With a laudable exception of Chinese and (in a much more limited way) Japanese and Korean, there is a strong bias in the field towards Indo-European languages, in line with Share's (2014) critical observation. This bias poses a serious question on the generality of any theory mainly built on data from Indo-European languages. In sum, at present, the scientific community has access to little or no eye-tracking reading data from the vast majority of the world's languages and writing systems. Next, we estimated the presence of coordinated crosslinguistic studies. We found that the vast majority of studies in our bibliometric database examined only one language: 1038/1078 of studies with primary data. In other words, only 40 out of 1078 studies in the database (3.7%) conducted a direct cross-linguistic comparison. From this set of studies, 37 studies included data from two languages, and only three had data from three languages (Fukuda & Fukuda, 2009; Liversedge et al., 2016; Saggara & Ellis, 2013) . No studies in our database report data from four languages or more. Clearly, reading research does not have a sufficient empirical basis for investigating reading for comprehension across languages, neither in the diversity and number of represented languages nor in the availability of comparative cross-linguistic studies. Part II addresses this deficit by reporting MECO, a coordinated eye-tracking study of reading in multiple diverse languages, designed specifically for cross-linguistic comparisons. Table 1 presents the languages included in the current release of MECO. At present, MECO includes samples from a total of 13 languages, selected due to the availability of partner labs, which will be complemented in the future by further contributing researchers. Table 1 lays out the diversity of the investigated languages in terms of their typological classes and genetic groups, as well as scripts, morphological types, and orthographic transparency (as classified in Dryer & Haspelmath, 2013; Seymour et al., 2003; Verhoeven & Perfetti, 2017) . It also shows that many of the presently reported languages are under-studied: More than half (7/13) of the languages have an estimated prevalence of 1% or less in previous eye-tracking research, as reflected in the bibliometric search of Part I above. The present database, therefore, constitutes a considerable extension of the existing empirical data pool. Participants All participating laboratories aimed to reach n = 45-55 participants with usable data (see Data editing and cleaning below for details regarding inclusion of participants and trials), and indeed, the presently available and reported data sets in most languages reached this range. In some laboratories, however, the final stages of data collection were cut short by COVID-19-related closures; therefore, in two languages, the samples are smaller (n ~ 30 each): we plan to increase these samples in the future releases of the MECO project; see Future Directions. Table 2 lists the number of participants per site, the country and institution where the data was collected, and details regarding the participants' compensation. Table 2 also includes summaries of some of the basic background information collected using the Language Experience and Proficiency Questionnaire (LEAP-Q; see Additional questionnaires and tests below). This information includes age, years of education, and self-ratings of L1 proficiency in speaking, oral comprehension, and reading. Participants' full demographic information is available at the project's OSF page (see Data availability below). The ethics clearance was obtained by each participating site from the ethics research board of the corresponding institution or country. Materials At each site, participants read a set of 12 texts in their first and dominant language (L1). All texts were Wikipedia-style encyclopedic entries on a variety of topics, including historical figures, events, and natural or social phenomena. Topics were chosen such that they did not rely on specialized academic knowledge and did not have a specific cultural bias making them more or less familiar to some of the participating sites. At the first stage, texts were created in English, loosely based on the Wikipedia entries. Five of the 12 texts (44 sentences in total) were chosen to serve as sources for translation. These texts were translated to the corresponding L1 from the English original by the team at each site to create translation equivalents across all languages. The quality of translation and content similarity were ensured through back-translation from the target language to English (never done by the same person who produced the original translation) and an iterative process of introducing changes to the text in L1 and a subsequent back-translation. In a few cases, when the authors' team had professional translators with native knowledge of both English and the target language, they would evaluate the translation (made by a different person) directly in the target language. In this situation, the iterative process of aligning the source and target texts omitted the intermediate step of back-translating. See below for a quantitative evaluation of translation quality of texts across languages. The remaining seven texts were not translated. Instead, participating sites were instructed to use non-matched texts on the same topic as English originals (e.g., country flags, beekeeping), in the same prosaic genre (i.e., encyclopedic entries), of similar length (5-12 sentences, 10-15 lines), and of comparable level of difficulty (e.g., by avoiding uncommon grammatical constructions) 3 . These texts were typically compiled by each team using Wikipedia or similar open resources in the corresponding L1. Below we will show that the language-original texts, as far as we can attest, are similar to the English-translated texts in terms of their complexity and readability. Still, we provide English backtranslations of all texts used so that users of MECO can further evaluate these and other text properties and potentially decide to focus on particular texts in their analyses based on these characteristics. To evaluate the quality of translations in matched texts and ensure that there were no systematic differences in text readability or complexity across sites, the authors' team in each site prepared back-translations into English for all texts in all languages. Note that we had to use back-translations to estimate complexity/readability and translation equivalence because, at present, there are no (comparable) computational tools that can estimate these text-level metrics for all MECO languages. First, to estimate complexity, we tested comparability of both matched and unmatched texts across the languages in terms of readability and complexity using the back-translations. As reported in Supplementary Materials S1, a set of 10 readability and complexity metrics did not differ statistically across languages both in matched and unmatched texts. This finding suggests that the texts' complexity/readability were similar across sites and eliminates these factors as potential confounds. Second, to estimate the translation quality of the English-original texts, we quantified the text-wise cosine semantic similarity between back-translations and the English originals (using pretrained latent semantic analysis [LSA] vectors). This analysis revealed that back-translated matched texts were highly similar to the English originals (mean cosine = 0.88), significantly more than the similarity of unmatched texts to the unmatched originals (mean cosine = 0.66, p < .001) and statistically on par with the similarity of backtranslated Finnish texts to the English originals in the study of Liversedge et al. (2016; mean cosine = 0.93; p > .1; see Supplementary Materials S2 for details). Please refer to the project's OSF repository to access all back-translations along with the estimates of readability/complexity and similarity to the English originals. As a general point, we note that the decision to make some of the texts translated and others more loosely related was motivated by three considerations. First, this step ensured that the materials represent a wider natural variety of orthographic, morphological, and syntactic constructions in each language, which is not constrained by the demands of translation accuracy and sentence-by-sentence alignment of content across materials in the corpus. Second, we expected a greater diversity of texts to give rise to greater variability in individual reading strategies and patterns, which is desirable for characterizing natural reading behavior within and across languages. A third consideration was to enable a direct investigation of a methodological issue in cross-linguistic research, namely, what degree of control over materials is needed to make a cross-linguistic comparison meaningful (see also Papadopoulos et al., 2021) . It is clear that some types of analyses require a close matching of semantic equivalence across languages through translation (e.g., whether sentences with the same meaning require the same time to read in different languages, Liversedge et al., 2016 ). Yet it is still unclear whether, and to what extent, cross-linguistic differences in the global text contents influence eye-movement patterns over and above other well-known factors at the level of characters, morphemes, words, and larger multiword units (e.g., Schuster et al., 2016) . The importance of this point is hard to overestimate. A radical methodological stand on this issue may be that any credible cross-linguistic comparison of oculomotor patterns must be based on semantically matched texts; otherwise, the diverging semantics of texts in different languages would present a confound. This view would invalidate virtually all existing knowledge of cross-linguistic differences in reading because only a very small portion of prior work is based on translated texts (see above). An alternative stand, however, is that semantic similarity is required only when investigating particular effects of interest and that other cross-linguistic differences in oculomotor behavior generalize regardless of the specific contents of the text 4 . By including both semantically matched and unmatched cross-linguistic materials in the design, MECO enables researchers to examine whether various comparative effects of interest generalize beyond the tight semantic control and are thus more representative of reading behavior in general (also see examples in Part III below). Each text was followed by four yes/no comprehension questions: these were simple questions that tapped into factual knowledge obtained from the read materials and served as an attention check. The comprehension questions were similar in content across languages in matched texts, but naturally differed for non-matched texts, reflecting the differences in the text content. Table 3 details the number of words and sentences in each text in each language. A word is defined in this study as a unit in writing separated by a space. In addition to the reading task, participants at all sites completed a battery of individual differences tests and questionnaires. Two identical instruments were used in all sites: (1) the nonverbal IQ test from the Culture Fair Test-3 (CFT20, Subset 3 Matrices, short version, Form A, timed at 3 minutes, Weiß, 2006) , and (2) an abridged version of the Language Experience and Proficiency Questionnaire (LEAP-Q; Marian et al., 2007) . The CFT20 aimed at providing a comparable measure of nonverbal intelligence across all sites, and the LEAP-Q at collecting basic demographic and linguistic information about participants. Furthermore, each site used a short battery of (non-eyetracking) measures of individual differences in L1 reading and proficiency. The goal in collecting these additional measures was to enable correlational analyses of the relations between individual differences in component skills of reading and oculomotor reading behavior within samples. Given the variability in what individual-differences tests are available for specific languages, the tasks were not identical across sites. Most commonly, the tests examined participants' vocabulary size, word and pseudoword naming, phonological/morphological awareness, and other component skills of reading. The full individual-differences data from each site, along with short task descriptions, are available at the project's OSF page (see Data availability). Procedure In all sites, the experimental session began with participants signing a consent form and filling out the LEAP-Q questionnaire. Then, participants proceeded to the reading task, during which their eye movements were recorded. Following the reading task, participants took the individual-differences battery, including the CFT-20 and any L1 individual-differences tests. The entire procedure lasted no more than an hour, and breaks were provided when needed. Note that at the conclusion of the experimental session, participants in all samples (except for South Korean) proceeded to participate in an English-language eye-tracking study. The goal of that study was to create an additional eye-tracking corpus of reading in English as a non-dominant language, which can be used to examine the L2 reading behavior of participants with different L1s. This additional study is beyond the scope of the current paper and is therefore reported elsewhere (Kuperman et al., in press ). Apparatus and procedure Information regarding the apparatus used at the different sites and additional settings can be found in Supplementary Material S3. Eye movements were recorded with an EyeLink Portable Duo, 1000 or 1000+ eye tracker (SR Research, Kanata, Ontario, Canada) with a sampling rate of 1000 Hz. A chin rest and a head restraint were used to minimize head movements. Calibration was performed using a series of nine fixed targets distributed around the display, followed by a 9-point accuracy test to validate eye position. Stimuli were viewed binocularly, but eye-movement data from only one eye (the right eye in most participants) were analyzed. Prior to the presentation of the trial stimuli, a dot appeared on the monitor screen, slightly to the left (or right, in the case of Hebrew, which is a right-toleft writing system) of the first word in the passage. Once the participant had fixated on it, the trial would begin. This drift check and correction took place at the beginning of each trial, and calibration was monitored by the experimenter throughout and redone if necessary. Each of the 12 texts appeared on a separate screen. Participants were instructed to read the passages silently for comprehension and press the space bar when their reading of a passage was completed. A mono-spaced font with 1.5 spacing was used in the reading task in all languages, with a font size between 16 and 24 points (see Supplementary Materials S3). Due to inevitable differences in equipment (e.g., screen size and type) and the spatial configuration of the participating eye-tracking labs, maintaining an identical font size, distance from the screen, and screen resolution was unfeasible. Instead, we required in each lab that participants were tested in the conditions most comfortable for visual inspection of the reading materials, as established by the prior practice of these labs and adjustments based on pilot participants (for chosen settings, see Supplementary Materials S3). For reference, in the longest matched text ("wine tasting"), the number of text lines varied from 9 to 14 (M = 12.38), with a maximal number of characters per line varying from 93 to 114 (M = 105.75), except for in Korean where this number was substantially smaller (60). The 12 texts were presented in the same fixed order in all languages (see text number in Table 3 ). Each text was followed by four yes/no comprehension questions, each showing on a separate screen one after another. Participants responded by pressing "0" for no or "1" for yes, and their answers were recorded. Data editing and cleaning In paragraph reading, there is often a need to correct eye fixation locations and assign fixations to text lines within a passage. This is commonly done using a manual procedure (but see Carr, 2021; Cohen, 2013; Tang et al., 2012) . One of our methodological objectives Table 3 Number of sentences (#sent) and words (#word) in each text across languages #sent number of sentences; #word number of words. Translated texts are marked with an asterisk; other texts were language-specific. Note that some small deviations in the number of sentences per text in matched texts are due to differences in spelling conventions (e.g., using colon or period before "For example"). Topic DU EE EN FI GE GR HE IT KO NO RU SP TR 1* Janus #sent 10 10 10 10 10 10 10 10 10 10 10 9 10 #word 186 131 183 128 174 189 130 185 142 177 151 210 146 2 Shaka #sent 7 9 6 8 9 6 11 7 7 8 7 7 7 #word 194 133 185 116 161 171 209 174 150 169 145 190 131 3* Doping #sent 9 9 9 10 9 9 9 9 9 9 9 9 9 #word 185 was to maintain high replicability in all aspects of the experimental setup and data analysis, in line with principles of Open Science. For this reason, we opted for automatic correction of fixation locations using the popEye software (implemented in R, version 0.6.4, Schroeder, 2019). The popEye software is an integrated environment to preprocess and analyze eye-tracking data from reading experiments. During preprocessing, popEye assigns fixations to lines, words, and letters. For the present study, an algorithm was used in which individual fixations are first grouped into sequences based on their spatial and temporal proximity. In the next step, sequences are assigned to the closest line based on their average horizontal location (see Beymer & Russell, 2005; Carr et al., 2021; Špakov et al., 2019 , for similar approaches). Following this automatic procedure, the software's output was visually inspected by members of the research team to assess the quality of the resulting data. This step was necessary but may have introduced subjective judgment. We argue, however, that this process has fewer "researcher degrees of freedom" (Simmons et al., 2011) than an alternative process where fixation alignment is done fully manually. Trials (texts) where fixations were erroneously assigned to lines (typically due to poor calibration or software failures) were deemed unusable and were removed from the analysis. Then, participants who had less than five usable trials were removed from the analysis altogether. The number and percent of trials retained after data cleaning in each site can be found in Table 2 above. In the current release of MECO, we only report data from usable participants and trials, as determined based on the current version of popEye. Note that the amount of usable data, as determined by the current version of popEye, comprises approximately 70% of the complete data. This is in line with the estimated upper limit that can be achieved by any automated algorithm using the present setup (see Carr et al., 2021 , for a comparison of different line assignment algorithms). Since the popEye software is under development and may improve its algorithms for correcting fixation locations, future releases of MECO may supplement the current samples with data from some of the trials or participants that are presently removed (see Limitations and future directions in the General Discussion). For the analyses below (reported in Part III), we additionally removed data points that showed either very short (< 80 ms) first fixations or very long total fixation times (top 1% of the participant-specific distribution). The current (and first) release of MECO includes full interest-area reports from usable participants and trials as well as full data from individual differences tests and background questionnaires. Additionally, we report data at the passage and sentence level, broken down by participant. We also include the analytical code used for Part III. The data, materials, and code are available at the project's OSF page https:// osf. io/ 3527a/ (this is now a public version of the OSF repository). In Part III below, we consider a number of variables reflecting oculomotor behavior at the word level during reading. Note that the output of the popEye software includes several additional variables not discussed here, including fixation locations and information at the sentence and passage level: For future users of MECO, we provide a description of the various variables included in the database at the project's OSF page. Returning to variables used in Part III below, those defined at the word level included: skipping 5 (a binary index of whether the word was fixated at least once during the entire reading of the text [and not only during the first pass], labeled skipping); first fixation duration (the duration of the first fixation landing on the word, firstFixationDuration); gaze duration (the summed duration of fixations on the word in the first pass, i.e., before the gaze leaves it for the first time, gazeDuration); total fixation duration (the summed duration of all fixations on the word, totalFixationDuration); first-run number of fixations (the number of fixations on a word during the first pass, nFixationsFirstRun); total number of fixations (number of fixations on a word overall, nFixationsTotal); regression (a binary index of whether the gaze returned to the word after inspecting further textual material, i.e., to the right of the word in left-to-right orthographies, regressionIn); and rereading (a binary index of whether the word elicited fixations after the first pass, i.e., after the gaze left the word for the first time, rereading). See Inhoff and Radach (1998) and Rayner (1998) for a detailed discussion of these variables. At the participant level, the following variables were defined: comprehension accuracy (percent of correct responses to all 48 questions, accuracy), matched comprehension accuracy (percent of correct responses to the 20 questions in the five translated passages, accuracyMatched), and reading rate (in words per minute, readingRate), as well as mean word-level variables (e.g., participant's mean skipping rate, mean first fixation duration, etc.). While reading rate is not an oculomotor measure and is closely related to total fixation duration (though it additionally accounts for skipped words and saccade durations), we include the variable to ensure comparability of the present data with the cross-linguistic educational and psychological literature using reading rate (see review by Brysbaert, 2019) . It also opens the opportunity for researchers to use our materials if they do not have access to an eye tracker but want to collect information about reading rate and reading comprehension in their lab. Additionally, we used scores from the CFT test of nonverbal intelligence (cft). The 13-level categorical variable Language was a critical independent variable in all of our analyses. Furthermore, we considered word length in characters as a benchmark predictor of reading. Since all languages in our current corpus use spacing for segmentation, word length was defined as a number of characters between spaces, excluding punctuation marks. Reliability Correlational research is pointless without information about the reliability of the variables because the observed correlation between two variables depends on both the theoretical correlation and the reliability of the measured variables. The reliability of the eye-tracking data was estimated in two ways. First, we examined the reliability of the eye-tracking variables at the participant level. For most variables, this was done by using a splithalf procedure where, for each language, we examined the correlation between mean values for "odd" and "even" words within a participant. These reliability estimates reflect the extent to which each eye-tracking measure provides a stable measure of individual differences in each language. The only exception to this procedure was the estimation of reliability for reading rate, which was examined by calculating the intra-class correlation coefficients (ICC) across reading rates from the 12 texts in each language for each participant. Second, we estimated split-half reliability at the word token level (i.e., the level of individual word occurrences). This was done by examining the correlation between means for "odd" and "even" participants within each word token for each language and eye-tracking measure. This metric represents reliability values relevant for word-level investigations (e.g., effects of length or frequency of words) 6 . For both types and for each measure, we computed both raw correlations and the Spearman-Brown-corrected values (Spearman, 1910) . The latter values reflect reliability estimates for the full sample size of participants/words (rather than for half of the participants/words, which are the bases for calculating uncorrected correlations). The full breakdown of reliability estimates by language is reported in Supplementary Materials S4 and S5 (which includes subject-and wordtoken-level estimates, respectively). Below we provide a description of the main findings. The reliability of eye-tracking measures at the participant level was very high (all corrected rs > 0.93), as may be expected given the large number of words read by each participant (for related estimates and discussion, see Staub, 2021) . Reliability at the word token level was somewhat lower but still within recommended ranges for most measures and languages (see, for instance, reliabilities in GECO, Cop et al., 2017) , with some eye-tracking measures (e.g., total fixation duration, skips, number of fixations) having higher reliability than others (e.g., first fixation duration, rereading). As expected, reliability at the word token level was somewhat lower for sites with a smaller sample size (see, e.g., estimates for Turkish), but still in all sites the average reliability across measures was high (all mean corrected rs > 0.7). In addition to estimates for eye-movement measures, we calculated the reliability of offline participant-level measures that were collected in all sites: CFT scores and comprehension accuracy. The former was estimated using a split-half procedure on available data collected across different languages (as the test was identical across all sites). It was found to be r = 0.4 uncorrected and r = 0.57 after Spearman-Brown correction. Although these values are far from perfect (which is unsurprising in a short test with only 12 items), they still point to reasonable reliability and therefore suggest that CFT scores can be used (with caution) as a metric of individual differences. In contrast, the reliability estimates for comprehension accuracy (both for all texts and matched texts only; see Supplementary Materials S4) were generally lower, with substantial variability across languages. This is expected: The goal of the comprehension question was not to provide a measure of individual differences but rather to motivate participants to attend to the texts and be used as a group-level metric. These reliability values should be taken as a warning not to use comprehension scores from MECO as a proxy of individual differences (at least not in most languages). We envision MECO as a resource that can generate and test hypotheses at different degrees of resolution, from a single language to a group of languages. Such groups 6 Note that our word token-level estimates of reliability differ from the estimates provided in the GECO corpus (Cop et al., 2017) . Cop et al.'s calculations were based on the word type level, i.e., they averaged values across all occurrences of a word. Our choice is motivated by the fact that morphological variability of different linguistic systems greatly affects how many tokens are associated with each word type and makes the word type-level reliability less comparable across languages. may be defined genetically, e.g., Germanic or Romance, or typologically, e.g., morphologically agglutinative languages such as Finnish and Turkish. Finally, as demonstrated below, analyses can be applied to the entire set of languages. Equally, units of linguistic interest to study may vary from a single character or sound to phenomena defined at the passage level. Moreover, researchers will be able to consult the data on the participant level, both within and across languages. As stated in the Introduction, this part of the paper aims to quantify differences and similarities between all 13 languages, promoting the long-standing agenda of cross-linguistic psychological research (see, among others, reviews in Frost, 2012; Liversedge et al., 2016; Verhoeven & Perfetti, 2017) . This analysis offers new insights into a key theoretical question in cross-linguistic reading research: what aspects of reading behavior are shared across writing systems, and what aspects are language-specific. This section provides an overview of reading behavior across languages. To this end, we calculated the mean values of each dependent variable for each participant in each sample. Detailed summaries are available as auxiliary files at the project's OSF page, including a breakdown of each eye-tracking variable by language. We then computed the correlations between behavioral measures of reading calculated from these by-participant means across all languages (Table 4) . Next, we calculated the means and standard errors for all eye-movement measures and comprehension accuracy by language based on these by-participant averages ( Fig. 2 ; the values used to create this plot are available under "auxiliary files" in the project's OSF page). A visual inspection of this figure points to substantial variability in eye-movement behavior across languages. Note that similar descriptive patterns were observed when matched and unmatched texts were examined separately (see Supplementary Materials S6), and that by-participant means of eye-movement measures calculated for matched and unmatched texts in each language correlated very highly (mean r = 0.90, range: 0.68-0.97; see Supplementary Materials S7). This suggests that the language differences observed were not because some texts were not perfectly matched translations. While a detailed analysis of specific oculomotor patterns is subject to future research, we note a few findings here. The Norwegian sample appears to stand out: these readers showed relatively lower accuracy in comprehension questions (65% in matched texts), shorter and fewer fixations, and a higher rate of skipping. This might indicate that this sample engaged in a relatively superficial kind of reading, investing less in the inferential and integrative processes required for comprehension than readers at other sites. Another noteworthy pattern emerged in Estonian: these readers had a large number of fixations on the words they read, along with relatively long fixations and a high rereading rate. This stands in contrast to a typical trade-off between the number of fixations and their duration or the number of passes. A final observation is that Korean readers demonstrated short reading times and a high skipping rate, presumably due to very short words in this orthography (see below). These patterns may be important to take into account when drawing cross-linguistic comparisons. The cross-sample variability in Fig. 2 leads to the first key theoretical question we ask in this section: What behavioral measures account for the most cross-linguistic variability in reading performance? We address this central question in several complementary ways in the remainder of this paper. In the initial analysis, we fitted ordinary linear regression This pattern indicates that most cross-linguistic differences in the oculomotor behavior at the word level materialize in the spatial distribution of fixations over words (e.g., which words attract fixations and which do not). Once the word is fixated, cross-linguistic variability in how long it is viewed is substantially lower, despite the diversity of studied languages. We return to this finding below. Another approach to identifying the relative importance of predictors of reading behavior recruits a conditional inference analysis of the MECO data. The outcome of this analysis is a decision tree, which identifies a hierarchy of reading measures that most strongly predict differences between languages. More specifically, in this analysis, byparticipant mean values of all reading measures serve as input to a recursive partitioning classification tree that has language as a response variable. At each recursion, this procedure identifies the reading measure with the strongest association with the language variable (response). Then it implements a binary split of that measure on the value that offers the best binary partition of participants into classes representing languages. Inferential statistics for associations between reading variables and language as the response variable, as well as the best values for partitions into classes, are estimated using the permutation test. Partitions are implemented within classes until the permutation test p-values for the splits are not statistically significant. For further technical details, see Matsuki et al. (2016) and references therein. We used function ctree from the party package (Hothorn et al., 2006) in the statistical software platform R. Figure 3 visualizes the resulting conditional regression tree. Variables that account for splits higher up in the tree are more important than those closer to the bottom, i.e., they are more strongly associated with language as a response variable. Again, skipping rate emerged as the single most important factor in accounting for cross-linguistic variability. This variable was indicated in the first recursion (from top to bottom) as the one with the strongest association with language as a response variable. Only at lower portions of the tree does an additional variable (first-run number of fixations) comes into play. Durational variables did not come out as significant predictors of language as a response variable. It is noteworthy that each terminal node (representing the distribution of participants over languages in the bottom part of Fig. 3 ) accounts for a nontrivial percent of readers from multiple languages. Thus, for example, the majority of Estonian and Finnish readers (29/52 Estonians, 25/49 Finns) were concentrated in the leftmost node in Fig. 3 (i.e., had a skipping rate lower than 21.6% and less than 1.39 first-pass fixations), but still more than 40% of participants from these two samples were scattered in other nodes. This suggests that there is no specific combination of oculomotor parameters (skipping rate, durations, regression rate, etc.) that uniquely identifies reading in any given language. Separate analyses on matched and unmatched texts converged on skipping as the strongest predictor of cross-linguistic differences (see Supplementary Materials S9). In sum, skipping rate and-to a smaller degree-number of first-run fixations account for the most behavioral variability between languages. The salient role of skipping in predicting cross-linguistic variability in reading performance calls for further investigation. Consistent with the literature on this aspect of oculomotor control (e.g., Brysbaert et al., 2005; Drieghe et al., 2004; Kliegl et al., 2004; Rayner & McConkie, 1976; Reilly & O'Regan, 1998; Vitu, 2011) , we link skipping rate to one of the benchmark predictors of reading: word length. Specifically, we expect cross-linguistic differences in skipping rate to reflect variability in the distribution of word lengths across languages. It is a well-established finding that, within a language, longer words are skipped less often (see references above). In fact, Kuperman et al. (2018) show that word length has the greatest relative importance out of all predictors of skipping rate in English. Accordingly, in regard to cross-linguistic variability, we expect that written languages with shorter words on average will demonstrate proportionally higher skipping rates. While the majority of the reported languages are letter-based, Korean is an important exception. Our calculation of word length for Korean is based on syllable-based characters 8 . Separately for each language, we fitted a logistic regression mixed-effects model to the binary variable of whether the word was skipped, with word length as a sole predictor and by-participant and by-word random intercepts. Word lengths were centered (but not scaled), such that the intercept of a regression model estimated the predicted skipping rate for a word of average length in a given language: while given in logit units, we transformed these estimates into percentage points (i.e., estimated percent words skipped in a word of average length). The slopes of the regression models estimated an increase in skipping rate (in logit units) related to an increase in one character. Supplementary Materials S10 includes descriptive statistics of word lengths and estimated slopes and intercepts. The correlation between mean word length in a given language and skipping rate estimated for that length (the model intercept, transformed to percent) was negative and very strong: r = −0.88, p < 0.001. Figure 4 illustrates the finding. Korean was a language on one extreme with a mean word length of 2.92 characters (SD = 1.27) and an estimated skipping rate of 29%. This is because one Korean character typically represents 2-3 phonological elements. On the other extreme was Finnish, with the mean word length of 7.82 characters (SD = 3.90) and the estimated skipping rate of 6%. The remaining languages followed the linear trend almost perfectly 9 . Interestingly, the correlation between mean length and model slope (i.e., the rate of change in skipping rate as a function of word length) was very weak and not significant (r = 0.19, p = 0.539). Taken together, these findings point to a strong role of visuo-oculomotor factors in explaining what makes eyemovement behavior vary across languages the most. It is well known that longer words elicit fewer skips within a language. We see that this finding generalizes across languages: i.e., a preference of a given written language for longer words comes with a lower skipping rate in reading. Conversely, it appears that every language responds to an increase of word length by one character with a roughly similar decrease in skipping rate, regardless of the language's overall gravitation towards longer or shorter words. This indicates a strong reliance of readers' probability of fixating versus skipping words on visual characteristics of the linguistic input. Since this characteristic varies widely between languages (by a factor of 2.7 in mean word lengths in our sample), so does the value of skipping rate (by a factor of 5.8). At the same time, durational measures for fixated words differ much less between languages. While mean viewing times are markedly different for some pairs of languages (see Fig. 2 ), the overall cross-linguistic variability in viewing times accounts for a relatively small amount of variance compared to the within-language variability and other eye-movement measures. Put differently, if one imagines a hypothetical reader who is equally proficient in all written languages in the present sample, most of their oculomotor accommodation to characteristics of specific languages will be driven by word lengths and will go into adjusting the rate 9 The estimates for the skipping rate for an average word length in a language were not related to the number of characters subtended by one degree of visual angle, font size, or screen size. While different across testing sites, these parameters did not underlie the observed correlation between skipping rate and average word length. 8 Korean writing is based on a syllabic unit where a syllable structure of an onset, nucleus, and a coda is visually represented in writing. Thus, orthographic representations of syllables like 고 and 공 consist of two or three alphabetic components that are spatially combined into a character: ㄱ and ㅗ and ㄱ, ㅗ, and ㅇ. of skipping. Once a word is fixated, cross-linguistic differences in word lengths or other characteristics will lead to a smaller adjustment in viewing time. Our analyses so far revealed that some eye-tracking measures (i.e., skipping, and to a lesser extent, refixation) vary considerably across languages while other measures do less so. A related question is whether some languages are overall more similar to one another in terms of the eye-movement behavior of their readers. A reasonable starting point is that reading behavior may be more similar among languages that are more similar in their structure. As follows from Table 1 , several languages in our sample have a similar historical origin (e.g., Germanic: Dutch, English, German, and Norwegian; and Romance: Italian and Spanish); a similar writing system type and script (e.g., there are nine alphabetic languages written in Latin-based scripts in our sample of 13); or a similar type of morphology and level of orthographic transparency. We examined whether the linguistic similarity between languages translated into similarity in the oculomotor patterns of their readers. To this end, we selected mean values of three eye-movement measures that represent different aspects of oculomotor behavior-skipping rate, gaze duration, and total number of fixations-as a vector representing every participant. These variables were selected to reflect (with little redundancy) both the probability of fixating versus skipping a word, the time spent viewing the word in the first pass, and the total effort of viewing the word. (A solution that additionally included total fixation duration was also run and produced the same result: we report below the more parsimonious solution.) We calculated the Euclidian distance between all pairs of (scaled) participant vectors and aggregated this participant-level data to compute an average distance between each pair of languages. This distance metric was supplied as input into a hierarchical cluster analysis using the Ward clustering criterion: the hclust function in R was used (Langfelder & Horvath, 2012) . Figure 5 reports the clustering solution. The first partition (from top to bottom) is between Finnish, Estonian, and Turkish versus the remainder of languages. A lower partition on the right side of the tree (which contains the remaining Fig. 4 Estimated skipping rate for a word of an average length as a function of mean word length in language ten languages) separated two clusters of five languages each (Spanish, Dutch, Korean, English, and Russian; versus Greek, Norwegian, Hebrew, German, and Italian) . This clustering rules out several logical possibilities for behavioral commonalities. Thus, languages that have largely overlapping lexicons and broad similarities in their phonology, morphology, and orthography appear to be no closer to each other in their behavioral patterns than to other languages. In particular, Germanic and Romance languages are dispersed over multiple clusters rather than grouped together. Furthermore, similarities in scripts were inconsequential: Hebrew, Korean, Russian, and Greek were dispersed among languages using Latin-based scripts rather than grouped together. In fact, the only potential criterion that separated some of the clusters from others was word length, which was in turn related to skipping rate (see above): Finnish, Estonian, and Turkish, which form a distinct cluster from other languages, are the languages with the longest words in the sample. They are also agglutinative and highly orthographically transparent, which are both factors contributing to increased word length. It is possible that a clear-cut organizing principle exists in the clustering of languages based on reading behavior, but it is masked from us due to relatively small sample sizes and possible sampling biases in the respective languages. At this point, we can only conclude that even if similarities of a linguistic nature do lead to cross-linguistic similarities in reading behavior, these tendencies are subtle. The inspiration for this paper is that empirical science both drives and is driven by accessibility to high-quality and large-scale data. The open science movement in the cognitive sciences adopted this notion, leading to a constantly growing number of collaborative multi-lab studies aimed at providing theories with such data (e.g., Hagger et al., 2016; Open Science Collaboration, 2015; ManyBabies Consortium, 2020) . However, in addition to typical requirements from multi-lab investigations, a collaborative study of reading must also additionally reflect the striking diversity of languages (which vary in their phonology, morphology, and syntax), including written languages (which embody a range of solutions as to how to reflect speech in print). This is essential because theories of reading that claim any degree of cross-linguistic coverage must be tested using comparable data from multiple languages. Such data should be obtained using comparable designs across languages, in format, content, task, and data collection methods. The present paper provides the field of reading with such necessary data. We specifically focus on eye-tracking methodology to study reading, which is arguably the most ecologically valid and temporally sensitive record of reading behavior, and indeed eye movements are part and parcel of reading itself. We start by examining whether the need for cross-linguistic data has already been satisfied in studies of eye movements during reading by using a bibliometric analysis of relevant publications over the last two decades to estimate the field's cross-linguistic coverage. The analysis reported in Part I revealed clear biases towards a handful of languages: with the exception of Chinese, well-represented languages tend to be alphabetic and Roman script-based, and European (mostly Indo-European, with an expected further bias towards English). Moreover, the number of studies that conducted a coordinated examination of more than one language is very small, and no study has covered more than three languages. In Part II of this paper, we introduced the Multilingual Eye-movement Corpus (MECO): a collaborative international project aimed at addressing the need for comparable cross-linguistic data. MECO comprises eye-tracking data for reading in the first (dominant) language, reading in English (a non-dominant language for all but one sample), and a battery of individual differences tests both in the readers' first language and in English. In the current first release of MECO, we report first language reading data from laboratories in 13 countries and languages. These 13 languages exemplify a typologically wide range of phonological, morphological, and syntactic systems, originating from multiple language families. MECO thus makes possible a direct comparison between different writing systems (alphabets and abjads) and scripts (alphabetic Roman-and non-Romanbased, Hebrew, and Hangul). Reading materials were 12 encyclopedic texts, including both translation-equivalent and untranslated materials. Participants were university students in their respective countries with the language of testing as their dominant language. The MECO eye-movement record includes information on a broad range of oculomotor measures. It is further supplemented by data on comprehension accuracy, demographic and linguistic background, as well as tests of individual differences, some of which were shared across all samples while others were specific to each language. In the spirit of open science, the MECO data, materials, and code are made available to promote cross-linguistic collaborative research on reading and advance reproducibility. Therefore, MECO constitutes a valuable tool to address novel reading research questions across a wide variety of languages without the need to collect (eye-tracking) data. It is also accessible to researchers working on less-studied languages who may not have the necessary equipment to run eye-tracking experiments at their disposal. In Part III of the paper, we demonstrate the utility of the MECO data by providing a comparative analysis aiming at characterizing similarities and differences in cross-linguistic reading patterns based on all languages in the corpus. The main finding is that the oculomotor measure differentiating reading behaviors across languages the most is skipping rate. That is, languages differ in the rate of likelihood in which readers tend to fixate on a word at least once versus skipping it altogether. In turn, we find that skipping rate in a language is very strongly determined by the average word length in that language, with languages gravitating towards longer words (e.g., Finnish, Estonian, or Turkish) showing an overall lower skipping rate than those with shorter words (e.g., Korean or Hebrew). These systematic patterns were observed in both semantically matched and unmatched texts, suggesting that they are robust to natural variability in topics and propositional contents. Remarkably, neither the differences in word length nor other linguistic characteristics of the current set of languages showed a noticeable systematic influence on any other oculomotor measure. In particular, there were minor differences in fixation durations across languages (either first fixation durations or total fixation duration). In all languages, if readers select a word for fixation, they tend to spend a similar time viewing it on average. This suggests that viewing times are mainly representative of core language processes rather than surface characteristics of languages on the linguistic levels examined in this paper. Of course, this finding does not imply that multilingual investigations of reading times (and all other behavioral measures of reading) are unnecessary. Our results pertain to the effect of general linguistic features on reading times and do not necessarily extend to other phenomena that are language-specific and could (and should) be investigated cross-linguistically. It may be tempting to entirely couch the discussion of the cross-linguistic impact of word length and skipping rate in visual terms, with the count of characters and the space they occupy on a screen driving the oculomotor planning and execution (see references above). It is important to realize, however, that cross-linguistic differences in word lengths reflect fundamental properties of written languages (see discussion in Liversedge et al., 2016) . Whether a writing system that a language adopts chooses its symbols to reflect all individual sounds (alphabetic), some types of sounds (consonantal alphabets or abjads), syllables (syllabaries), or entire words (logographic) has a profound impact on word lengths in a system. Similarly, how characters of a language package phonological information visually affects word length too (see Korean Hangul). Other factors of influence include orthographic transparency (the degree of completeness and consistency with which orthographic words reflect words' phonology), the use of function words (e.g., articles, prepositions), and the type of morphology (e.g., agglutinative like Finnish or Turkish where markers of syntactic functions are affixed to the word versus isolating languages like English where they are expressed as separate function words). Moreover, specific orthographic conventions within a language affect word length. For instance, Hebrew does not allow single-letter orthographic words. In the same vein, German, Dutch, Norwegian, and Finnish allow very long unspaced compounds, while English introduces spaces between some constituents. Thus, word length and skipping rate as its behavioral counterpart are strongly related to the architecture of a written language and its relationship with the oral language. With the present representation of languages, we still do not have sufficient cross-language coverage and variability to address a systematic influence of specific linguistic features like script, typological family, morphological type, or orthographic transparency. This question can be addressed as a wider range of languages are added to MECO. Other noteworthy findings concern the surprising lack of similarity in reading behavior between written languages that are genetically or typologically related. That is, the clustering solution based on major predictors of eye-movement behavior grouped together languages in a way that, to our knowledge, does not reflect any accepted classification of either oral languages or writing systems (Daniels & Bright, 1996; Dryer & Haspelmath, 2013) . This finding hints at a possibility that behavioral patterns during reading are mostly guided by features of input texts that are not accounted for and do not easily translate into existing language classifications. If so, a new, behaviorally relevant map of language structures may be required. Our conclusions are reached based on texts of which some were translated from English to other languages of MECO, and some were not (though they were constructed to represent the same topic and genre). This manipulation pursued the goal of determining how essential close semantic matching is for different cross-linguistic investigations. This question is methodologically critical. If translated texts are imperative for reaching a sound comparative conclusion about any aspect of reading, no previous data are valid material for cross-linguistic comparisons unless based on translations. A full exploration of what cross-linguistic reading behavior patterns hold in both matched and unmatched materials is beyond the scope of this first paper. Still, in all the analyses of MECO data above, highly similar results are obtained in matched and unmatched materials. Namely, we found similar descriptive patterns in matched and unmatched texts, similar estimates of variance explained by Language when estimated based on reading of matched texts, unmatched texts, or their combination, and a similar role of skipping as the strongest predictor of crosslinguistic differences. Thus, in pursuing the present set of analytical goals mostly tied to the level of the word, we did not find matching through translation to be a relevant factor, at least not within the genre of encyclopedic expository passages. We further showed high correlations between by-participant means of eye-movement measures computed on matched and unmatched texts in the various sites (see Supplementary Materials S2). This suggests that for a certain range of research questions and phenomena (in particular, those that employ by-participant means of different eye-movement measures), the requirement of a close semantic matching across languages may be relaxed. We emphatically do not imply that all questions can be answered without resorting to translated texts. To give one example, Liversedge et al. (2016) demonstrated that a fruitful study of global, cumulative eye-movement patterns at the sentence and passage level demands a thorough semantic matching across languages. For such questions, researchers should use the matched portion of the MECO corpus only. Importantly, since it may not be clear a priori which aspects of reading are or are not critically bound to the close semantic similarity between materials across languages, MECO can serve as a testbed for addressing this question and thus guide design decisions in future work. We view MECO as a living organism that undergoes evolution, partly as a way to remedy its limitations, and thus discuss current limitations and future directions jointly. An important goal for the MECO project is to expand its coverage of individual languages, language groups, and writing systems. This will correct the current over-representation of alphabetic languages and languages that use spacing for overt segmentation of characters into words or syllables. A future release of MECO will contain an additional range of languages. An additional limitation is that currently there is no systematic way to disentangle behaviors specific to this sample (e.g., selectivity of their university, variability in reading proficiency within a country, and testing procedures in specific labs) from behaviors dictated by particulars of that language. While we view this as an inherent limitation of cross-language research, one way to mitigate some of it is to collect multiple samples for each language. Within-language samples can represent regional varieties or differences in educational or social backgrounds of readers and differences between universities on how participants are asked to read. Importantly, we invite additional collaborators to participate in this multi-lab initiative. Both new languages and additional samples from currently included languages are welcome and are critical for expanding the present resource and increasing its variability and reliability. We hope that public access to all MECO materials and procedures will facilitate this expansion. Guidelines for how to participate are provided at the MECO project website, www.meco-read.com. A related limitation has to do with statistical power. Current samples (mostly, 45-55 participants) afford sufficient power for some types of analyses but limited power for others. The exact power estimates obviously depend on the unit and type of analysis and the expected effect size, but the general estimates are as follows. MECO is expected to provide sufficient statistical power to observe effects of a median size (|d| ≥ 0.4) in sentence-level analyses, even when only two language samples are considered and even when matched and unmatched texts are examined separately. This is because each of such conditions would generally meet the 80% power requirement of 40 participants × 40 observations estimated in Brysbaert and Stevens (2018) . By extension, sufficient power is also expected at the word level, where a much larger number of observations is available in each language sample, even if split into matched or unmatched texts. Note, however, that analyses where each participant contributes only one data point and languages are examined individually or are compared pairwise may be characterized by limited power. In future releases of MECO, this will be addressed in two ways. First, some laboratories will increase their samples, especially ones where pandemic-related closures thwarted data collection. A second increase in the number of observations may come from further developing the popEye software package (Schroeder, 2019) , which we use here for automatic analysis of eye fixation locations (thus avoiding an error-prone manual process). Refined algorithms may also reduce the number of observations the software excludes due to minor deficiencies of calibration or head movements. Despite these limitations, it is clear that even this first release of MECO provides unprecedented statistical power for cross-language analyses, with more than 500 participants providing almost 800,000 data points. As stated above, MECO is a resource that can be used for pursuing a variety of research questions. For instance, additional work needs to be done on investigating crosslinguistic effects of benchmark predictors of reading behavior, including the effects of word length, frequency, and predictability in context (Rayner, 2009) . We expect future MECO updates to incorporate frequency and predictability estimates for most languages and provide analyses that will compare benchmark effects across languages and writing systems. Also, our present focus was on the word-level reading behavior across languages: a follow-up is recommended to tackle cross-linguistic variability at the sentence and passage level. An additional examination of spatial information regarding landing positions and amplitudes of saccadic eye movements will shed further light on the "where" aspect of oculomotor control. Furthermore, we expect analyses relating skill tests in individual languages to the patterns of reading behavior to contribute to the growing literature on individual differences in reading (e.g., Radach & Kennedy, 2013) . Another example of an interesting prospective study derives from the finding that skipping versus fixation probability accounts for most of the cross-linguistic differences. Further research may look at whether and how these differences reflect trade-offs between spatial fixation density and reading times (e.g., Radach & Heller, 2000) , possibly based on orthographic or morphological complexity of individual languages 10 . A final future direction that we outline here departs from the present focus on a bird's-eye view of reading behavior across written languages of the world. We believe that MECO can also be useful for more particular questions of psychological and linguistic interest and a more specific examination of individual languages and language groups. The online version contains supplementary material available at https:// doi. org/ 10. 3758/ s13428-021-01772-6. The English lexicon project WebGazeAnalyzer: a system for capturing and analyzing web reading behavior using eye gaze How many words do we read per minute? A review and meta-analysis of reading rate Word skipping: Implications for theories of eye movement control in reading Power analysis and effect size in mixed effects models: A tutorial Algorithms for the automated correction of vertical drift in eye-tracking data Software for the automatic correction of recorded eye fixation locations in reading experiments Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading Writing system variation and its consequences for reading and dyslexia Word skipping in reading: On the interplay of linguistic and visual factors The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology Orthography and the development of reading processes: An eye-movement study of Chinese and English Reading time data for evaluating broad-coverage models of English sentence processing Towards a universal model of reading Comparison of reading capacity for A multilab preregistered replication of the ego-depletion effect Unbiased recursive partitioning: A conditional inference framework Integration and prediction difficulty in Hindi sentence comprehension: Evidence from an eye-tracking corpus Definition and computation of oculomotor measures in the study of cognitive processes We thank Ralph Radach for this and many other helpful suggestions Reading as a perceptual process Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments Length, frequency, and predictability effects of words on eye movements in reading Tracking the mind during reading: The influence of past, present, and future words on fixation durations Learning to read across languages: Cross-linguistic relationships in first-and secondlanguage literacy development Contributions of reader-and text-level characteristics to eye-movement patterns during passage reading Fast R functions for robust correlations and hierarchical clustering Russian Sentence Corpus: Benchmark measures of eye movements in reading in Russian Beyond isolated word recognition Universality in eye movements and reading: A trilingual investigation The Provo Corpus: A large eye-tracking corpus with predictability norms Quantifying sources of variability in infancy research using the infant-directed-speech preference The Language Experience and Proficiency Questionnaire (LEAP-Q): Assessing language profiles in bilinguals and multilinguals The Random Forests statistical technique: An examination of its value for the study of reading Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) Estimating the reproducibility of psychological science The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms Methodological Issues in Literacy Research Across Languages: Evidence From Alphabetic Orthographies An influence over eye movements in reading exerted from beyond the level of the word: Evidence from reading English and French Relations between spatial and temporal aspects of eye movement control Models of oculomotor control in reading: Toward a theoretical foundation of current debates Effects of orthographic consistency on eye movement behavior: German and English children and adults process the same words differently Eye movements in reading and information processing: 20 years of research Eye movements and attention in reading, scene perception, and visual search Eye movements as reflections of comprehension processes in reading. Scientific studies of reading Linguistic and cognitive influences on eye movements during reading What guides a reader's eye movements? Vision Research Psychology of reading Eye movement control during reading: A simulation of some word-targeting strategies popEye: Analysis of eye-tracking data from reading experiments Words in context: The effects of length, frequency, and predictability on brain responses during natural reading Foundation literacy acquisition in European orthographies False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant On the Anglocentricities of current reading research and practice: The perils of overreliance on an "outlier" orthography Alphabetism in reading science Improving the performance of eye trackers with limited spatial accuracy and low sampling rates for reading analysis by heuristic fixation-to-word mapping Correlation calculated from faulty data How reliable are individual differences in eye movements in reading Eye movements in reading Chinese and English text EyeMap: a software system for visualizing and analyzing eye movement data in reading Learning to read across languages and writing systems On the role of visual and oculomotor processes in reading Grundintelligenzskala 2 mit Wortschatztest and Zahlenfolgetest [Basic intelligence scale 2 with vocabulary knowledge test and sequential number test Second-language experience modulates first-and second-language word frequency effects: Evidence from eye movement measures of natural paragraph reading Reading acquisition, developmental dyslexia, and skilled reading across languages: a psycholinguistic grain size theory Open Practices Statement MECO's data and materials are made available at the project's OSF page-see "data availability" section above for details We wish to thank the following individuals: Mariam Bekhet, Paige Cater, John Connolly, Melda Coskun Karadag, Connie Imbault, Alyssa Janes, Shani Kahta, Minji Kang, Evgenia-Peristera Kouki, Elizaveta Kuzmina, Nadia Lana, Sean McCarron, Kelly Nisbet, Victoria Ong, Anat Prior, Eva Saks, Elisabet Service, Anna Swain, Heather Wild, Sophia Yang, and Laoura Ziaka. Thanks are due to Ralph Radach and two anonymous reviewers for valuable comments to earlier drafts.