key: cord-0784210-4pulps6g authors: So, Kevin K. H.; To, Carol K. S. title: Systematic Review and Meta-Analysis of Screening Tools for Language Disorder date: 2022-02-23 journal: Front Pediatr DOI: 10.3389/fped.2022.801220 sha: a8255ed4e06f5a67d64592042ef672fe43b53e45 doc_id: 784210 cord_uid: 4pulps6g Language disorder is one of the most prevalent developmental disorders and is associated with long-term sequelae. However, routine screening is still controversial and is not universally part of early childhood health surveillance. Evidence concerning the detection accuracy, benefits, and harms of screening for language disorders remains inadequate, as shown in a previous review. In October 2020, a systematic review was conducted to investigate the accuracy of available screening tools and the potential sources of variability. A literature search was conducted using CINAHL Plus, ComDisCome, PsycInfo, PsycArticles, ERIC, PubMed, Web of Science, and Scopus. Studies describing, developing, or validating screening tools for language disorder under the age of 6 were included. QUADAS-2 was used to evaluate risk of bias in individual studies. Meta-analyses were performed on the reported accuracy of the screening tools examined. The performance of the screening tools was explored by plotting hierarchical summary receiver operating characteristic (HSROC) curves. The effects of the proxy used in defining language disorders, the test administrators, the screening-diagnosis interval and age of screening on screening accuracy were investigated by meta-regression. Of the 2,366 articles located, 47 studies involving 67 screening tools were included. About one-third of the tests (35.4%) achieved at least fair accuracy, while only a small proportion (13.8%) achieved good accuracy. HSROC curves revealed a remarkable variation in sensitivity and specificity for the three major types of screening, which used the child's actual language ability, clinical markers, and both as the proxy, respectively. None of these three types of screening tools achieved good accuracy. Meta-regression showed that tools using the child's actual language as the proxy demonstrated better sensitivity than that of clinical markers. Tools using long screening-diagnosis intervals had a lower sensitivity than those using short screening-diagnosis intervals. Parent report showed a level of accuracy comparable to that of those administered by trained examiners. Screening tools used under and above 4yo appeared to have similar sensitivity and specificity. In conclusion, there are still gaps between the available screening tools for language disorders and the adoption of these tools in population screening. Future tool development can focus on maximizing accuracy and identifying metrics that are sensitive to the dynamic nature of language development. SYSTEMATIC REVIEW REGISTRATION: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=210505, PROSPERO: CRD42020210505. Language disorder refers to persistent language problems that can negatively affect social and educational aspects of an individual's life (1) . It is prevalent and estimated to affect around 7.6% of the population (2) . Children with language disorder may experience difficulties in comprehension and/or in the use of expressive languages (3) . Persistent developmental language disorder not only has a negative impact on communication but is also associated with disturbance in various areas such as behavioral problems (4), socio-emotional problems (5) , and academic underachievement (6) . Early identification of persistent language disorder is challenging. There are substantial variabilities in the trajectories of early language development (7, 8) . Some children display consistently low language, some appear to resolve the language difficulties when they grow older, and some demonstrated apparently typical early development but develop late-emerging language disorder. This dynamic nature of early language development has introduced difficulties in the identification process in practice (9) . Therefore, rather than a one-off assessment, late talkers under 2 years old are recommended to be reassessed later. Referral to evaluation may not be not based on positive results in universal screening, but mainly concerns from caregivers, the presence of extreme deviation in development, or the manifestation of behavioral or psychiatric disturbances under 5 years old (9) . Those who have language problems in the absence of the above conditions are likely to be referred for evaluation after 5 years old. Only then will they usually receive diagnostic assessment. Ideally, screening should identify at-risk children early enough to provide intervention and avoid or minimize adverse consequences for them, their families, and society, improving the well-being of the children and the health outcomes of the population at a reasonable cost. Despite the high prevalence and big impact of language disorder, universal screening for language disorder is not practiced in every child health surveillance. Screening in the early developmental stages is controversial (10) . While early identification has been advocated to support early intervention, there are concerns about the net cost and benefits of these early screening exercises. For example, the US Preventive Task Force reviewed evidence concerning screening for speech and language delay and concluded that there was inadequate evidence regarding the accuracy, benefits, and harms of screening. The Task Force therefore did not support routine screening in asymptomatic children (11) . This has raised concerns in the professional community who believe in the benefits of routine screening (12) . However, it is undeniable that another contributing factor for the recommendation of the Task Force was that screening tools for language disorder vary greatly in design and construct resulting in the variability in identification accuracy. Previous reviews of screening tools for early language disorders have shown that these tools make use of different proxies for defining language issues, including a child's actual language ability, clinical markers such as non-word repetition, or both (13) . Screening tools have been developed for children at different ages [e.g., toddlers (14) and preschoolers (15) ] given the higher stability of language status at a later time point (16, 17) . Screening tools also differ in the format of administration. For example, some tools are in the form of a parent-report questionnaire while some have to be administered by trained examiners via direct assessment or observations. Besides the test design, methodological variations have also been noted in primary validation studies, such as the validation sample, the reference standards (i.e., the gold standard for language disorder), and the screening-diagnosis interval. These variations might eventually lead to different levels of screening accuracy, which has been pointed out in previous systematic reviews (10, 13) . These variations have been examined in terms of the screening accuracy (13) . Parent-report instruments and trained-examiner screeners have been found to be comparable in screening accuracy. In longitudinal studies in which language disorder status has been validated at various time points, accuracy appears to be lower for longer-term prediction than for concurrent prediction. Although the reviews have provided a comprehensive overview regarding the variations in different language screening tools, the analyses have mainly been based on qualitative and descriptive data. In the current study we performed a systematic review of all currently available screening tools for early language disorders that have been validated against a reference standard. We report on the variations noted in terms of (1) the type of proxy used in defining language disorders, (2) the type of test administrators, (3) the screening-diagnosis intervals and (4) age of screening. Second, we conducted a meta-analysis of the diagnostic accuracy of the screening tools and examined the contributions of the above four factors to accuracy. The protocol for the current systematic review was registered at PROSPERO, an international prospective register of systematic reviews (Registration ID: CRD42020210505, record can be found on https://www.crd.york.ac.uk/prospero/display_record. php?RecordID=210505). Due to COVID-19, the registration was published with basic automatic checks in eligibility by the PROSPERO team. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses for Diagnostic Test Accuracy (PRISMA-DTA) (18) checklist was used as a guide for the reporting of this review. A systematic search of the literature was conducted in 2020 October based on the following databases: CINAHL Plus, ComDisDome, PsycINFO, PsycArticles, ERIC, PubMed, Web of Science, and Scopus. The major search terms were as follows: Child * OR Preschool * AND "Language disorder" * OR "language impairment * " OR "language delay" AND Screening OR identif * . To be as exhaustive as possible, the earliest studies available in the databases and those up to October 2020 were retrieved and screened. Appendix A Table A1 showed the detailed search strategies in each database. Articles from the previous reviews were also retrieved. The relevance of the titles, abstracts, and then the full texts were determined for eligibility. Cross-sectional or prospective studies validating screening tools or comparing different screening tools for language disorders were included in the review. The focus was on screening tools validated with children aged 6 or under from the general population or those with referral, regardless of the administration format of the tools, or how language disorder was defined in the studied. Studies that did not report adequate data on the screening results, and in which accuracy data cannot be deduced from the data reported, were excluded from the review (see Appendix A Table A2 for details). Data was extracted by the first author using a standard data extraction form. The principal diagnostic accuracy measures extracted were test sensitivity and specificity. The number of people being true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) was also extracted. Sensitivity and specificity were calculated based on 2 by 2 contingency tables in the event of discrepancy between the text description and the data reported. The data extraction process was repeated after the first extraction to improve accuracy. Screening tools with both sensitivity and specificity exceeding 0.90 were regarded as good and those with both measures exceeding 0.80 but below 0.90 were regarded as fair (19) . Quality assessment of included articles was conducted by the first author using QUADAS-2 by Whiting, Rutjes (20) . QUADAS-2 can assist in assessing risk of bias (ROB) in diagnostic test accuracy studies with signaling questions concerning four major domains. The ROB in patient selection, patient flow, index tests, or the screening tools in the current review, and the reference standard tests were evaluated. Ratings of ROB for individual studies were illustrated using a traffic light plot. A summary ROB figure weighted with sample size was generated using the R package "robvis" (21) . Due to the large discrepancy in the sample size across studies, an unweighted summary plot was also generated to show the ROB of the included studies. The overall accuracy of the tools was compared using descriptive statistics. Because sensitivity and specificity are correlated, increasing either one of them by varying the cut-off of test positivity would usually result in a decrease in the other. Therefore, a bivariate approach was used to jointly model sensitivity and specificity (22) in generating hierarchical summary receiver-operating characteristic (HSROC) curves to assess the overall accuracy of screening by proxy and by screening-diagnosis intervals. HSROC is a more robust method accounting for both within and between study variabilities (23) . Three factors that could be associated with screening accuracy, chosen a priori, were included in the meta-analysis: proxy used, test administrators, and screening-diagnosis interval. Effect of screening age on accuracy was also evaluated. The effect of each variable was evaluated using a separate regression model. The variables of proxy used were categorical, with the categories being "child's actual language, " "performance in clinical markers, " and "using both actual language and performance in clinical markers." Test administrator was also a categorical variable with the categories being "parent" and "trained-examiners." The variable of screening-diagnosis interval was dichotomously defined-intervals within 6 months were categorized as evaluating concurrent validity, whereas intervals of more than 6 months were categorized as evaluating predictive validity. The variable of screening age was also dichotomously defined with age 4 as the cut-off-those screened for children under the age of 4yo and those for children above 4yo. This categorization was primarily based on the age range of the sample, or the target screening age reported by the authors. Studies with age range that span across age 4 were excluded from the analysis. Considering the different thresholds used across studies and the correlated nature of sensitivity and specificity, meta-regression was conducted using a bivariate random effect model based on Reitsma et al. (22) . For studies examining multiple index tests and/or multiple cut-offs using the same population, only one screening test per category per study was included in the HSROC and metaregression models. The test or cut-off with the highest Youden's index was included in the meta-analytical models. Youden's Index, J, was defined as J = Sensitivity + Specificity − 1 All data analyses were conducted with RStudio Version 1.4.1106 using the package mada (24) . Sensitivity analysis was carried out to exclude studies with a very high ROB (with 2 or more indicating a high risk in rating) to assess its influence on the results. A total of 2351 articles, including 815 duplicates, were located using the search strategies, and an additional 15 articles were identified from previous review articles. After the inclusion and exclusion criteria were applied, a final sample of 47 studies were identified for inclusion in the review. Figure 1 shows the number of articles included and excluded at each stage of the literature search. The weighted overall ROB assessment for the 47 studies is shown in Figure 2A , and the individual rating for each study is shown in Appendix B. Overall, half of the data was exposed to a high ROB in the administration and interpretation of the reference standard test, while almost two-thirds of the data had a high ROB in the flow and timing of the study. As indicated by the unweighted overall ROB summary plot in Figure 2B , half of the 47 studies were unclear about whether the administration and interpretation of the reference standard test would introduce bias. This was mainly attributable to a lack of reporting of the reference standard test performance. About half of the studies had a high ROB in the flow and timing of the study. This usually arose from a highly variable or lengthy follow-up period. A total of 67 different index tests (or indices) were evaluated in the 47 included articles. The tests were either individual tests per se or part of a larger developmental test. The majority (50/67, 74.6%) of the screening tools examined children's actual language. Thirty of these index tests involved parents or caregivers as the main informants. Some of these screening tools were in the form of a questionnaire with Yes-No questions regarding children's prelinguistic skills, receptive language, or expressive language based on parent's observations. Some used a vocabulary checklist (e.g., CDI, LDS) in which parents checked off the vocabulary their child can was able to comprehend and/or produce. Some tools also asked parents to report a Age of screening is reported in range or mean in the form of X 1 -X 2 and M=X 3 ; In case range or mean is not reported, the intended age for screening of the tool will be reports as X 4 . b Based on Plante and Vance (19) , Fair = over 0.8 in both sensitivity and specificity; Good = over 0.9 in both sensitivity and specificity. c Not included because the sample was identical to Klee et al. (65) . Frontiers in Pediatrics | www.frontiersin.org their child's longest utterances according to their observation and generated indices. The other 20 index tests on language areas were administered by trained examiners such as nurses, pediatricians, health visitors or speech language pathologists (SLPs). These screening tools were constructed as checklists, observational evaluations, or direct assessments, tapping into children's developmental milestones, their word combinations and/or their comprehension, expression, and/or articulation. Some of these direct assessments involved the use of objects or pictures as testing stimuli for children. A small proportion (3/67, 4.48%) of tests evaluated clinical markers performance including non-word repetitions and sentence repetitions rather than children's actual structural language skills or communication skills. About nine percent (6/67, 8.96%) screened for both language abilities and clinical markers. Both types of tests required trained examiners to administer them. The tests usually made use of a sentence repetition task and one test also included non-word repetition. Another nine percent (6/67, 8.96%) utilized indices from language sampling, such as percentage of grammatical utterances (PGU), mean length of utterances in words (MLU3-W), and number of different words (NDW) as proxies. These indices represented a child's syntactic, semantic, or morphological performance. The smallest proportion (2/67, 2.99%) of the tests elicited parental concerns about their children being screened for language disorder. One asked parents to rate their concern using a visual analog scale, while the other involved interviews with the parents by a trained examiner. Sixty-five of the 67 screening tools had reported concurrent validity. Tables 1-5 summarize the characteristics of these 65 studies by the proxy used. Nine studies investigated the predictive validity of screening tools. Table 6 summarizes the studies. All the studies used child's actual language ability as the proxy. Screening tools based on children's actual language ability had a sensitivity ranging from 0.46 to 1 (median = 0.81) and a specificity of 0.45 to 1 (median = 0.86). About 30% of the studies showed that their tools achieved at least fair accuracy, while 8.89% achieved good accuracy. Screening tools using clinical markers had a sensitivity ranging from 0.3 to 1 (median = For tests that were validated against multiple cut-offs, only the one with highest Youden's index was shown; Sc. Age, screening age. a Age of screening is reported in range or mean in the form of X 1 -X 2 and M=X 3 ; In case range or mean is not reported, the intended age for screening of the tool will be reports as X 4. . b Based on Plante and Vance (19) , Fair = over 0.8 in both sensitivity and specificity; Good = over 0.9 in both sensitivity and specificity. For tests that were validated against multiple cut-offs, only the one with highest Youden's index was shown; Sc. Age, screening age; LI2, language impairment at age 2; LI3, language impairment at age 3. a Age of screening is reported in range or mean in the form of X 1 -X 2 and M=X 3 ; In case range or mean is not reported, the intended age for screening of the tool will be reports as X 4 . b Based on Plante and Vance (19) , Fair = over 0.8 in both sensitivity and specificity; Good = over 0.9 in both sensitivity and specificity. 0.71) and a specificity of 0.45 to 1 (median = 0.91). Two of the five studies 1 (40%) evaluating screening tools based on clinical markers showed their tools had good sensitivity and good specificity, but the other three studies showed a sensitivity and a specificity below fair. Concerning screening tools based on both actual language ability and clinical marker performance, the sensitivity ranged from 0.36 to 1 (median = 0.84), and the specificity ranged from 0.81 to 0.96 (median=0.93) and above half of these studies (4/7 2 , 57.1%) achieved at least fair performance in both sensitivity and specificity, and 3 of the 7 studies achieved good performance. Screening tools based on indices from language sampling had sensitivity ranging from 0.59 to 1 (median = 0.865) and specificity ranging from 0.67 to 0.92 (median = 0.825). Half of these six screening tools achieved fair accuracy, but none achieved good accuracy. None of the two screening tools based on parental concern achieved at least fair screening accuracy. Fifteen of the 65 studies also reported predictive validity, with a sensitivity ranging from 0.32 to 0.94 (median = 0.81) and a specificity ranging from 0.61 to 0.93 (median = 0.85). Three of the tools (20%) achieved at least fair accuracy in both sensitivity and specificity, but none of them were considered to have good accuracy. Three HSROC curves were generated for screening tools based on language ability, clinical markers, both language ability and clinical markers, and those assessing concurrent validity. Two HSROC curves were generated for screening tools administered by trained examiners and parents/ caregivers, respectively. Two HSROC curves were generated for screening under and above the age of 4, respectively. A separate HSROC curve was generated for screening tools assessing predictive validity. Screening based on indices from language sampling (n = 3) or parental concern (n = 2) were excluded from the HSROC analysis due to the small number of primary studies. Figure 3 shows the overall performance of screening tools based on language ability, clinical markers and both. Visual inspection of the plotted points and confidence region revealed considerable variation in accuracy in all three major types of screening tools. The summary estimates and confidence regions indicated that the overall performance of screening tools based on language ability achieved fair specificity (<0.2 in false positive rate) but fair-to-poor sensitivity. Screening tools based on clinical markers showed considerable variation in both sensitivity and specificity in that both measures ranged from good-to-poor. Screening tools based on both language ability and clinical markers achieved good-to-fair specificity, but fairto-poor sensitivity. Figure 4 shows the overall performance of Table 5 in the paper, description in the discussion differed from the figures in the table. screening tools administered by parents/caregivers or trained examiners. Visual inspection revealed that both types of screening tools achieved fair-to-poor sensitivity and good-to-fair specificity. Figure 5 shows the overall performance of screening for children under and above 4yo, respectively. Visual inspection revealed screening under 4yo achieved good-to-poor sensitivity and specificity, while screening above 4yo achieved good-topoor sensitivity and good-to-fair specificity. Figure 6 shows the performance of the screening tools evaluating predictive validity. These screening tools achieved fair-to-poor sensitivity and specificity. The effects of screening proxy, test administrator, screeningdiagnosis interval and age of screening on screening accuracy were investigated using bivariate meta-regression. Table 7 summarizes the results. Screening tools with <6-month screening-diagnosis interval (i.e., concurrent validity) were associated with higher sensitivity when compared to those with longer than a 6-month interval (i.e., predictive validity). Tools using language ability as the proxy showed a marginally significantly higher sensitivity than those based on clinical markers. Screening tools based on language ability and those based on both language ability and clinical markers appeared to show a similar degree of sensitivity. For tools assessing concurrent validity, screening under the age of 4 had a higher sensitivity with marginal statistical significance but showed similar specificity with screening above the 4yo. As for tools assessing predictive validity, screening under and above 4yo appeared to show similar sensitivity and specificity. Similarly, screening tools relying on parent report and those conducted by trained examiners appeared to show a similar sensitivity. Despite the large variability in specificity, none of the factors in the meta-regression model explained this variability. Results of sensitivity analysis after excluding studies with high ROB are illustrated in Table 8 . The observed higher sensitivity for screening tools using actual language as proxy compared with those using clinical markers became statistically significant. The difference in sensitivity between screening tools assessing concurrent validity and those assessing predictive validity appeared to be larger than before the removal of the high ROB studies. However, the observed marginal difference between screening under and above 4yo became non-significant after the exclusion of high-risk studies. Similar to the results without excluding studies with high ROB, none of the included factors in sensitivity analysis explained variation in specificity. The present review shows that currently available screening tools for language disorders during preschool years varies widely in their design and screening performance. Large variability in screening accuracy across different tools was a major issue in screening for language disorder. The present review also revealed that the variations arose from the choices of proxy and screening-diagnosis interval. Screening tools based on children's actual language ability were shown to have higher sensitivity than tools based on clinical markers. The fact that screening tools based on clinical markers did not prove to be sensitive may be related to the mixed findings from primary studies. Notably, one of the primary studies using non-word repetition and sentence repetition tasks showed perfect accuracy in classifying all children with and without language disorder (110) . The findings, however, could not be replicated in another study, using exactly the same test, which identified only 3 of the 10 children with language disorder (104) . The difference highlighted the large variability in the performance of non-word and sentence repetition even among children with language disorders, in addition to the inconsistent difference found between children with and without language disorder (149) . Another plausible explanation for the relatively higher sensitivity of using child's actual language skills lies in the resemblance between the items used for screening based on the child's actual language and the diagnostic tests used as the reference standard. Differences in task design and test item selection across studies may have further increased the inconsistencies (149) . Therefore, in future tool development or refinement, great care should be taken in the choice of screening proxy. More systematic studies directly comparing how different proxies and factors affect screening accuracy are warranted. There was no evidence that other factors related to tool design, such as the test administrators of the screening tools, explained variability in accuracy. In line with a previous review (13) , parentreport screening appeared to perform similarly to screening administered by trained examiners. This seemingly comparable accuracy supports parent-report instruments as a viable tool for screening, in addition to their apparent advantage of lower cost of administration. Primary studies directly comparing both types of screening in the same population may provide stronger evidence concerning the choice of administrators. As predicted, long term prediction was harder to achieve than estimating concurrent status. Meta-analysis revealed that screening tools reporting predictive validity showed a significantly lower sensitivity than that of tools reporting concurrent validity, which was also speculated in the previous review (13) . One possible explanation lies in the diverse developmental trajectories of language development in the preschool years. Some of the children who perform poorly in early screening may recover spontaneously at a later time point, while some who appeared to be on the right track at the beginning may develop language difficulties later on (7) . Current screening tools might not be able to capture this dynamic change in language development in the preschool years, resulting in lower predictive validity than expected. Hence, language disorder screening should concentrate on identifying or introducing new proxies or metrics that are sensitive to the dynamic nature of language development. Vocabulary growth estimates, for example, might be more sensitive to long-term outcomes than a single point estimation (150) . Although the current review has shown that different proxies has been used in screening language disorder, there is a limited number of studies examining how proxies other than children's actual language ability perform in terms of predictive validity. It would be useful to investigate the interaction between the proxy used and the screening-diagnosis interval in future studies. Age of screening was expected to be affected by the varying developmental trajectories. Screening at an earlier age might have lower accuracy than screening at a later age when language development becomes more stable. This expected difference was not found in the current meta-analysis. However, it is worth noting that screening tools used at different ages not only differed in the age of screening, but also other domains. In the metaanalysis, over half (55%, 16/29) of the screening under 4 relied on parent reports and used tools such as vocabulary checklists and reported utterances while none of the screening above 4 (0/8) were based on parent reports. Inquiry about the effect of screening age on screening accuracy is crucial as it has direct implication on the optimal time of screening. Future studies that compare the screening accuracy at different ages with the method of assessment being kept constant (e.g., using the same screening tool) may reveal a clearer picture. Overall, only a small proportion of all the available screening tools achieved good accuracy in identifying both children with and without language disorder. Yet, there is still insufficient evidence to recommend any screening tool, especially given the presence of ROB in some studies. Besides, the limited number of valid tools may explain partly why screening for language disorder has not yet been adopted as a routine surveillance exercise in primary care, in that the use of any one type of screening tools may result in a considerable amount of overidentification and missing cases, which can lead to long term social consequences (19) . As shown in the current review, in the future development of screening tools, the screening proxy should be carefully chosen in order to maximize test sensitivity. However, as tools that have good accuracy are limited, there remains room for discussion on whether future test development should aim at maximizing sensitivity even at the expense of specificity. The cost of over-identifying a false-positive child for a more in-depth assessment might be less than that of underidentifying a true-positive child and depriving the child of further follow-ups (104) . If this is the case, the cut-off for test positivity can be adjusted. The more stringent the criteria used in screening, the higher the sensitivity the test yield but with the trade-off of a decrease in specificity. However, the decision should be made by fully acknowledging the harms and benefits, which has not been addressed in the current review. While an increase in sensitivity by adjusting the cut-off might lead to the benefit of better followups, the accompanying increase in false positive rate might lead to the harms of stigmatization and unnecessary procedures. Given the highly variable developmental trajectories in asymptomatic children, another direction for future studies could be to evaluate the viability of targeted screening in a higher-risk population and compare it with universal screening. This is the first study to use meta-analytical techniques specifically to evaluate the heterogeneity in screening accuracy of tools for identifying children with language disorder. Nonetheless, there were several limitations of the study. One limitation was related to the variability and validity of the gold standard in that the reference standard tests. Different countries or regions use different localized standardized or nonstandardized tools and criteria to define language disorder. There is no one consensual or true gold standard. More importantly, the significance and sensitivity and specificity of the procedures used to identify children with language disorders in those reference tests were not examined. Some reference tests may employ arbitrary cut-offs (e.g.,−1.25 SD) to define language disorders while some researchers advocate children's well-being as the outcome, such that when children's lives are negatively impacted by their language skills, they are considered as having language disorders (151). This lack of consensus might further explain the diverse results or lack of agreement in replication studies. Another limitation of the study was that nearly all the included studies had at least some ROB. This was mainly due to many unreported aspects in the studies. It is suggested that future validation studies on screening tools should follow reporting guidelines such as STARD (152). A third limitation was that the rating of ROB only involved one rater, and more raters may minimize potential bias. Lastly, not all included screening tools were analyzed in the meta-analysis. Some studies evaluated multiple screening tools at a number of cut-offs or times of assessment. Only one data point per study was included in the meta-analysis and the data used in meta-analysis were chosen based on Youden's index. This selection would inevitably inflate the accuracy shown in the meta-analysis. With the emergence of new methods for meta-analysis for diagnostic studies, more sophisticated methods for handling this complexity of data structure may be employed in future reviews. This review shows that current screening tools for developmental language disorder vary largely in accuracy, with only some achieving good accuracy. Meta-analytical data identified some sources for heterogeneity. Future development of screening tools should aim at improving overall screening accuracy by carefully choosing the proxy or designing items for screening. More importantly, metrics that are more sensitive to persistent language disorder should be sought. To fully inform surveillance for early language development, future research in the field can also consider broader aspects, such as the harms and benefits of screening as there is still a dearth of evidence in this respect. Publicly available datasets were analyzed in this study. This data can be found at: Reference lists of the article. Phase 2 of CATALISE: A multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology The impact of nonverbal ability on prevalence and clinical presentation of language disorder: evidence from a population study Longterm consistency in speech/language profiles: II. Behavioral, emotional, and social outcomes Childhood language disorder and social anxiety in early adulthood Longterm consistency in speech/language profiles: I. Developmental and academic outcomes Trajectories of language delay from age 3 to 5: Persistence, recovery and late onset Late talkers and later language outcomes: Predicting the different language trajectories CATALISE: A multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children Predictive validity of preschool screening tools for language and behavioural difficulties: A PRISMA systematic review Screening for speech and language delay and disorders in children aged 5 years or younger: US Preventive Services Task Force recommendation statement Universal Screening of Young Children for Developmental Disorders: Unpacking the Controversies. Occasional Paper Screening for speech and language delay in children 5 years old and younger: a systematic review Telehealth measures screening for developmental language disorders in Spanish-speaking toddlers Screening for the identification of oral language difficulties in Brazilian preschoolers: a validation study Stability of core language skill from early childhood to adolescence: A latent variable approach The stability of primary language disorder Preferred reporting items for systematic review and meta-analysis of diagnostic test accuracy studies (PRISMA-DTA): explanation, elaboration, and checklist Selection of preschool language tests: A databased approach QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies Risk-of-bias VISualization (robvis): An R package and Shiny web app for visualizing risk-of-bias assessments Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Meta-Analysis of Diagnostic Accuracy Concurrent validity of two language screening tests Sequenced Inventory of Communication Development Two language screening tests compared with developmental sentence scoring Bankson Language Screening Test Developmental Sentence Analysis: A Grammatical Assessment Procedure for Speech and Language Clinicians The Cambridge Language and Speech Project (CLASP). I. Detection of language difficulties at 36 to 39 months Developmental profile II Persistence of the open syllable in defective articulation The Bus Story: A Test of Continuous Speech A pilot study to evaluate a new early screening instrument for speech and language delays Receptive-Expressive Emergent Language Test Screening effectiveness of the Minnesota child development inventory expressive and receptive language scales: sensitivity, specificity, and predictive value Reynell Developmental Language Scales, Revised (Windsor: NFER) Minnesota Child Development Inventory Sensitivity and specificity of a low-cost screening protocol for identifying children at risk for language disorders American Speech-Language and Hearing Association Teste de Language Infantil nas Áreas de Fonologia, Vocabulário, Fluência e Pragmática Early language screening in City and Hackney: work in progress. Child Care Health Dev Reynell Developmental Language Scales, 2nd revision The Symbolic Play Test The diagnostic accuracy of four vocabulary tests administered to preschool-age children Expressive One-Word Picture Vocabulary Test, Revised Peabody Picture Vocabulary Test: PPVT-IIIB Receptive One-Word Picture Vocabulary Test Expressive Vocabulary Test (EVT) Measurement properties and classification accuracy of two Spanish parent surveys of language development for preschool-age children Concurrent validity of a parent survey measuring communication skills of Spanish speaking preschoolers with and without delayed language The ASQ User's Guide Classification accuracy of brief parent report measures of language development in Spanishspeaking toddlers Ages and Stages Questionnaire MacArthur Inventarios del Desarrollo de Habilidades Comunicativas: User's Guide and Technical Manual Accuracy of telehealthadministered measures to screen language in Spanish-speaking preschoolers Utility of the MacArthur-bates communicative development inventory in identifying language abilities of late-talking and typically developing toddlers The MacArthur-Bates Communicative Development Inventory: Words and Sentences The Preschool Language Scale-3 Concurrent and predictive validity of an early language screening program The language development survey: A screening tool for delayed language in toddlers Infant MSEL Manual: Infant Mullen Scales of Early Learning Improving the positive predictive value of screening for developmental language disorder Evaluation of a structured test and a parent led method for screening for speech and language problems: prospective population based study The Reynell Developmental Language Scales III Early language screening in City and Hackney: The concurrent validity of a measure designed for use with 2½-year-olds Which three year olds need speech therapy? Uses of the Levett-Muir language screening test. Health Visitor Test of Articulation The Grammatical Analysis of Language Disability: A Procedure for Assessment and Remediation Validation of the early language scale Schlichting test voor taalbegrip Schlichting Test voor Taalproductie-II: voor Nederland en Vlaanderen Handleiding CCC-2-NL The Dutch well child language screening protocol for 2-year-old children was valid for detecting current and later language problems Evaluation of a language-screening programme for 2.5-year-olds at Child Health Centres in Sweden An investigation into aspects of the Mayo early language screening test. Child Care Health Dev The Mayo Early Language Screening Test. Western Health Board: Mayo Speech and Language Therapy Department Development and validation of language evaluation scale Trivandrum for children aged 0-3 years -LEST (0-3) Receptive-Expressive Emergent Language Scale Modifying a language screening tool for three-year-old children identified severe language disorders six months earlier ABFW: teste de linguagem infantil nas áreas de fonologia, vocabulário, fluência e pragmática The Test for Reception of Grammar, Version 2 (TROG-2) Validation of the Brazilian Children's Test of Pseudoword Repetition in Portuguese speakers aged 4 to 10 years Validation of the language development survey (LDS): A parent report tool for identifying language delay in toddlers Early identification of language delay by direct language assessment or parent report? Elternfragebögen für die Früherkennung von Risikokindern. ELFRA: Hogrefe Sprachentwicklungstest für sweijährige slindes-SETK-2 und für dreibis fünfjährige Kinder-SETK 3-5 Secondary prevention of paediatric language disability: a comparison of parents and nurses as screening agents Screening for speech and language disorders: The reliability, validity and accuracy of the General Language Screen Ontwikkelingsonderzoek op het consultatiebureau: handboek bij het vernieuwde Van Wiechenonderzoek Early Language Milestone Scale and language screening of young children The Early Language Milestone Scale. Pro-Ed Early identification of children with communication disorders: Concurrent and predictive validity of the CSBS Developmental Profile Communication and Symbolic Behavior Scales: Developmental Profile Non-word repetition performance in Slovak-speaking children with and without SLI: novel scoring methods Evaluating the GAPS test as a screener for language impairment in young children Grammar and Phonology Screening Test: (GAPS) London: DLDCN Clinical Evaluation of Language Fundamentals-Preschool Preschool speech and language screening: further validation of the sentence repetition screening test Elicited imitation: its effectiveness for speech and language screening Illinois Test of Psycholinguistic Abilities An investigation to validate the grammar and phonology screening (GAPS) test to identify children with specific language impairment The design and standardization of a speech and language screening test for use with preschool children Sequenced Inventory of Communication Development Two grammatical tasks for screening language abilities in Spanish-speaking children Preschool Language Scale-Fifth Edition Spanish Screening Test (PLS-5 Spanish Screening Test) Fluharty Preschool Speech and Language Screening Test: Teaching Resources Developmental Sentence Scoring Screening Kit of Language Development Screening kit of language development: A preschool language screening instrument Development of a language screening instrument for Swedish 4-year-olds Investigation of the language tasks to include in a short-language measure for children in the early school years The development and validation of the Short Language Measure (SLaM): A brief measure of general language ability for children in their first year at school Clinical Evaluation of Language Fundamentals-Fourth Edition, Australian Standardised Edition A preschool articulation and language screening for the identification of speech disorders Differentiating children with and without language impairment based on grammaticality SPELT-P 2: Structured Photographic Expressive Language Test How grammatical are threeyear-olds? Identifying children at risk for language impairment: screening of communication at 18 months Neurolingvistisk undersökningsmodell fr språkstörda barn. Utbildningsproduktion AB (kommer inom kort att ges ut i ny, något omarbetad upplaga av Pedagogisk Design TROG svensk manual [svensk översättning och bearbetning: Eva Holmberg och Eva Lundälv Why screening canadian preschoolers for language delays is more difficult than it should be AGS Early Screening Profiles Preschool Language Scale-Fourth Edition (PLS-4) Battelle Developmental Inventory. Itasca: Riverside Bracken Basic Concept Scale-Revised Technical Report for Brigance Screens Teacher identification of speech and language impairment in kindergarten students using the Kindergarten Development Check Predicting later language outcomes from the language use inventory Use Inventory: An Assessment of Young Children's Pragmatic Language Development for 18-to 47-Month-Old Children Diagnostic Evaluation of Language Variation Clinical Evaluation of Language Fundamentals CCC-2: Children's Communication Checklist-2 Can severely language delayed 3-year-olds be identified at 18 months? Evaluation of a screening version of the MacArthur-Bates communicative development inventories Communicative development in Swedish children 16-28 months old: The Swedish early communicative development inventory-words and sentences Swedish early communicative development inventories: Words and gestures Westerlund M, Sundelin C. Screening for developmental language disability in 3-year-old children. Experiences from a field study in a Swedish municipality Mullen Scales of Early Learning Consistency of a nonword repetition task to discriminate children with and without developmental language disorder in Catalan-Spanish and European Portuguese speaking children Understanding Individual Differences in Language Development Across the School Years STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fped. 2022.801220/full#supplementary-material Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.Publisher's Note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.Copyright © 2022 So and To. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.