key: cord-0990112-exhgl95b
authors: Tercero-Hidalgo, Juan R.; Khan, Khalid S.; Bueno-Cavanillas, Aurora; Fernández-López, Rodrigo; Huete, Juan F.; Amezcua-Prieto, Carmen; Zamora, Javier; Fernández-Luna, Juan M.
title: Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study
date: 2022-05-02
journal: J Clin Epidemiol
DOI: 10.1016/j.jclinepi.2022.04.027
sha: b2431ccff95909db6d40a266c4461bfc6d2d7ec1
doc_id: 990112
cord_uid: exhgl95b

Objective A rapidly developing scenario like a pandemic requires the prompt production of high-quality systematic reviews, which can be automated using artificial intelligence (AI) techniques. We evaluated the application of AI tools in COVID-19 evidence syntheses. Study design After prospective registration of the review protocol, we automated the download of all open-access COVID-19 systematic reviews in the COVID-19 Living Overview of Evidence database, indexed them for AI-related keywords, and located those that used AI tools. We compared their journals’ JCR Impact Factor, citations per month, screening workloads, completion times (from pre-registration to preprint or submission to a journal) and AMSTAR-2 methodology assessments (maximum score 13 points) with a set of publication date matched control reviews without AI. Results Of the 3999 COVID-19 reviews, 28 (0.7%, 95% CI 0.47-1.03%) made use of AI. On average, compared to controls (n=64), AI reviews were published in journals with higher Impact Factors (median 8.9 vs 3.5, p<0.001), and screened more abstracts per author (302.2 vs 140.3, p=0.009) and per included study (189.0 vs 365.8, p<0.001) while inspecting less full texts per author (5.3 vs 14.0, p=0.005). No differences were found in citation counts (0.5 vs 0.6, p=0.600), inspected full texts per included study (3.8 vs 3.4, p=0.481), completion times (74.0 vs 123.0, p=0.205) or AMSTAR-2 (7.5 vs 6.3, p=0.119). Conclusion AI was an underutilized tool in COVID-19 systematic reviews. Its usage, compared to reviews without AI, was associated with more efficient screening of literature and higher publication impact. There is scope for the application of AI in automating systematic reviews.

WHAT IS NEW?

• The use of artificial intelligence (AI) in COVID-19 systematic reviews was very low.

• COVID-19 reviews using AI tools showed higher publication impact and workload savings.

• There is scope for the application of AI in automating systematic reviews going forward. Evidence-based medicine depends on the production of timely systematic reviews to guide and update health care practice and policies [1] . This is a resource-intensive undertaking, requiring teams of multiple reviewers to interrogate numerous repositories and databases, screen through thousands of potentially relevant citations and articles, extract the pertinent data from the selected studies, and then prepare cohesive summaries of the findings [2, 3] . In the context of the SARS-CoV2/COVID-19 pandemic, methods to speed up this lengthy process were urgently needed [4, 5] .

Systematic evidence synthesis relies on robust and standardized procedures to achieve dependable results. However, the call to accelerate research output during the pandemic led to a decrease on reviews' methodological quality [6, 7] and the ascend of "rapid reviews" [8, 9] (which shorten the usual timeframes by sacrificing on search depth, screening robustness or data extraction and at the expense of increased risk of errors). Are these unavoidable tradeoffs for timelier results?

Instead, artificial intelligence (AI) based solutions (that automate parts of the workflow by mimicking human problem-solving, comprising machine-learning, nature language processing, data mining and other subfields) [10] are now available to either complement or substitute human efforts with limited risk of bias [11] [12] [13] , and have been previously (but scarcely) [14] employed in evidence synthesis to enhance screening [15] and data extraction [16, 17] . Their aims are to shorten production times, allow for broader screenings of J o u r n a l P r e -p r o o f the literature and reduce reviewers' workloads without compromising on methodological quality.

Here, we evaluated the use of AI techniques among COVID-19 evidence syntheses to empirically determine whether, compared to COVID-19 evidence syntheses without AI, they impacted on the production, the quality, and the publication of systematic reviews.

This methodological study [18] is reported following PRISMA 2020 guidelines [19] (checklist provided as Supplementary material 1-A), and its protocol was prospectively registered at Open Science Forum Registries (DOI 10.17605/OSF.IO/H5DAW) [20] .

We considered for inclusion all COVID-19 related systematic reviews that could have made use of any AI tool (machine learning, deep learning, or natural language processing) to accelerate, improve or complement any aspect of the review conduct (search, screening, data extraction and synthesis). We implemented a script (available at DOI 6 links. The process was repeated 3 times since the publication of our protocol to reduce the loss of articles due to server-side errors (last searched on August 17 th , 2021).

To capture reviews which deployed AI, we constructed a list of keywords with high probability of appearing in papers with AI tools ( statistical power of reviews without AI, for each included review we used the obtained records to randomly select 3 controls with the same publication date (within a one-day margin if not enough articles were available for a given date). In addition, we located and included for analysis all previous versions of reviews labelled as living or "updated".

The following data were manually extracted independently by two authors (JRTH and RFL) from each review: type of review (as described by its authors: standard, rapid/scoping, living, or update of a prior version); disclosed funding and conflicts of interest information; publication status, 2020 Journal Citation Reports (JCR) Impact Factor of the publishing journal and number of citations received (up to August 17 th , 2021); number of abstracts screened, full texts reviewed and included studies; number of authors and of reviewers participating in the screening; and dates of protocol registration (if available) and of the review's earliest version.

For living and updated reviews, we computed the increase in records screened and included J o u r n a l P r e -p r o o f between each of their versions and attributed their citation count to the newest one (to avoid double counting). Excel was used to record all variables.

Three authors (JRTH and RFL, assisted by CAP) graded all reviews with the AMSTAR-2 quality appraisal and risk of bias rating [24] . We excluded items 11-12 and 15, which apply to metaanalyses (as pre-specified by our protocol) and gave 0.5 points for "partial YES" answers when applicable, making for a maximum score of 13 points. For living and updated reviews, we only evaluated their most recent version (to avoid double counting). For reviews that included both randomized controlled trials and observational studies, question 9 (assessment of the risk of bias of individual studies) was graded separately for each study type. The list of the quality items evaluated is provided as Supplementary material 1-C.

We calculated the ratios of abstracts screened and full texts inspected per author (as workload measurement) and per included study (screening precision). The number of reviewers participating in the screening was reported inconsistently between studies and was therefore not used in the calculations. We calculated the completion time of the pre-registered reviews as the difference between their protocol's date and the first pre-print's date of publication (or reception date at the journal, for published articles with no pre-prints available). Living and updated reviews' completion times were calculated as the difference between the publication dates of each of their versions. We excluded non pre-registered reviews from this metric due to heterogeneity in the reporting of their starting dates. We used Pearson's chi-square test to compare the percentage of rapid, living, funded, and published reviews between groups.

Publishing journals' JCR Impact Factor, citation counts, screening workloads, completion times J o u r n a l P r e -p r o o f and AMSTAR-2 ratings were presented as medians with interquartile ranges (IQR), represented using box-and-whisker diagrams and compared using the Wilcoxon-Mann-Whitney test. R version 4.0.5 was used for statistical computing, and GraphPad Prism 9.2.0 for graphing. We also provided a narrative description of reviews using artificial intelligence, detailing which parts of the review process were automated and what software they used, how the AMSTAR-2 ratings differed among them, and how authors justified or what impact they attributed to the use of AI tools.

As outlined in Figure 1 , we identified 7050 bibliographic records of COVID-19 systematic reviews, successfully downloaded 3999, and manually inspected 580 that matched some of our keywords. We selected 20 reviews, of which there were 8 prior versions, making a total of 28 reviews (0.7% of the total, 95% CI 0.47-1.03%) with use of AI. Of the 60 articles selected as publication-date-matched controls, we located another 4 prior versions, making a total of 64 articles without use of AI. The complete list of selected articles is provided as an Excel document (Supplementary Material 2, sheet "Included reviews") with all the extracted variables and the AMSTAR-2 quality appraisal's breakdown for each question. The full list of manually inspected and finally discarded articles is also provided (sheet "Excluded reviews").

J o u r n a l P r e -p r o o f 

Extracted variables are summarized in Table 1 and can be visualized in Figure 2 . Of the 20 reviews selected for using AI, there were 5 rapid reviews (25%, with 1 scoping review and 1 rapid evidence map) and 5 living reviews (25%). Fifteen reviews provided a conflicts of interest statement, of which 12 (60%) declared having received external funding; 12 (60%) were published. Of the 60 control reviews, there were 6 rapid reviews (10%, with 1 scoping review) and 3 living reviews (5%). Fifty-seven reviews provided a conflicts of interest statement, of which 27 (45%) declared having received external funding; 48 (80%) were published. JCR Impact Factors and citation counts showed high variability in the AI group, mainly due to the inclusion of 3 BMJ [25] [26] [27] , 2 Cochrane [28, 29] and 1 Lancet [30] reviews. Furthermore, only 10 reviews in the AI group (50%) and 22 in the controls (36%) pre-registered a protocol, making for a total of 44 data points for the completion times' calculation. 

The AI group included a higher proportion of living reviews than the controls ( We observed no differences in the pre-registered reviews' times to completion ( 

According to the step of the review process where AI was used, we can classify the 20 reviews in the AI group in three categories, as shown in Table 2 . 

"We needed a more efficient approach to keep up with the rapidly increasing volume of COVID-19 literature", "accounted for approximately 80% of the screening burden" [29] Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19

Struyf et al.

Database of Systematic Reviews [42] Are medical procedures that induce coughing or involve respiratory suctioning associated with increased generation of aerosols and risk of SARS-CoV-2 infection? A rapid systematic review Wilson et al. J o u r n a l P r e -p r o o f

Three reviews [31] [32] [33] complemented their search procedures with open-ended question queries on CORD-19 [45] , an open dataset of COVID-19 related articles structured to facilitate the use of text mining and machine learning systems: Zaki et al. [32] used a GitHub repository based on the Okapi BM25 search algorithm; Zaki et al. [33] employed BioBERT, a peerreviewed [46] and open-source text mining system pre-trained for biomedical content analysis; and Parasa et al. [31] provided no details on the search engine employed.

Additionally, Michelson et al. [34] used proprietary software from the "GenesisAI" company to produce a "rapid meta-analysis" as proof-of-concept of their product. Daley et al. [35] disclosed no information on the software employed. Only 2 reviews in this sub-group were published, and none registered a protocol. The average AMSTAR-2 score was 3.7/13.

Seven articles [25, 26, [36] [37] [38] [39] [40] employed RobotSearch, a peer-reviewed [47] and open-source software to identify randomized controlled trials (RCT) from a citations list. It is based on a neural network trained with data from Cochrane's reviews and stands out for its ease of use (no installation is required) and flexibility (as it allows for different levels of sensitivity, including one developed specifically for systematic reviews, as well as integration with other scripts).

In our sample, RobotSearch was often incorporated in the workflows of living or partially automated reviews. Two of the reviews that made use of RobotSearch were Bartoszko et al. [25] , a network meta-analysis of the evidence for COVID-19 prophylaxis, and Siemieniuk et al. [26] , a living meta-analysis of randomized trials to inform World Health Organization (WHO)

Living Guidelines on drugs for treatment of COVID-19, of which Izcovich et al. [38] and

Zeraatkar et al. [39] are separate sub-studies. Both are part of the "BMJ Rapid

Recommendations" project and maintain a website where summaries of the evidence available and interim analyses are published. The average AMSTAR-2 score was 7.5/13.

We found eight articles [27] [28] [29] [30] [41] [42] [43] [44] that made use of AI-powered screening procedures.

Five of them [27] [28] [29] 41, 42] used EPPI-Reviewer, a web-based tool (distributed as shareware)

to assist in the elaboration of all kinds of literature reviews. It offers a wide variety of features, from bibliographic management to collaborative working, as well as study identification capabilities, automatic clustering of articles, and text mining. In particular, the included reviews used its "SGCClassifier" module to prioritize the screening of articles more likely to be included. As a result, both Wynants et al. [27] and two Cochrane reviews [28, 29] quoted a 80% reduction in the screening burden due to this tool.

were used by other two articles: SWIFT-Active Screener [48] by Elmore et al. [43] , which was set to achieve a certain study recall objective as the screening's stopping criterion; and

Evidence Prime by Chu et al. [30] , to double-check the screening process. Finally, Alkofide et al. [44] used Abstrackr, the only open-source software in this category, which uses feedback from previously selected and rejected articles to guide the screening process. Evaluations of this tool published in the literature [49] suggest high workload savings in the production of systematic reviews at the cost of 0.1% false negative rates.

Among the reviews analyzed in this study, this subgroup presented the highest scores in the AMSTAR-2 appraisal tool (9.1/13), with the notable mentions of two Cochrane reviews [28, 29] (12 points) and a rapid meta-analysis [30] published in the Lancet (10.5 points). Contrary to reviews in the other categories that prioritized search depth, the use of AI-powered tools in this subgroup was motivated by the screening burden faced by the reviewers: quoting Dinnes et al. [28] , "a more efficient approach [was needed] to keep up with the rapidly increasing volume of COVID-19 literature".

We evaluated if the potential benefits of deploying AI in evidence syntheses have been realized in COVID-19 reviews. We found that AI was rarely utilized, appearing in only 0.7% of the studied reviews, but that it was significantly associated with reductions in authors'

screening workload and publication in journals with higher Impact Factor. Being a living review was associated with using AI, with the most common use cases being the optimization of screening (prioritizing studies with high likelihood of being relevant) and the selection of randomized controlled trials.

As a limitation of our study, we would highlight its low statistical power due to the small number of reviews using AI. Anticipating the limited availability of reviews with AI, we adopted a highly sensitive screening procedure, processing more than 7000 bibliographic references of COVID-19 systematic reviews (combining expert advice in the selection of keywords and a fully-featured search engine), and chose a 3:1 control group size to minimize the risk of type We also note that reporting workloads "per author" instead of "per reviewer participating in the screening" may underestimate workload measurements for large teams (when not all their authors participate in the screening). A higher author count might also be related to resource availability, and thus access to expert advice regarding AI. Likewise, better-resourced groups with AI expert support might have greater access to well-indexed journals, potentially biasing Impact Factor analyses in favor of AI. The AMSTAR-2 tool was inevitably applied without blinding the reviewers to use or non-use of AI, which, given the subjectiveness of certain aspects of the methodology assessment, might have influenced this evaluation. Finally, the use of citation counts to measure reviews' impact has known deficiencies such as being influenced by citation bias or the authority of the authors [50] , and this approach may underestimate the impact of recently published reports.

On average, it takes 15 months for teams of 5 reviewers to complete a traditional systematic review [51] , with estimated screening error rates of around 10% [52] . Facing the COVID-19

J o u r n a l P r e -p r o o f pandemic demanded robust evidence summaries with urgency as delays incurred cost in terms of lost lives and economic damage. However, despite the explosive growth that the AI and machine learning fields have experienced during the last years, they played a surprisingly limited role in COVID-19 evidence synthesis. Our findings are consistent with previous reports [14] that the benefits AI can provide in the conduct of systematic reviews are unknown to most review authors, while the relative unorthodoxy of its methods might initially hinder their acceptance by the research community. Open-source software, more prone to community adoption, will be essential in this aspect. Hopefully, our article will raise the profile of AI in evidence syntheses.

Our narrative description of the reviews included in this study showed that none made use of more than one AI-tool. A more cohesive approach, seamlessly merging AI into every step of the review process, would save reviewers' time trying to interconnect different tools with sometimes incompatible formats. Semi-automated screening procedures were one of the areas where AI showed more adoption, and the variety of software options (such as EPPI-Reviewer, already adopted as a Cochrane Review Production Tool) was higher. On the contrary, full automation was only employed by RobotSearch (an extensively appraised randomized trials identifier), suggesting that the adoption of increasingly automated solutions may be hindered by the need to further assess their potential cost on recall and risk-of-bias against their productivity contributions.

The need for automated solutions in research synthesis is obvious, as reviewers' workload is growing with the rapidly expanding biomedical field. Adoption of new technologies can take J o u r n a l P r e -p r o o f time, but realizing AI's potential in evidence synthesis should be a priority. Going forward, AI must be incorporated to systematic reviews as the next step towards timely, better, and more responsive decision-making.

J o u r n a l P r e -p r o o f 

Cochrane Handbook for Systematic Reviews of Interventions version 6.3. Cochrane

Systematic review automation technologies

Resource use during systematic review production varies widely: a scoping review

We need clinical guidelines fit for a pandemic

Methodological challenges in studying the COVID-19 pandemic crisis

Reporting and methodological quality of COVID-19 systematic reviews needs to be improved: an evidence mapping

Methodological quality of COVID-19 clinical research

Rapid review methods more challenging during COVID-19: commentary with a focus on 8 knowledge synthesis steps

A QuESt for speed: rapid qualitative evidence syntheses as a response to the COVID-19 pandemic

Artificial intelligence and automation of systematic reviews in women's health

Using text mining for study identification in systematic reviews: a systematic review of current approaches

Toward systematic review automation: A practical guide to using machine learning tools in research synthesis

Living systematic reviews: 2. Combining human and machine effort

Systematic review automation tools improve efficiency but lack of knowledge impedes their adoption: a survey

Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews

Data extraction methods for systematic review (semi)automation: A living systematic review

Automating data extraction in systematic reviews: a systematic review

A tutorial on methodological studies: the what, when, how and why

The PRISMA 2020 statement: an updated guideline for reporting systematic reviews

Covid-19 systematic evidence synthesis with Artificial Intelligence: a Review of Reviews

COVID-19 evidence syntheses with artificial intelligence: an empirical study of systematic reviews

Evidence synthesis relevant to COVID-19: a protocol for multiple systematic reviews and overviews of systematic reviews

AMSTAR 2: A critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both

Prophylaxis against covid-19: Living systematic review and network meta-analysis

Drug treatments for covid-19: living systematic review and network meta-analysis

Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal

point-of-care antigen and molecular-based tests for diagnosis of SARS-CoV-2 infection

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19

Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis

Prevalence of Gastrointestinal Symptoms and Fecal Viral Shedding in Patients with Coronavirus Disease 2019: A Systematic Review and Meta-analysis

The influence of comorbidity on the severity of COVID-19 disease: A systematic review and analysis

The estimations of the COVID-19 incubation period: A scoping reviews of the literature

Ocular toxicity and Hydroxychloroquine: A Rapid Meta-Analysis

A Systematic Review of the Incubation Period of SARS-CoV-2: The Effects of Age, Biological Sex, and Location on Incubation Period

Impact of remdesivir on 28 day mortality in hospitalized patients with COVID-19

Impact of systemic corticosteroids on hospitalized patients with COVID-19

Adverse effects of remdesivir, hydroxychloroquine, and lopinavir/ritonavir when used for COVID-19: Systematic review and metaanalysis of randomized trials

Tocilizumab and sarilumab alone or in combination with corticosteroids for COVID-19: A systematic review and network meta-analysis

Clinical trials in COVID-19 management & prevention: A meta-epidemiological study examining methodological quality

Impacts of school closures on physical and mental health of children and young people: a systematic review

Are medical procedures that induce coughing or involve respiratory suctioning associated with increased generation of aerosols and risk of SARS-CoV-2 infection? A rapid systematic review

Risk and Protective Factors in the COVID-19 Pandemic: A Rapid Evidence Map

Tocilizumab and Systemic Corticosteroids in the Management of Patients with COVID-19: A Systematic Review and Meta-Analysis

CORD-19: The Covid-19 Open Research Dataset

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide

Accelerated document screening through active learning and integrated recall estimation

Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool

Citation bias and other determinants of citation in biomedical research: findings from six citation networks

Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry

Error rates of human reviewers during abstract screening in systematic reviews

AI reviews Control reviews All authors contributed to the editing of the paper and approved its final version to be submitted.