key: cord-1050254-46uiy8wl authors: Perlman-Arrow, S.; Loo, N.; Bobrovitz, N.; Yan, T.; Arora, R. K. title: A real-world evaluation of the implementation of NLP technology in abstract screening of a systematic review date: 2022-02-25 journal: nan DOI: 10.1101/2022.02.24.22268947 sha: fc882a8ffdc968f99ca43abea48c1d54a5eed8c2 doc_id: 1050254 cord_uid: 46uiy8wl The laborious and time-consuming nature of systematic reviews hinders the dissemination of up-to-date evidence synthesis. Well-performing natural language processing (NLP) for systematic reviews have been developed, showing promise to improve efficiency. However, the feasibility and value of these tools have not been comprehensively demonstrated in a real-world review. We developed an NLP-assisted abstract screening tool that provides text inclusion recommendations, keyword highlights, and visual context cues. We evaluated this tool in a living systematic review on SARS-CoV-2 seroprevalence, conducting a quality improvement assessment of screening with and without the tool. We evaluated changes to abstract screening speed, screening accuracy, characteristics of included texts, and user satisfaction. The tool improved efficiency, reducing screening time per abstract by 45.9% and decreasing inter-reviewer conflict rates. The tool conserved precision of article inclusion (positive predictive value; 0.92 with tool vs 0.88 without) and recall (sensitivity; 0.90 vs 0.81). The summary statistics of included studies were similar with and without the tool. Users were satisfied with the tool (mean satisfaction score of 4.2/5). We evaluated an abstract screening process where one human reviewer was replaced with the tool's votes, finding that this maintained recall (0.92 one-person, one-tool vs 0.90 two tool-assisted humans) and precision (0.91 vs 0.92) while reducing screening time by 70%. Implementing an NLP tool in this living systematic review improved efficiency, maintained accuracy, and was well-received by researchers, demonstrating the real-world effectiveness of NLP in expediting evidence synthesis. 1 Background reviews. [15, 11, 12, 21] Only one study to date has evaluated these tools in the context of 27 an ongoing review with user interactions. This evaluation involved only one reviewer, was 28 done after traditional screening was completed, and focused exclusively on screening time. 29 [10] 30 Furthermore, few reports have evaluated the impact of implementing NLP tools into 31 living literature reviews, [13] and none have assessed user-tool interactions or user satis-32 faction in this context. Living reviews could benefit from a continuous level of screening 33 efficiency and lend themselves well to integration of NLP tools: an initial manual review 34 can yield a large number of screened texts, which could serve as the training set to develop 35 an algorithm to in turn expedite continuous review updates. 36 SeroTracker conducts a living systematic review of global SARS-CoV-2 seroprevalence 37 and publishes results onto an interactive dashboard (Serotracker.com). [22, 23] Each 38 week, our team screens 800-1000 new abstracts and extracts approximately 30 articles. To 39 optimize the efficiency of our screening process, we developed an NLP-assisted software 40 tool and conducted a quality improvement (QI) project assessing the efficiency changes 41 and usability of integrating this tool into our usual methods. We evaluated changes in the 42 time taken to conduct screening, the accuracy of the screening process, the characteristics 43 of included texts in our overall review, user interactions with the tool, and user satisfaction 44 with the process. Moreover, we assessed different combinations of reviewer and tool pairing 45 to determine how to best improve our screening process. As an evaluation of NLP-based 46 tools in an ongoing living systematic review, our report provides novel and comprehensive evidence regarding the feasibility of NLP for screening and its real-world performance 48 benefits. 49 2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) In line with process improvement measures at SeroTracker, reviewers were interviewed to 73 assess satisfaction with the screening process. Team members noted its time-consuming 74 nature and identified the following key challenges: difficulty in tracking the number of 75 texts screened, the inability to reverse a vote if they misclicked, and the inability to identify 76 key information at a glance to determine whether a text should be included. 77 We developed an NLP-enabled tool that adds additional features to Covidence to allow 78 more efficient identification of text eligibility. This tool included (1) an inclusion recom-79 mendation indicator, which displays a confidence rating ranging from "not recommended" 80 to "strongly recommended" in the form of a coloured circle beside the abstract title. This 81 was developed using the transformer-based pre-trained NLP model PubMedBERT. [18] We 82 fine-tuned the model on a set of 25,000 previously screened abstracts from the living sys-83 tematic review. We also included (2) a feature that highlights the Population, Intervention, 84 and Outcome (PIO) abstract components in different colours, using the same model but The tool also incorporated features to streamline screening and ameliorate user experi-88 ence: (3) a screening progress tracker (4) a button to undo a user's most recent votes on a 89 text (5) a feature that displays abstracts in a way that separates them by section headings 90 (e.g., "Background", "Methods", etc.) and (6) a feature highlighting reviewer-specified 91 keywords (Appendix Figure C3 ). 92 We conducted a project with AB design to assess the feasibility and impact of tool imple-94 mentation on abstract screening. We selected a set of 400 abstracts ("pilot abstracts") to 95 evaluate tool performance. These abstracts had been previously screened using the same 96 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; inclusion criteria as part of SeroTracker's review. 309/400 were previously excluded and 91/400 were previously included in the review. This project was conducted over a five-week period in three stages Figure 1 . In the first 99 two weeks, team members conducted screening without the tool ("without-tool stage"). 200 pilot abstracts were added to the regular primary searches each week. We subsequently 101 implemented a week-long washout period, where no pilot abstracts were added, and 102 reviewers installed and familiarized themselves with the tool. In the final two weeks, team 103 members used the tool for screening ("with-tool stage") using the features they felt were 104 most helpful. In this stage, 200 pilot abstracts were again added to the regular primary 105 searches each week. 106 Three sets of reviewer votes on the pilot abstracts were collected: votes from the initial 108 screen in April ("pre-project votes"), votes from the without-tool, and votes from the 109 with-tool stages. We used pre-project votes as the reference standard for comparison. 110 We evaluated key process, outcome, and structure measures. Process measures included (1) efficiency metrics, including the screening time and the conflict rate with and without 112 the tool and (2) accuracy metrics, including precision (positive predictive value) and recall 113 (sensitivity) of screening, as well as the performance of the tool's inclusion recommenda-114 tions. To evaluate precision and recall, we first calculated the baseline expected variability 115 due to human error in screening, by comparing the included and excluded texts between 116 the pre-project and without-tool stages, as there is inherent human error in the systematic 117 review process. [26] We then assessed whether the outcomes of the with-tool stage were 118 within expected levels of variability. The first outcome measure evaluated was the tool's impact on results of the review, 120 assessed by comparing summary descriptive statistics for included seroprevalence estimates 121 in the pre-project, without-tool, and with-tool stages. We also assessed reviewers' usage 122 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; of different features, such as voter alignment with the NLP recommendations and the frequency of use of each feature. We surveyed users to understand overall satisfaction with 124 the tool. 125 Finally, one structure measure was evaluated, which compared tool performance with 126 different combinations of human and tool votes. We assessed the tool's performance 127 using data from the project in a "one-person and one-tool" (OPOT) screening process, a 128 simulated abstract screening scenario in which one human reviewer is replaced with the Results from this work were used to inform whether to integrate this tool into regular 143 practice at SeroTracker and to inform further improvements. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; Table 2) . 156 Lastly, we repeated the analysis only using abstracts ultimately included at the abstract 157 screening stage. These took a similar time to screen with or without the tool ( Table 2 , The tool's impact on conflict rate was also assessed. An increased number of conflicting 166 votes decreases the efficiency of abstract screening, as a third reviewer must resolve these. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; exact test) when the tool was added ( 1 If, during full-text extraction, a text that was included in screening is found to lack information to extract, it can be excluded 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; score (0.905), with a precision of 0.827 and recall of 1.0 at this level. Scores remained high at all thresholds. Table 6 summarizes the statistics of the seroprevalence estimates included in the pre-project 197 stage, the without-tool stage, and the with-tool stage. The majority of the statistics remain 198 consistent with and without the tool. Compared to the pre-project votes, the with-tool stage 199 did not exclude any estimates deemed to have a low or moderate risk of bias. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; After the project's conclusion, reviewers were sent a satisfaction survey. 9 of 13 members 215 provided feedback ( Table 7) . The self-reported usage information did not align perfectly 216 with the computer recorded usage. This could be due to recall bias, as the survey was 217 conducted one week after the conclusion of the project. Reviewers who used features rated 218 their usefulness out of five ( Table 7) . The inclusion recommendation feature was voted the 219 most useful (mean score of 4.70/5), and keyword (3.88/5) and PIO highlighting (4.00/5) 220 were the least useful. Reviewers reported that the tool improved perceived screening speed by allowing them 222 to rapidly identify key information that qualifies or disqualifies abstracts for inclusion, 223 specifically through the bolding headings, PIO highlighting, and keyword highlighting. While the undo feature was rarely used, users reported that it provided them with more 225 security, allowing them to correct mistakes that would otherwise be permanent. While 226 many users found the inclusion recommendations useful, they noted that it could give a 227 false sense of security and cause users to blindly trust the tool, rather than carefully read 228 through the abstract. Users also noted that the PIO highlighting feature often highlighted 229 incorrect information, making it distracting at times. This complaint was reflected in the 230 low adoption of the PIO highlighting feature. Results are reported in Table 4 for texts that were included in the abstract screen and for 238 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; texts that were included in the full-text screen. None of the FPs were ultimately included in the review; all texts were excluded during extraction. Recall for included texts was The tool-only screening scenario, in which votes are provided by the tool while a human 246 reviewer conducts only conflict resolution and full-text screening, performed comparably 247 well to both OPOT and to two humans. Precision, however, was reduced when using this 248 system (Table 4) . with previously included texts, meaning that its precision at the abstract screening stage 262 would likely be lower in practice. SeroTracker plans to adopt the OPOT(W) system into 263 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; regular screening practice, to reduce screening time while maintaining a level of qualitycontrol through human monitoring. We will incorporate reviewer feedback to augment the 265 tool's utility further. The improvements above can ultimately improve the efficiency of the 266 entire living systematic review, allowing us to maintain up-to-date information. such as whether the abstract received an "include" or "exclude" vote, or whether a reviewer 309 had seen the abstract before, which were not accounted for. 310 Furthermore, the AB project design could induce order effects. While we showed 311 in Section 3.1.1 that the removal of duplicate abstract-vote pairs did not affect the time 312 decrease in abstract screening, we could not definitively demonstrate that order effects 313 did not influence votes, particularly for texts that were accepted into full-text screening 314 and subsequently interacted with multiple times. Finally, the sample size was limited to 315 13 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; reduce screening load placed on the team, and some conclusions lacked the sample size for statistical significance. This screening load constraint also resulted in a higher inclusion 317 rate in the pilot abstract set (23%) than what is typically observed in screening (5%). Beyond design, there are limitations to assessing the impact of our tool as a whole. Firstly, there is no definitive "gold standard" for inclusion or exclusion of abstracts; we 320 assumed "pre-project" screening labels to be accurate. Furthermore, our precision and CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; https://doi.org/10.1101/2022.02.24.22268947 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 20 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; 6 Tables Table 1 : Key methods and results of process, outcome and structure measures evaluated. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. . It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. 26 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 25, 2022. ; Risk of Bias . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Neutralization 0 (0%) 0 (0%) 0 (0%) Unclear 0 (0%) 0 (0%) 0 (0%) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. : Operating characteristics of the tool evaluated on the pilot abstracts, with "True" labels taken as the outcome of the previous full screening, and the predicted labels taken as the tool's inclusion likelihood. (a) shows the ROC curve, with the four confidence thresholds given by the tool marked on the curve. (b) shows the precision, recall and F1 scores as a function of the tool's confidence threshold, from the lower confidence (at least not recommended/red) to the highest (at least strongly recommended/dark green). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 25, 2022. ; Knowledge Synthesis in Evidence-Based Medicine Users' Guides to the Medical Literature: IX. A Method for Grading Health Care Recommendations Analysis of the Time and Workers Needed to Conduct Systematic Reviews of Medical Interventions Using Data from the PROSPERO Registry Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? How COVID Broke the Evidence Pipeline Living Systematic Reviews: An Emerging Opportunity to Narrow the Evidence-Practice Gap Feasibility and Acceptability of Living Systematic Reviews: Results from a Mixed-Methods Evaluation Toward Systematic Review Automation: A Practical Guide to Using Machine Learning Tools in Research Synthesis A Machine Learning Tool to Semi-Automate Abstract Screening for Systematic Reviews. Syst Rev Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools An Evaluation of DistillerSR's Machine Learning-Based Prioritization Tool for Title/Abstract Screening -Impact on Reviewer-Relevant Outcomes Automatic Screening Using Word Embeddings Achieved High Sensitivity and Workload Reduction for Updating Living Network Meta-Analyses Rayyan-a Web and Mobile App for Systematic Reviews Semi-Automated Screening of Biomedical Citations for Systematic Reviews Natural Language Processing Was Effective in Assisting Rapid Title and Abstract Screening When Updating Systematic Reviews Pre-Training of Deep Bidirectional Transformers for Language Understanding. CoRR Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing PICO Element Detection in Medical Text without Metadata: Are First Sentences Enough Reducing Workload in Systematic Review Preparation Using Automated Citation Classification An Open Source Machine Learning Framework for Efficient and Transparent Systematic Reviews SeroTracker: A Global SARS-CoV-2 Seroprevalence Dashboard Global Seroprevalence of SARS-CoV-2 Antibodies: A Systematic Review and Meta-Analysis