key: cord-0319545-0gvk7gwj
authors: Schoot, Rens van de; Bruin, Jonathan de; Schram, Raoul; Zahedi, Parisa; Boer, Jan de; Weijdema, Felix; Kramer, Bianca; Huijts, Martijn; Hoogerwerf, Maarten; Ferdinands, Gerbrich; Harkema, Albert; Willemsen, Joukje; Ma, Yongchao; Fang, Qixiang; Hindriks, Sybren; Tummers, Lars; Oberski, Daniel
title: Open Source Software for Efficient and Transparent Reviews
date: 2020-06-22
journal: nan
DOI: 10.1038/s42256-020-00287-7
sha: 223846b7b56e250e1b6f521997b4c1b809cc0da7
doc_id: 319545
cord_uid: 0gvk7gwj

To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool (ASReview) to accelerate the step of screening titles and abstracts. For many tasks - including but not limited to systematic reviews and meta-analyses - the scientific literature needs to be checked systematically. Currently, scholars and practitioners screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that ASReview can yield far more efficient reviewing than manual reviewing, while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice.

would like to learn from, and thereby drastically reducing the total number of records that require manual screening. [16] [17] [18] In most, so-called, Human-in-the-Loop (HITL) 19 machine learning applications, the interaction between the machine learning algorithm and the human is used to train a model with a minimum number of labeling tasks. Unique for systematic reviewing is that not only all relevant records (i.e., titles and abstracts) should be seen by a researcher, but also an extremely diverse range of concepts needs to be learned, thereby requiring flexibility in the modeling approach as well as careful error evaluation. 11 In the case of systematic reviewing, the algorithm(s) are interactively optimized for finding the most relevant records, instead of finding the most accurate model. Therefore, the term Researcher-In-The-Loop (RITL) was introduced 20 as a special case of HITL with three unique components: (1) The primary output of the process is a selection of the records, not a trained machine learning model, (2) All records in the relevant selection are seen by a human at the end of the process 21 , (3) The use-case requires a reproducible workflow and complete transparency is required. 22 .

Existing tools implementing such an active learning cycle for systematic reviewing are described in Table 1 , see the Appendix for an overview of all the software we considered (note this list was based on a review of software tools 12 ). However, existing tools have two main drawbacks. First, many are closed source applications with black box algorithms. This is problematic as transparency and data ownership are essential in the era of open science 22 . Second, to our knowledge, existing tools lack the necessary flexibility to deal with the large range of possible concepts to be learned by a screening machine. For example, in systematic reviews, the optimal type of classifier will depend on variable parameters, such as the proportion of relevant publications in the initial search and the complexity of the inclusion criteria used by the researcher. 23 For this reason any successful system must allow for a wide range of classifier types. Benchmark testing is crucial to understand the real-world performance of any ML-aided system, but currently, such benchmark options are mostly lacking.

In this paper, we present an open source ML-aided pipeline with active learning for systematic reviews called ASReview. The goal of ASReview is to help scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible, while being transparent in the process. The open, free and ready-to-use software ASReview addresses all concerns mentioned above: it is open source, uses active learning, allows multiple ML-models. It also has a benchmark mode which is especially useful for comparing and designing algorithms.

Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline. Although we focus this paper on systematic reviews, ASReview can handle any text source.

In what follows, we first present the pipeline for manual versus ML-aided systematic reviews.

Subsequently, we present how ASReview has been set up, and how ASReview can be used in different workflows by presenting several real-world use cases. Then, we present the results of simulations that benchmark performance, and present the results of a series of user-experience tests. Last, we discuss future directions.

Traditionally, the pipeline of a systematic review without active learning starts with researchers doing a comprehensive search in multiple databases 24 , using free text words as well as controlled vocabulary to retrieve potentially relevant references. The researcher then typically verifies that key papers they expect to find are indeed included in the search results. The researcher downloads a file with records containing the text to be screened. In the case of systematic reviewing it contains the titles and abstracts, and potentially other metadata like authors, journal, DOI, of potentially relevant references into a reference manager. Ideally, two or more researchers then screen the records' titles and abstracts based on eligibility criteria established beforehand. 4 After all records have been screened, the full texts of the potentially relevant records are read to analyze which will be ultimately included in the review. Most records are excluded in the title and abstract phase. Typically, only a small fraction of the records belong to the relevant class, making title and abstract screening an important bottleneck in systematic reviewing process 25 . For instance, a recent study analyzed 10,115 records, and excluded 9,847 after title and abstract screening, a drop of more than 95%. 26 Therefore, ASReview focuses on this labor-intensive step.

The research pipeline of ASReview is depicted in Figure 1 . The researcher starts with a search exactly as described above, and subsequently uploads a file containing the records (i.e. metadata containing the text of the titles and abstracts) into the software. Prior knowledge is then selected which is used for training of the first model and presenting the first record to the researcher.

Because screening is a binary classification problem, the reviewer must select at least one key record to include and exclude based on background knowledge. More prior knowledge may result in improved efficiency of the active learning process.

Based on the prior knowledge, a machine learning classifier is trained to predict study relevance (labels) from a representation of the record containing text (feature space). In order to prevent "authority bias" in the inclusions, we have purposefully chosen not to include an author name or citation network representation in the feature space. In the active learning cycle, the software presents one new record to be screened and labeled (1 -"relevant" vs. 0 -"irrelevant") by the user. The user's binary label is subsequently used to train a new model, after which a new record is presented to the user. This cycle continues up to a certain user-specified stopping criterion has been reached. The user now has a file with (1) records labeled as either relevant or irrelevant and (2) unlabeled records ordered from most to least probable to be relevant as predicted by the current model. This setup helps to move through a large database much quicker than in the manual process, while, at the same time, the decision process remains transparent.

The source code 27 hold metadata on, for example, a paper. Mandatory metadata is text and can for example be titles or abstracts from scientific papers. If available, both are used to train the model, but at least one is needed. An advanced option is available which splits the title and abstracts in the feature extraction step and weights the two feature matrices independently (for TF-IDF only). Other metadata such as author, date, URL, DOI, and keywords are optional but not used for training the models. When using ASReview in simulation or exploration mode, an additional binary variable to indicate historical labeling decisions is required. This column, which is automatically detected, can also be used in the oracle mode as background knowledge for prior selection of relevant papers before entering the active learning cycle. If not available the user has to select at least one relevant record which can be identified by searching the pool of records. Also, at least one irrelevant record should be identified; the software allows to search for specific records or presents random records which are most likely to be irrelevant due to the extremely imbalanced data.

The software has a simple yet extensible default model: a Naive Bayes classifier, TF-IDF feature extraction, Dynamic Resampling balance strategy 31 , and certainty-based sampling 17,32 for the query strategy. These defaults were chosen based on their consistently high performance in benchmark experiments across several datasets 31 . Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally. Users can change the settings, shown in Table 2 , and technical details are described in our documentation 28 . Users can also add their own classifiers, feature extraction techniques, query strategies and balance strategies.

ASReview has a number of implemented features (see Table 2 ). First, there are several classifiers available: (1) naive Bayes, (2) support vector machines, (3) logistic regression, (4) neural networks, (5) random forests, (6) LSTM-base which consists of an embedding layer, an LSTM layer with one output, a dense layer, and a single sigmoid output node, and (7) LSTM-pool which consists of an embedding layer, an LSTM layer with many outputs, a max pooling layer, and single sigmoid output node. Feature extraction techniques available are Doc2Vec, 33 embedding with IDF or TF-IDF 34 (the default is unigram, with the option to run n-grams, while other parameters are set to the defaults of Scikit-learn 35 ) , and sBERT. 36 The available query strategies for the active learning part are (1) There are several balance strategies that rebalance and reorder the training data. This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) Full sampling which uses all the labeled records, (2) Undersampling the irrelevant records, so that the included and excluded records are in some particular ratio (closer to one), and (3) "Dynamic Resampling", a novel method similar to undersampling in that it decreases the imbalance of the training data 31 

Resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same. The ratio between relevant and irrelevant records is not fixed over interactions, but dynamically updated, depending on the number of labeled records, the total number of records and the ratio between relevant and irrelevant records. Details on all the described algorithms can be found in the code and documentation referred to above.

By default, ASReview converts the records' texts into a document-term matrix, terms are converted to lowercase, and no stop words are removed as default (but this can be changed).

Because the document-term matrix is identical in each iteration of the active learning cycle, it is generated in advance of model training, and stored in the (active learning) state file. The indexed records can easily be requested from the document-term matrix in the state file. Internally, records are identified by their row number in the input dataset. In "oracle mode", the record that is selected to be classified is retrieved from the state file and the record text and other metadata (such as title and abstract) are retrieved from the original dataset (from file or computer memory).

ASReview can run on your local computer, or a (self-hosted) local or remote server. Data -all records and their labels -remain on the users' computer. Data ownership and confidentiality is crucial, and no data is processed or used in any way by third parties. This stands in distinction with some of the existing systems, as shown in the last column of Table 1 .

Below we highlight a number of real-world use cases and high-level function descriptions for using the pipeline of ASReview.

ASReview can be integrated in classic systematic reviews or meta-analyses. Such reviews or meta-analyses entail several explicit and reproducible steps, as outlined in the PRISMA guidelines .4 Scholars identify all likely relevant publications in a standardized way, screen retrieved publications to select eligible studies based on defined eligibility criteria, extract data from eligible studies and synthesize the results. ASReview fits in this process, particularly in the abstract screening phase. ASReview does not replace the initial step of collecting all potentially relevant studies. As such, results from ASReview depend on the quality of the initial search process, including selection of databases 24 and construction of comprehensive searches using keywords and controlled vocabulary. However, ASReview can be used to broaden the scope of the search, by keyword expansion or by omitting limitation in the search query, resulting in a higher number of initial papers to limit the risk of missing relevant papers during the search part (i.e., more focus on recall instead of precision).

Also, when analyzing very large literature streams, many reviewers nowadays move towards meta-reviews, that is, systematic reviews of systematic reviews. 37 This can be problematic as the various reviews included could use different eligibility criteria and therefore are not always directly comparable. Because of the efficiency of ASReview, scholars using the tool could conduct the study by analyzing the papers directly instead of using the systematic reviews. Furthermore, ASReview supports the rapid update of a systematic review. The included papers from the initial review are used to train the machine learning model before screening of the updated set of papers starts. This allows the researcher to quickly screen the updated set of papers based on decisions made in the initial run.

As an example case, let us look at the current literature on COVID-19 and the coronavirus. An enormous number of papers are being published on COVID-19 and the coronavirus. It is very time consuming to manually find relevant papers, for example to develop treatment guidelines. This is especially problematic as urgent overviews are required. Medical guidelines rely on comprehensive systematic reviews, but the medical literature is growing at breakneck pace, and the quality of the research is not universally adequate for summarization into policy. 38 Such reviews must entail adequate protocols with explicit and reproducible steps, including identifying all potentially relevant papers, extracting data from eligible studies, assessing potential for bias, and synthesizing the results into medical guidelines. Researchers need to screen (tens of) thousands of COVID-19 related studies by hand to find relevant papers to include in their overview. Using ASReview, this can be done far more efficiently by selecting key papers that match their (COVID-19) research question in the first step; this should start the active learning cycle and lead to the most relevant COVID-19 papers for their research question being presented next. Therefore, a plug-in was developed for ASReview 39 preprints, containing metadata of preprints from over 15 preprints servers across disciplines, published since January 1, 2020. 41 The preprint dataset is updated weekly by the maintainers and then automatically updated in ASReview as well. As this dataset is not readily available to researchers through regular search engines (e.g. PubMed), its inclusion in ASReview provided added value to researchers interested in COVID-19 research, especially if they want a quick way to screen preprints specifically.

To evaluate the performance of ASReview on a labeled dataset, users can employ the simulation mode. As an example, we ran simulations based on four labeled datasets with version 0.7.2 of ASReview. All scripts to reproduce the results in this paper can be found on Zenodo 43 .

Datasets. First, we analyzed the performance for a study systematically describing studies that performed viral Metagenomic Next-Generation Sequencing (mNGS) in common livestock such as cattle, small ruminants, poultry, and pigs. 44 Performance Metrics. We evaluated the four datasets using three performance metrics. First, we assess the "Work Saved over Sampling" (WSS). WSS is the percentage reduction in the number of records needed to screen that is achieved by using the program instead of screening records at random. WSS is measured at a given level of recall of relevant records, for example 95%, indicating the work reduction in screening effort at the cost of failing to detect 5% of the relevant records. For some researchers it is essential that all relevant literature on the topic is retrieved; this entails that the recall should be 100% (i.e., WSS@100%). Note that to be sure to detect 100% of relevant records, all records need to be screened, therefore leading to no time savings. We also propose the amount of Relevant References Found after having screened the first 10% of the records, RRF10%. This is a useful metric for getting a quick overview of the relevant literature.

For every dataset, 15 runs were performed with one random inclusion and one random exclusion, see Figure 2 . The classical review performance with randomly found inclusions is shown by the dashed line. The average work saved over sampling at 95% recall for ASReview is 83% and ranges from 67% to 92%. Hence, 95% of the eligible studies will be found after screening between only 8% to 33% of the studies. Furthermore, the number of relevant abstracts found after reading 10% of the abstracts ranges from 70% to 100%. In short, our software would have saved many hours of work.

We conducted a series of user experience tests to learn from end users how they experience the software and implement it in their workflow. The study was approved by the Ethics Committee of the Faculty of Social and Behavioral Sciences of Utrecht University (ID 20-104).

was conducted with an academic research team in a substantive research field (public administration and organizational science) that has conducted various systematic reviews and meta-analyses. It was composed of three university professors (ranging from assistant to full) and

three PhD candidates. In one 3.5-hour session, the participants used the software and provided feedback via unstructured interviews and group discussions. The goal was to provide feedback on installing the software and testing the performance on their own data. After these sessions we To analyze the notes, thematic analysis was used, which is a method to analyze data by dividing the information in subjects that all have a different meaning 54 using the software Nvivo 12 55 . When something went wrong the text was coded as "showstopper". When something did not go smoothly the text was coded as "doubtful". When something went well the subject was coded as "superb". The features the participants requested for future versions of the ASReview tool were discussed with the lead engineer of the ASReview team and were submitted to GitHub as issues or feature requests.

The answers to the quantitative questions can be found at the Open Science Framework 56 . The participants (N=11) rated the tool with a grade of 7.9 (SD = 0.9) on a scale from one to ten ( Table   2 ). The unexperienced users on average rated the tool with an 8.0 (SD= 1.1, N=6) . The experienced user on average rated the tool with a 7.8 (SD= 0.9, N=5). The participants described the usability test with words such as "helpful", "accessible", "fun", "clear" and "obvious".

The UX-tests resulted in the new release v0.10 57 , v0.10.1 58 and the major release v0.11 59 , which is a major revision of the GUI. The documentation has been upgraded to make installing and launching ASReview more straightforward. We made setting up the project, selecting a dataset and finding prior knowledge is more intuitive and flexible. In addition, we added a project dashboard with information on your progress and advanced settings. 

To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a system to accelerate the step of screening titles and abstracts. Our system uses active learning to train an ML model that predicts relevance from texts using a limited Drawbacks of ML-based screening systems, including our own, remain. First, while the active learning step greatly reduces the number of papers that must be screened, it also prevents a straightforward evaluation of the system's error rates without further onerous labeling. Providing users with an accurate estimate of the system's error rate in the application at hand is therefore a pressing open problem. Second, while, as argued above, the use of such systems is not limited in principle to reviewing, to our knowledge no empirical benchmarks of actual performance in these other situations yet exist. Third, ML-based screening systems automate the screening step only; while the screening step is time-consuming and a good target for automation, it is just one part of a much larger process, including the initial search, data extraction, coding for risk of bias, summarizing results, etc. While some other work, similar to our own, has looked at (semi-)automating some of these steps in isolation 60, 61 , to our knowledge the field is still far removed from an integrated system that would truly automate the review process while guaranteeing the quality of the produced evidence synthesis. Integrating the various tools that are currently under development to aid the systematic reviewing pipeline is therefore a worthwhile topic for future development.

Possible future research could also focus on the performance of identifying full text articles with different document length and domain-specific terminologies or even other types of text, such as newspaper articles and court cases. When the selection of prior knowledge is not possible based on expert knowledge, alternative methods could be explored. For example, unsupervised learning or pseudo-labeling algorithms could be used to improve training 62, 63 . In addition, as the NLP community pushes forward the state of the art in feature extraction methods, these are easily added to our system as well. In all cases, performance benefits should be carefully evaluated using benchmarks for the task at hand. To this end, common benchmark challenges should be constructed that allow for an even comparison of the various tools now available. To facilitate such a benchmark, we have constructed a repository of publicly available systematic reviewing datasets 64 .

The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We invite the community to contribute to open source projects such as our own, as well as to common benchmark challenges, so that we can provide measurable and reproducible improvement over current practice. 

Stopping: When the model predicts none of the remaining abstracts to be relevant.

"We do not have a limit on how long we retain your account information and/or data." "We do not share any information with third parties."

Classifier: NB; SVM; DNN; LR; LSTM-base; LSTM-pool; RF.

Model inputs: piece of text, for example title and abstract. Active learning starts after: One label.

Stopping: Is currently left to the reviewer.

Software does not have access to user data, because the program runs locally.

Classifier: SVM with SGD learning.

Model inputs: User-provided key terms and citation (abstract, title, keywords).

Feature extraction: Word2Vec.

Label options:Iinclude; exclude.

Balance strategy: Reweighting.

Active learning starts after: 100 inclusions and 100 exclusions.

Retraining: Every 30 abstracts.

Stopping: Is left to the reviewer.

No terms and conditions available.

The Colandr team was contacted and they ensured the user can remove data any time. In the future, user data will be used to improve Colandr but only if granted permission from the project owner FASTREAD 68 Classifier: SVM.

Feature extraction: TF-IDF.

Query strategy: Uncertainty sampling; Certainty sampling. Users are allowed to switch between active learning types after 30 inclusions.

Balance strategy: Mix of weighting and aggressive undersampling.

Active learning starts after: One relevant abstract is retrieved (through querying random abstracts).

Retraining: every 10 abstracts.

Stopping: The number of relevant abstracts is estimated by semi-supervised learning.

Software does not have access to user data, because the program runs locally.

Classifier: SVM.

Model inputs: User-provided key terms and citation (title and abstract). http://crebp-sra.com/#/ N N N Notes: URL = webpage affiliated with the tool; Open Source = whether source materials of the application are openly available on a collaborative platform such as GitHub; Open Source URL = link to open source code; Machine Learning = whether the tool implements machine learning with the goal of reducing the number of abstracts needed to screen; Active Learning = whether the tool implements machine learning with the goal of reducing the number of abstracts needed to screen; Y=Yes, N=No.

Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references

An introduction to systematic reviews

Research Synthesis and Meta-Analysis: A Step-by-Step Approach

The PRISMA statement for reporting systematic reviews and metaanalyses of studies that evaluate health care interventions: explanation and elaboration

Systematic reviews: what have they got to offer evidence based policy and practice? (ESRC UK Centre for Evidence Based Policy and Practice London

Systematic reviews: making them policy relevant. A briefing for policy makers and systematic reviewers

Systematic reviews from astronomy to zoology: myths and misconceptions

Searching for Studies

Precision of healthcare systematic review searches in a cross-sectional sample

Error rates of human reviewers during abstract screening in systematic reviews

Toward systematic review automation: a practical guide to using machine learning tools in research synthesis

Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation

Using text mining for study identification in systematic reviews: a systematic review of current approaches

Semi-automated screening of biomedical citations for systematic reviews

Reducing Workload in Systematic Review Preparation Using Automated Citation Classification

Active learning with support vector machines

Reducing systematic review workload through certainty-based screening

Active Learning Literature Survey

Interactive machine learning for health informatics: when do we need the human-in-the-loop?

Researcher-in-the-loop for systematic reviewing of text databases

Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec

Promoting an open research culture

Towards Automatic Recognition of Scientifically Rigorous Clinical Research Evidence

Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources

Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry

Innovation in the Public Sector: A Systematic Review and Future Research Agenda

ASReview: Active learning for systematic reviews

ASReview Software Documentation 0.14. (Zenodo, 2020)

ASReview Core Development Team. ASReview PyPI package

ASReview Core Development Team. Docker container for ASReview

Active learning for screening prioritization in systematic reviews -A simulation study

Certainty-Enhanced Active Learning for Improving Imbalanced Data Classification

Distributed Representations of Sentences and Documents

Using TF-IDF to Determine Word Relevance in Document Queries

Scikit-learn: Machine Learning in Python

Sentence Embeddings using Siamese BERT-Networks. ArXiv190810084 Cs

Methodology in conducting a systematic review of systematic reviews of healthcare interventions

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Extension for COVID-19 related datasets in ASReview

CORD-19: The Covid-19 Open Research Dataset

Scripts for 'ASReview: Open source software for efficient and transparent active learning for systematic reviews

Results for 'ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews

A Systematic Literature Review on Fault Prediction Performance in Software Engineering

The GRoLTS-Checklist: Guidelines for reporting on latent trajectory studies

Bayesian PTSD-Trajectory Analysis with Informed Priors Based on a Systematic Literature Search and Expert Elicitation

Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage

ASReview Core Development Team

ASReview Core Development Team

ASReview Core Development Team

ASReview Core Development Team. Release v0

Human-moderated remote user testing: Protocols and applications

Thematic analysis

ASReview Core Development Team. Release v0

ASReview Core Development Team

ASReview Core Development Team. Release v0

RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond

Unsupervised data augmentation for consistency training

Snorkel: rapid training data creation with weak supervision

Deploying an interactive machine learning system in an evidence-based practice center: abstrackr

Using machine learning to advance synthesis and use of conservation and environmental evidence

Finding better active learners for faster literature reviews

Rayyan-a web and mobile app for systematic reviews

Prioritising references for systematic reviews with RobotAnalyst: A user study

ASReview Core Development Team

RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials

Proceedings of the 30th annual ACM symposium on applied computing

Proceedings of the 16th International Conference on Enterprise Information Systems

Proceedings of the 20th international conference on evaluation and assessment in software engineering

Abstracts of the 24th Cochrane Colloquium

We would like to thank the Utrecht University Library, focus area Applied