key: cord-0710701-6s5gh2pj authors: Sun, Yingcheng; Butler, Alex; Lin, Fengyang; Liu, Hao; Stewart, Latoya A; Kim, Jae Hyun; Idnay, Betina Ross S; Ge, Qingyin; Wei, Xinyi; Liu, Cong; Yuan, Chi; Weng, Chunhua title: The COVID-19 Trial Finder date: 2020-11-20 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocaa304 sha: 79a42d9d715228f9c26fc6bfb00448d7b12a387a doc_id: 710701 cord_uid: 6s5gh2pj Clinical trials are the gold standard for generating reliable medical evidence. The biggest bottleneck in clinical trials is recruitment. To facilitate recruitment, tools for patient search of relevant clinical trials have been developed, but users often suffer from information overload. With nearly 700 COVID-19 trials conducted in the United States as of August 2020, it is imperative to enable rapid recruitment to these studies. The COVID-19 Trial Finder was designed to facilitate patient-centered search of COVID-19 trials, first by location and radius distance from trial sites, and then by brief, dynamically-generated medical questions to allow users to pre-screen their eligibility for nearby COVID-19 trials with minimum human computer interaction. A simulation study using 20 publicly available patient case reports demonstrates its precision and effectiveness. Patient-to-trial matching remains a critical bottleneck in clinical research, largely due to the freetext format of clinical trial information, 1 particularly eligibility criteria that are indispensable for screening patient eligibility and yet not amenable to even simple computation. 2 Existing clinical trial search systems are either keyword-based or questionnaire-based. 3 Keyword-based search engines, such as the ClinicalTrials.gov, FindMeCure.com, Janssen Global Trial Finder 4 or ResearchMatch 5 , require users to search for trials using keywords, which tends to impose challenges for query formulation and generate information overload. 6 Static questionnaire systems, such as Fox Trial Finder 7 , filter out irrelevant trials by asking users to answer a long list of preselected questions, which can be laborious and are not user friendly. The COVID-19 pandemic is one of the greatest challenges modern medicine has faced. As of August 2020, there have been more than 6 million confirmed cases and 180,000 reported deaths in the United States, with few approved treatments. 8, 9 In response to the COVID-19 emergency, clinical trial research assessing the efficacy and safety of COVID-19 treatments are being created at an unprecedented rate. As of August 31, 2020, well over 3,100 clinical trials have been registered in ClinicalTrials.gov, the largest clinical trial registry in the world. The need for rapid and accessible trial search tools has never been more apparent than now. In this paper, we describe an open-source semantic search engine for COVID-19 clinical trials conducted in the United States called the "COVID-19 Trial Finder" by extending our previously published method for using dynamically generated questionnaires for enabling efficient clinical trial search 6 . This is an interactive COVID-19 trial search engine that enables minimized, dynamic questionnaire generation in response to user provided answers in real-time. It is powered by a regularly updated machine-readable dataset for all the COVID-19 trials in the US. It is also enhanced with a web-based visualization of the geographic distribution of COVID-19 trials in the US to enable friendly user navigation with the trial space. By facilitating search for appropriate COVID-19 trials in specific geographic areas, the system enables research volunteers to perform self-screening using the eligibility criteria of these COVID-19 trials. Further, it allows clinical trialists to assess the landscape of COVID-19 trials by eligibility criteria and geographical locations in order to identify collaboration opportunities for similar COVID-19 studies and improve trial response corresponding to evolving case surges. The system is accessible at (https://covidtrialx.dbmi.columbia.edu), with its source code at (https://github.com/WengLab-InformaticsResearch/COVID19-TrialFinder). We evaluated the system on 20 published COVID-19 case reports and demonstrated its precision and efficiency. The COVID-19 Trial Finder consists of two modules for trial indexing and trial retrieval, respectively. The trial indexing module works offline to extract entities and attributes from eligibility criteria text and to create a trial index using semantic tags, which are the extracted terms mapped to standardized clinical concepts. The retrieval module dynamically generates medical questions and iteratively filters out trials based on user answers until a sufficiently shortened list of trials is generated. Figure 1 shows the system architecture. Clinical trial eligibility criteria exist largely as free text, so they must be formalized to a machinereadable syntax to allow for semantic trial retrieval. In the trial indexing module, COVID-19 related trials are acquired from the ClinicalTrials.gov by querying all the trials indexed with "COVID-19" being their condition. Using a semi-automated method, their eligibility criteria are structured and formatted using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) 10,11 first by an automated tool Critera2Query 12 , and then verified and corrected manually as needed by medical domain experts (AB, JK, LS) to overcome the limitations in Criteria2Query. With international memberships in the terminology sets, OMOP CDM provides a comprehensive standard vocabulary for representing clinical concepts commonly available inpatient data. Take an example in our system, the phrase "shortness of breath" in the criteria text will first be extracted by Criteria2Query through its Named Entity Recognition (NER) module and mapped to the concept "Dyspnea" as the semantic tag through the entity normalization module, which will then be used to index associated trials. We provided a dataset with this paper 13 , including 581 COVID-19 trials annotated by 10, 223 semantic tags, with 17.6 tags per trial on average. After removing dubplates, 1,811 distinct tags are used for trial index. Supplementary Material Table 1 lists a subset of the dataset. Newly registered COVID-19 trials are updated weekly to the database, and the trial index list is updated accordingly. Entities from the following five domains are used for question generation: i.e., Condition, Device, Drug, Measurement, and Procedure. Domain-specific question templates are provided in Supplementary Material Table 2 . For example, "Have you ever been diagnosed with (condition concept)?" is the question template for "Condition" entities. The templates are not exhaustive to cover all possible questions, but are designed to triage trials by building upon prior knowledge of common eligibility criteria. 6 The trial retrieval module interacts with users and asks criteria-related questions to facilitate eligibility determination. Users first enter their location (e.g. zip code) and then select a study type (e.g., interventional, observational, etc.). Next, five most frequently used criteria about current age, high-risk status (e.g., hospital worker), COVID-19 status, current hospitalization or ICU admission, and pregnancy status are formulated as "standard questions" and posed to all users. These criteria are frequently used across all COVID-19 trials and thus serve as an important participant stratification step, so they are posed together on a single page instead of via dynamically generated questions. Afterwards, an eligibility criterion with the max information gain (i.e. with the highest entropy) is selected and rendered into question using the corresponding templates. Based on the user's answer to the presented question, the most ineligible trials are filtered out. This process iterates, each time narrowing down the trial candidates pool and visualizing the recruiting sites of the remaining eligible trials on an interactive map, until the user reaches a short list of trials. Four main web interfaces are shown in Figure 2 . On the index page, users can specify the geographical area to search for recruiting sites by inputting a zip code and an adjustable radius in (1) . Section (2) shows a collapsible section containing advanced search options such as trial type, recruiting status, and keyword search. Five 'standard questions' are expected to be answered in (3) , and users can skip them by clicking the "skip questions" button in (4) and enter the dynamic questionnaire page. Users can select the question type and answer one question at a time in section (5) , and the candidate trial list is dynamically updated and shown in (6) . The answered questions are recorded in (7), where users can return to previous questions to make updates. After clicking the "Show Eligible Trials" button, users will be able to visualize them geographically. Eligible trials are listed by their titles in (8) , and all the recruiting sites within the user delimited area are marked as small green icons on the map (9) . The embedded map is powered by Google Maps API. Users can select any trial to review additional details such as study type, description, contact information, and location(s) in (10) , and its recruiting sites within the geographic area as specified by the user in (1) will be highlighted and pinpointed on the map as well. Additionally, participants interested in learning more about the study will be able to access a link to ClinicalTrials.gov. The initial version of the COVID-19 Trial Finder was released in May, and more than 690 page visits from 20 countries were recorded by the end of August 2020, including 615 visits from 40 states in the United States, according to the report of Google Analytics. We evaluated the effectiveness of the system by assessing its precision in identifying appropriate trials for users. We selected 20 patient cases in the U.S. from COVID-19 case reports curated by LitCOVID 14 , with consideration for diversity in location, age, sex, and comorbidities, and run simulations on our system based on these patient cases. The detailed information of the 20 cases can be found in the Supplementary Material Table 3 . For each case report, the zip code was based on the corresponding author's address stated in the case report. We set the default radius as 100 miles to ensure adequate coverage of available trials. The five standard questions and multiple dynamic questions were answered based on the patient profile and reported symptoms. We continuously answered the dynamic questions until no more questions could be generated and the system prompted a review of the returned trial list. An example of the question-answering process is shown in Supplementary Material Table 4 . We then manually reviewed each identified clinical trial in the final list to confirm its relevance to the user query by examining the inclusion and exclusion criteria available at ClinicalTrials.gov Table 1 includes all the above results. No. 10), precision was normalized by the number of trials after screening for each case. Next, we evaluated the efficiency of the system by comparing the number of trials identified at each step of the search. The percentage of trials being filtered was the number of identified trials after answering the five standard questions divided by the number of retrieved trials after answering dynamically-generated questions. On average 34.8% of trials were filtered out after answering 9 dynamic questions which is consistent with the experimental results of DQueST 6 . A small number of identified trials were irrelevant as confirmed by manual review. In review, the imprecision encountered in finding eligible trials was largely caused by the inability to generate relevant questions. For a few criteria, no questions were asked to filter out ineligible trials and these limitations can be summarized into three types: Location, Identity, and Condition, with examples alongside the unmatched criteria and causes for the errors described in Table 2 . No question asked about offspring information The "Location" limitation refers to insufficient granularity or specificity in our question template for locations. The "Identity" limitation signifies insufficient specificity in the identity of the participant, such as some trials recruiting clinical therapists instead of COVID19 infected patients. For the "Condition" limitation, extraction may be incorrect or missed so that concepts are mismatched to the terms. 12 For example, the word "severe" can be a qualifier for a condition as opposed to be part of condition definition. To improve the relevance and precision of the trial filtering, this system could utilize a more granular annotation model to cover more entities and attributes as well as a wider range of domain types for these annotations such as Visit, Person, or Observation within the OMOP Model. Further, additional question templates would allow an increased number of questions to be posed. Considering the tradeoff between finer granularity in the annotation model and the increase in annotation cost, we did not add more questions in this study, but future efforts can explore how to efficiently annotate more types of criteria to boost the precision of trial matching while maintaining a high level of usability and comfortable ease-of-access to maintain user participation. Currently we included only COVID-19 trials conducted in the United States in the Trial Finder application simply to keep the scope manageable for evaluation purpose and to avoid the need for engineering work on translating the system into different foreign languages. Our open-source method is available for adoption and implementation by researchers across the world. We compared the inclusion and exclusion criteria of 777 COVID-19 trials in the US and 2,318 COVID-19 trials outside the US registered on ClinicalTrials.gov by Oct 1st, 2020, and found 42.3% of overlap (87.0% if not counting infrequent criteria, which are defined as criteria that appear in less than 10 trials). For the different criteria, they can also be indexed with standard concepts and searched by corresponded questions since the OMOP CDM includes international terminologies. It will be interesting and feasible to apply the Trial Finder system for non-US trials in the future. The COVID-19 Trial Finder facilitates fast search and self-eligibility screening for COVID-19 trial seekers. Despite its limitations, preliminary evaluation by emulated case reports demonstrates its precision and efficiency, showing its potential as a user-friendly COVID-19 trial search engine. Supplementary material is available at Journal of the American Medical Information Association online. EliIE: An open-source information extraction system for clinical trial eligibility criteria Information extraction from free text in clinical trials with knowledgebased distant supervision A conversational agent-based clinical trial search engine Global Trial Finder: Why It Just Got Easier to Enroll in a Janssen Clinical Study Connecting the public with clinical trial options: the researchmatch trials today tool DQueST: dynamic questionnaire for search of clinical trials Fox Trial Finder: An Innovative Web-Based Trial Matching Tool To Facilitate Clinical Trial Recruitment An interactive online dashboard for tracking covid-19 in us counties, cities, and states in real time COVID-19 Dashboard by the OHDSI Common Data Model v6.0 Specifications Criteria2Query: a natural language interface to clinical databases for cohort definition Data from: The COVID-19 Trial Finder None. YS, AB, and CW conceived the system design together. YS, AB, FL and CL designed and implemented the system. CW supervised the design and implementation. HL, LS, JK and CY contributed to the data annotation. BI, QG, XW contributed to the evaluation of the system. All authors edited and approved the manuscript. The data underlying this article are available in Dryad Digital Repository, at