key: cord-0692221-o6ndhlsn authors: Du, Jingcheng; Wang, Qing; Wang, Jingqi; Ramesh, Prerana; Xiang, Yang; Jiang, Xiaoqian; Tao, Cui title: COVID-19 Trial Graph: A Linked Graph for COVID-19 Clinical Trials date: 2021-04-24 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocab078 sha: 1fc8142310c66512d5747544effeddd36318f9cb doc_id: 692221 cord_uid: o6ndhlsn OBJECTIVE: Clinical trials are an essential part of the effort to find safe and effective prevention and treatment for COVID-19. Given the rapid growth of COVID-19 clinical trials, there is an urgent need for a better clinical trial information retrieval that supports searching by specifying criteria including both eligibility criteria and structured trial information. MATERIALS AND METHODS: We built a linked graph for registered COVID-19 clinical trials: the COVID-19 Trial Graph, to facilitate retrieval of clinical trials. Natural language processing (NLP) tools were leveraged to extract and normalize the clinical trial information from both their eligibility criteria free texts and structured information from ClinicalTrials.gov. We linked the extracted data using the COVID-19 Trial Graph and imported it to a graph database, which supports both query and visualization. We evaluated trial graph using case queries and graph embedding. RESULTS: The graph currently (as of 10-05-2020) contains 3,392 registered COVID-19 clinical trials, with 17,480 nodes and 65,236 relationships. Manual evaluation of case queries found high-precision and recall scores on retrieving relevant clinical trials searching from both eligibility criteria and trial-structured information. We observed clustering in clinical trials via graph embedding, which also showed superiority over the baseline (0.8704 vs. 0.8199) in evaluating whether a trial can complete its recruitment successfully. CONCLUSIONS: The COVID-19 Trial Graph is a novel representation of clinical trials that allows diverse search queries and provides a graph-based visualization of COVID-19 clinical trials. High-dimensional vectors mapped by graph embedding for clinical trials would be potentially beneficial for many downstream applications, such as trial end recruitment status prediction, and trial similarity comparison. Our methodology also is generalizable to other clinical trials, such as cancer clinical trials. The coronavirus disease 2019 (COVID-19) pandemic is an ongoing global pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-COV-2). As of September 15, 2020, more than 29 million cases had been reported worldwide, resulting in over 900,000 deaths. The United States alone has reported 6.5 million cases, with more than 194,000 deaths. [1] Unfortunately, as COVID-19 is very new, there are no drugs or other therapeutics approved by the U.S. Food and Drug Administration so far to treat or prevent COVID-19. [2] As a part of essential efforts to identify safe and effective approaches for the prevention of and treatment for COVID-19, clinical trials research communities are accelerating efforts in response to this pandemic. To date, more than 3,300 clinical trials have been registered at ClinicalTrials.gov for COVID-19. More than half are actively recruiting participants. [3] A registered clinical trial typically contains both structured information (e.g., recruitment status, study type, intervention/treatment) and unstructured information, including eligibility criteria. Given the high volume and rapid growth of COVID-19 clinical trials, it is critical to have an integrated repository to link clinical trial information to facilitate better trial searching, matching, and accrual. Existing systems (e.g., clinical trials registry [4] , ClinicalTrials.gov) to track and manage COVID-19 trials focus more on the retrieval of structural information (e.g., title, target condition or disease, study types). Eligibility criteria, which define the subject cohort in each clinical study in the free-text format, however, is often not available for searching in these existing systems. Although ClinicalTrials.gov offers advanced search functions (e.g., https://clinicaltrials.gov/ct2/results/refine?cond=COVID-19) for clinical trials, most of its searching abilities focus on structured information. Due to the free-text nature of eligibility criteria, ClinicalTrials.gov provides only very limited search functionality on eligibility criteria, only including age, sex, and whether it accepts healthy volunteers. [3] Other important information in the cohort definition, such as existing conditions of potential participants (e.g., diabetes, pregnancy), use of prior medications, and lab tests/measurement, are not searchable in ClinicalTrials.gov. In this study, we build the COVID-19 Trial Graph, a graph-based clinical trial data repository, to link structured and unstructured (i.e., eligibility criteria) information for existing registered COVID-19 clinical trials. The COVID-19 Trial Graph supports diverse search queries with a particular focus on eligibility criteria and provides a graph-based visualization of COVID-19 clinical trials. In addition, as one of the first efforts, we trained a clinical trial graph embedding on the COVID-19 Trial Graph. We evaluated the trained trial embedding through visualization and one supervised machine learning task: prediction of trial end recruitment status. We used currently ended trials for evaluation and the model could also be used to predict whether an ongoing trial can complete recruitment successfully. The overview of the study design can be seen in Figure 1 . For each COVID-19 clinical trial collected from ClinicalTrials.gov, we first leveraged natural language processing (NLP) tools to extract and normalize structured and unstructured (i.e., eligibility criteria) information. Then, we linked the extracted data and entities to construct the COVID-19 Trial Graph. As a part of the evaluation, we further tested some case queries and evaluated the precision of the results. We also trained a clinical trial graph embedding, which can be used to represent clinical trials for downstream applications. [Insert Figure 1 about here] We collected structured data from COVID-19 relevant clinical trials from ClinicalTrials.go on 09-22-2020. Data included NCT ID, title, intervention, study type, location, phase, sponsor, outcome measures, etc.. For the intervention target, we manually normalized the most frequent terms (top 300) to their full names, for example, "hcq," "hydroxychloroquine (hcq)," and "hydroxychloroquine sulfate 200 mg" was mapped to "hydroxychloroquine." For the location, we leveraged Google Map API to extract the country/region information. For unstructured data in clinical trials eligibility criteria free text, Criteria2query [5] , a hybrid information extraction pipeline for parsing eligibility criteria text, was then adopted to extract a variety of named entities from both inclusion and exclusion criteria, including Condition, Drug, Measurement, Procedure, and Observation. All of the extracted entities from eligibility criteria text were then mapped to the OMOP Common Data Model, which is a popular standardized data representation for observational data [6] . Specifically, we indexed all the standard concept names and corresponding synonyms of OMOP standard vocabulary using Apache Lucene. Then we performed BM25 algorithm to retrieve candidate concepts from the Lucene index and leverage text processing and rule-based concept re-ranking functions of CLAMP [7] to map to the final standard concept ID. Figure 1(b) shows the meta-data level design of the COVID-19 Trial Graph. There are nine types of concepts (i.e., nodes) defined in the graph, including a concept for the clinical trial, five concepts for eligibility criteria, and three concepts for structured trial information. The nine types of nodes as well as their statistics can be seen in Table 1 . All other relevant information (e.g., status, study type, phase) were added as attributes for the clinical trial concept. For each clinical trial and eligibility criteria concept pair, two relationships, including both inclusion and exclusion, were defined, depending on where the pair appears. The full list of relationships as well as the statisitics can be seen in Table 2 . We leveraged Neo4j, a leading graph database, to import all of the graph data and host the COVID-19 Trial Graph. [8] Cypher is a declarative graph query language that is supported by Neo4j. [9] We designed several Cypher case queries and manually check the return results and calculate the precision of these queries. The graph data files and Cypher case queries are publically available at: https://github.com/UT-Taogroup/clinical_trial_graph . Embedding was first proposed for NLP tasks that map words or phrases in vocabulary to vectors of real numbers based on word-word co-occurrences, so as to enable the mathematical computation between text pieces. [10] Similarly, graph embedding refers to a set of algorithms that transform graphs or elements in graphs to vectors of real numbers by capturing the topology and node-to-node relationships in the graphs, to facilitate a series of downstream computational operations. In this study, we leveraged node2vec to learn the node representations in the COVID-19 Trial Graph. Node2vec algorithms adopt a random-walk sampling strategy to learn the representation for nodes by optimizing a neighborhood-preserving objective. [11, 12] We empirically set the embedding dimension as 100. The walk length was set at 10, and the number of walks was set at 2,000. The number of iterations was set at 1. We implemented the algorithm from Grover and Leskovec. [13] To visualize the trial graph embedding, we first reduced the embedding dimension to 50, using principal component analysis (PCA), and then further reduced the dimensions to two, using t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is a machine learning-based algorithm for dimension reduction and has been widely adopted for high-dimensional embedding visualization for biomedical concepts. [14, 15] The perplexity of t-SNE was set at 50, and the number of iterations was set at 10,000. We implemented t-SNE using the code available at To further test whether the learned clinical trial graph embedding can convey critical trial information, we conducted a pilot experiment to evaluate the efficacy of trial graph embedding in the prediction of trial status (i.e., whether a trial can complete recruitment successfully). We excluded ongoing clinical trials (e.g., status "Not yet recruiting", "Recruiting", "Enrolling by invitation", "Active, not recruiting") or the trial the status of which is unknown in ClinicalTrials.gov. Based on the recruitment status provided, we split the trials into two groups: (1) completed, including the recruitment status "Completed" in ClinicalTrials.gov; and (2) stopped, including the recruitment status "Terminated" (i.e., the study stopped early and will not start again), "Withdrawn" (i.e., the study stopped early, before enrolling its first participant), and "Suspended" (i.e., the study stopped early but may start again). We took Clinical Trial nodes embeddings from the whole trial graph embeddings as the input for each clinical trial and evaluated several machine-learning algorithms to predict the recruitment status from the embedding vectors directly, including logistic regression (LR), extra trees (ET), support vector machines (SVM), random forest (RF), and gradient boosting (GB). Ten-fold cross-validations were conducted, and the average accuracy was reported. A total of 3,392 clinical trials from 114 countries or regions that targeted COVID-19 were collected from ClinicalTrial.gov. The COVID-19 Trial Graph currently contains 17,480 nodes in total, with nine types and 65,236 relationships with 13 types. The statistics of nodes and relationships can be seen in Tables 1 and 2 , respectively. Condition is the most prevalent concept extracted from the eligibility criteria of COVID-19-related clinical trials, and Condition concepts appeared more in exclusion criteria texts than in inclusion criteria texts. We designed the following example queries to evaluate whether the COVID-19 Trial Graph can support a diverse search ability across structured information and eligibility criteria. The queries ranged from easy to complicated, depending on the involvement of criteria. As the COVID-19 Trial Graph is hosted on a Neo4j database, we implemented these queries using the Cypher graph query language. We used a hybrid method including keywords search followed by manual review to identify relevant clinical trials to be considered as ground truth for each case query (the details are available in the supplement material S1), and then compared the manuallyreviewed results with clinical trials retrieved by the queries. The precision and recall scores were calculated based on these gold standards. Table 3 shows precision and recall scores for each query. As we can observe, these case queries have scores in both precision and recall. (5) Case query 4 10 10 90% (9) 90% (1) *Please note that these eligible trials were identified using keywords search followed by manual review The errors were mainly caused by false negative of the NLP models (i.e., not able to identify and normalize specific clinical terms from the eligibility criteria texts). For example, there were five false negatives for case query 3. Three of them included phrases like "difficult to breath", "respiratory distress" in their eligibility criteria texts. However, the existing NLP models were not able to map these phrases to "shortness of breath" (Dyspnea in OMOP). respectively. There is also clustering for the intervention target. For example, we observed that some clinical trials related to hydroxychloroquine and plasma treatment were clustered. Clustering also can be observed for study status (i.e., recruitment status) and study types (i.e., interventional versus observational). [Insert Figure 2 about here] In our experiments, we categorized 355 clinical trials to the "completed" group, and 78 trials to the "stopped" group respectively, based on their status reported at Clinicaltrial.gov. The accuracy of machine-learning algorithms in predicting recruitment status can be seen in Table 4 . As this dataset is highly-unbalanced, we defined a baseline accuracy when the classifier assigns a "completed" label to a trial in every prediction, which is 0.820. The embedding generated from the COVID-19 Trial Graph has been found effective in predicting the recruitment status for almost all of the algorithms. SVM with RBF kernel achieved the highest accuracy at 0.870 with the COVID-19 Trial Graph embedding, a 5-point increase compared with baseline accuracy. We performed a comparison between graph embedding vectors and random initialized vectors. The highest accuracy generated by machine-learning algorithms for random vectors was 0.820 (+/-0.018), which is the same as the baseline accuracy. Information extraction from free-text eligibility criteria is considered a challenging task, as the eligibility criteria descriptions are often arbitrary and ambiguous. [5] Although we leveraged state-of-the-art NLP tools to extract the mention of clinical concepts from eligibility criteria, errors could occur in regard to entity recognition and normalization tasks. In addition, the present study also ignores temporal and math operators in eligibility criteria, such as "Mechanically ventilated" > "5 days." Although such information is important to define the study cohort, the accurate recognition and inference of these logic operators are much more complicated and, thus, beyond the scope of this study. Clinical trials also contain other free-text information, such as study description, and arms, which were not represented in the COVID-19 Trial Graph. As an early attempt to represent clinical trial information in a graph, there remains some interesting research questions. For example, we used node2vec to generate graph embedding of the clinical trials, but it might be not the optimal graph embedding algorithm as it ignores the types of relationships. Novel embedding algorithms such as deep learning might be applied. [17] In addition, although we demonstrated the potential efficacy of predicting trial recruitment status through graph embedding, it's unclear which features or data in the clinical trials have a high predictive power through our existing experiments. This would be worth to be investigated in the future as well. Formal representation is essential for managing the heterogeneity and diversity of data in clinical trials. This study reports our pilot efforts to represent clinical trials, and we used COVID-19 as the use case. The COVID-19 Trial Graph is a linked graph that captured essential information in clinical trials related to COVID-19. Evaluations demonstrated its potential to expand conventional search abilities and to represent clinical trials through graph embedding that could be helpful for downstream tasks, such as predicting recruitment status or finding similar trials. Our methodology is also generalizable to other clinical trials, such as cancer clinical trials. The data underlying this article and case queries are available at https://github.com/UT-Taogroup/clinical_trial_graph DISCLAIMER COVID-19 Map -Johns Hopkins Coronavirus Resource Center COVID-19 Hub | Covid-19 Treatment Hub COVID-19 -Modify Search -ClinicalTrials.gov A real-time dashboard of clinical trials for COVID-19 Criteria2Query: A natural language interface to clinical databases for cohort definition CLAMP -a toolkit for efficiently building customized clinical natural language processing pipelines Neo4j Graph Platform -The Leader in Graph Databases. Neo4j Graph Platf Cypher Query Language -Neo4j Graph Database Platform Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases Node2vec: Scalable feature learning for networks Ten quick tips for effective dimensionality reduction Distributed representation of genes based on coexpression The National Institutes of Health, and Cancer Prevention and Research Institute of Texas had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.