key: cord-0426091-exnddv0q authors: Lobbé, Quentin; Chavalarias, David; Delanoë, Alexandre; Ferrand, Gabriel; Cohen-Boulakia, Sarah; Ravaud, Philippe; Boutron, Isabelle title: Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example date: 2022-05-27 journal: nan DOI: 10.1016/j.jclinepi.2022.05.004 sha: 296b2e3d20242e8dd098435cfa6793add3a538f2 doc_id: 426091 cord_uid: exnddv0q Objective to visualize the evolution of all registered COVID-19 vaccines trials Study Design and Setting as part of the living mapping of the COVID-NMA initiative, we identify biweekly all COVID-19 vaccine trials and automatically extract data from the EU clinical trials registry, ClinicalTrials.gov, IRCT and the WHO International Clinical Trials Registry Platform. Data are curated and enriched by epidemiologists. We have used the phylomemy reconstruction process to visualize the temporal evolution of COVID-19 vaccines trials descriptions. We have analyzed the textual contents of 1,794 trials descriptions (last search in October 2021) and explored their collective structure along with their semantic dynamics. Results the structures highlighted by the phylomemy reconstruction processes synthesize the complexity of the knowledge produced by the research community. The reconstructed phylomemy clearly retrieves the five major COVID-19 vaccine platforms in the form of complete branches. The branches interactions reflect the exploration of a new approach to vaccine implementation moving from homologous prime vaccination to heterologous prime vaccination. Phylomemies also clearly identifies shifts in research questions; from vaccine efficacy to booster efficacy. Conclusion this new method provides important insights for the global coordination between research teams especially in crisis situations such as the COVID-19 pandemic. Over the past two years, the ongoing COVID-19 pandemic has impacted a wide number of human domains: from economy to education, from public health to politics. Among others, Science swung early on into action to find both a cure and an effective vaccine. This has resulted in an unprecedented volume of publications that have generated an information overload for the medical community. One of today's challenges is to synthesize the huge variety of the research avenues explored about COVID-19 research in order to improve coordination between the different research streams. Related work on this domain have focused on visualizing structured data or meta-data to follow the mutation of the virus [8] , the worldwide evolution of the pandemic [5] or the discovery of new treatments [11] . However, two points should be noticed : -first, none of these works have searched to analyze the temporal evolution of knowledge on COVID-19 (including the apparition of vaccines and their usage) using visualization techniques. Such analysis of past research may provide very important hints to understand current and future research investigations. -second, related work have focused on structured (meta-)data and have rarely exploited the richness of the content of clinical trials. Nevertheless, the clinical trials available in the set of international primary and secondary trial registries [1, 11] (i.e., all trials registered in the International Clinical Trials Registry Platform (ICTRP), Clinicaltrials.gov and the EU clinical trials registry) contain both large and precious information. In the present work, we have thus designed a solution to visualize the content of not yet annotated textual fields (such as full trials descriptions) to reveal how knowledge evolves in pandemic times. More precisely, we have used the phylomemy reconstruction process [3] to reconstruct the temporal evolution of the semantic landscape of timestamped corpora of textual documents. We have applied our solution to the dataset on vaccines of the COVID-NMA database Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example which results from an international initiative and provides a unique collection of highly-curated, reviewed and complete data on clinical trials based on mapping and reviewing trials registries. This work has been intrinsically interdisciplinary by involving expertise in epidemiology, complex systems, visualization and data science. The challenges to address are the following: How can we make the most of the crucial information stored in mutable and ever-growing databases of clinical trials? How can we create visualization to render the evolution of the database content and help epidemiologists interpret such visualizations? Our paper aims at applying a new text mining methodphylomemy reconstruction -to reconstruct the temporal evolution of COVID-19 vaccines research. To that end, we choose to analyze a set of 1,794 clinical trials descriptions extracted from the COVID-NMA database. Our data set has been collected and curated by the combined effort of epidemiologists, data integration and complex systems researchers. The COVID-NMA project is an international initiative aimed at providing a living mapping and a living systematic review of all trials assessing treatments and preventive interventions for COVID-19 [1, 11] . The development of the COVID-NMA database relies on a full methodology designed to generate and make available a complete, comprehensive, integrated, non-redundant and carefully annotated data sets on clinical trials. We automatically extract data from clinical registries on a weekly basis and provide assistance to epidemiologists on the curation and annotation process. Raw data is extracted from the EU clinical trials register, from the ClinicalTrials.gov managed by the U.S. National Library of Medicine, from the Iranian Registry of Clinical Trials registry and from the WHO International Clinical Trials Registry Platform (ICTRP) -an international registry that assembles information on clinical trials registered in 17 primary registries to identify new trial assessing COVID-19 vaccine and update of previously registered trial records. Data are extracted from registries, annotated and enriched by epidemiologists, then stored and made available through the COVID-NMA database. In order to be used as input data, the COVID-NMA database need to be pre-processed. But the unprecedented volume of trials related to COVID-19 has challenged our capacity to build a time-consistent and insightful visualization. We thus faced two main issues: dealing with the mutable nature of trials registries and collaboratively curating a core vocabulary from the trials descriptions. We first note that international trials registries can be post-updated by research teams. Contrary to regular scientific publications, their textual contents are mutable by nature: the description of a given trial can be post-updated weeks after having been recorded. One can detail an experimental protocol or simply link some results to a trial. Usually, researchers can manually deal with such updates, but in our case, the tight temporality of the COVID-19 pandemic forced us to monthly reconstruct our visualization on top of hundreds of changeable trials. We have thus developed a time-consistent strategy: for each recorded trial, we have made the decision not to update the textual description of this trial. As subsequent trials referred to the first registered version of each vaccines description, we chose to keep those first versions as references. By doing so, we don't break the temporal continuity of the phylomemy reconstruction process as we preserve the natural evolution of the descriptions. However, we have kept the meta-data (i.e, trial phases, funding, associated publications, etc.) up-to-date with their most recent version. The phylomemy reconstruction process requires the creation of a core vocabulary to visualize the evolution of the trials descriptions. To that end, we have first filtered from the COVID-NMA database a set of 1,794 records exclusively related to vaccination. Within each selected trial, we have merged the sections 'pharmacological treatment', 'treatment type' and 'treatment name' together to create a normalized description. The resulting corpus  has latter been collectively and collaboratively curated by epidemiologists thanks to the free software Gargantext [6] . This software makes it possible to follow a human-driven approach where epidemiologists can validate and annotate each term they want to extract from the descriptions. We have thus created a core vocabulary as a list of 175 expression, called root terms, that can have several variants. The phylomemy reconstruction process [2, 3] combines advanced text-mining methods, scientometrics and methods for the reconstruction of evolving complex networks in order to reconstruct the latent semantic structures of an Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example In our case study, the corpus is a set of 1,794 trials descriptions. The roots are all the technical and equivalent names (including characteristics variations and any misspelling) given for a same vaccine. For instance, the technical expressions "rad5" and "rad26" were aggregated into "gam-covid-vac". The corpus is then sliced into periods of interest  * = { } 1≤ ≤ , ⊂  for which roots' co-occurrences are computed. In our case study, we consider two weeks periods starting every monday from February 2020 to October 2021 and the output is a series of matrices of roots co-occurrences. 2. Similarity measure. Within each period of time and on the basis of its co-occurrences matrix, we estimate the semantic similarity between roots using the confidence measure [7] . The completion of this task results in a temporal series of graphs of similarity ( Figure 1 .2). 3. Fields detection. For each period, a community detection algorithm -the frequent item set method [12] -is applied to detect subsets of densely connected roots within the graphs of similarity. Theses subsets are called fields (Figure 1 .3) and their aggregated root expressions describe consistent research topics that were explored at a given period. In our case study, the fields correspond to one or more descriptions of clinical trials sharing the same vaccine strategy. The output of this field detection step is a temporal series of clustering The phylomemy reconstruction process makes it possible to draw the knowledge lineages at different resolutions through the tuning of a level of observation [3] . The complexity of the resulting semantic landscape can range from a wide 'continent' to an 'archipelago' of specialized branches of knowledge. The structures highlighted by a phylomemy reconstruction process synthesize the complexity of the knowledge produced by a research community. In order to make this newly reconstructed knowledge actionable and explorable, a phylomemy can be visualized as a temporal network [10] . Fields are represented by full circles and solid dark lines translate their kinship connections. Emerging terms (i.e., terms appearing for the first time in the phylomemy) are displayed over the whole structure according to the combined coordinates of their period and fields of appearance. Term's size depends on their frequencies in the original corpus of trials. Branches are sorted from left to right so that closely related ones lie side by side. Interactive features can be used to reveal the entire fields' content, follow the dissemination of a given term throughout the phylomemy or simplify the scale of description of a selected branch. After having explored and analyzed Figure 2 alongside epidemiologists, we noticed that the reconstructed phylomemy clearly retrieves five major COVID-19 vaccine platforms in the form of complete branches. These platforms include the classical vaccine platforms i.e., 'non-replicating viral vector', 'inactivated virus' and 'protein subunit' as well as the next-generation vaccine platform i.e., 'dna based vaccines' and 'rna based vaccines'. The visualization shows the continuous development of each branch and the way some of them started to interact and eventually blended while others stopped. Interestingly, trials of 'rna based vaccines' were registered very early in the course of the pandemic (February 2020) with trials evaluating the vaccine developed by Moderna TX (mRNA-1273) followed by the vaccine developed by Pfizer/BioNTech (BNT162b2) and sibling ones like BNT162b1 or BNT162b2sa that were not much longer tested (see Figure 2 .a). The number of trials increased rapidly and interactions with other widely explored techniques were observed shortly afterwards: notably with the 'non-replicating viral vector' family (ChAdOx1 -As-traZeneca -see Figure 2 .b). The latest interaction involved the 'protein subunit' branch in July 2021. In contrast, 'dna based vaccines', with a first trial registered in April 2020, had a very limited number of trials planned and the whole branch stopped rapidly in 2020. Similarly, other platforms of 'replicating viral vector vaccine', 'virus-like particle vaccine' and 'live attenuated virus vaccine' showed a very limited development. As the development and approval of COVID-19 vaccines was expected to take time, researchers also explored repurposing non-COVID vaccines. Considering the lower severity of the disease in children and young adults, some researchers hypothesized the possible heterologous protective effect of these vaccines. Some evidence shows that liveattenuated vaccines such as Bacille Calmette-Guerin (BCG), Measles, Mumps, Rubella (MMR) can induce protective Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example Figure 4 : Phylomemy of the randomized only COVID-19 vaccines trials. In blue, we highlight all the trials with an associated publication (i.e., preprint or peer-reviewed articles). innate immunity, which could be central in controlling SARS-CoV-2 [4] . While this hypothesis was appealing, it did not seem to expand into a wider research domain. The branch of 'non-COVID vaccines' appears and expands at the beginning of the pandemic but progressively decreases towards the end of 2020 as other more promising vaccines arose. Nevertheless, some researchers highlighted the need to adequately assess the use of non-COVID live-attenuated vaccines as they could potentially boost response in high-risk populations, be used in addition to COVID-vaccines to increase effectiveness and durability of their effect, or be used to protect people exposed to COVID-19 patients [4] . The branches interactions reflect the exploration of a new approach to vaccine implementation moving from homologous prime vaccination (i.e., injections of two doses of the same vaccine) to heterologous prime vaccination (i.e., injection of the first dose of a given vaccine and the second dose of another vaccine). This is clearly shown in Fig Phylomemies are essential in identifying shifts in research questions. While evidence of the beneficial effect of vaccines is mounting, research questions are moving toward exploring the effect of booster to overcome the waning of vaccine efficacy over time. Early in 2021, new trials assessing the impact of administrating a third dose (see Figure 2 , red outline at the bottom) have been registered particularly for 'rna based vaccines' and 'non-replicating viral vector' [9] . An important part of the research on boosters' effects is considering heterologous boosters. Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example By using meta-data from the trials registries, we can filter the current phylomemy and thus push faceted observations to the fore or identify current main research questions. Phylomemies also provide important information on research planning and reporting. As shown in Figure 2 , most trials registered are randomized controlled trials. Early in the pandemic, non-randomized trials were primarily early phase trials while those registered in 2021 include both early phase trials exploring new vaccines and phase 4 trials assessing vaccines safety. We can explore the visualization to better understand how different countries participated in the overall research effort over time. For example, when filtering on the country (see maps.gargantext.org/phylo/vaccines/countries), we see that trials conducted in the USA explored all vaccine platforms and that first registered trials frequently involved a center in the USA, confirming their leading role in clinical research (e.g., 'dna based vaccine', 'rna based vaccine', 'protein subunit'). Other important trials characteristics such as funding sources can also be highlighted (see maps.gargantext.org/phylo/vaccines/fundings). We also address the question of the publication of trial results (i.e., preprint or peer-reviewed articles). As shown in Figure 4 , we currently have access to the results of a very limited number of planned trials. While most of the COVID vaccine trials registered in early 2020 are published, most of the non-COVID vaccine trials are still unpublished. Thus this visualization can orient scientist toward new question regarding the organisation of the trials like the understanding whether these trials were actually conducted with unpublished results or were unable to recruit is an important issue. Phylomemies also offer flexibility and the possibilities to explore specific questions that may arise over time. Using meta-data and filters is a way to address specific questions through the phylomemy by highlighting and quickly finding subsets of trials within the registries. For example, there is currently some interest and concerns of decreased efficacy of vaccines in immunocompromised populations. It is possible to filter on this specific population and show that this specific population has been considered in clinical trials only recently (first time in 2021) and mainly for RNA-based vaccines. We can also make use of the 'type of patient' field to push trials related to specific public ('newborn', 'children', 'adolescent', 'pregnant women', etc.) to the fore of the visualization. A dedicated phylomemy can be explored at maps.gargantext.org/phylo/vaccines/publics. By using the example of COVID-19 vaccine, our results show that phylomemies can improve our understanding and knowledge on research planning at a global level. At this stage, two main questions arise from our work: Is our method generalizable? Can we infer or predict future trends from the visualizations? The COVID-19 pandemic has been a clear illustration of the huge amount of avoidable research waste related to the lack of global coordination of clinical research [11] . To adequately plan clinical research and avoid research waste, we need new approaches. Researchers need tools to understand the current research landscape and prioritize research questions. However, data available are massive and one cannot synthesize easily the large amount of data generated (research planned and results produced). We need to develop infrastructures such as a global observatory of clinical research based on high quality data as well as tools to help stakeholders exploring these data to improve our understanding of the ecosystem. Our case study is entirely focused on COVID-19 vaccines trials. But we defend that our approach is now generalizable to various contexts without extra effort. In the following, we develop this point by incrementally generalizing future domains of application as described here-after: • Integrating other COVID-NMA meta-data: in consultation with epidemiologists, we have enriched our visualizations with the possibility to filter on a selected set of meta-data such as the participants characteristics (age, pregnancy etc) (see 4.5) . This list can be extended to all the structured fields of the COVID-NMA database by simply changing the pre-processing script (see 2. 2). Furthermore, as phylomemies are designed to reveal the structure out of unstructured data, the COVID-NMA database could even be enriched with new fields such as sub-categories of vaccine trials: 'heterologous', 'boosters', etc. In due time, our approach could influence the way scientists share trials information in registries by standardizing new meta-data and un-structured textual content. Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example • Working on COVID-19 treatments instead of vaccines: choosing between visualizing COVID-19 vaccines or treatments is also a matter of selecting the right field in the COVID-NMA database. Yet, we will still have to create a new core vocabulary (see 2.2). But thanks to GarganText (the free text mining software used to annotate the vaccines descriptions upstream from the phylomemies), it will only take a few days to collaboratively achieve this task and annotate hundreds of trials descriptions. A preliminary study can be found in [3] . • Visualizing trials related to another disease: since the beginning of this study, the process designed to fill and integrate the COVID-NMA database has evolved making it possible to construct easily new databases gathering data on other kind of diseases. Indeed, the process to extract raw data from registries has a large generic step where we extract the description of each trial including information on its design, the inclusion/exclusion criteria for patients, the description of arms, the set of outcomes... As an example, we have reconstructed the phylomemy of 1,798 trials related to Alzheimer disease and extracted from the WHO international registries. The resulting visualization can be explored at maps.gargantext.org/phylo/alzheimer/. • Analyzing publication data: phylomemies have been first designed to visualize the content of scientific publications. Thanks to the recent integration of phylomemies to the free software GarganText, one can already reconstruct the temporal structure of various corpora of thousands of papers extracted from PubMed or from the Web of Science. In the context of a pandemic like COVID-19 pandemic, it would be relevant to explore in parallel both the phylomemy on trials and the the phylomemy on papers or pre-print to get a comprehensive view of the scientific landscape. A first exploration of COVID-19 literature can be found at maps.gargantext.org/maps/covid-19 in the form of a semantic graph. • Monitoring various types of clinical trials on the fly: if all placed end-to-end, the continuous integration of WHO international registries within the COVID-NMA enrichment pipeline and through the phylomemy reconstruction process would enable the dynamic analysis of any kind of trials content as they arise. Such analytical workflow will require the creation of two teams: an integration team able to deal with the possible evolution of the registries and an annotation team dedicated to the creation and update of core vocabularies ; but it would allow better coordination of scientific teams around the world in all medical fields and accelerate medical discoveries.. Phylomemies are not aim at predicting future trends altough the analysis of their dynamics could give some hints. They help us to understand the present states of dynamical processes regarding their former evolution. Inferring upcoming developments can be pursue by combining the knowledge of field's experts (like epidemiologists) and the comprehension of hidden dynamics relieved by the phylomemies. In the case of this paper, the nature of the input data already place us ahead of regular scientific time by visualizing registries data instead of publications data. Indeed, except for pandemic time, there usually is a gap of five years between the first trial and the publication of a new treatment. Phylomemies associated to trials registries might consequently be scientific tools for understanding possible future treatments. Global coordination between research teams is a key for accelerating innovation in Science, especially during crisis situations such as the COVID-19 pandemics. Reducing redundancies and providing heuristics to find new search paths as they arise can save time and lives [11] . We claim that phylomemy reconstruction could be instrumental to guide trialists, funders and decision makers in biomedical research. In times of crises, it would enable them to better adapt to the evolution of the situation by following main research questions and identify less promising domains. It could also facilitate the identification of research gaps, research questions that may have been abandoned prematurely and redundancy in research. In a world where experts are increasingly specialized, our approach could draw attention to alternative solutions developed in other branches of science or to problems already encountered in research direction to be explored. It could also lead to new conceptual operations to be performed on a knowledge database, such as "give me all the branches of knowledge that are merging" or "suggest a promising combination of compounds to test". This could both accelerate research by making tangible the latent structure of innovation, and promote collaborations between teams that would Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example not otherwise be interested in each other's work. Phylomemy reconstructions may thus become collective and reflective tools to foster the worldwide collective coordination between researchers. This revolution in clinical trial processing is within reach. Nevertheless, it would imply having access to high quality data on research planning and protocol. Our case study focuses on a single disease, but this approach is fully generic and we call for a worldwide observatory for monitoring the dynamics of clinical trials. As it scales up, our approach could be implemented for any disease or research field. Phylomemetic patterns in science evolution-the rise and fall of scientific fields Draw me science -multi-level and multi-scale reconstruction of knowledge dynamics with phylomemies Old vaccines for new infections: Exploiting innate immunity to control covid-19 and prevent future pandemics Early indicators of intensive care unit bed requirement during the covid-19 epidemic: A retrospective study in ile-de-france region, france Mining the digital society -Gargantext, a macroscope for collaborative analysis and exploration of textual corpora Mapping general-specific noun relationships to wordnet hypernym/hyponym relations Phylogenetic network analysis of sars-cov-2 genomes Considerations in boosting COVID-19 vaccine immune responses Exploring, browsing and interacting with multi-level and multi-scale dynamics of knowledge Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets The original COVID-NMA database can be downloaded at covid-nma.com. The pre-processing script can be downloaded at https://doi.org/10.7910/DVN/JTRI7A. The full list of root terms is available at https: //doi.org/10.7910/DVN/JTRI7A. The reconstructed phylomemy is available for live explorations at maps.gargantext.org/phylo/vaccines/publications and downloadable at https://doi.org/10.7910/DVN/JTRI7A. Toward an observatory of the evolution of clinical trials through phylomemy reconstruction: the COVID-19 vaccines example.