key: cord-1014902-dx9zdkc5 authors: Balkányi, László; Lukács, Lajos; Cornet, Ronald title: Investigating the Scientific ‘Infodemic’ Phenomenon Related to the COVID-19 Pandemic: - a Position Paper from the IMIA Working Group on ˝Language and Meaning in BioMedicine” date: 2021-04-21 journal: Yearb Med Inform DOI: 10.1055/s-0041-1726483 sha: b5206538f7e23bd2a40a4640a08f7be35fd5b0f5 doc_id: 1014902 cord_uid: dx9zdkc5 Objectives : The study aims at understanding the structural characteristics and content features of COVID-19 literature and public health data from the perspective of the ‘Language and Meaning in Biomedicine’ Working Group (LaMB WG) of IMIA. The LaMB WG has interest in conceptual characteristics, transparency, comparability, and reusability of medical information, both in science and practice. Methods : A set of methods were used (i) investigating the overall speed and dynamics of COVID-19 publications; (ii) characterizing the concepts of COVID-19 (text mining, visualizing a semantic map of related concepts); (iii) assessing (re)usability and combinability of data sets and paper collections (as textual data sets), and checking if information is Findable, Accessible, Interoperable, and Reusable (FAIR). A further method tested practical usability of FAIR requirements by setting up a common data space of epidemiological, virus genetics and governmental public health measures’ stringency data of various origin, where complex data points were visualized as scatter plots. Results : Never before were that many papers and data sources dedicated to one pandemic. Worldwide research shows a plateau at ∼ 2,200 papers per week – the dynamics of areas of studies being slightly different. Ratio of epidemic modelling is rather low (∼1%). A few ‘language and meaning’ methods, such as using integrated terminologies, applying data and metadata standards for processing epidemiological and case-related clinical information and in general, principles of FAIR data handling could contribute to better results, such as improved interoperability and meaningful knowledge sharing in a virtuous cycle of continuous improvements. Defining the 'What to study': The last decade has seen epidemics with world-wide significance (2009: H1N1, 2012: MERS, 2015: Zika) but they neither had a similar global impact on the world economy and everyday life nor led to a similar amount of related research (see Table 3 . in Results for details). The COVID-19 pandemic generated a previously unseen intensity of scientific research and as a result, an unseen number of publications (see Figure 1 . in Results) and related public health data (openly available, mostly epidemiological). We consider both textual publications and COVID-19 related data published by relevant public health authorities as raw material for further studies. We paraphrase this phenomenon to be called a 'scientific infodemic' -narrowing the Merriam Webster general definition of infodemic [1] : '…. is a blend of "information" and "epidemic" that typically refers to a rapid and far-reaching spread of both accurate and inaccurate information about something, such as a disease…' to the more restricted world of scientific publications and available public health data. We think that using the paraphrase 'scientific infodemic' versus calling the papers of the observed period simply the 'scientific literature' emphasizes uncertainty, partly caused by the unusual number / ratio of publications that were only available as preprints, lacking peer review, but still made available under the time pressure of the need for controlling the COVID-19 pandemic. Not surprisingly, according to the Retraction Watch Database1 over 60 papers have been retracted just in 2020. Explaining the 'Who': Authors of this paper partly constitute the leadership of the 'Language and Meaning in Biomedicine' Working Group of IMIA (LaMB WG)having an interest in medical concept representation (see detailed history of the WG here [2] ). Therefore, we were motivated to investigate conceptual characteristics, transparency, comparability, and reusability of the COVID-19 research material. Our WG background helped to realize that to gain understanding, both quantitative and qualitative aspects of the COVID-19 scientific infodemic should be investigated. The background for 'Why' and 'How' to study COVID-19: Just reading, observing the research literature reveals peculiarities, such as divergent or even contradicting data on epidemiology, clinical features, and response to the various therapeutical interventions to COVID-19. Inconsistencies and contradicting information bits and pieces lead even to retraction of some papers (i.e. [3] ). Some of these peculiar features (as e.g., speed and sheer volume) are judged positively, some others negatively by the research community. However, observations and reading leads only to anecdotal evidence. Systematic, analytic study of conceptual structures and that of quantitative content characteristics might help understanding why the peculiarities occur, as well as the fight against the COVID-19 pandemic. Examples of quantitative studies in this paper are measuring speed and amount of accumulating scientific information, an example of analytic study that is investigating the (appreciated) openness of data related to studies of a new pathogen and the caused disease (see the Methods section for full details). Let us note that countries and various international agencies share a remarkable amount of almost real-time, fine-grained epidemic and pathogen-related data. Countries and supranational organizations had and still have a chance to design and execute efficient public health, economic and social responses. However, the COVID-19 pandemic research and responses have brought home this potential only partially. As mentioned by Flack and Mitchell [4] "the response has been ambivalent, uneven and chaotic -we are fumbling in low light, but it's the low light of dawn". Science and the applied public health response are just part of the global activities related to COVID-19 -and not the only sources for decision making, neither for the public in general nor for the policy makers. The objective of this study is to gain better understanding of the structure and content of the COVID-19 literature (both that of the textual and data resources). We also strive for better understanding and the clarification of the (possible) role of 'language and meaning' paradigms -such as homogenous semantics, conceptual interoperability, standardized data, role of ontologies, description logics and hybrid architectures, and possible role of knowledge representation. We think that all these objectives can be achieved by looking for measurable qualitative and quantitative characteristics of the COVID-19 research literature and the related open data resources. We also aim to check whether the above-mentioned impressions of unusual speed, amount and conceptual detail of the literature are correct. Once the peculiarities are characterized by these measurements, the secondary objective of this position paper is to check whether (and to what extent) 'language and meaning' paradigms, tools, methods could help to establish better transparency, adding reliability and more credibility of COVID-19 research results to the scientific community and beyond. We used a well-defined, pre-processed literature curation: numbers were retrieved by checking the 'COVID19 Article Collection' from LitCovid/ PubMed. In LitCovid the dynamics were checked and analyzed for the following subdomains: COVID-19 in general, its (patho-)mechanism, disease transmission, diagnosis, treatment, prevention, specific case reports and forecast -modelling. Publication numbers were granulated by week. Our added value here was the evaluation of tendencies and comparison of various areas. In addition to that, overall COVID-19 numbers were compared by us to numbers of other similar events of the past decade by a simple, date limited search. We established and studied a corpus using a text mining tool (Voyant [6] ). The above-mentioned LitCovid article collection was used. The titles of the listed ~ 45,000 papers were compiled, allowing authors of this paper to see the applied semantics of the research domain. The Voyant tool was used to visualize a semantic map of related concepts, using quasi-clusters of article title word (as labels for the concepts) frequencies. From a variety of analytic tools that Voyant has, in this case we used the T-distributed Stochastic Neighbour Embedding (t-SNE) scatter plotting. This is a non-linear dimensionality reduction algorithm, where visualizing t-SNE results allows a look at how the terms are arranged in a high-dimensional conceptual space. For the qualitative study we collected and studied several existing independent data sets and paper collections (understood as textual data sets). As Bakken remarks in her paper [7] : 'Regardless of the type of biomedical and health informatics research conducted (e.g., computational, randomized controlled trials, qualitative, mixed methods), transparency, reproducibility, and replicability are crucial to scientific rigor, open science, and advancing the knowledge base of our field and its application across practice domains'. In agreement with that view, we focused on methods dealing with such aspects of information quality as transparency, reproducibility, and replicability. As a first step, we assessed FAIR-ness [8] of these information sets -checking if they are findable, accessible, interoperable, and reusable. FAIR-fitting can be checked by a well-defined, easy-to-apply checklist, covering all criteria. The FAIR principles are embraced by many science communities as well as by the European Commission. Checking FAIR requirements allowed us to investigate how these information sources could be (re) used and combined. The details of FAIRness checking are explained in Results, in Table 5 . As a second step, the usability of FAIR requirements was challenged by a real test case. We performed a combined analysissetting up a common data space [9] , using independent data sources as detailed below. COVID-19-related information (both textual and data collections) might be grouped in various types: related to the spread of disease on population level (field epidemiology and modelling); describing the pathogen (e.g. virus RNA sequences); related to clinical manifestation (case numbers, comorbidity epidemiology, diagnosis and therapy of the disease); and characterizing public health and policy actions (e.g. a complex stringency measure of various government measures against the spread of COVID-19 disease). Table 1 We checked the FAIR requirements by using the EUDAT Fair Data Checklist [26] . Each requirement is checked against a set of relevant EUDAT checklist conditions (detailed in Results). In order to evaluate these checklist conditions, all sources were approached, opened, and read or in case of data sources downloaded as well. If all the specific conditions of a given area (e.g., 'interoperability') are met, the requirement is fulfilled. If none of them are met, obviously, the requirement is not fulfilled. If some of the conditions are met, the requirement is 'partially' fulfilled. • Testing FAIR-ness in practice: compiling epidemiological, virus genomics and stringency (of government measures against the COVID-19 pandemic) data As a test for practical interoperability, reusability and semantic consistency authors of this paper did also a compilation of data, using the above-mentioned data sources. We set up a Type epidemiological data sources: This type contains five independent data sources, each with worldwide coverage collecting data from multiple sources and partially cross checking each other for COVID 19 pandemic. Two data sources are that of the relevant international organizations (WHO, ECDC), one is provided by an internationally acknowledged academic source, and the last two are independent data sources. This type of data sources collects RNA sequences independently from each other, obviously having overlapping sequences from all around the world. These data sources present data of the response of the societies around the globe. (e.g., stringency, economics etc) [9] where epidemiological, virus genetics and governmental measures' stringency data are combined to characterize the COVID-19 pandemic. We used the 'Oxford Covid-19 Government Response' stringency data, which is a complex index, calculated form 19 indicators, organized into four groups [20]: C -containment and closure policies; Eeconomic policies; H -health system policies; and M -miscellaneous policies. Source data were compiled to data points, each expressing a numerical value on a given week of a given country regarding (i) a mean of genetic divergence of virus RNA samples -as a kind of index of what strains were active then and there, (ii) a cumulated number of death/million inhabitants of the following two to five weeks (a measure of severity of the outbreak), and (iii) value of the complex stringency measure, indicating the strength of government response measures. For the second objective, authors use references to earlier publications of the IMIA Language and Meaning in Biomedicine Working Group members -on how these 'language and meaning' related paradigms could and should be applied in case of COVID-19. The main paradigms are (a) the role of homogeneous semantics and inherent interoperability (terminologies) used in publications; (b) the need for standardized data in field and clinical epidemiology, enabling large-scale predictive data analysis -both in related papers and databases; (c) the role of ontologies, description logics and hybrid architectures; and (d) the role of knowledge representation -especially in studies related to artificial intelligence. We scrutinized earlier LaMB WG authors' related publications for finding examples, answers to questions, as : • which of these paradigms are relevant to improve research quality and mitigate inconsistencies in COVID-19 related data collection and interpretation; • what 'language and meaning' methods tools can do for connecting different fields, as e.g. genetics and pathophysiology data of SARS CoV 2 and COVID-19; and we check; • if 'language and meaning' methods are able to improve clinical response to COVID-19 (diagnostic and therapeutic issues); • how 'language and meaning' tools in the broad sense can help to overcome the problem of transparency and comprehensibility caused by the sheer amount of research information related to COVID-19. • Amount and dynamics of COVID-19 literature The results of the snapshot taken on 25. Aug 2020 -for all papers published since the beginning of the year, as listed in PubMed are reported in Figure 1 . At this date, the number of papers on COVID-19 in PubMed was 45,499. The dynamics (first papers appearing already in January 2020) shows a quick growth from March 2020 to mid-May 2020. It is interesting to see the ratio of various areas, shown in Table 2 . These results are published (and regularly updated) by the LitCovid curated paper collection. The LitCovid curation process deals with subdomain classification as well. We added red tendency arrows to show how data are changing and the analysis below. These are not calculated trendlines, not based on data, just visual aids, to catch direction of change. Dynamics of the various areas are shown on Figure 2 . Note that the axes were distorted to make the scales visually comparable to each other. Red linear tendency arrows show the different speed and dynamics. The arrows help to realize that in almost all areas there is a kind of "saturation" effect, but research reaches a maximum "throughput" on different levels (i.e., number of publications). -that is most probably mirroring the available scientific capacities. It is also important to note the difference in the sheer number of COVID-19 publications related to the number of publications produced at other similar events in the past decade (for other outbreaks we covered two years to collect all publications related to the actual outbreak). We applied a simple, date limited PubMed search, see the results and the search strings in Table 3 . • Results for characterizing the most important and emerging concepts describing the COVID-19 pandemic A T-distributed Stochastic Neighbour Embedding (t-SNE) scatter plot of article title words frequencies -generated by using the appropriate Voyant tool -resulted in a concept map, shown in Figure 3 . Coloring of dots shows the (quasi-)clustering, while dot sizes represent rates of word occurrences. Concepts in the blue area (like: patient, treatment, respiratory) show papers focusing mostly on clinical aspects, the purple area concepts (like: emerging, covid, health, epidemic) depict mostly epidemiology aspects, while the green area reaches out to response and other broad aspects of society status during COVID-19 (concepts: implications, lessons, perspective, etc.) Table 3 Numbers of outbreak-related publications (worldwide) of the last decade in PubMed Investigating the Scientific 'Infodemic' Phenomenon Related to the COVID-19 Pandemic • Metadata issues as a crucial element of research transparency It is important to note that (practical) data transparency and reusability should be judged at least on two levels: data syntax and data semantics. (Pragmatics, the third level is obviously not a transparency or reusability issue.) Regarding syntax, the below listed data sources are all transparent and reusable, provided either as 'csv' or 'xml/ json' formats. However, the semantic layer is much less transparent -that makes data comparison and compilations problematic. Items in Table 4 show that the same conceptual entities (like country names) are labelled and coded differently (e.g., in the epidemiological sources). Checking FAIR-ness of these information sets provides us a structured, well-proven approach to overcome the complexity of investigating information sources. Table 5 below shows that most of the sources only partially meet the FAIR requirementsforecasting some difficulties for COVID-19 meta-analysis studies in the future. This analysis shows that possibly the most critical requirement is interoperability. The sources usually fail the "controlled vocabularies, keywords, thesauri or ontologies are used where possible" requirement, and some also lack standard metadata formats. • Result of testing FAIRness in practice: compiling epidemiological, genomics and governmental measures' stringency data Figure 4 demonstrates the use of the FAIRness assessment. We use this visualization here as an illustration of using / applying FAIR principles in practice. Data sources, complying to FAIR requirements were used. In this complex figure, COVID-19 epidemiological data are combined with the COVID-19 related stringency of government measures data and SARS-CoV 2 virus genetic divergence -all from different, independent sources. Not surprisingly all the actual steps (as finding data sets, downloading, data cleansing, importing to spreadsheet, processing the data, and building meaningful visualizations) proved to be doable tasks. On Figure 4 each composite data point reflects values of a given week of a country. The x-axis presents an 'epidemic severity' index, showing the cumulated new deaths/million people following 2-5 weeks of the actual week. The y-axis presents the public health stringency governmental measures of a country, while the size variations of the dots indicate the mean genetic divergence of the viral strains present in the country on that week. The colors of the dots show a K-means clustering along the timeline, using the Kmc tool [27] allowing us to watch out for the influence of progress of the COVID-19 pandemic over the timeline. In this paper we analyzed the way to establish such data compilations -and not the figure information itself. Particularly, we emphasize the (FAIRrelated) problems of compiling these data. Regarding findability and accessibility all the sources were of equal high value, even though they do not fully comply to the formal requirements as described in the EUDAT FAIR requirements checklist [26}. For each of them (Our World in Data, Worldometer, Nextstrain -GISAID, EpiCoV, Oxford Government Response Tracker) the data were findable and accessible, metadata were well-defined and rich, but regarding a persistent ID for each data element, only the genetic data fulfilled this requirement. However, regarding interoperability, we discovered some problems. For instance, basic and crucial data element such as labels and standard abbreviations (e.g., ISO codes) for countries were differing in the various sources (e.g., in case of the United Kingdom or Macedonia). Also, none of them named the source of the country list properly, hampering reusability on the long run. In the same way, normalizing the data (i.e., for population size) was not based on transparent data sources. Lack of explicit usage of permalink type global unique identifiers is also an issue for long term reusability. • Overcoming inconsistencies in related data collection and descriptive, textual interpretations in papers. In [28] , one of the main conclusions was, that benefits of the integrated terminologies in terms of homogenous semantics and inherent interoperability outweigh the complexity added to the system. This statement is relevant in case of COVID-19 that proved to be a disease with a broad set of various clinical manifestations. Indeed, various possible pathomechanism pathways were and are investigated. Using homogenous semantics by integrated terminologies both in related papers and in related data bases could have prevented inconsistent presentation of signs and symptoms, of progress of disease. For example, controversial anecdotal reports of using various hydroxychloroquine or chloroquine compounds were incomparable for various reasons, among them the incomplete and inconsistent terminologies. In this case of methods, as e.g., in genetics and pathophysiology data of SARS-CoV-2 and COVID-19, applying FAIR principles consistently across the available information sources would be of great value. In [29] Jacobsen et al. declare that, by intent, the 15 guiding principles for FAIR do not dictate specific technological implementation. It is noted that this has also resulted in inconsistent interpretations that carry the risk of leading to incompatible implementation. It is also concluded that while the FAIR principles are formulated on a high level for true interoperability, we need to support convergence in implementation choices that are widely accessible and (re)-usable. Our own findings support this as well. • Improving clinical response to COVID-19 (diagnostic and therapeutic issues). Here, Schulz et al. [30] guide us: they explain that interpretation of clinical data is highly dependent on contexts, data is often un-or semi-structured, and it is difficult to repurpose even standardized data, e. g. for clinical epidemiology, data analysis, or decision support. However, it is emphasized that data interoperability gained attention due the value of largescale predictive data analysis. A broader approach of 'language and meaning' Investigating the Scientific 'Infodemic' Phenomenon Related to the COVID-19 Pandemic Regarding the lack of further growth in literature numbers since May on Figure 1 .authors of this paper suggest that this shows a kind of scientific community "bandwidth" or capacity of scientific publication channels. This ceiling seemingly appears at roughly 2,000 -2,200 papers per week (see also the red tendency arrows). The relatively low number of forecasting -modelling papers probably shows that so far there are significantly less scientific resources available for this important area compared to other aspects of the COVID-19 pandemic. This could mean as well that the human capacities for forecasting and modelling resources were full-time involved in daily operational tasks or there is insufficient data to develop performant models. In this case there will be an increase of modelling papers in the future. A third reason for this could be the lack of enough validated, controlled data. Another disputable aspect is whether the relative growth of COVID-19 related literature, perceived as unprecedented and impressive by numbers, as shown in in Table 3 , does match the general tendency of growth of scientific publications in general or outweigh it. This should be further investigated. FAIR-ness testing: While getting the results for practical usability testing, we came across some issues of compiling epidemiological, genomics and government measures' stringency data, as described above in the Results section. In addition to the problems described there, we have to think about some possible interesting cognitive dissonances: i.e. is the 'number of infections' data or metadata of an outbreak? Obviously, for an index-based disease surveillance database this number is 'data', while for an outbreak event database the same number is a descriptive metadata of a given outbreak. Similarly, an aggregate of cases per country could be considered metadata if you consider the patient-level as data, but it is data if your study object is "country". Could 'language and meaning' methods improve clinical response in connection to field epidemiology for COVID-19 (diagnostic and therapeutic issues)? Authors of this paper agree with [30] that difficulties in interpreting COVID-19 pandemics literature highlight a need of data standards for making clinical data interoperable and shareable in a virtuous cycle of continuous improvement for field epidemiology as well. We also support the need for the application of the eStandards methodology -aligning reusable interoperability components, specifications, and tools. Limitations: In this paper we focus only on the scientific part of the 'infodemic' phenomenon -we do not deal with the general media infodemic. Specifically, due to scope and mandate limits, we use one (albeit outstanding) source, called 'COVID19 Article Collection' of PubMed -from 'LitCovid', that is "a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus. It is (declared to be) the most comprehensive resource on the subject, providing a central access to …. a growing (number of) relevant articles in PubMed. The articles are updated daily and are further categorized by different research topics and geographic locations for improved access." [5] . In addition to that we do not offer detailed quality analysis of the information in the studied literature as this is out of the scope of the present study, however we recognize the need for such further investigations. A possibly promising comparison of the contents of the relevant bioRxiv/medRxiv COVID selection with LitCovid pre-prints is also missing from this study. Summing up characteristics and some peculiarities of COVID-19 literature it is remarkable that never were that many papers dedicated to a pandemic. A certain "saturation" (~ 2,200-2,300 papers per week) might show either the upper limit of scientific capacities around the globe, or that of the scientific publication "bandwidth". As various on-line channels were and are available for scientific publication even without peer review, we think that the number of papers stabilizing at that level are an indicator of what the science community can produce. Never before were such amount of various open access database contents available appearing in such a short period of time. At the same time, the vast potential of using these data was not fully brought home, and quality is often dubious. We argue that 'language and meaning' related methods and paradigms would contribute to better results. Specifically, using integrated terminologies in terms of homogenous semantics would lead to better, easier detection of inter-connectedness of results of various studies. Applying data and metadata standards for processing epidemiological and case related clinical information would lead to better comparable data, coming from various sources -as the need for data normalization, validation, cleaning would need less efforts. In general, principles of FAIR data handling would further enable machine and technical level interoperability and meaningful knowledge sharing in a virtuous cycle of continuous improvements. Words We're Watching: 'Infodemic History and charter of IMIA Working Group 'Language and Meaning in Biomedicine' , earlier called 'Medical Concept Representation Retraction: Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19 Complex Systems Science Allows Us to See New Paths Forward Keep up with the latest coronavirus research The journey to transparency, reproducibility, and replicability The FAIR Guiding Principles for scientific data management and stewardship Data set, combining epidemiological, genetics, and government stringency data of COVID-19 pandemic How FAIR are your data? Zenodo Kmc and kmc User Guide. kmc version 001 Recent Developments in Clinical Terminologies -SNOMED CT, LOINC, and RxNorm FAIR Principles: Interpretations and Implementation Considerations Fundamentals of Clinical Data Science The Interplay of Knowledge Representation with Various Fields of Artificial Intelligence in Medicine