key: cord-0684612-np6z6w3c
authors: Martínez Beltrán, Enrique Tomás; Pérez, Mario Quiles; Pastor-Galindo, Javier; Nespoli, Pantaleone; García Clemente, Félix Jesús; Mármol, Félix Gómez
title: COnVIDa: COVID-19 multidisciplinary data collection and dashboard
date: 2021-03-30
journal: J Biomed Inform
DOI: 10.1016/j.jbi.2021.103760
sha: 62afd3fb7751b8a5333dcf1b8aec5600b5a0be36
doc_id: 684612
cord_uid: np6z6w3c

Since the first reported case in Wuhan in late 2019, COVID-19 has rapidly spread worldwide, dramatically impacting the lives of millions of citizens. To deal with the severe crisis resulting from the pandemic, worldwide institutions have been forced to make decisions that 1 affect the socio-economic realm. In this sense, researchers from diverse knowledge areas are investigating the behavior of the disease in a rush against time. In both cases, the lack of reliable data has been an obstacle to carry out such tasks with accuracy. To tackle this challenge, COnVIDa(https://convida.inf.um.es) has been designed and developed as a user-friendly tool that easily gathers rigorous multidisciplinary data related to the COVID-19 pandemic from different data sources. In particular, the pandemic expansion is analyzed with variables of health nature, but also social ones, mobility, etc. Besides, COnVIDapermits to smoothly join such data, compare and download them for further analysis. Due to the open-science nature of the project, COnVIDais easily extensible to any other region of the planet. In this way, COnVIDabecomes a data facilitator for decision-making processes, as well as a catalyst for new scientific researches related to this pandemic.

On December 31, 2019, the World Health Organization (WHO) China Country Office was informed of a number of patients with a pneumonia of unknown nature, detected in the city of Wuhan in the province of Hubei, China. Later on, such phenomenon was clarified as the outbreak of a novel coronavirus-caused respiratory disease (i.e., . COVID-19 rapidly spread worldwide, being declared as a pandemic on March 11, 2020, by the WHO 1 .

The severe health and social crisis resulting from the COVID-19 pandemic has forced institutions and authorities to make decisions that are transcendental for the lives 1 https://www.who.int/emergencies/diseases/ novel-coronavirus-2019/events-as-they-happen of millions of citizens [1] . In fact, people from many different countries have suffered mobility restrictions, even lockdowns in the worst cases, which have harshly affected them from a socio-economic perspective [2, 3] . In this regard, scientists from the most diverse research disciplines are struggling ever since to investigate the behavior of the disease against the clock. For instance, virologists and epidemiologists from medical sciences are committed to a continuous battle to find novel treatments to counteract the effects of the virus [4] . Likewise, economists make an effort to mitigate the negative economic impact [5] of the epidemic, while computational area researchers study the evolution of the pandemic to predict potential changes [6] .

In both cases, the lack of rigorous and reliable data has undoubtedly been an unfortunate impediment to carry out these tasks with accuracy and precision [7] . The motivation of such data scarcity lies mainly in the novelty of the virus: every individual is currently facing a threat with no antecedents. In such an uncertain context, the causal inference from observational data [8] becomes an allied methodology to understand how and why this health crisis is unfolding. Additionally, the management and communication of official data made by the public authorities have often been questionable. Therefore, the acquisition and analysis of data from trusted sources represent a need more than ever, endowing humans with useful information against both the COVID-19 and the spread of disinformation [9] .

Under these premises, COnVIDa 2 has come to fill this gap by offering a user-friendly tool to easily collect relevant COVID-19 pandemic data in the context of Spain from different and trusted data sources. Additionally, COnVIDa permits to join such data, compare them on a geographic basis and, if desired, download them in commonly-used data analysis formats for subsequent analysis. Besides, COnVIDa has been designed and developed following the open science principles [10] , including open data sources during the project lifecycle [11] . Given the open science nature of the project, the tool is easily extensible to any other region of the planet. Furthermore, COnVIDa is implemented as a robust and modular web dashboard, with different components working in synergy to offer an excellent user experience [12] .

In this way, COnVIDa becomes a data facilitator for decision-making processes, as well as a catalyst for new scientific research related to this pandemic. In this sense, the tool can help users to study the underlying correlations between several parameters, such as the evolution of the pandemic, the containment measures adopted by the authorities, socio-economic and educational implications, and so forth [13] . More specifically, thanks to COnVIDa, it is possible to study the impact of lockdowns or preventive 2 https://convida.inf.um.es measures on the population from different viewpoints. In particular, the pandemic expansion is not only analyzed concerning variables of a health nature (e.g., increase in cases, deaths, etc.), but also in relation to parameters such as employment, consumption, tourism, and so on.

Furthermore, it will also help to predict possible future outbreaks by observing, measuring, and cross-relating certain outstanding variables (e.g., the number of active cases with Internet searches related to COVID-19 symptoms).

It is worth mentioning that any correlation detected must be analyzed and processed with rigor, not meaning in any case causality. In this sense, it is impossible to deduce a cause-effect relationship among different parameters by merely observing a correlation or association among them solely [14] .

This tool effectively responds to the necessity of several individuals who need to work with reliable and multidisciplinary data from different sources, all linked to the COVID-19 pandemic in Spain. In our vision, those people do not necessarily possess the technical skills to automate the tedious task of data collection and representation.

At the time of writing this manuscript, COnVIDa incorporates, in addition to data of COVID-19 in Spain, data from INE (i.e., Spanish National Institute of Statistics), Mobility (provided by Google and Apple), MoMo (i.e., Daily Mortality Monitoring System from the Spanish Health Institute Carlos III), and AEMET (i.e., Spanish State Agency of Meteorology). Nonetheless, due to its modular architecture, it is easily extensible to integrate additional relevant data sources. It is, in fact, a very versatile tool, easily adaptable to geographic regions with different granularity (region, province, health area, municipality, etc.), and even to other countries.

The remainder of this paper is organized as follows.

Section 2 depicts the principal characteristics of COnVIDa, which are compared with other similar tools. In Section 3,

the COnVIDa Data Sources are presented and explained, motivating on their choice. Next, Section 4 presents the modular architecture of COnVIDa with a particular focus on the back-end modules. Then, Section 5 describes the COnVIDa dashboard, designed to be user-friendly and highly intuitive. Also, Section 6 describes the core element of the framework, which implements the logic to store and provide data from different data sources. In Section 7, the REST API developed within the COnVIDa framework is briefly explained. Additionally, Section 8 presents an indepth discussion about COnVIDa, arguing the utility of the proposed platform and the lessons learned throughout its design and development. Finally, Section 9 concludes the paper, presenting some interesting future lines to further improve this useful tool.

COnVIDa is a web-based platform that displays dayto-day updated data related to the impact and conditions 

Going into detail, COnVIDa is strongly characterized by the following key features:

and support different types of information. In the current version, we include (i) COVID-19 medical statistics to monitor the epidemiological evolution [15] ;

(ii) citizens' lifestyle and health measurements to reveal potential correlations [16] , (iii) numbers in terms of mobility to analyze agglomerations or impact of lockdowns/reopenings [17] , (iv) records on the excess of mortality to study actual deaths due to the pandemic [18] , and (v) weather statistics to explore a possible environmental sensitivity [19] . These data sources are widely discussed later in Section 3. F5. Dual COnVIDa uses two visualization panels to draw the selected input conditions. The first one maintains the focus on a temporal analysis while the other deals with a regional analysis. The temporal visualization shows the plot with the selected days on the X-axis and daily values on the Y-axis. The regional visualization draws boxplots with the selected regions on the X-axis to represent descriptive statistics of the daily values for the selected date range.

Additionally, a map is deployed to visualize the selected metrics geographically per region. F7. Downloadable COnVIDa allows the manual download of the filtered data in standardized CSV or XLS files to facilitate further analysis by end-users, data analysts, or scientists. This framework also implements an application programming interface (API) endpoint for programmers to retrieve the data in JSON format. The downloaded data can be used to predict pandemic dynamics such as COVID-19 infections and hospitalizations, or apply any machine learning analysis, as evidenced by [20] . It is also possible to generate a table in HTML format (which can be useful for journalists). Moreover, the generated graphs can also be downloaded in PNG format.

F8. Easy-to-use COnVIDa implements a simple interface that intuitively offers the filtering and plotting functionalities. The design has been improved through several releases considering both beta-testers and endusers recommendations and impressions. A summary table with the most remarkable statistics (counts, means, standard deviations, min, max, and percentiles) of the represented data has also been included under the regional visualization panel. 

The properties mentioned above make COnVIDa an innovative and alternative tool among the existing ones. In this line, Table 1 analyzes other similar platforms with pandemic monitoring purposes providing (some of) the features mentioned above. Despite the many tools that emerged at the beginning of the pandemic, we only con- Finally, the internal engine of our solution is easily extensible for developers to program new data sources. Thus, everyone can contribute to our platform or test their theories beyond our proposed data sources offered as standard.

We have not observed such feature in any other analyzed application, except for OWiD Explorer, that maintains a highly active and well-documented GitHub repository.

As mentioned above, one of the strengths of the COn-VIDa framework is the multidisciplinary nature of the data.

These relevant data come from different official data sources to provide reliable information about the pandemic effects and avoid the spread of disinformation.

In addition, each selected data source is composed of one or more data items. A data item is a low-grain resource that codifies a specific piece of information and belongs to one of the aforementioned data sources.

Next, each of the COnVIDa data sources is deeply described and motivated. For an easy reference, see Table 2 .

The current version of COnVIDa includes five data sources related to the COVID-19 pandemic in Spain.

The first data source is offered by esCOVID19data 10 , which offers daily data at community and province levels about the COVID-19 evolution and its most essential as- 

In the same way, information about the mobility of the citizens is collected from two relevant sources. On the one hand, information about mobility gathered from Google 12 , and on the other hand, the information provided by Apple 13 . The temporal granularity of these data is daily, while its regional granularity is constrained to Spanish communities. This information helps to understand the spread of the virus throughout those regions, which may affect the normal mobility of citizens across distinct areas (e.g., parks, residential, etc.). In this case, mobility values are updated by the official data sources every day.

To evaluate the real mortality caused by the COVID- 

To determine how the climate and weather conditions may eventually affect the pandemic, meteorological data are obtained from the AEMET 15 ("Agencia Estatal de Meteorología", the Spanish State Agency of Meteorology).

The retrieved metrics corresponding to this data source (e.g., minimum/maximum temperature, rainfall, solar radiation, etc.) are organized by day and Spanish community. It is worth mentioning that the agency publishes, every day, the weather statistics corresponding to fives day ago.

From a technical perspective, COnVIDa is a synergy of various elements implemented strategically in a web server to provide and display data securely, promptly, and reliably, as shown in Figure 2 . The resulting architecture is composed of the following modules: This container-based distributed design benefits from the advantages of virtualization [21] , such as scalability, dynamic management, and agile deployment. Therefore, the modules are interconnected in the same internal virtualized network.

As previously mentioned, COnVIDa implements a web- is to make any user perceive a sense of order within the page, thus facilitating its wide use and acceptance.

The dashboard itself is in turn composed of the following main panels:

M1. Data selection panel It is in charge of selecting the specific data to be represented, as illustrated in Table 3 , although the INE data source cannot be considered for temporal representation, the rest of the data items offer daily metrics that can be crossed together. For instance, a user could select one data item from each temporal data source, two different regions, and a time window from 21/02/2020 until 8/03/2020. As a result, the Xaxis will show all the days between those two dates.

In contrast, the Y-axis will show the corresponding values for these data items in the regions. Therefore, this panel allows obtaining a graphical representation of the information (as reported in Figure 4 ). shown in Figure 5a . Additionally, a summary table is also built characterizing each selected data item (see Figure 5c ).

(a) Regional visualization (boxplots chart) (b) Regional visualization (geographical map) (c) Summary table of selected data Figure 5 : Display of regional data

Firstly, it is worth remarking that all data sources are programmed and integrated into COnVIDa through the same methodology, which makes integration between data items much more direct. In particular, to guarantee a homogeneous data modeling, data sources must respect the interface with the tool through minimum configuration requisites and implementation guidelines. These details are in-depth explained in Section 6.3.

Additionally, COnVIDa defines reference units of time and regions. Currently, we consider the 'day' as the standard time unit (due to its importance in monitoring the pandemic), and both 'communities' or 'provinces' as valid regional units (for its relevance in socially and politically managing the pandemic). These formats should be respected by any data source integrated into COnVIDa for homogeneity. However, despite the above, some points should be considered when using COnVIDa and mixing different types of data sources, as summarized in Table 3 .

With respect to the temporal integration across data sources, any temporal data source can be selected, cross- 

Regardless of the selected data items being temporal or regional, the graphical representation of data can be changed. First, it can be shown in a line or bar chart, adapting to the user's purpose. Then, it allows changing the scale (linear or logarithmic, being the last one useful when comparing data series of significantly different scales). Any selection made is applied directly to the data, immediately and automatically. 19 COVID-19 data items can be also compared per provinces

Furthermore, COnVIDa includes the functionality to download the data in a generic format (CSV, XLS, JSON, or HTML) (see Figure 6 ). To this extent, it has to be stated that it is possible to download both the raw data and the summary table of the selected data. 

COnVIDa library is the core Python-based implementation of the end-user requests management and external data sources collection. Particularly, the library offer functions that, given a range of dates, a list of data items, and a list of regions, return the tabular data in a standard format from a local cache or external repositories. The main elements of COnVIDa library are the following:

• COnVIDa-server. Server side of COnVIDa service, which implements the logic to process the dashboard/API requests and query the Data Cache to ultimately return the results. Moreover, it is programmed to update the Data Cache daily. This element is in-depth described in Section 6.1.

• Data Cache. Structure that stores the data locally for the dashboard and REST API. This structure permits the dispatch of dashboard/API requests locally on our server, without the need of externally downloading the requested data every time. It is explained in Section 6.2.

• COnVIDa-lib. Core implementation of the crawling functionality. It offers functions to retrieve the data supported by the tool. Each request using this library entails the downloading of data. This submodule can be augmented with new data sources without altering the functioning of the rest of the elements, as suggested in Section 6.3.

COnVIDa-server 20 , whose class diagram is presented in Figure 7 , with the refresh time in the availability of the data in the original repositories.

COnVIDa-lib 22 is responsible for collecting the data from external repositories. As mentioned above, the regular update of the data is performed through the functions of this library.

An appealing feature of this package is that it can be used in isolation without deploying the rest of the web server with the associated architecture. In this sense, the classes and functions implemented can be used by any programmer to launch gathering processes in their applications.

Moreover, COnVIDa-lib constitutes an object-oriented package ready to be extended. Considering the principal elements and terminology, implementing a new data source (and associated data items) should be simple. Figure 8 shows the relationships to be considered when implement- • Extend the parent data source class.

• Declare to None the class attributes (In the first execution of the class, these class attributes will load the values from the configuration files.)

• Define and fulfill the functions of the class. Specifically, the function which processes partial data should apply the necessary transformations to return data compliant with standard temporal and regional granularity.

• If extra configuration elements were needed for the Data Source, they should be read. COnVIDa REST API implements two different routes for obtaining data, as reported in Table 4 . The first one is used to recover information from temporal data where, as previously mentioned, it is necessary to indicate a range of dates. Instead, the second is employed to recover regional data and, more precisely, extract information from the INE data source.

To better understand the functioning of this module, an example is presented next. COnVIDa REST API is re-quested for "Cumulative incidence in the last 14 days" data in two regions: Murcia and Badajoz. To provide such data, a POST request is made to the URL https://convida.

inf.um.es/api/temporal. The parameter data is entered with the value "Cumulative incidence in the last 14

days", regions with the values "Murcia" and "Badajoz", start_date with the date "2021-01-01", and end_date with "2021-01-08". Once the request has been made, the API responds with an accepted 200 and, in turn, provides requested data for each region and day of the chosen range (see Figure 9 ). The request made by the user goes directly to the REST service without going through the dashboard (see Figure 2 ). This offers a faster response since no graphs or tables have to be generated. Moreover, this service fosters the possibility of efficiently implementing third-party applications leveraging the modularity of COnVIDa library.

In the timely battle against the spread of COVID-19, COnVIDa has been designed and developed to gather, manage, and visualize the pandemic data. Throughout the previous sections, COnVIDa has been deeply analyzed, focusing on its most relevant features and modular architecture. In this Section, an in-depth discussion is presented, explicitly arguing the utility of the proposed platform and the lessons learned through its design and development. • start_date: First day in the range, yyyy-mm-dd format (e.g. 2020-01-01).

• end_date: Last day in the range (e.g. 2020-05-13).

/api/regional POST • data: List of data items (e.g. [Physical activity]).

• regions: List of regions (e.g. [Murcia, Andalucía]). Germany, Argentina, and others, as shown in Figure 10 . Furthermore, COnVIDa has achieved more than 2,000 active users worldwide, with a particular emphasis on Spain.

Thanks to its design, users with both mobile and computer accessed the COnVIDa website, according to Google Analytics statistics.

In turn, Figure 11 shows the preferred query selectors for COnVIDa users. Based on the queries realized, in total 2800, we can see that positive cases, PCR cases, deaths, ICU patients, and hospitalized patients are the most popular selectors. Other selectors related to Mobility or Weather are also used but to a lesser extent.

In Figure 12 , it can be seen that the most consulted regions through the website are Murcia, Andalucía, Madrid, and Navarra, surpassed only by the consultation of data for the whole country of Spain. Similarly, a graph has been included on the most consulted provinces (Figure 13) Furthermore, the development of the modular COn-VIDa architecture and functions has required extensive efforts. Once the challenge mentioned above of identifying valuable data sources has been accomplished, the platform needs to acquire such data with a certain frequency and manage them. To achieve this, COnVIDa architecture contains two core modules, namely, COnVIDa-server

and COnVIDa-lib, as detailed in Section 6. In particular, COnVIDa-lib has been designed as a class hierarchy to form an object-oriented package ready to be extended.

In fact, more expert users with programming skills can reuse the functions and classes implemented in the library within their workspaces. Moreover, the entire COnVIDa Besides, COnVIDa has been proposed to reach user categories with different expertise and intentions. Thus, it is designed to provide a great user experience regardless of user purposes and skills. To this extent, COnVIDa passes through periodical usage tests run by volunteer beta testers. Notably, the platform has been submitted to the evaluation of various users with different profiles (i.e., normal users, managers, researchers, journalists), aiming to receive meaningful feedback on improving the dashboard and its functionalities. Thanks to their work, COnVIDa has been enhanced iteratively and is ready to support diverse users.

This paper has described COnVIDa, a web-based tool for compiling and displaying multidisciplinary data about the current COVID-19 pandemic. The user can select the range of dates, the regions, and data items to inspect, and the framework smoothly plots the associated graphics. At the time of writing, the platform can collect clinical/pandemic, social, mobility, excess mortality, and weather information from official Spanish sources. Moreover, thanks to the adoption of the open science philosophy, our solution is technically designed to be easily extended with additional data sources and geographical regions. In this sense, we have found very few projects whose functionality can be directly extended by third-party contributions.

Having analyzed other similar platforms, we can conclude that COnVIDa is the most powerful tool in terms of flexibility when filtering specific scenarios and crosscomparing any set of variables, hence enabling scientists or decision-makers to consult cross-cutting relationships and, ultimately, expose correlations in specific situations. Mass media have broadly published about such novel dashboard, and the usage statistics confirm that it is proving particularly useful in exploring places where the virus is most prevalent.

As for future directions, authors are preparing different releases that aim to increase the functionalities and scope of the website. In this sense, the main milestones to come ahead are: i) to allow further types of granularity when filtering places, such as countries, health areas, or even municipalities, or dates, such as for a week or month level monitoring; ii) to add data items related to the use of public transport, economy, employment, and Google search trends; iii) to support user authentication, manage users' queries records, logs and statistics, and even permit those users the proposal of new data series or specific correlations; and iv) to introduce basic analytical capabilities by incorporating options for applying machine learning algorithms, making causal inference from observational data, running automatic correlation studies, and suggesting prediction models.

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: 

The cancellation of mass gatherings (MGs)? decision making in the time of covid-19

The Economic Effects of Covid-19 Containment Measures

Covid-19 social distancing in the kingdom of saudi arabia: Bold measures in the face of political, economic, social and religious challenges

Tocilizumab treatment in covid-19: A single center experience

The socio-economic implications of the coronavirus pandemic (covid-19): A review

Databased analysis, modelling and forecasting of the COVID-19 outbreak

Data gaps and the policy response to the novel coronavirus, Working Paper 26902

Causal inference in public health

Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set

Tackling the challenges of 21st-century open science and beyond: A data science lab approach

The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends

Automatic generation and easy deployment of digitized laboratories

Digital technology and COVID-19

Thinking clearly about correlations and causation: Graphical causal models for observational data

How decision makers can use quantitative approaches to guide outbreak responses

Obesity and impaired metabolic health in patients with COVID-19

Aggregated mobility data could help fight COVID-19

Excess all-cause mortality during the COVID-19 pandemic in Europe -preliminary pooled estimates from the EuroMOMO network

Misconceptions about weather and seasonality must not misguide COVID-19 response

Covid-19 diagnostics, tools, and prevention

Containerization and the paas cloud

Association between mobility patterns and covid-19 transmission in the usa: a mathematical modelling study

A framework for research linking weather, climate and COVID-19

This study was partially funded by grants from the Spanish Government with codes FPU18/00304 and RYC-2015-18210, co-funded by the European Social Fund, and by a predoctoral grant from the University of Murcia.