key: cord-131678-rvg1ayp2
authors: Ponce, Marcelo; Sandhel, Amit
title: covid19.analytics: An R Package to Obtain, Analyze and Visualize Data from the Corona Virus Disease Pandemic
date: 2020-09-02
journal: nan
DOI: nan
sha: 
doc_id: 131678
cord_uid: rvg1ayp2

With the emergence of a new pandemic worldwide, a novel strategy to approach it has emerged. Several initiatives under the umbrella of"open science"are contributing to tackle this unprecedented situation. In particular, the"R Language and Environment for Statistical Computing"offers an excellent tool and ecosystem for approaches focusing on open science and reproducible results. Hence it is not surprising that with the onset of the pandemic, a large number of R packages and resources were made available for researches working in the pandemic. In this paper, we present an R package that allows users to access and analyze worldwide data from resources publicly available. We will introduce the covid19.analytics package, focusing in its capabilities and presenting a particular study case where we describe how to deploy the"COVID19.ANALYTICS Dashboard Explorer".

In 2019 a novel type of Corona Virus was first reported, originally in the province of Hubei, China. In a time frame of months this new virus was capable of producing a global pandemic of the Corona Virus Disease (CoViD19), which can end up in a Severe Acute Respiratory Syndrome (SARS-COV-2). The origin of the virus is still unclear [4, 5, 6] , although some studies based on genetic evidence, suggest that it is quite unlikely that this virus was human made in a laboratory, but instead points towards cross-species transmission [7, 8] . Although this is not the first time in the human history when humanity faces a pandemic, this pandemic has unique characteristics. For starting the virus is "peculiar" as not all the infected individuals experience the same symptoms. Some individuals display symptoms that are similar to the ones of a common cold or flu while other individuals experience serious symptoms that can cause death or hospitalization with different levels of severity, including staying in intensive-care units (ICU) for several weeks or even months. A recent medical survey shows that the disease can transcend pulmonary manifestations affecting several other organs [9] . Studies also suggest that the level of severity of the disease can be linked to previous conditions [10] , gender [11] , or even blood type [12] but the fundamental and underlying reasons still remain unclear. Some infected individuals are completely asymptomatic, which makes them ideal vectors for disseminating the virus. This also makes very difficult to precisely determine the transmission rate of the disease, and it is argued that in part due to the peculiar characteristics of the virus, that some initial estimates were underdetermining the actual value [13] . Elderly are the most vulnerable to the disease and reported mortality rates vary from 5 to 15% depending on the geographical location. In addition to this, the high connectivity of our modern societies, make possible for a virus like this to widely spread around the world in a relatively short period of time.

What is also unprecedented is the pace at which the scientific community has engaged in fighting this pandemic in different fronts [14] . Technology and scientific knowledge are and will continue playing a fundamental role in how humanity is facing this pandemic and helping to reduce the risk of individuals to be exposed or suffer serious illness. Techniques such as DNA/RNA sequencing, computer simulations, models generations and predictions, are nowadays widely accessible and can help in a great manner to evaluate and design the best course of action in a situation like this [15] . Public health organizations are relying on mathematical and data-driven models (e.g. [16] ), to draw policies and protocols in order to try to mitigate the impact on societies by not suffocating their health institutions and resources [17] . Specifically, mathematical models of the evolution of the virus spread, have been used to establish strategies, like social distancing, quarantines, self-isolation and staying at home, to reduce the chances of transmission among individuals. Usually, vaccination is also another approach that emerges as a possible contention strategy, however this is still not a viable possibility in the case of CoViD19, as there is not vaccine developed yet [18, 19] .

Simulations of the spread of virus have also shown that among the most efficient ways to reduce the spread of the virus are [20] : increasing social distancing, which refers to staying apart from individuals so that the virus can not so easily disperse among individuals; improving hygiene routines, such as proper hand washing, use of hand sanitizer, etc. which would eventually reduce the chances of the virus to remain effective; quarantine or self-isolation, again to reduce unnecessary exposure to other potentially infected individuals. Of course these recommendations based on simulations and models can be as accurate and useful as the simulations are, which ultimately depend on the value of the parameters used to set up the initial conditions of the models. Moreover these parameters strongly depend on the actual data which can be also sensitive to many other factors, such as data collection or reporting protocols among others [21] . Hence counting with accurate, reliable and up-to-date data is critical when trying to understand the conditions for spreading the virus but also for predicting possible outcomes of the epidemic, as well as, designing proper containment measurements. Similarly, being able to access and process the huge amount of genetic information associated with the virus has proben to shred light into the disease's path [22, 23] .

Encompassing these unprecedented times, another interesting phenomenon has also occurred, in part related to a contemporaneous trend in how science can be done by emphasizing transparency, reproducibility and robustness: an open approach to the methods and the data; usually refer as open science. In particular, this approach has been part for quite sometime of the software developer community in the so-called open source projects or codes. This way of developing software, offers a lot of advantages in comparison to the more traditional and closed, proprietary approaches. For starting, it allows that any interested party can look at the actual implementation of the code, criticize, complement or even contribute to the project. It improves transparency, and at the same time, guarantees higher standards due to the public scrutiny; which at the end results in benefiting every one: the developers by increasing their reputation, reach and consolidating a widely validated product and the users by allowing direct access to the sources and details of the implementation. It also helps with reproducibility of results and bugs reports and fixes. Several approaches and initiatives are taking the openness concepts and implementing in their platforms. Specific examples of this have drown the Internet, e.g. the surge of open source powered dashboards [24] , open data repositories, etc.

Another example of this is for instance the number of scientific papers related to CoViD19 published since the beginning of the pandemic [25] , the amount of data and tools developed to track the evolution of pandemic, etc. [26] . As a matter of fact, scientists are now drowning in publications related to the CoViD19 [27, 28] , and some collaborative and community initiatives are trying to use machine learning techniques to facilitate identify and digest the most relevant sources for a given topic [29, 30, 31] .

The "R Language and Environment for Statistical Computing" [1, 2] is not exception here. Moreover, promoting and based on the open source and open community principles, R has empowered scientists and researchers since its inception. Not surprisingly then, the R community has contributed to the official CRAN [32] repository already with more than a dozen of packages related to the CoViD19 pandemic since the beginning of the crisis. In particular, in this paper we will introduce and discuss the covid19.analytics R package [3] , which is mainly designed and focus in an open and modular approach to provide researchers quick access to the latest reported worldwide data of the CoViD19 cases, as well as, analytical and visualization tools to process this data. This paper is organized as follow: in Sec. 2 we describe the covid19.analytics , in Sec. 3 we present some examples of data analysis and visualization, in Sec. 4 we describe in detail how to deploy a web dashboard employing the capabilities of the covid19.analytics package providing full details on the implementation so that this procedure can be repeated and followed by interested users in developing their own dashboards. Finally we summarize some conclusions in Sec. 5.

The covid19.analytics R package [3] allows users to obtain live 1 worldwide data from the novel CoViD19. It does this by accessing and retrieving the data publicly available and published by two main sources: the "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University" [33] for the worldwide and US data, and the city of Toronto for the Toronto data [34] . The package also provides basic analysis and visualization tools and functions to investigate these datasets and other ones structured in a similar fashion.

The covid19.analytics package is an open source tool, which its main implementation and API is the R package [3] . In addition to this, the package has a few more adds-on:

• a central GitHUB repository, https://github.com/mponce0/covid19.analytics where the latest development version and source code of the package are available. Users can also submit tickets for bugs, suggestions or comments using the "issues" tab.

• a rendered version with live examples and documentation also hosted at GitHUB pages, https: //mponce0.github.io/covid19.analytics/;

• a dashboard for interactive usage of the package with extended capabilities for users without any coding expertise, https://covid19analytics.scinet.utoronto.ca. We will discuss the details of the implementation in Sec. 4.

• a "backup" data repository hosted at GitHUB, https://github.com/mponce0/covid19analytics. datasets -where replicas of the live datasets are stored for redundancy and robust accesibility sake (see Fig. 1 ).

One of the main objectives of the covid19.analytics package is to make the latest data from the reported cases of the current CoViD19 pandemic promptly available to researchers and the scientific community In what follows we describe the main functionalities from the package regarding data accessibility.

The covid19.data function allows users to obtain realtime data about the CoViD19 reported cases from the JHU's CCSE repository, in the following modalities:

• aggregated data for the latest day, with a great 'granularity' of geographical regions (ie. cities, provinces, states, countries)

• time series data for larger accumulated geographical regions (provinces/countries)

• deprecated : we also include the original data style in which these datasets were reported initially.

The datasets also include information about the different categories (status) "confirmed"/"deaths"/"recovered" of the cases reported daily per country/region/city. This data-acquisition function, will first attempt to retrieve the data directly from the JHU repository with the latest updates. If for what ever reason this fails (eg. problems with the connection) the package will load a preserved "image" of the data which is not the latest one but it will still allow the user to explore this older dataset. In this way, the package offers a more robust and resilient approach to the quite dynamical situation with respect to data availability and integrity.

In addition to the data of the reported cases of CoViD19, the covid19.analytics package also provides access to genomics data of the virus. The data is obtained from the National Center for Biotechnology Information (NCBI) databases [35, 36] . Table 1 shows the functions available in the covid19.analytics package for accessing the reported cases of the CoViD19 pandemic. The functions can be divided in different categories, depending on what data they provide access to. For instance, they are distinguished between agreggated and time series data sets. They are also grouped by specific geographical locations, i.e. worldwide, United States of America (US) and the City of Toronto (Ontario, Canada) data.

The Time Series data is structured in an specific manner with a given set of fields or columns, which resembles the following format: "Province.State" | "Country.Region" | "Lat" | "Long" | ... sequence of dates ...

One of the modular features this package offers is that if an user has data structured in a data.frame organized as described above, then most of the functions provided by the covid19.analytics package for analyzing Time Series data will just work with the user's defined data. In this way it is possible to add new data sets to the ones that can be loaded using the repositories predefined in this package and extend the analysis capabilities to these new datasets.

Sec. 3.5 presents an example of how external or synthetic data has to be structured so that can use the function from the covid19.analytics package. It is also recommended to check the compatibility of these datasets using the Data Integrity and Consistency Checks functions described in the following section.

Due to the ongoing and rapid changing situation with the CoViD-19 pandemic, sometimes the reported data has been detected to change its internal format or even show some anomalies or inconsistencies 1 .

For instance, in some cumulative quantities reported in time series datasets, it has been observed that these quantities instead of continuously increase sometimes they decrease their values which is something that should not happen 2 . We refer to this as an inconsistency of "type II".

Some negative values have been reported as well in the data, which also is not possible or valid; we call this inconsistency of "type I".

When this occurs, it happens at the level of the origin of the dataset, in our case, the one obtained from the JHU/CCESGIS repository [33] . In order to make the user aware of this, we implemented two consistency and integrity checking functions:

• consistency.check: this function attempts to determine whether there are consistency issues within the data, such as, negative reported value (inconsistency of "type I") or anomalies in the cumulative quantities of the data (inconsistency of "type II")

• integrity.check: this determines whether there are integrity issues within the datasets or changes to the structure of the data Alternatively we provide a data.checks function that will execute the previous described functions on an specified dataset.

Data Integrity. It is highly unlikely that the user would face a situation where the internal structure of the data or its actual integrity may be compromised. However if there are any suspicious about this, it is possible to use the integrity.check function in order to verify this. If anything like this is detected we urge users to contact us about it, e.g. https://github.com/mponce0/covid19.analytics/issues. Data Consistency. Data consistency issues and/or anomalies in the data have been reported several times 1 These are claimed, in most of the cases, to be missreported data and usually are just an insignificant number of the total cases. Having said that, we believe that the user should be aware of these situations and we recommend using the consistency.check function to verify the dataset you will be working with.

Nullifying Spurious Data. In order to deal with the different scenarios arising from incomplete, inconsistent or missreported data, we provide the nullify.data function, which will remove any potential entry in the data that can be suspected of these incongruencies. In addition ot that, the function accepts an optional argument stringent=TRUE, which will also prune any incomplete cases (e.g. with NAs present).

Similarly to the rapid developments and updates in the reported cases of the disease, the sequencing of the virus is moving almost at equal pace. That's why the covid19.analytics package provides access to good number of the genomics data currently available.

The covid19.genomic.data function allows users to obtain the CoViD19's genomics data from NCBI's databases [36] . The type of genomics data accessible from the package is described in Table 2 .

Although the package attempts to provide the latest available genomic data, there are a few important details and differences with respect to the reported cases data. For starting, the amount of genomic information available is way larger than the data reporting the number of cases which adds some additional constraints when retrieving this data. In addition to that, the hosting servers for the genomic databases impose certain limits on the rate and amounts of downloads.

In order to mitigate these factors, the covid19.analytics package employs a couple of different strategies as summarized below:

• most of the data will be attempted to be retrieved live from NCBI databases -same as using src='livedata'.

• if that is not possible, the package keeps a local version of some of the largest datasets (i.e. genomes, nucleotides and proteins) which might not be up-to-date -same as using src='repo'. • the package will attempt to obtain the data from a mirror server with the datasets updated on a regular basis but not necessarily with the latest updates -same as using src='local'.

These sequence of steps are implemented in the package using tryCath() exceptions in combination with recursivity, i.e. the retrieving data function calling itself with different variations indicating which data source to use.

As the covid19.analytics package will try present the user with the latest data sets possible, different strategies (as described above) may be in place to achieve this. One way to improve the realiability of the access to and avialability of the data is to use a series of replicas of the datasets which are hosted in different locations. Fig. 1 summarizes the different data sources and point of access that the package employs in order to retrieve the data and keeps the latest datasets available.

Genomic data as mentioned before is accessed from NCBI databases. This is implemented in the covid19.genomic.data function employing the ape [37] and rentrez [38] packages. In particular the proteins datasets, with more than 100K entries, is quite challenging to obtain "live". As a matter of fact, the covid19.genomic.data function accepts an argument to specify whether this should be the case or not. If the src argument is set to 'livedata' then the function will attempt to download the proteins list directly from NCBI databases. If this fail, we recommend using the argument src='local' which will provide an stagered copy of this dataset at the moment in which the package was submitted to the CRAN repository, meaning that is quite likely this dataset won't be complete and most likely outdated. Additionaly, we offer a second replica of the datasets, located at https://github.com/mponce0/covid19analytics.datasets where all datasets are updated periodically, this can be accessed using the argument src='repo'.

In addition to the access and retrieval of the data, the covid19.analytics package includes several functions to perform basic analysis and visualizations. Table 3 shows the list of the main functions in the package.

Description Main Type of Output Data Acquisition covid19.data obtain live* worldwide data for covid19 virus, from the JHU's CCSE repository [33] return dataframes/list with the collected data covid19.Toronto.data obtain live* data for covid19 cases in the city of Toronto, ON Canada, from the City of Toronto reports [34] return dataframe/list with the collected data covid19.US.data obtain live* US specific data for covid19 virus, from the JHU's CCSE repository [33] return dataframe with the collected data Genomics covid19.genomic.data c19.refGenome.data c19.fasta.data c19.ptree.data c19.NPs.data c19.NP_fasta.data obtain genomic data from NCBI databases -see Table 2 In the reported data, this is mostly given by the Province/City and/or Country/Region. In order to facilitate the processing of locations that are located geo-politically close, the covid19.analytics package provides a way to identify regions by indicating the corresponding continent's name where they are located. I.e. "South America", "North America", "Central America", "America", "Europe", "Asia" and "Oceania" can be used to process all the countries within each of these regions.

The geographicalRegions function is the one in charge of determining which countries are part of what continent and will display them when executing geographicalRegions().

In this way, it is possible to specify a particular continent and all the countries in this continent will be processed without needing to explicitly specifying all of them.

Reports. As the amount of data available for the recorded cases of CoViD19 can be overwhelming, and in order to get a quick insight on the main statistical indicators, the covid19.analytics package includes the report.summary function, which will generate an overall report summarizing the main statistical estimators for the different datasets. It can summarize the "Time Series" data (when indicating cases.to.process="TS"), the "aggregated" data (cases.to.process="AGG") or both (cases.to.process="ALL"). The default will display the top 10 entries in each category, or the number indicated in the Nentries argument, for displaying all the records just set Nentries=0.

The function can also target specific geographical location(s) using the geo.loc argument. When a geographical location is indicated, the report will include an additional "Rel.Perc" column for the confirmed cases indicating the relative percentage among the locations indicated. Similarly the totals displayed at the end of the report will be for the selected locations.

In each case ("TS" or/and "AGG") will present tables ordered by the different cases included, i.e. confirmed infected, deaths, recovered and active cases.

The dates when the report is generated and the date of the recorded data will be included at the beginning of each table.

It will also compute the totals, averages or mean values, standard deviations and percentages of various quantities, i.e.

• it will determine the number of unique locations processed within the dataset • it will compute the total number of cases per case type • Percentages -which are computed as follow:

for the "Confirmed" cases, as the ratio between the corresponding number of cases and the total number of cases, i.e. a sort of "global percentage" indicating the percentage of infected cases with respect to the rest of the world covid19.analytics 

"internal" rsync/git -when a new release is push to CRAN "internal" scripts src="livedata" src="repo" src="local" https://github.com/mponce0/covid19analytics.datasets Figure 1 : Schematic of the data acquision flows between the covid19.analytics package and the different sources of data. Dark and solid/dashed lines represent API functions provided by the package accesible to the users. Dotted lines are "internal" mechanisms employed by the package to synchronize and update replicas of the data. Data acquisition from NCBI servers is mostly done utilizing the ape [37] and rentrez [38] packages.

for "Confirmed" cases, when geographical locations are specified, a "Relative percentage" is given as the ratio of the confirmed cases over the total of the selected locations for the other categories, "Deaths"/"Recovered"/"Active", the percentage of a given category is computed as the ratio between the number of cases in the corresponding category divided by the "Confirmed" number of cases, i.e. a relative percentage with respect to the number of confirmed infected cases in the given region • For "Time Series" data:

it will show the delta (change or variation) in the last day, daily changes day before that (t − 2), three days ago (t − 3), a week ago (t − 7), two weeks ago (t − 14) and a month ago (t − 30)

when possible, it will also display the percentage of "Recovered" and "Deaths" with respect to the "Confirmed" number of cases the column "GlobalPerc" is computed as the ratio between the number of cases for a given country over the total of cases reported -The "Global Perc. Average (SD: standard deviation)" is computed as the average (standard deviation) of the number of cases among all the records in the data -The "Global Perc. Average (SD: standard deviation) in top X" is computed as the average (standard deviation) of the number of cases among the top X records A typical output of the summary.report for the "Time Series" data, is shown in the example 4 in Sec. 3 . In addition to this, the function also generates some graphical outputs, including pie and bar charts representing the top regions in each category; see Fig. 2 .

Totals per Location & Growth Rate. It is possible to dive deeper into a particular location by using the tots.per.location and growth.rate functions. These functions are capable of processing different types of data, as far as these are "Time Series" data. It can either focus in one category (eg. "TS-confirmed", "TS-recovered", "TS-deaths",) or all ("TS-all"). When these functions detect different types of categories, each category will be processed separately. Similarly the functions can take multiple locations, ie. just one, several ones or even "all" the locations within the data. The locations can either be countries, regions, provinces or cities. If an specified location includes multiple entries, eg. a country that has several cities reported, the functions will group them and process all these regions as the location requested.

Totals per Location. The tots.per.location function will plot the number of cases as a function of time for the given locations and type of categories, in two plots: a log-scale scatter one a linear scale bar plot one.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout.

Additionally, the function will attempt to generate different fits to match the data:

• an exponential model using a Linear Regression method • a Poisson model using a General Linear Regression method • a Gamma model using a General Linear Regression method

The function will plot and add the values of the coefficients for the models to the plots and display a summary of the results in the console. It is also possible to instruct the function to draw a "confidence band" based on a moving average, so that the trend is also displayed including a region of higher confidence based on the mean value and standard deviation computed considering a time interval set to equally dividing the total range of time over 10 equally spaced intervals.

The function will return a list combining the results for the totals for the different locations as a function of time.

Growth Rate. The growth.rate function allows to compute daily changes and the growth rate defined as the ratio of the daily changes between two consecutive dates.

The growth.rate function shares all the features of the tots.per.location function as described above, i.e. can process the different types of cases and multiple locations.

The graphical output will display two plots per location:

• a scatter plot with the number of changes between consecutive dates as a function of time, both in linear scale (left vertical axis) and log-scale (right vertical axis) combined

• a bar plot displaying the growth rate for the particular region as a function of time.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout. In addition to that, when there is more than one location the function will also generate two different styles of heatmaps comparing the changes per day and growth rate among the different locations (vertical axis) and time (horizontal axis). Furthermore, if the interactiveFig=TRUE argument is used, then interactive heatmaps and 3d-surface representations will be generated too.

Some of the arguments in this function, as well as in many of the other functions that generate both static and interactive visualizations, can be used to indicate the type of output to be generated. Table 4 lists some of these arguments. In particular, the arguments controlling the interactive figures -interactiveFig and interactive.display-can be used in combination to compose an interactive figure to be captured and used in another application. For instance, when interactive.display is turned off but interactiveFig=TRUE, the function will return the interactive figure, so that it can be captured and used for later purposes. This is the technique employed when capturing the resulting plots in the covid19.analytics Dashboard Explorer as presented in Sec. 4.2.

Finally, the growth.rate function when not returning an interactive figure, will return a list combining the results for the "changes per day" and the "growth rate" as a function of time, i.e. when interactiveFig is not specified or set to FALSE (which its default value) or when interactive.display=TRUE. 

When is turned off, but interactiveFig=TRUE, the function will return the interactive figure, so that it can be captured and used for later purposes. Trends in Daily Changes. The covid19.analytics package provides three different functions to visualize the trends in daily changes of reported cases from time series data.

• single.trend, allows to inspect one single location, this could be used with the worldwide data sliced by the corresponding location, the Toronto data or the user's own data formatted as "Time Series" data.

• mtrends, is very similar to the single.trend function, but accepts multiple or single locations generating one plot per location requested; it can also process multiple cases for a given location.

• itrends function to generate an interactive plot of the trend in daily changes representing changes in number of cases vs total number of cases in log-scale using splines techniques to smooth the abrupt variations in the data The first two functions will generate "static" plots in a compose with different insets:

• the main plot represents daily changes as a function of time

• the inset figures in the top, from left to right:

total number of cases (in linear and semi-log scales),

changes in number of cases vs total number of cases changes in number of cases vs total number of cases in log-scale

• the second row of insets, represent the "growth rate" (as defined above) and the normalized growth rate defined as the growth rate divided by the maximum growth rate reported for this location Plotting Totals. The function totals.plt will generate plots of the total number of cases as a function of time. It can be used for the total data or for a specific or multiple locations. The function can generate static plots and/or interactive ones, as well, as linear and/or semi-log plots.

Plotting Cases in the World. The function live.map will display the different cases in each corresponding location all around the world in an interactive map of the world. It can be used with time series data or aggregated data, aggregated data offers a much more detailed information about the geographical distribution.

The covid19.analytics package allows users to model the dispersion of the disease by implementing a simple Susceptible-Infected-Recovered (SIR) model [39, 40] . The model is implemented by a system of ordinary differential equations (ODE), as the one shown by Eq.(1).

where S represents the number of susceptible individuals to be infected, I the number of infected individuals and R the number of recovered ones at a given moment in time. The coefficients β and γ are the parameters controlling the transition rate from S to I and from I to R respectively; N is the total number of individuals, i.e. N = S(t) + I(t) + R(t); which should remain constant, i.e.

Eq.(1) can be written in terms of the normalized quantities,

Although the ODE SIR model is non-linear, analytical solutions have been found [41] . However the approach we follow in the package implementation is to solve the ODE system from Eq.(1) numerically.

The function generate.SIR.model implements the SIR model from Eq.(1) using the actual data from the reported cases. The function will try to identify data points where the onset of the epidemic began and consider the following data points to generate proper guesses for the two parameters describing the SIR ODE system, i.e. β and γ.

It does this by minimizing the residual sum of squares (RSS) assuming one single explanatory variable, i.e. the sum of the squared differences between the number of infected cases I(t) and the quantity predicted by the modelĨ(t),

The ODE given by Eq.(1) is solved numerically using the ode function from the deSolve and the minimization is tackled using the optim function from base R.

After the solution for Eq.(1) is found, the function will provide details about the solution, as well as, plot the quantities S(t), I(t), R(t) in a static and interactive plot.

The generate.SIR.model function also estimates the value of the basic reproduction number or basic reproduction ratio, R 0 , defined as,

which can be considered as a measure of the average expected number of new infections from a single infection in a population where all subjects can be susceptible to get infected.

The function also computes and plots on demand, the force of infection, defined as, F inf ection = βI(t), which measures the transition rate from the compartment of susceptible individuals to the compartment of infectious ones.

For exploring the parameter space of the SIR model, it is possible to produce a series of models by varying the conditions, i.e. range of dates considered for optimizing the parameters of the SIR equation, which will effectively "sweep" a range for the parameters β, γ and R 0 . This is implemented in the function sweep.SIR.models, which takes a range of dates to be used as starting points for the number of cases used to feed into the generate.SIR.model producing as many models as different ranges of dates are indicated. One could even use this in combination to other resampling or Monte Carlo techniques to estimate statistical variability of the parameters from the model.

In this section we will present some basic examples of how to use the main functions from the covid19.analytics package.

We will begin by installing the covid19.analytics package. This can be achieved in two alternative ways:

1. installing the latest stable version of the package directly from the CRAN repository. This can be done within an R session using the install.packages function, i.e. > install.packages("covid19.analytics")

2. installing the development version from the package's GITHUB repository, https://github.com/ mponce0/covid19.analytics using the devtools package [42] and its install_github function. I.e.

# begin by installing devtools if not installed in your system > install.packages("devtools") # install the covid19.analytics packages from the GITHUB repo > devtools::install_github("mponce0/covid19.analytics")

After having installed the covid19.analytics package, for accessing its functions, the package needs to be loaded using R's library function, i.e.

The covid19.analytics uses a few additional packages which are installed automatically if they are not present in the system. In particular, readxl is used to access the data from the City of Toronto [34] , ape is used for pulling the genomics data from NCBI; plotly and htmlwidgets are used to render the interactive plots and save them in HTML documents, deSolve is used to solve the differential equations modelling the spread of the virus, and gplots, pheatmap are used to generate heatmaps.

Lst. 1 shows how to use the covid19.data function to obtain data in different cases.

# obtain all the records combined for " confirmed " , " deaths " and " recovered " cases # for the global ( worldwide ) * aggregated * data covid19 . data . ALLcases <-covid19 . data () # obtain time series data for global " confirmed " cases covid19 . confirmed . cases <-covid19 . data ( " ts -confirmed " ) # reads all possible datasets , returning a list covid19 . all . datasets <-covid19 . data ( " ALL " )

# reads the latest aggregated data of the global cases covid19 . ALL . agg . cases <-covid19 . data ( " aggregated " ) # reads time series data for global casualties covid19 . TS . deaths <-covid19 . data ( " ts -deaths " ) # read " Time Series " data for the city of Toronto Toronto . TS . data <-covid19 . data ( " ts -Toronto " )

# this can be also done using the covid19 . Toronto . data () fn Tor . TS . data <-covid19 . Toronto . data () # or get the original data as reported by the City of Toronto Tor . DF . data <-covid19 . Toronto . data ( data . fmr = " ORIG " ) # retrieve US time series data of confirmed cases US . confirmed . cases <-covid19 . data ( " ts -confirmed -US " ) # retrieve US time series data of death cases US . deaths . cases <-covid19 . data ( " ts -deaths -US " ) # or both cases combined US . cases <-covid19 . US . data () Listing 1: Reading data from reported cases of CoViD19 using the covid19.analytics package.

In general, the reading functions will return data frames. Exceptions to this, are when the functions need to return a more complex output, e.g. when combining "ALL" type of data or when requested to obtain the original data from the City of Toronto (see details in Table 3 ). In these cases, the returning object will be a list containing in each element dataframes corresponding to the particular type of data. In either case, the structure and overall content can be quickly assessed by using R's str or summary functions.

One useful information to look at after loading the datasets, would be to identify which locations/regions have reported cases. There are at least two main fields that can be used for that, the columns containing the keywords: 'country' or 'region' and 'province' or 'state'. Lst. 2 show examples of how to achieve this using partial matches for column names, e.g. "Country" and "Province".

# read a data set data <-covid19 . data ( " TS -confirmed " ) # look at the structure and column names str ( data ) names ( data ) # find ' Country ' column country . col <-pmatch ( " Country " , names ( data ) ) # slice the countries countries <-data [ , country . col ] # list of countries print ( unique ( countries ) ) # sorted table of countries , may include multiple entries print ( sort ( table ( countries ) ) ) # find ' Province ' column prov . col <-pmatch ( " Province " , names ( data ) ) # slice the Provinces provinces <-data [ , prov . col ] # list of provinces print ( unique ( provinces ) ) # sorted table of provinces , may include multiple entries print ( sort ( table ( provinces ) ) ) Listing 2: Identifying geographical locations in the data sets.

An overall view of the current situation at a global or local level can be obtained using the report.summary function. Lst. 3 shows a few examples of how this function can be used.

# a quick function to overview top cases per region for time series and aggregated records report . summary () # save the tables into a text file named ' covid19 -SummaryReport _ CURRENTDATE . txt ' # where * CURRRENTDATE * is the actual date report . summary ( saveReport = TRUE ) # summary report for an specific location with default number of entries report . summary ( geo . loc = " Canada " ) # summary report for an specific location with top 5 report . summary ( Nentries =5 , geo . loc = " Canada " ) # it can combine several locations report . summary ( Nentries =30 , geo . loc = c ( " Canada " ," US " ," Italy " ," Uruguay " ," Argentina " ) )

A typical output of the report generation tool is presented in Lst. 4.

Typical output of the report.summary function. This particular example was generated using report.summary(Nentries=5,graphical.output=TRUE,saveReport=TRUE), which indicates to consider just the top 5 entries, generate a graphical output as shown in Fig. 2 and a to save text file including the report which is the one shown here.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-

# #### TS -CONFIRMED Cases --Data dated : 2020 -06 -25 :: 2020 -06 -26 13:10:01 

# #### TS -DEATHS Cases --Data dated : 2020 -06 -25 :: 2020 -06 -26 13:10:02 

# #### TS -RECOVERED Cases --Data dated : 2020 -06 -25 :: 2020 -06 -26 13:10:02 

# #### AGGREGATED Data --ORDERED BY CONFIRMED Cases --Data dated : 2020 -06 -26 :: 2020 -06 -26 13:10:03 

# #### AGGREGATED Data --ORDERED BY DEATHS Cases --Data dated : 2020 -06 -26 :: 2020 -06 -26 13:10:03 

# #### AGGREGATED Data --ORDERED BY RECOVERED Cases --Data dated : 2020 -06 -26 :: 2020 -06 -26 13:10:03 

# #### AGGREGATED Data --ORDERED BY ACTIVE Cases --Data dated : 2020 -06 -26 :: 2020 -06 -26 13:10:03 * Statistical estimators computed considering 3781 independent reported entries * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * OVERALL SUMMARY * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Statistical estimators computed considering 266 / 266 / 253 independent reported entries per case -type * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * A daily generated report is also available from the covid19.analytics documentation site, https: //mponce0.github.io/covid19.analytics/.

The covid19.analytics package allows users to investigate total cumulative quantities per geographical location with the totals.per.location function. Examples of this are shown in Lst. 5.

# totals for confirmed cases for " Ontario " tots . per . location ( covid19 . confirmed . cases , geo . loc = " Ontario " ) # total for confirmed cases for " Canada " tots . per . location ( covid19 . confirmed . cases , geo . loc = " Canada " ) # total nbr of confirmed cases in Hubei including a confidence band based on moving average tots . per . location ( covid19 . confirmed . cases , geo . loc = " Hubei " , confBnd = TRUE ) # total nbr of deaths for " Mainland China " tots . per . location ( covid19 . TS . deaths , geo . loc = " China " ) # ## # read the time series data for all the cases all . data <-covid19 . data ( 'ts -ALL ') # run on all the cases tots . per . location ( all . data , " Japan " ) # ## # total for death cases for " ALL " the regions tots . per . location ( covid19 . TS . deaths ) # or just tots . per . location ( covid19 . data ( " ts -confirmed " ) ) Listing 5: Calculation of totals per Country/Region/Province. In addition to the graphical output as shown in Fig. 3 , the function will provide details of the models fitted to the data. 

Similarly, utilizing the growth.rate function is possible to compute the actual growth rate and daily changes for specific locations, as defined in Sec. 2.2. Lst. 6 includes examples of these.

# read time series data for confirmed cases TS . data <-covid19 . data ( " ts -confirmed " ) # compute changes and growth rates per location for all the countries growth . rate ( TS . data ) # compute changes and growth rates per location for ' Italy ' growth . rate ( TS . data , geo . loc = " Italy " ) # compute changes and growth rates per location for ' Italy ' and ' Germany ' growth . rate ( TS . data , geo . loc = c ( " Italy " ," Germany " ) ) # #### # Combining multiple geographical locations : # obtain Time Series data TSconfirmed <-covid19 . data ( " ts -confirmed " ) # explore different combinations of regions / cities / countries # when combining different locations , heatmaps will also be generated comparing the trends among these locations growth . rate ( TSconfirmed , geo . loc = c ( " Italy " ," Canada " ," Ontario " ," Quebec " ," Uruguay " ) ) growth . rate ( TSconfirmed , geo . loc = c ( " Hubei " ," Italy " ," Spain " ," United ␣ States " ," Canada " ," Ontario " ," Quebec " ," Uruguay " ) ) growth . rate ( TSconfirmed , geo . loc = c ( " Hubei " ," Italy " ," Spain " ," US " ," Canada " ," Ontario " , " Quebec " ," Uruguay " ) ) # turn off static plots and activate interactive figures growth . rate ( TSconfirmed , geo . loc = c ( " Brazil " ," Canada " ," Ontario " ," US " ) , staticPlt =

# static and interactive figures growth . rate ( TSconfirmed , geo . loc = c ( " Brazil " ," Italy " ," India " ," US " ) , staticPlt = TRUE , interactiveFig = TRUE )

Listing 6: Calculation of growth rates and daily changes per Country/Region/Province.

In addition to the cumulative indicators described above, it is possible to estimate the global trends per location employing the functions single.trend, mtrends and itrends. The first two functions generate static plots of different quantities that can be used as indicators, while the third function generates an interactive representation of a normalized a-dimensional trend. The Lst. 7 shows examples of the use of these functions. Fig. 6 displays the graphical output produced by these functions.

# single location trend , in this case using data from the City of Toronto tor . data <-covid19 . Toronto . data () single . trend ( tor . data [ tor . data $ status == " Active ␣ Cases " ,]) # or data from the province of Ontario ts . data <-covid19 . data ( " ts -confirmed " ) ont . data <-ts . data [ ts . data $ Province . State == " Ontario " ,] single . trend ( ont . data ) # or from Italy single . trend ( ts . data [ ts . data $ Country . Region == " Italy " ,]) # multiple locations ts . data <-covid19 . data ( " ts -confirmed " ) mtrends ( ts . data , geo . loc = c ( " Canada " ," Ontario " ," Uruguay " ," Italy " ) ) # multiple cases single . trend ( tor . data ) # interactive plot of trends # for all locations and all type of cases itrends ( covid19 . data ( " ts -ALL " ) , geo . loc = " ALL " ) # or just for confirmed cases and some specific locations , saving the result in an HTML file named " itrends _ ex . html " itrends ( covid19 . data ( " ts -confirmed " ) , geo . loc = c ( " Uruguay " ," Argentina " ," Ontario " ," US " ," Italy " ," Hubei " ) , fileName = " itrends _ ex " )

Listing 7: Calculation of trends for different cases, utilizing the single.trend, mtrends and itrends functions. The typical representations can be seen in Fig. 6. 

Most of the analysis functions in the covid19.analytics package have already plotting and visualization capabilities. In addition to the previously described ones, the package has also specialized visualization functions as shown in Lst.8. Many of them will generate static and interactive figures, see Table 3 for details of the type of output. In particular the live.map function is an utility function which allows to plot the location of the recorded cases around the world. This function in particular allows for several customizable features, such as, the type of projection used in the map or to select different types of projection operators in a pull down menu, displaying or not the legend of the regions, specify rescaling factors for the sizes representing the number of cases, among others. The function will generate a live representation of the cases, utilizing the plotly package and ultimately open the map in a browser, where the user can explore the map, drag the representation, zoom in/out, turn on/off legends, etc.

# retrieve time series data TS . data <-covid19 . data ( " ts -ALL " ) # static and interactive plot totals . plt ( TS . data ) # totals for Ontario and Canada , without displaying totals and one plot per page totals . plt ( TS . data , c ( " Canada " ," Ontario " ) , with . totals = FALSE , one . plt . per . page = TRUE )

# totals for Ontario , Canada , Italy and Uruguay ; including global totals with the linear and semi -log plots arranged one next to the other totals . plt ( TS . data , c ( " Canada " ," Ontario " ," Italy " ," Uruguay " ) , with . totals = TRUE , one .

plt . per . page = FALSE )

# totals for all the locations reported on the dataset , interactive plot will be saved as " totals -all . html " totals . plt ( TS . data , " ALL " , fileName = " totals -all " ) # retrieve aggregated data data <-covid19 . data ( " aggregated " ) # interactive map of aggregated cases --with more spatial resolution live . map ( data ) # or live . map () # interactive map of the time series data of the confirmed cases with less spatial resolution , ie . aggregated by country live . map ( covid19 . data ( " ts -confirmed " ) ) Listing 8: Examples of some of the interactive and visualization capabilities of plotting functions. The typical representations can be seen in Fig. 7 .

Last but not least, one the novel features added by the covid19.analytics package, is the ability of model the spread of the virus by incorporating real data. As described in Sec. 2.2, the generate.SIR.model function, implements a simple SIR model employing the data reported from an specified dataset and a particular location. Examples of this are shown in Lst.9. The generate.SIR.model function is complemented with the plt.SIR.model function which can be used to generate static or interactive figures as shown in Fig. 8 .

The generate.SIR.model function as described in Sec.2 will attempt to obtain proper values for the parameters β and γ, by inferring the onset of the epidemic using the actual data. This is also listed in the output of the function (see Lst.10), and it can be controlled by setting the parameters t0 and t1 or deltaT, which are used to specify the range of dates to be considered for using when determining the values of β and γ. The fatality rate (constant) can also be indicated via the fatality.rate argument, as well, as the total population of the region with tot.population. # read time series data for confirmed cases data <-covid19 . data ( " ts -confirmed " )

# run a SIR model for a given geographical location generate . SIR . model ( data , " Hubei " , t0 =1 , t1 =15) generate . SIR . model ( data , " Germany " , tot . population =83149300) generate . SIR . model ( data , " Uruguay " , tot . population =3500000) generate . SIR . model ( data , " Ontario " , tot . population =14570000 , add . extras = TRUE ) # the function will aggregate data for a geographical location , like a country with multiple entries generate . SIR . model ( data , " Canada " , tot . population =37590000 , add . extras = TRUE ) Fig.(8) , also raises an interesting point regarding the accuracy of the SIR model. We should recall that this is the simplest approach one could take in order to model the spread of diseases and usually more refined and complex models are used to incorporate several factors, such as, vaccination, quarantines, effects of social clusters, etc. However, in some cases, specially when the spread of the disease appears to have enter the so-called exponential growth rate, this simple SIR model can capture the main trend of the dispersion (e.g. left plot from Fig.8 ). While in other cases, when the rate of spread is slower than the freely exponential dispersion, the model clearly fails in tracking the actual evolution of cases (e.g. right plot from Fig.8) .

Finally, Lst. 11 shows an example of the generation of a sequence of values for R 0 , and actually any of the parameteres (β, γ) describing the SIR model. In this case, the function takes a range of values for the initial date t0 and generates different date intervals, this allows the function to generate multiple SIR models and return the corresponding parameters for each model. The results are then bundle in a "matrix"/"array" object which can be accessed by column for each model or by row for each paramter sets.

# read TimeSeries data TS . data <-covid19 . data ( " TS -confirmed " ) # select a location of interest , eg . France # France has many entries , just pick " France " FR . data <-TS . data [ ( TS . data $ Country . Region == " France " ) & ( TS . data $ Province . State == " " ) ,]

# sweep values of R0 based on range of dates to consider for the model ranges <-15:25 deltaT <-35 params _ sweep <-sweep . SIR . models ( data = FR . data , geo . loc = " France " , t0 _ range = ranges , deltaT = deltaT )

# the parameters --beta , gamma , R0 --are returned in a " matrix " " array " object print ( params _ sweep ) 

As mentioned before, the functions from the covid19.analytics package also allow users to work with their own data, when the data is formated in the Time Series strucutre as discussed in Sec.2.1.2. This opens a large range of possibilities for users to import their own data into R and use the functions already defined in the covid19.analytics package. A concrete example of how the data has to be formatted is shown in Lst. 12. The example shows how to structure the data in a TS format from "synthetic" data generated from randomly sampling different distributions. However this could be actual data from other places or locations not accesible from the datasets provided by the package, or some researchers may have access to their own private sets of data too. The example also shows two cases, where the data can include the "status" column or not, and whether it could be more than one location. As a matter of fact, we left the "Long" and "Lat" fields empty but if one includes the actual coordinates, the maping function live.map can also be used with these structured data.

# TS data structure : # " Province . State " " Country . Region " " Lat " " Long " dates . . . # First let ' s create a ' fake ' location fake . locn <-c ( NA , NA , NA , NA )

# names for these columns names ( fake . locn ) <-c ( " Province . State " ," Country . Region " ," Lat " ," Long " ) # let ' s set the dates dates . vec <-seq ( as . Date ( " 2020 / 1 / 1 " ) , as . Date ( " 2020 / 4 / 09 " ) , " days " ) # data . vecX would be the actual values / cases data . vec1 <-rpois ( length ( dates . vec ) , lambda =25) # can also add more cases data . vec2 <-abs ( rnorm ( length ( dates . vec ) , mean =135 , sd =15) ) data . vec3 <-abs ( rnorm ( length ( dates . vec ) , mean =35 , sd =5) ) # this will names the columns as your dates names ( data . vec1 ) <-dates . vec names ( data . vec2 ) <-dates . vec names ( data . vec3 ) <-dates . vec # merge them into a data frame with multiple entries synthetic . data <-as . data . frame ( rbind ( rbind ( c ( fake . locn , data . vec1 ) ) , rbind ( c ( fake . locn , data . vec2 ) ) , rbind ( c ( fake . locn , data . vec3 ) ) ) )

# finally set you locn to somethign unqiue , so you can use it in the generate . SIR . model fn synthetic . data $ Country . Region <-" myLocn " # one could even add " status " synthetic . data $ status <-c ( " confirmed " ," death " ," recovered " ) # OR just one case per locn synthetic . data2 <-synthetic . data [ , -ncol ( synthetic . data ) ] synthetic . data2 $ Country . Region <-c ( " myLocn " ," myLocn2 " ," myLocn3 " ) # now we can use this ' synthetic ' dataset with any of the TS functions # data checks integrity . check ( synthetic . data ) consistency . check ( synthetic . data ) data . checks ( synthetic . data ) # quantitative indicators tots . per . location ( synthetic . data ) growth . rate ( synthetic . data ) single . trend ( synthetic . data2 [3 ,] ) mtrends ( synthetic . data ) # SIR models synthSIR <-generate . SIR . model ( synthetic . data2 , geo . loc = " myLocn " ) plt . SIR . model ( synthSIR , interactiveFig = TRUE ) sweep . SIR . models ( synthetic . data2 , geo . loc = " MyLocn " ) Listing 12: Example of structuring data in a TS format, so that it can be used with any of the TS functions from the covid19.analytics package.

The covid19.analytics package provides access to genomics data available at the NCBI databases [35, 36] . The covid19.genomic.data is the master function for accesing the different variations of the genomics information available as shown in gtypes <-c ( " genome " ," fasta " ," tree " , " nucleotide " ," protein " , " nucleotide -fasta " ," protein -fasta " , " genomic " ) Each of these functions return different objects, Lst. 14 shows an example of the different structures for some of the objects. The most involved object is obtained from the covid19.genomic.data when combining different types of datasets.

# str ( results ) List of 7 $ refGenome : List of 3 .. $ livedata : chr [1:29903] " a " " t " " t " " a " ... .. $ repo : chr [1:29903] " a " " t " " t " " a " ... .. $ local : chr [1:29903] " a " " t " " t " " a " . 29903] " a " " t " " t " " a " ... .. .. -attr ( * , " species " ) = chr " Severe _ acute _ respiratory _ syndrome _ coronavirus _ 2 " .. $ local : List of 1 .. .. $ NC _ 045512.2: chr [1:29903] " a " " t " " t " " a " ... .. .. -attr ( * , " species " ) = chr " Severe _ acute _ respiratory _ syndrome _ coronavirus _ 2 " $ ptns : List of 2 .. $ repo : chr [1:117619] " YP _ 009742608 " " YP _ 009742609 " " YP _ 009742610 " " YP _ 009742611 " ... .. $ local : chr [1:117619] " YP _ 009742608 " " YP _ 009742609 " " YP _ 009742610 " " YP _ 009742611 " ... : chr [1:11142] " 2020 -01 -13 T00 :00:00 Z " " 2020 -07 -17 T00 :00:00 Z " " 2020 -07 -17 T00 :00:00 Z " " 2020 -07 -17 T00 :00:00 Z " ...

: chr [1:11142] " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " .. : chr [1:11142] " Homo ␣ sapiens " " Homo ␣ sapiens " " Homo ␣ sapiens " " Homo ␣ sapiens " ... .. $ Isolation _ Source : chr [1:11142] " " " " " " " " ... .. $ Collection _ Date : chr [1:11142] " 2019 -12 " " 2020 -06 -14 " " 2020 -06 -13 " " 2020 -06 -13 " ... .. $ BioSample : chr [1:11142] " " " SAMN15488404 " " SAMN15488426 " " SAMN15488426 " ... .. $ GenBank _ Title : chr [1:11142] " Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2 ␣ isolate ␣ Wuhan -Hu -1 , ␣ complete ␣ genome " " Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2 ␣ isolate ␣ SARS -CoV -2 / human / IND / GBRC242 / 2020 , ␣ complete ␣ genome " " Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2 ␣ isolate ␣ SARS -CoV -2 / human / IND / GBRC264a / 2020 , ␣ complete ␣ genome " " Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2 ␣ isolate ␣ SARS -CoV -2 / human / IND / GBRC264b / 2020 , ␣ complete ␣ genome " ... $ proteins : ' data . frame ': 117619 obs . of 20 variables : .. $ Accession : chr [1:117619] " YP _ 009742608 " " YP _ 009742609 " " YP _ 009742610 " " YP _ 009742611 " ... .. $ SRA _ Accession : chr [1:117619] " " " " " " " " ... .. $ Release _ Date : chr [1:117619] " 2020 -03 -30 T00 :00:00 Z " " 2020 -03 -30 T00 :00:00 Z " " 2020 -03 -30 T00 :00:00 Z " " 2020 -03 -30 T00 :00:00 Z " ... .. $ Species : chr [1:117619] " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " Severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " .. 

: chr [1:117619] " leader ␣ protein ␣ [ Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2] " " nsp2 ␣ [ Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2] " " nsp3 ␣ [ Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2] " " nsp4 ␣ [ Severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ 2] " ... $ SRA : List of 2 .. $ sra _ info : chr [1:6] " This ␣ download ␣ ( via ␣ FTP ) ␣ provides ␣ Coronaviridae ␣ family -containing ␣ SRA ␣ runs ␣ detected ␣ with ␣ NCBI ' s ␣ kmer ␣ analysis ␣ ( STAT ) ␣ tool . ␣ " " It ␣ provides ␣ corresponding ␣ SRA ␣ run ␣ ( SRR ) ,␣ sample ␣ ( SRS ) ,␣ and ␣ submission ␣ ( SRA ) ␣ accessions , ␣ as ␣ well ␣ as ␣ BioSample ␣ an " | _ _ truncated _ _ " The ␣ STAT ␣ kmer ␣ analysis ␣ was ␣ performed ␣ via ␣ a ␣ two -step ␣ process ␣ with ␣ a ␣ 32 -mer ␣ coarse ␣ database ␣ and ␣ a ␣ 64 -mer ␣ fine ␣ database . ␣ " " The ␣ database ␣ is ␣ generated ␣ from ␣ RefSeq ␣ genomes ␣ and ␣ the ␣ viral ␣ genome ␣ set ␣ from ␣ nt ␣ using ␣ a ␣ minhash -based ␣ approach . ␣ " ... .. $ sra _ runs : ' data . frame ': 26776 obs . of 5 variables : .. .. $ acc : chr [1:26776] " ERR1857120 " " ERR1857132 " " ERR3955908 " " ERR3955909 " ... .. .. $ sample _ acc : chr [1:26776] " ERS1572626 " " ERS1572627 " " ERS2783634 " " ERS2783635 " ... .. .. $ biosample : chr [1:26776] " SAMEA1 008836 68 " " SAME A10088 4418 " " SAMEA4965217 " " SAMEA4965218 " ... .. .. $ sra _ study : chr [1:26776] " ERP021740 " " ERP021740 " " ERP111280 " " ERP111280 " ... .. .. $ bioproject : chr [1:26776] " " " " " " " " ... $ references : List of 2 .. $ : chr " covid19 . analytics ␣ --␣ local ␣ data " .. $ : chr " / Users / marcelo / Library / R / 4.0 / library / covid19 . analytics / extdata / " Listing 14: Objects composition for the example presented in Lst. 13

One aspect that should be mentioned with respect to the genomics data is that, in general, these are large datasets which are continuously being updated hence increasing theirs sizes even more. These would ultimately present pragmatical challenges, such as, long processing times or even starvation of memory resources. We will not dive into major interesting examples, like DNA sequencing analysis or building phylogenetics trees; but packages such as ape, apegenet, phylocanvas, and others can be used for these and other analysis.

One simple example we can present is the creation of dynamical categorization trees based on different elements of the sequencing data. We will consider for instance the data for the nucleotides as reported from NCBI.

The example on Lst. 15 shows how to retrieve either nucleotides (or proteins) data and generate categorization trees based on different elements, such as, hosting organism, geographical location, sequences length, etc.

In the examples we employed the collapsibleTree package, that generates interactive browsable trees through web browsers.

# retrieve the nucleotides data nucx <-covid19 . genomic . data ( type = ' nucleotide ' , src = ' repo ') # identify specific fields to look at len . fld <-" Length " acc . fld <-" Accession " geoLoc . fld <-" Geo _ Location " seq . fld <-" Sequence _ Type " host . fld <-" Host " seq . limit1 <-25000 seq . limit1 <-29600 seq . limit2 <-31000 # selection criteria , nucleotides with seq . length between 29600 and 31000 selec . ctr .1 <-nucx $ Length < seq . limit2 & nucx $ Length > seq . limit1 # remove nucletoides without specifying a " host " Listing 15: Example of how to generate a dynamic browsable tree using some of information included in the nucleotides dataset. Some of these trees representations are shown in Fig. 9 .

In this section we will present and discuss, how the covid19.analytics Dashboard Explorer is implemented. The main goal is to provide enough details about how the dashboard is implemented and works, so that users could modify it if/as they seem fit or even develop their own. For doing so, we will focus in three main points:

• the front end implementation, also know as the user interface, mainly developed using the shiny package

• the back end implementation, mostly using the covid19.analytics package

• the web server installation and configuration where the dashboard is hosted 

The covid19.analytics Dashboard Explorer is built using the Shiny package [43] in combination with the covid19.analytics package. Shiny allows users to build interactive dashboards that can work through a web interface. The dashboard mimics the covid19.analytics package commands and features but enhances the commands as it allows users to use dropdowns and other control widgets to easily input the data rather than using a command terminal. In addition the dashboard offers some unique features, such as a Personal Protective Equipment (PPE) model estimation, based on realistic projections developed by the US Centers for Disease Control and Prevention (CDC).

The dashboard interface offers several features:

1. The dashboard can be run on the cloud/web allowing for multiple users to simultaneously analyze the data with no special software or hardware requirements. The Shiny package makes the dashboard mobile and tablet compatible as well. 2. It aids researchers to share and discuss analytical findings. 3. The dashboard can be run locally or through the web server. 4. No programming or software expertise is required which reduces technical barriers to analyzing the data. Users can interact and analyze the data without any software expertise therefore users can focus on the modeling and analysis. In these times the dashboard can be a monumental tool as it removes barriers and allows a wider and diverse set of users to have quick access to the data. 5. Interactivity. One feature of Shiny and other graphing packages, such as Plotly, is interactivity, i.e. the ability to interact with the data. This allows one to display and show complex data in a concise manner and focus on specific points of interest. Interactive options such as zoom, panning and mouse hover all help in making the user interaction enjoyable and informative. 6. Fast and easy to compare. One advantage of a dashboard is that users can easily analyze and compare the data quickly and multiple times. For example users can change the slider or dropdown to select multiple countries to see the total daily count effortlessly. This allows the data to be displayed and changed as users analysis requirements change.

The dashboard can be laucnhed locally in a machine with R, either through an interactive R session or in batch mode using Rscript or R CMD BATCH or through the web server accessing the following URL https://covid19analytics.scinet.utoronto.ca.

For running the dashboard locally, the covid19.analytics package has also to be installed. For running the dashboard within an R session the package has to be loaded and then it should be invoked using the following sequence of commands, > library(covid19.analytics) > covid19Explorer()

The batch mode can be executed using an R script containing the commands listed above. When the dashboard is run locally the browser will open a port in the local machine -localhost:port-connection, i.e. http://127.0.0.1. It should be noted, that if the dashboard is launched interactively within an R session the port used is 5481 -http://127.0.0.1:5481-, while if this is done through an R script in batch mode the port used will be different.

To implement the dashboard and enhance some of the basic functionalities offered, the following libraries were specifically used in the implementation of the dashboard:

• shiny [43] : The main package that builds the dashboard.

• shinydashboard [44] : This is a package that assists us to build the dashboard with respect to themes, layouts and structure.

• shinycssloaders [45] : This package adds loader animations to shiny outputs such as plots and tables when they are loading or (re)-calculating. In general, this are wrappers around base CSS-style loaders.

• plotly [46] : Charting library to generate interactive charts and plots. Although extensively used in the core functions of the covid19.analytics , we reiterate it here as it is great tool to develop interactive plots.

• DT [47] : A DataTable library to generate interactive table output.

• dplyr [48] : A library that helps to apply functions and operations to data frames. This is important for calculations specifically in the PPE calculations.

The R Shiny package makes developing dashboards easy and seamless and removes challenges. For example setting the layout of a dashboard typically is challenging as it requires knowledge of frontend technologies such as HTML, CSS3 and Bootstrap4 to be able to position elements and change there asthetic properties. Shiny simplifies this problem by using a built in box controller widget which allows developers to easily group elements, tables, charts and widgets together. Many of the CSS properties, such as, widths or colors are input parameters to the functions of interest. The sidebar feature is simple to implement and the Shiny package makes it easy to be compatible across multiple devices such as tablets or cellphones. The Shiny package also has built in layout properties such as FluidRow or Columns making it easy to position elements on a page. The library does have some challenges as well. One challenge faced is theme design. ShinyDashboard does not make it easy to change the whole color theme of the dashboard outside of the white or blue theme that is provided by default. The issue is resolved by having the developer write custom CSS and change each of the various properties manually.

The dashboard contains two main components a sidebar and a main body. The sidebar contains a list of all the menu options. Options which are similar in nature are grouped in a nested format. For example the dashboard menu section called "Datasets and Reports", when selected, displays a nested list of further options the user can choose such as the World Data or Toronto Data. Grouping similar menu options together is important for making the user understand the data. The main body displays the content of a page. The content a main body displays depends on the sidebar and the selected menu option the user selects.

There are three main generic elements needed to develop a dashboard: layouts, control widgets and output widgets. The layout options are components needed to layout the features or components on a page. In this dashboard the layout widgets used are the following:

• Box: The boxes are the main building blocks of a dashboard. This allows us to group content together.

• TabPanels: TabPanels allow us to create tabs to divide one page into several sections. This allows for multiple charts or multiple types of data to be displayed in a single page. For example in the Indicators page there are four tabs which display four different charts with mosaic tab displaying the charts in different configurations.

• Header and Title: these are used to display text, title pages in the appropriate sizes and fonts.

An example describing these elements and its implementation is shown in Lst. 16 . ␣ TABLE ' ) , h4 ( ' World ␣ data ␣ of ␣ all ␣ covid ␣ cases ␣ across ␣ the ␣ globe ') , column (4 , selectInput ( ns ( " category _ list3 " ) , label = h4 ( " Category " ) , choices = category _ list ) ) , column (4 , downloadButton ( ns ( ' downloadData ') , " Download " ) ) , withSpinner ( DT :: dataTableOutput ( ns ( " table _ contents " ) ) ) ) } Listing 16: Snippet of a code that describes the various features used in generating a dashboard. The ns(id) is a namespaced id for inputs/outputs. WithSpinner is the shiny cssloaders which generates a loading gif while the chart is being loaded.

Shiny modules are used when a shiny application gets larger and complicated, they can also be used to fix the namespacing problem. However shiny modules also allow for code reusability and code modularity as the code can be broken into several pieces called modules. Each module can then be called in different applications or even in the same application multiple times. In this dashboard we break the code into two main groups user interface (UI) modules and server modules. Each menu option has there own dedicated set of UI and associated server modules. This makes the code easy to build and expand. For each new menu option a new set of UI and sever module functions will be built. Lst. 16 is also an example of an UI module, where it specifies the desing and look of the element and connect with the active parts of the application.

Lst. 20 shows an example of a server function called reportServer. This type of module can update and display charts, tables and valueboxes based on the user selections. This same scenario occurs for all menu options with the UI/server paradigm.

Another way to think about the UI/server separation, is that the UI modules are in charge of laying down the look of a particular element in the dahboard, while the sever is in charge of dynamically 'filling' the dynamical elements and data to populate this element.

Control widgets, also called input widgets, are widgets which users use to input data, information or settings to update charts, tables and other output widgets. The following control widgets were used in this dashboard:

• NumericalInput: A textbox that only allows for numerical input which is used to select a single numerical value.

• SelectInput: a dropdown that may be multi select for allowing users to select multiple options as in the case of the country dropdown. Figure 10 : Screenshot from the "covid19.analytics Dashboard Explorer", "Mosaic" tab from the 'Indicators' category. Four interactive figures are shown in this case: the trends (generated using the itrends function), totals (genrated using the totals.plt function) and two world global representations of CoViD19 reported cases (generated using the live.map function).

The two upper plots are adjusted and re-rendered according to the selection of the country, category of the data from the input boxes.

• Slider: The slider in our dashboard is purely numerical but used to select a single numerical value from a given range of a min and max range.

• Download Button: This is a button which allows users to download and save data in various formats such as csv format.

• Radiobuttons: Used to select only one from a limited number of choices.

• Checkbox: Similar in purpose to RadioButtons that also allow users to select one option from a limited number of options.

Output control widgets are widgets that are used to display content/information back to the user. There are three main ouput widgets used in this dashboard:

• PlotlyOuput: This widget output and creates Plotly charts. Plotly is a graphical package library used to generate interactive charts.

• RenderTable: Is an output that generates the output as an interactive table with search, filter and sort capabilities provided out of the box.

• ValueBox: This is a fancy textbox with border colors and descriptive font text to generate descriptive text to users such as the total number of deaths. The dashboard contains the menus and elements shown in 17 and described below:

• Indicators: This menu section displays different CoViD19 indicators to analyze the pandemic. There are four notable indicators Itrend, Total Plot, Growth Rate and Live map, which are displayed in each of the various tabs. Itrend displays the "trend" in a log-log plot, Total plot shows a line graph of total number, Growth rate displays the daily number of changes and growth rate (as defined in Sec. 2.2), Live map shows a world map of infections in an aggregated or timeseries format. These indicators are shown together in the "Mosaic" tab.

• Models: This menu option contains a sub-menu presenting models related to the pandemic. The first model is the SIR (Susceptible Infection Recovery) which is implemented in the covid19.analytics package. SIR is a compartmental model to model how a disease will infect a population. The second two models are used to estimate the amount of PPE needed due to infectious diseases, such as Ebola and CoViD19.

• Datasets and Reports: This section provides reporting capability which outputs reports as csv and text files for the data. The World Data subsection displays all the world data as a table which can be filtered, sorted and searched. The data can also be saved as a csv file. The Toronto data displays the Toronto data in tabular format while also displaying the current pandemic numbers. Data Integrity section checks the integrity and consistency of the data set such as when the raw data contains negative numbers or if the cumulative quantities decrease. The Report section is used to generate a report as a text file.

• References: The reference section displays information on the github repo and documentation along with an External Dashboards section which contains hyperlinks to other dashboards of interest. Dashboards of interest are the Vaccine Tracker which tracks the progress of vaccines being tested for covid19, John Hopkins University and the Canada dashboard built by the Dall Lana School of Epidemiology at the University of Toronto.

• About Us: Contact information and information about the developers.

In addition to implementing some of the functionalities provided by the covid19.analytics package, the dashboard also includes a PPE calculator. The hospital PPE is a qualitative model which is designed to analyze the amount of PPE equipment needed for a single CoViD19 patient over a hospital duration. The PPE calculation implemented in the covid19.analytics Dashboard Explorer is derived from the CDC's studies for infectious diseases, such as Ebola and CoViD19. The rationality is that Ebola and CoViD19 are both contagious infections and PPE is used to protect staff and patients and prevent transmission for both of these contagious diseases.

The Hospital PPE calculation estimates and models the amount of PPE a hospital will need during the CoViD19 pandemic. There are two analysis methods a user can choose to determine Hospital PPE requirement. The first method to analyze PPE is to determine the amount of PPE needed for a single hospitalized CoViD19 patient. This first model requires two major component: the size of the healthcare team needed to take care of a single CoViD19 patient and the amount of PPE equipment used by hospital staff per shift over a hosptialization duration. The model is based off the CDC Ebola crisis calculation [49] . Alhough Ebola is a different disease compared to CoViD19, there is one major similarity. Both CoViD19 and Ebola are diseases which both require PPE for protection of healthcare staff and the infected patient. To compensate the user can change the amount of PPE a healthcare staff uses per shift. That information can be adjusted by changing the slider values in the Advanced Setting tab. The calculation is pretty straightforward as it takes the PPE amount used per shift and multiplies it by the number of healthcare staff and then by the hospitalization duration.

The first model has two tabs. The first tab displays a stacked bar chart displaying the amount of PPE equipment user by each hospital staff over the total hospital duration of a single patient. It breaks each PPE equipment by stacks. The second tab panel called Advanced settings has a series of sliders for each hospital staff example nurses where users can use the slider to change the amount of PPE that the hospital staff will user per shift.

The second model is a more recent calculation developed by the CDC [50] . The model calculates the burn rate of PPE equipment for hospitals for a one week time period. This model is designed specifically for Covid19. The CDC has created an excel file for hospital staff to input their information and also an android app as well which can be utilized.

This model also implemented in our dashboard, is simplified to calculate the PPE for a one week setting. The one week limit was implemented for two reasons, first to limit the amount of input data a user has to enter into the system as too much data can overwhelm and confuse a user; second because the CoViD19 pandemic is a highly fluidic situation and for hospital staff to forecast their PPE and resource equipments greater than a one week period may not be accurate.

Note that this model is not accurate if the facilitiy recieves a resupply of PPE. For resupplied PPE start a new calculation. There are four tab panels to the burn rate calculation which displays charts and settings.

The first tab Daily Usage displays a multi-line chart displaying the amount of PPE used daily, ∆P P E daily . The calculation for this is a simple subtraction between two consecutive days, i.e. the second day (j + 1) from the first day (j) as noted in Eq. (6) .

The tab panel called Remaining Supply shows a multi line chart the number of days the remaining PPE equipment will last in the facility. The duration of how long the PPE equipment can last in a given facility, inversely depends on the amount of CoViD19 patients admitted to the hospital. To calculate the remaining PPE one calculates the average amount of PPE used over the one week duration and then divides the amount of PPE at the beginning of the day by the average PPE usage, as shown in Eq. (7),

where T denotes the time average over a T period of time.

The third panel called PPE per patient displays a multi line chart of the burn rate, i.e. the amount of PPE used per patient per day. Eq.(8) represents the calculation as the remaining PPE supply divided by the number of CoViD19 patients in the hospital staff during that exact day.

The fourth tab called Advanced settings is a series of show and hide "accordians" where users can input the amount of PPE equpiment they have at the start of each day. There are six collapsed boxes for each PPE equipment type and for CoViD19 patient count. Expanding a box displays seven numericalInput textboxes which allows users to input the number of PPE or patient count for each day.

The equations describing the PPE needs, Eqs. (6, 7, 8) are implemented in the shiny dashboard using the dplyr library. The dplyr library allows users to work with dataframe like objects in a quick and efficient manner. The three equations are implemented using a single dataframe. The Advanced Setting inputs of the Burn Rate Analysis tab are saved into a dataframe. The PPE equations -Eqs. ( 

The back-end implementation of the dashboard is achieved using the functions presented in Sec.2 on the server module of the dashboard. The main strategy is to use a particular function and connect it with the input controls to feed the needed arguments into the function and then capture the output of the function and render it accordingly.

Let's consider the example of the globe map representation shown in the dashboard which is done using the live.map function. Lst. 19 shows how this function connects with the other elements in the dashboard: the input elements are accessed using input$... which in this are used to control the particular options for the displaying the legends or projections based on checkboxes. The output returned from this function is captured through the renderPlotly({...}) function, that is aimed to take plotly type of plots and integrate them into the dashboard.

# livemap plot charts on the three possible commbinations output $ ts _ livemap <-output $ ts2 _ livemap <-output $ ts3 _ livemap <-output $ ts4 _ livemap <-renderPlotly ({ legend <-input $ sel _ legend projections <-input $ sel _ projection live . map ( covid19 . data ( " ts -confirmed " ) , interactive . display = FALSE , no . legend = legend , select . projctn = projections ) })

Listing 19: Example of how the live.map function is used to render the ineractive figures display on the dashboard.

Another example is the report generation capability using the report.summary function which is shown in Lst. 20. As mentioned before, the input arguments of the function are obtained from the input controls. The output in this case is rendered usign the renderText({...}) function, as the output of the original function is plain text. Notice also that there are two implementations of the report.summary, one is for the rendering in the screen and the second one is for making the report available to be downloaded which is handled by the downloadHandler function.

reportServer <-function ( input , output , session , result ) { output $ report _ output _ default <-renderText ({ # extract the vairables of the inputs nentries <-input $ txtbox _ Nentries geo _ loc <-input $ geo _ loc _ select6 ts <-input $ ddl _ TS capture . output ( report . summary ( graphical . output = FALSE , Nentries = nentries , geo . loc = geo _ loc , cases . to . process = ts ) ) } , sep = '\ n ') report <-reactive ({ nentries <-input $ txtbox _ Nentries geo _ loc <-input $ geo _ loc _ select6 ts <-input $ ddl _ TS report <-capture . output ( report . summary ( graphical . output = FALSE , Nentries = nentries , geo . loc = geo _ loc , cases . to . process = ts ) ) return ( report ) }) output $ downloadReport <-downloadHandler ( filename = function () { paste ( " Report -" , Sys . Date () ," . txt " , sep = " " ) } , content = function ( file ) { writeLines ( paste ( report () ) , file ) } ) } Listing 20: Report capabilites implemented in the dashboard using the report.summary function.

The final element in the deployment of the dashboard is the actual set up and configuration of the web server where the application will run. The actual implementation of our web dashboard, accessible through https://covid19analytics.scinet.utoronto.ca, relies on a virtual machine (VM) in a physical server located at SciNet headquarters.

We should also notice that there are other forms or ways to "publish" a dashboard, in particular for shiny based-dashboards, the most common way and perhaps straighforward one is to deploy the dashboard on https://www.shinyapps.io. Alternatively one could also implement the dashboard in a cloud-based solution, e.g. https://aws.amazon.com/blogs/big-data/running-r-on-aws/. Each approach has its own advantages and disadvantages, for instance, depending on a third party solution (like the previous mentioned) implies some cost to be paid to or dependency on the provider but will certainly eliminate some of the complexity and special attention one must take when using its own server. On the other hand, a self-deployed server will allow you for full control, in principle cost-effective or cost-controled expense and full integration with the end application.

In our case, we opted for a self-controlled and configured server as mentioned above. Moreover, it is quite a common practice to deploy (multiple) web services via VMs or "containers". The VM for our web server runs on CentOS 7 and has installed R version 4.0 from sources and compiled on the VM.

After that we proceeded to install the shiny server from sources, i.e. https://github.com/rstudio/ shiny-server/wiki/Building-Shiny-Server-from-Source.

After the installation of the shiny-server is completed, we proceed by creating a new user in the VM from where the server is going to be run. For security reasons, we recommend to avoid running the server as root. In general, the shiny server can use a user named "shiny". Hence a local account is created for this user, and then logged as this user, one can proceed with the installation of the required R packages in a local library for this user. All the neeeded packages for running the dashboard and the covid19.analytics package need to be installed.

Lst. 21 shows the commands used for creating the shiny user and finalizing the configuration and details of the log files.

# Place a shortcut to the shiny -server executable in / usr / bin sudo ln -s / usr / local / shiny -server / bin / shiny -server / usr / bin / shiny -server # Create shiny user sudo useradd -r -m shiny # Create log , config , and application directories sudo mkdir -p / var / log / shiny -server sudo mkdir -p / srv / shiny -server sudo mkdir -p / var / lib / shiny -server sudo chown shiny / var / log / shiny -server sudo mkdir -p / etc / shiny -server Listing 21: List of commands used on the VM to finalize the setup of the shiny user and server. Source: https: //github.com/rstudio/shiny-server.

For dealing with the Apache configuration on port 80, we added the file /etc/httpd/conf.d/rewrite.conf as shown in Lst. 22.

RewriteCond %{ REQUEST _ SCHEME } = http RewriteRule^https : / / %{ SERVER _ NAME }%{ REQUEST _ URI } [ QSA , R = permanent ] Listing 22: Modifications to the Apache configurations, specified in the file rewrite.conf. These three lines rewrite any incoming request from http to https.

For handling the Apache configuration on port 443, we added this file /etc/httpd/conf.d/shiny.conf, as shown in Lst.23. This VirtualHost receives the https requests from the Internet on port 443, establishes the secure connection, and redirects all input to port 3838 using plain http. All requests to "/" are redirected to "http://0.0.0.0:3838/app1/", where app1 in this case is a subdirectory where a particular shiny app is located.

There is an additional configuration file, /etc/httpd/conf.d/ssl.conf, which contains the configuration for establishing secure connections such as protocols, certificate paths, ciphers, etc.

The main tool we use in order to communicate updates between the different elements we use in the development and mantainance of the covid19.analytics package and dashboard web interface is orchestrated via GIT repositories. In this way, we have in place version control systems but also offer decentralized with multiple replicas. Fig. 11 shows an schematic of how our network of repositories and service is connected. The central hub for our package, is located at the GIThub repo htttps: //github.com/mponce0/covid19.analytics; we then have (and users can too) our own clones of local copies of this repo -we usually use this for development and testing-. When a stable and substantial contribution to the package is reached, we submit this to the CRAN repository. Similarly, when an update is done on the dashboard we can synchronize the VM via git pulls and deploy the updates on the server side.

In this paper we have presented and discussed the R covid19.analytics package, which is an open source tool to obtain, analyze and visualize data of the CoViD19 pandemic. The package also incorporates a dashboard to facilitate the access to its functionalities to less experienced users.

As today, there are a few dozen other packages also in the CRAN repository that allow users to gain access to different datasets of the CoViD19 pandemic. In some cases, some packages just provide access to data from specific geographical locations or the approach to the data structure in which the data is presented is different from the one presented here. Nevertheless, having a variety of packages from which users can try and probably combine, is an important and crucial element in data analysis. Moreover it has been reported different cases of data misuse/misinterpretation due to different issues, such as, erroneous metadata or data formats [51] and in some cases ending in articles' retractions [52] . Therefore providing additional functionalities to check the integrity and consistency of the data, as our the covid19.analytics package GITHUB repo -CENTRAL REPOSITORY https://github.com/mponce0/covid19.analytics shiny server, running on VM https://covid19analytics.scinet.utoronto.ca Local copies CRAN repo https://cran.r-project.org/package=covid19.analytics

Local copies

Local copies GITHUB IO -web rendering https://mponce0.github.io/covid19.analytics/ private instances Figure 11 : Schematic of the different repositories and systems employed by the covid19.analytics package and dashboard interface.

does is paramount. This is specially true in a situation where the unfolding of events and data availability is flowing so fast that sometimes is even hard to keep track of all the changes. Moreover, the covid19.analytics package offers a modular and versatile approach to the data, by allowing users to input their own data for which most of the package functions can be applied when the data is structured using a time series format as described in this manuscript.

The covid19.analytics is also capable of retrieving genomics data, and it does that by incorporating a novel, more reliable and robust way of accessing and designing different pathways to the data sources.

Another unique feature of this package is the ability of incorporating models to estimate the disease spread by using the actual data. Although a simple model, it has shown some interesting results in agreement for certain cases. Of course there are more sophisticated approaches to shred light in analyzing this pandemic; in particular novel "community" approaches have been catalyzed by this too [53] . However all of these approaches face new challenges as well [54] , and on that regards counting with a variety, in particular of open source tools and direct access to the data might help on this front.

R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing

R: A language for data analysis and graphics

covid19.analytics: Load and Analyze Live Data from the CoViD-19 Pandemic

The biggest mystery: what it will take to trace the coronavirus source

Animal source of the coronavirus continues to elude scientists

A pneumonia outbreak associated with a new coronavirus of probable bat origin

The proximal origin of sars-cov-2

Bat-borne virus diversity, spillover and emergence

Extrapulmonary manifestations of covid-19

Opensafely: factors associated with covid-19 death in 17 million patients

Considering how biological sex impacts immune responses and covid-19 outcomes

Coronavirus blood-clot mystery intensifies

Using influenza surveillance networks to estimate state-specific prevalence of sars-cov-2 in the united states

Consolidation in a crisis: Patterns of international collaboration in early covid-19 research

Critiqued coronavirus simulation gets thumbs up from code-checking efforts

Timing social distancing to avert unmanageable covid-19 hospital surges

Special report: The simulations driving the world's response to covid-19

Covid-19 vaccine design: the janus face of immune enhancement

Covidep: a web-based platform for real-time reporting of vaccine target recommendations for sars-cov-2

Social network-based distancing strategies to flatten the covid-19 curve in a post-lockdown world

Asymptotic estimates of sarscov-2 infection counts and their sensitivity to stochastic perturbation

Evolutionary origins of the sars-cov-2 sarbecovirus lineage responsible for the covid-19 pandemic

An interactive web-based dashboard to track covid-19 in real time

Pandemic publishing poses a new covid-19 challenge

Will the pandemic permanently alter scientific publishing?

How swamped preprint servers are blocking bad coronavirus research

Project, Trainees, Faculty, Advancing scientific knowledge in times of pandemics

COVID-19 Risk Factors: literature database & meta-analysis

CoronaWhy: Building a Distributed, Credible and Scalable Research and Data Infrastructure for Open Science, SciNLP: Natural Language Processing and Data Mining for Scientific Text

The Comprehensive R Archive Network

COVID-19 Data Repository by the Center for Systems Science and Engineering

COVID-19: Status of Cases in Toronto

Database resources of the National Center for Biotechnology Information

ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R

rentrez: an r package for the ncbi eutils api

A contribution to the mathematical theory of epidemics

The sir model for spread of disease: the differential equation model, Loci.(originally Convergence

Exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates

devtools: Tools to Make Developing R Packages Easier

shiny: Web Application Framework for R

shinydashboard: Create Dashboards with 'Shiny', r package version 0

shinycssloaders: Add CSS Loading Animations to 'shiny' Outputs

Interactive Web-Based Data Visualization with R, plotly, and shiny, Chapman and Hall/CRC, 2020

DT: A Wrapper of the JavaScript Library 'DataTables', r package version 0

dplyr: A Grammar of Data Manipulation

Estimated Personal Protective Equipment (PPE) Needed for Healthcare Facilities

Personal Protective Equipment (PPE) Burn Rate Calculator

High-profile coronavirus retractions raise concerns about data oversight

Covid-19 pandemic reveals the peril of ignoring metadata standards

Artificial intelligence cooperation to support the global response to covid-19

The challenges of deploying artificial intelligence models in a rapidly evolving pandemic

The R script containing the shiny app to be run should be placed in /etc/shiny-server and confiurations details about the shiny interface are adjusted in the /etc/shiny-server/shiny-server.conf file.Permissions for the application file has to match the identity of the user launching the server, in this case the shiny user.At this point if the installation was sucessful and all the pieces were placed properly, when the shiny-server command is executed, a shiny hosted app will be accessible from localhost:3838.Since the shiny server listens on port 3838 in plain http, it is necessary to setup an Apache web server to act as a reverse proxy to receive the connection requests from the Internet on ports 80 and 443, the regular http and https ports, and redirect them to port 3838 on the same host (localhost).

MP wants to thank all his colleagues at SciNet, especially Daniel Gruner for his continous and unconditional support, and Marco Saldarriaga who helped us setting up the VM for installing the shiny server.