key: cord-0570688-8t35z4gl authors: Koubaa, Anis title: Understanding the COVID19 Outbreak: A Comparative Data Analytics and Study date: 2020-03-29 journal: nan DOI: nan sha: 157975517d3dc3220a80a47e558b49d2755bbc6a doc_id: 570688 cord_uid: 8t35z4gl The Coronavirus, also known as the COVID-19 virus, has emerged in Wuhan China since late November 2019. Since that time, it has been spreading at large-scale until today all around the world. It is currently recognized as the world's most viral and severe epidemic spread in the last twenty years, as compared to Ebola 2014, MERS 2012, and SARS 2003. Despite being still in the middle of the outbreak, there is an urgent need to understand the impact of COVID-19. The objective is to clarify how it was spread so fast in a short time worldwide in unprecedented fashion. This paper represents a first initiative to achieve this goal, and it provides a comprehensive analytical study about the Coronavirus. The contribution of this paper consists in providing descriptive and predictive models that give insights into COVID-19 impact through the analysis of extensive data updated daily for the outbreak in all countries. We aim at answering several open questions: How does COVID-19 spread around the world? What is its impact in terms of confirmed and death cases at the continent, region, and country levels? How does its severity compare with other epidemic outbreaks, including Ebola 2014, MERS 2012, and SARS 2003? Is there a correlation between the number of confirmed cases and death cases? We present a comprehensive analytics visualization to address the questions mentioned above. To the best of our knowledge, this is the first systematic analytical papers that pave the way towards a better understanding of COVID-19. The analytical dashboards and collected data of this study are available online [1]. The Coronavirus (COVID-19) outbreak nowadays represents the most critical event worldwide. It has been declared by the World Health Organization (WHO) as a Global Public Health Emergency by the end of January 2020, and then as a global pandemic in March 2020. The impressive fast spread of the virus is unprecedented and has exceeded all expectations. The containment of the virus is increasingly challenging as almost all countries in the world become infected. The virus has begun on from Wuhan district in China, where the first confirmed case was reported to have happened on November 17, 2020 [2] . Initially, the confirmed cases in China were continually increasing. On January 31, the total infections reached a bit less than 10000 confirmed cases, with 214 recovered and 213 reported deaths (2% death rate, and similar for recovery). Although the Chinese authorities have taken incremental and prompt preventive measures to avoid the exponential outbreak, the virus continued to spread not only within Chinese borders but also worldwide. The virus was transmitted through travelers around the world. One of the most dangerous aspects of the Coronavirus is that it has an incubation period of 2-14 days, during which the patient transmits the virus without having any symptoms. All these circumstances have favored the exponential growth of the infection leading to a world health emergency crisis. As a consequence, after only two months from the official declaration of COVID-19 as Global Public Health Emergency, and despite the numerous exceptional preventive measures that every country has taken to avoid the outbreak, the virus has contaminated almost all the world countries. Figure 1 illustrates the evolution of the number of countries that were affected by the Coronavirus outbreak from January 22 until March 26, 2020, based on the daily report data provided by Johns Hopkins University repository [3] .It is observed from the figure that the outbreak pick growth started towards the end of February, which is almost four weeks since the disease was declared as a Global Public Health Emergency. Besides Asia, the first continent to be severely affected, the outbreak has been more generalizing to other continents during March 2020, putting first Europe into a crisis, followed by the Americas, and finally the African countries. At the time of writing this paper, a total of 173 countries are reported to have confirmed cases with different gravity, while only 60 countries had confirmed cases at the end of February 2020, and only 25 countries at the end of January 2020. This means that the increase rate was between 2.4 to 2.9 each month. Almost all countries worldwide are currently infected, but the impact of the COVID-19 virus has widely varied between the continents, regions, and countries. This represents the motivation of this data analytics study. Our objective is to unveil the secrets of the COVID-19 virus and understand its evolution in the world. We aim to know the distribution of confirmed and death cases across the continents, regions, and countries and the correlation between them. Furthermore, we compare the impact of the The rest of the paper is organized as follows. Since its spread, there have been several initiatives to investigate the impact of COVID-19 from the scientific community. In [4] , the authors have proposed to analyze the use of social media to exchange information about the Coronavirus. They proposed to identify situational information to investigate the propagation of COVID-19 related information in social media. They used natural language processing techniques to classify COVID-19 information into several types of situational information. In [5] , the authors develop a predictive model to forecast the propagation of COVID-19 in Wuhan and its impact on public health by considering the social preventive measures. Some other researchers, like in [6] , the authors have analyzed the COVID-19 outbreak during its early phases in Italy. They provided estimates of the reproduction number and serial intervals. In [7] , the authors investigated the impact of preventive measures, such as social distancing, lockdown, in the containment of the virus outbreak. They developed prediction models that forecast how these measures can reduce the mortality impact of aged people. The authors of [8] addressed the question about how the virus has spread from the epicenter of Wuhan city to the whole world. They have also analyzed the impact of preventive measures such as quarantine and city closure in mitigating the adverse impact of the spread. The authors have demonstrated visual graphs and developed a mathematical model of the disease transmission pattern. In [9] , the author has analyzed the virus outbreak in Italy based on early data collected to predict the outcome of the process. He argued that there is a strong correlation between the situation in Italy and that of Hubei Province. Some researchers have attempted to use deep learning and artificial intelligence in the context of COVID-19. In [10] , the authors have proposed COVID-Net, which is a deep convolutional neural network for the detection of COVID-19 infection from chest radiography images open-source dataset. The dataset contains 5941 chest radiography images of 2839 patient cases. In [11] , the authors have developed an image processing technique for the detection, quantification, and tracking of the COVID-19 virus. They utilized deep neural network models for the classification of suspected COVID-19 thoracic CT features, using data from 157 patients from the USA and China. The classification area under the curve (AUC) of the study was found to be 0.996. In [12] , the authors investigated drop-weights based Bayesian Convolutional Neural Networks (BCNN) and its effect on improving the performance of the diagnostic of COVID-19 chest Xray. They showed that the uncertainty in prediction is highly correlated with the accuracy of prediction. In this paper, we propose a detailed data analytics study about the COVID-19 virus to understand its impact. Besides, we compare its severity against Ebola 2014, MERS 2012, and SARS 2003. To achieve this objective, we have collected data from authentic sources and widely accepted by the scientific community. In what follows, we present the datasets used in this study. We searched for datasets that provide credible data about the COVID-19 outbreak. The 2019 Novel Coronavirus COVID-19 Data Repository provided by Johns Hopkins University [3] is the most comprehensive, up-to-date, and complete dataset that gives daily reports of the COVID-19 outbreak, in terms of confirmed cases, death cases, and recovered cases. Besides, Johns Hopkins University maintains an active dashboard that reports daily updates of the Coronavirus [13] . Also, the same dataset is being extensively used by the data science community of Kaggle to develop several analytics notebooks and dashboard about COVID-19 [14] . Each row in the COVID-19 dataset contains the following relevant data: • Observation date: it represents the date when the corresponding data row was reported. • Country: the country from where the data emerged • Confirmed cases: the number of COVID-19 confirmed cases • Death cases: the number of COVID-19 death cases • Recovered cases: the number of COVID-19 recovered cases In addition to this data, we have processed the dataset to add additional information related to: • Continent: the continent of the country related to the collected data. We considered five continents, including Africa, the Americas (north and south), Asia, Europe, and Oceania (Australia and New Zealand). • Region: the region is a level between country and continent. We considered the following regions in our study: (Northern/Southern/Eastern/Western/Middle) Africa, (Northern/Southern/Eastern/Western) Europe, (Norther/Southern/South-Eastern/Eastern/Western) Asia, (Northern/South/Central) America, Arabic Gulf, Caribbean, Australia, and New Zealand, Melanesia, and Micronesia. The mapping between countries and their corresponding regions and continents was performed using the following CSV file 1 . We have also collected datasets for the other epidemic diseases, namely: [16] • Ebola 2014-2016 Outbreak Complete Dataset [17] All the dataset provides time-series information about confirmed and death cases per country per day during the observation period, except, the MERS dataset that provides only the final statistics of the disease for the confirmed cases (no death cases reports). We could not find any credible data source for the time series evolution of MERS, neither the death cases. 1 https://www.kaggle.com/statchaitya/country-to-continent We have processed these datasets to clean the data and also add the mapping of the countries to their region and continent to develop region-level and continent-level statistics. Also, we have created an all-in-one dataset with all data combined for comparative purposes. In this work, we have used Tableau Professional software to analyze the collected data and develop visualization dashboards about the Coronavirus disease. Our methodology consists in creating descriptive models of the Coronavirus outbreak using statistical charts to understand the nature of the spread and its impact. We develop our analysis at three levels, namely, at the country-level, at region-level, and continent-level. Each level provides different granularities towards understanding the distribution of the disease around the world. The descriptive model provides different types of statistical charts, including bar charts, geographic maps, heat maps, box plot, and packed bubbles, to represent different features of the COVID-19 outbreak. We also develop some predictive models using linear and polynomial regressions to predict the evolution of the outbreak, given the historical data. In this study, we also compare COVID-19 with the other three most critical world epidemic outbreaks, namely Ebola 2014, MERS 2012, and SARS 2003. We visualize the difference in terms of the impact of these diseases in terms of confirmed and death cases, analyze the characteristic of each disease. The lessons learned in this data analytics study serves as a ground for data science for further investigation of the COVID-19 epidemic outbreak. In this section, we will present the results of this data analytics study. The dashboards of this study are also available online [1]. A. How does COVID-19 evolve? Figure 2 depicts the evolution of the COVID-19 outbreak in the logarithmic scale during the period from January 22, 2020, to March 27, 2020, i.e., two months period. Let us consider January as the reference month. We performed a linear regression analysis on the different curves shown in Figure 2 , and we determined the confirmed/recovered/death/active rates during the observation periods. These rates are shown in Table I . The rates are the slope of the regression lines. They are shown as the first parameter between parenthesis in the table below. By observing the trend lines of linear regression models on the different curve at each month, we conclude the following observation: This is also confirmed by the ratio between the recovered rate and the death rate. Comparing the trend death and the trend of recovered cases, it can be observed that the ratio of recovered to death rates, was 0.86 (19/22) in January 2020, meaning that the death rate was a bit higher in January than the recovered rate. However, in February 2020, the ratio of recovered to death rates increases to 13.33 (1373/103) and reaches 3.66 (2906/792) in March 2020. Thus, the general trend is that the disease is being more controlled in terms of fatality rates due to increasing emergency procedures that the different countries have implemented. The results presented above a coarse grain in the sense that they related a global assessment of the evolution. However, the evolution of the COVID-19 infection depends much on the countries, the region, and the continent. It is, therefore, important to assess the evolution at these levels to get a better understanding of it. Figure 4 presents the cumulative confirmed/recovered/death/active cases reported as of March 27, 2020 for the top-10 countries, then their regions and continents. We observe that in the top-10 countries, there are six countries are from Europe (i.e., Italy, Spain, Germany, France, United Kingdom, and Switzerland), three from Asia (China, Iran and South Korea), and the United States of America, which recently become top-1 in terms of the number of infections. Nonetheless, the highest death rate is in Italy, with more than 9000 death cases reported, because Italy has been severely affected by the virus well before the USA, since the end of February. However, the USA is currently having a death rate of 39 deaths per day, whereas it has 0 deaths in February 2020. Based on the data collected there is a strongly believed that the COVID-19 takes almost one month to transit from one continent to another in the direction from the East to the West since the pick was in China at the end of January, then it was in Italy (South of Europe) at the end of February, and it reached the USA at the end of March, where the pick of infection are in New York located at the Eastern side of the USA. It can be observed in Figure 3 that the Eastern side of the USA, and mainly New York, are the most affected, considering that it is closer to Europe. If the trend is the same, it will be expected that the West side of the USA will reach its pick by the end of April 2020. In Figure 4 , it can be observed that Europe has the most significant confirmed cases currently and the most severe fatality rates, where the maximum reached are the south of Europe with more than 14000 death cases, among which more than 9000 are located in Italy. Italy is currently having the third of fatalities in the whole world. It is also observed that Oceania and Africa are less affected by the virus as compared to other continents. Finally, the distribution of the different cases is illustrated in the heatmap presented in Figure 5 . Dark colors mean a high concentration of cases, and lighter colors mean smaller concentrations of cases. In the previous section, we have presented a comprehensive analysis of the COVID-19 virus, and we got a better understanding of how it was evolved and its impact on the country, region, and continent level. In this section, we address the question: How does COVID-19 compare to other epidemics? Several other epidemics have emerged in the last 20 years, in particular, Severe acute respiratory syndrome (SARS2003) in 2003 in Hong Kong, the Middle East respiratory syndrome (MERS2012) in Saudi Arabia and the Middle East, and Ebola 2014 in Western African coast, namely Guinea, Liberia, and Sierra Leone. These three epidemics, in addition to COVID-19, are the most remarkable world diseases in the last 20 years, which we proposed to compare and analyze. 1) Comparative evolution over time: Figure 6 presents a dashboard that compares the four epidemic outbreaks. On the top, we observe the geographic heat map for the four diseases. It is visually apparent that COVID-19 is the largest outbreak to a considerable extent, followed by SAR It can be observed that COVID-19 is considered as a more acute specie of SARS 2003, as they both share some common features, including: (1) both have started from China, (2) they belong to the same family of Coronavirus syndrome affecting the respiratory system, (3) they have the highest contamination rate as compared to other epidemics. Based on these observations, it seems that the COVID-19 containment will take a more extended period for its complete containment as compared to SARS 2003. The third row of Figure 6 shows the daily confirmed cases for COVID-19, Ebola 2014, and SARS 2003. The trends of COVID-19 are exponential, whereas the trends of Ebola 2014 and SARS 2003 are high at the start of disease then start to decrease after two months of the first confirmed cases. This shows that the behavior of COVID-19 is more aggressive as compared to the other epidemics. 2) Comparative Impact: We address the question: how do the impacts of the epidemics compare to each other in terms of confirmed cases and death cases? Figure 7 shows the comparative impact with respect to the confirmed cases, and Figure 10 shows the comparative impact with respect to the death cases, at continent-level, region-level, and country-level. The blue color refers to the COVID-19; the red color refers to the Ebola 2014, and the yellow color refers to SARS 2003. Looking at the two figures, we can conclude that the COVID-19 is exceptionally more aggressive in terms of confirmed cases with more than 90% of the share of the heatmap, where it is at around 80% concerning the fatality impact. At the country-level, The USA has the most significant share of confirmed cases (as of March 27, 2020) with 16.43%, followed by Italy 13.98%, then China 13.15%. We can also observe that the number of confirmed cases of Ebola 2014 in Sierra Leon is similar to the COVID-19 spread in South Korea and countries in the West of Europe, namely, Netherlands, Belgium, and Austria. Regarding the death cases' impact, it is different from the confirmed cases. At continent-level, the highest death impact is in Europe with 56.26%, then Asia with 18.56% with COVID-19, which is of the same magnitude as the fatality of Ebola 2014 in Africa. Looking at Table II , we can observe a strong correlation between the median age at a continent/region and the fatality rate. Europe is the oldest of all continents, with a median age of 42% has the highest fatality rates, mainly in Southern and Western Europe. At region-level, we observed that the deadly impact of Ebola 2014 on Western Africa is the second most severe after the deadly impact of death in the South of Europe. At the country-level, the impact of COVID-19 is the highest in Italy, followed by the impact of Ebola 2014 in In what concerns SARS 2003, its fatality rate is much lower than Eolba 2014 and COVID-19 diseases. Figure 9 and Table 8 present the average confirmed/death cases per continent for each of the epidemics per continent. The results confirm the heatmap, and packed bubbles presented above and provide the average distribution of death in each continent. The highest average of confirmed cases Chinese Center for Disease Control and Prevention Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE Average Death Cases Table (Note: COVID-19 as of Characterizing the propagation of situational information in social media during covid-19 epidemic: A case study on weibo Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study Age-structured impact of social distancing on the covid-19 epidemic in india Visual data analysis and simulation prediction for covid-19 Predicting the ultimate outcome of the covid-19 outbreak in italy Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection and patient monitoring using deep learning ct image analysis Estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection Coronavirus COVID-19 Global Cases Dashboard Novel Corona Virus MERS Outbreak Dataset Ebola 2014-2016 Outbreak Complete Dataset