key: cord-0730225-oib1ctue authors: Tuli, S.; Tuli, R.; Gill, S. S. title: Predicting the Growth and Trend of COVID-19 Pandemic using Machine Learning and Cloud Computing date: 2020-05-11 journal: nan DOI: 10.1101/2020.05.06.20091900 sha: 69c0eeaa38509841e5a2300d1604dba556e87e89 doc_id: 730225 cord_uid: oib1ctue The outbreak of COVID-19 Coronavirus, namely SARS-CoV-2, has created a calamitous situation throughout the world. The cumulative incidence of COVID-19 is rapidly increasing day by day. Machine Learning (ML) and Cloud Computing can be deployed very effectively to track the disease, predict growth of the epidemic and design strategies and policy to manage its spread. This study applies an improved mathematical model to analyse and predict the growth of the epidemic. An ML-based improved model has been applied to predict the potential threat of COVID-19 in countries worldwide. We show that using iterative weighting for fitting Generalized Inverse Weibull distribution, a better fit can be obtained to develop a prediction framework. This can be deployed on a cloud computing platform for more accurate and real-time prediction of the growth behavior of the epidemic. A data driven approach with higher accuracy as here can be very useful for a proactive response from the government and citizens. Finally, we propose a set of research opportunities and setup grounds for further practical applications. The novel Coronavirus disease was first reported on 31 December 2019 in the Wuhan, Hubei Province, China. It started spreading rapidly across the world [1] . The cumulative incidence of the causitive virus (SARS-CoV-2) is rapidly increasing and has affected 196 countries and territories with USA, Spain, Italy, U.K. and France being the most affected [2] . World Health Organization (WHO) has declared the coronavirus outbreak a pandemic, while the virus continues to spread [3] . As on 4 May 2020, a total of 3,581,884 confirmed positive cases have been reported leading to 248,558 deaths [2] . The major difference between the pandemic caused by CoV-2 and related viruses, like SARS and MERS is the ability of CoV-2 to spread rapidly through human contact and leave nearly 20% infected subjects as Motivation and Our Contributions: ML [11] can be utilized to handle large data and intelligently predict the spread of the disease. Cloud computing [12] can be used to rapidly enhance the prediction process using high-speed computations [7] . Novel energy-efficient edge systems can be used to procure data, in order to bring down power consumption. In this paper, we present a prediction model deployed using FogBus framework [13] for accurate prediction of the number of COVID-19 cases, the rise and the fall of the number of cases in near future and the date when various countries may expect the pandemic to end. We also provide a detailed comparison with a baseline model and show how catastrophic the effects can be if poorly fitting models are used. We present a prediction scheme based on the ML model, which can be used in remote cloud nodes for real-time prediction allowing governments and citizens to respond proactively. Finally, we summarize this work and present various research directions. Article structure: The rest of the paper is organized as follows: Section 2 presents the prediction model and performance comparison. Section 3 concludes the work and describes the future research opportunities. Section 4 provides details of open repositories for the dataset, code and results. Machine Learning (ML) and Data Science community are striving hard to improve the forecasts of epidemiological models and analyze the information flowing over Twitter for the development of management strategies, and the assessment of impact of policies to curb its spread. Various datasets in this regard have been openly released to the public. Yet, there is a need to capture, develop and analyse more data as the COVID-19 grows worldwide [14, 15] . The novel coronavirus is leaving a deep socio-economic impact globally. Due to the ease of virus transmission, primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes, countries which are densely populated need to be on a higher alert [16] . To gain more insight on how COVID-19 is impacting the world population and to predict the number of COVID-19 cases and dates when the pandemic may be expected to end in various countries, we propose a Machine Learning model that can be run continuously on Cloud Data Centers (CDCs) for accurate spread prediction and proactive development of strategic response by the government and citizens. Dataset: The dataset used in this case study is the Our World in Data by Hannah Ritchie 1 . The dataset is updated daily from the World Health Organization (WHO) situation reports 2 . More details about the dataset are available at: https://ourworldindata.org/coronavirus-source-data. Cloud framework: The ML models are built to make a good advanced prediction of the number of new cases and the dates when the pandemic might end. To provide fail-safe computation and quick data analysis, we propose a framework to deploy these models on cloud datacenters, as shown in Figure 1 .In a cloud based environment, the government hospitals and private health-centers continuously send their positive patient count. Population density, average and median age, weather conditions, health facilities etc. are also to be integrated for enhancing the accuracy of the predictions. For this case study, we used three instances of single core Azure B1s virtual machines with 1-GiB RAM, SSD Storage and 64-bit Microsoft Windows Server 2016 1 . We used the HealthFog [11] framework leveraging the FogBus [13] for deploying multiple analysis tasks in an ensemble learning fashion to predict various metrics, like the number of anticipated facilities to manage patients and the hospitals. We analyzed that the cost of tracking patients on a daily basis, amortized CPU consumption and Cloud execution is 37% and only 1.2 USD per day. As the dataset size increases, computationally more powerful resources would be needed. ML model: Many recent works have suggested that the COVID-19 spread follows exponential distribution [17, 18, 19] . As per empirical evaluations and previous datasets on SARS-CoV-1 virus pandemic, many sources have shown that data corresponding to new cases with time has large number of outliers and may or may not follow a standard distribution like Gaussian or Exponential [20, 21, 22, 23] . In recent study by Data-Driven Innovation Laboratory, Singapore University of Technology and Design (SUTD) 3 , the regression curves were drawn using the Susceptible-Infected-Recovered model [24] and Gaussian distribution was deployed to estimate the number of cases with time. However, in the previously reported studies on the earlier version of the virus, namely SARA-CoV-1, the data was reported to follow Generalized Inverse Weibull (GIW) Distribution [25] better than Gaussian as shown in Figure 2 (details of Robust Weibull fitting follow in the next section). Detailed comparison for SARS-CoV-2 has been described in the next section. This fits the following function to the data: Here, f (x) denotes the number of cases with x, where x > 0 is the time in number of days from the first case, and α, β, γ > 0, ∈ R are parameters of the model. Now, we can find the appropriate values of the parameters α, β and γ to minimize the error between the predicted cases (y = f (x)) and the actual cases (ŷ). This can be done using the popular Machine Learning technique of Levenberg-Marquardt (LM) for curve fitting [26] . However, as various sources have suggested, in initial stages of COVID-19 the data has many outliers and noise. This makes it hard to accurately predict the number of cases. Thus, we propose an iterative weighting strategy and call our fitting technique "Robust Fitting". A diagrammatic representation of the iterative weighting process is shown in Figure 3 . The main idea is as follows. We maintain weights for all data points (i) in every iteration (n, starting from 0) as w n i . First, we fit a curve using the LM technique with weights of all data points as 1, thus w 0 i = 1 ∀ i. Second, we find the weight corresponding to every point for the next iteration (w n+1 i )) as: 1 Azure Cloud VMs: https://azure.microsoft.com/en-au/pricing/calculator/ 3 When Will COVID-19 End, DDI Lab, SUTD: https://ddi.sutd.edu.sg/when-will-covid-19-end 3 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.06.20091900 doi: medRxiv preprint Simply, in the above equation, we first take tanhshrink function defined as tanhshrink(x) = x − tanh(x) for the distances of all points along y axis from the curve (d i ). This is to have a higher value for points far from the curve and near 0 value for closer points. This, is then standardized by dividing with max value over all points and subtracted from 1 to get a weight corresponding to each point. This weight is then standardized using sof tmax function so that sum of all weights is 1. The curve is fit again using LM method, now with the new weights w n+1 i . The algorithm converges when the sum total deviation of all weights becomes lower than a threshold value. Distribution Model Selection: To find the best fitting distribution model for the data corresponding to COVID-19, we studied the data on daily new confirmed COVID cases. Five sets of global data on daily new COVID-19 cases were used to fit parameters of different types of distributions. Finally, we identified the best performing 5 distributions. The results are shown in Table 1 . We observe that using the iteratively weighted approach, the Inverse Weibull function fits the best to the COVID-19 dataset, as compared to the iterative versions of Gaussian, Beta (4-parameter), Fisher-Tippet (Extreme Value distribution), and Log Normal functions. When applied to the same dataset, Iterative Weibull showed an average MAPE of 12% lower than non-iteratively weighted Weibull. A step-by-step algorithm for iteratively weighted curve fitting using the GIW distribution (called "Robust Weibull") is given in Algorithm 1. Analysis and Interpretation: To compare the proposed "Robust Weibull fitting" model, we use the baseline proposed by Jianxi Luo from SUTD 3 . The comparison metrics include Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE) and Coefficient of determination (R 2 ). Table 2 shows the model predictions of the spread of the COVID-19 for every major country for which sufficient data was available and model fits had R 2 > 0.5 using the proposed model. As shown in the table, the proposed model performs significantly better than the baseline. 4 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. i | < then break end for end procedure As shown in Figure 4 4 , the predictions of the baseline Gaussian model deployed by SUTD are overoptimistic. Following such models could lead to premature uplifting of the lockdown, causing adverse effect on management of the epidemic. Having better fit models, as proposed here, could help plan a better strategy, based on more accurate predictions and future scenarios. Figure 5 shows the total predicted number of cases for all countries across the globe. Here we have neglected those countries where the data is insufficient for making predictions, or the number of days for data is less than 30. As shown in Figure 4 explained in model section, the fit curve can be used to predict the number of cases that will have to be dealt by the country, assuming the same trend continues. The figure illustrates that the maximum number of total cases will be in the North America region. The number of cases will also be high in the European continent, Russia and eastern Asia, including China, the original epicenter of the disease. 4 Curves and predictions of all countries have been given in Appendix 5 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. There is a need to explore social media and applying machine learning application to analyse social media data. In this study, we have discussed how improved mathematical modelling, Machine Learning and cloud computing can help to predict the growth of the epidemic proactively. Further, a case study has been presented which shows the severity of the spread of CoV-2 in countries worldwide. Using the proposed Robust Weibull model based on iterative weighting, we show that our model is able to make statistically better predictions than the baseline. The baseline Gaussian model shows an over-optimistic picture of the COVID-19 scenario. A poorly fitting model could lead to a non optimal decision making, leading to worsening of public health situation. We propose the future directions as follows. Firstly, other important parameters like population density, distribution of age, individual and community movements, level of healthcare facilities available, strain type and virulence of the virus etc., need to be included in the regression model to further enhance the prediction accuracy. Secondly, models like ARIMA [27] can be integrated with Weibull function for further time series analysis and predictions. Thirdly, ML can be utilized to predict the structure and function of various proteins associated with CoV-2 and their interaction with the host human proteins and cellular environment. The contribution of various socio-economic variables that determine the vulnerability, spread and progression of the epidemic can be predicted by developing suitable algorithms. AI based proactive measures can be taken to prevent the spread of the virus to sensitive groups in the society. Real time sensors can be used, for example in traffic camera or surveillance, which track COVID-19 symptoms based on visual imaging and tracking Apps, and inform respective hospitals and administrative authorities for punitive action [28] . Tracking needs to cover all stages from ports of entries to public places and hospitals [29] . The research directions and challenges are summarized in Figure 6 . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.06.20091900 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.06.20091900 doi: medRxiv preprint D e c All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.06.20091900 doi: medRxiv preprint A novel coronavirus outbreak of global health concern What the cruise-ship outbreaks reveal about covid-19 Clinical features of covid-19 in elderly patients: A comparison with young and middle-aged patients Preliminary estimation of the basic reproduction number of novel coronavirus (2019-ncov) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak Next generation technologies for smart healthcare: Challenges, vision, model, trends and future directions Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet Automated classification of usual interstitial pneumonia using regional volumetric texture analysis in high-resolution ct Machine learning applications in genetics and genomics Healthfog: An ensemble deep learning based smart healthcare system for automatic diagnosis of heart diseases in integrated iot and fog computing environments Transformative effects of iot, blockchain and artificial intelligence on cloud computing: Evolution, vision, trends and open challenges Fogbus: A blockchain-based lightweight framework for edge and fog computing Covid-19 Covid-19 image data collection Presumed asymptomatic carrier transmission of covid-19 Effective containment explains sub-exponential growth in confirmed cases of recent covid-19 outbreak in mainland china Covid-19 epidemic outside china: 34 founders and exponential growth. medRxiv Covid-19, exponential growth, and the power of showing up in social solidarity: The math behind the virus Prediction of sars epidemic by bp neural networks with online prediction strategy Monitoring the sars epidemic in china: a time series analysis Simulating the sars outbreak in beijing with limited data The sir model for spread of disease: The differential equation model. Loci.(originally Convergence The generalized inverse weibull distribution The levenberg-marquardt algorithm: implementation and theory Application of the arima model on the covid-2019 epidemic dataset. Data in brief A novel ai-enabled framework to diagnose coronavirus covid 19 using smartphone embedded sensors: Design study A deep learning algorithm using ct images to screen for corona virus disease (covid-19). medRxiv He is a national level Kishore Vaigyanic Protsahan Yojana (KVPY) scholarship holder for excellence in science and innovation. He has worked as a visiting researcher at the Cloud Computing and Distributed Systems He is the founder and CEO of Qubit Inc. He has worked remotely with the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, Department of Computing and Information Systems, the University of Melbourne, Australia in the realization of the FogBus framework. He has also worked at the Embedded Systems Laboratory, EPFL, Switzerland in the design of low-power and physics-optimized Edge devices made from emerging Non-Volatile Memories Before this, he was executive Director, National Agri-Food Biotech Institute, Mohali; and Director, National Botanical Research Institute, Lucknow. He has many publications in reputed journals and conferences including Nature Biotechnology Our prediction model is available online at https://github.com/shreshthtuli/covid-19-prediction. The dataset used for this work is the Our World Dataset, available at https://github.com/owid/covid-19-data/ tree/master/public/data/. Few interactive graphs can be seen at https://collaboration.coraltele.com/ covid/. Appendix: Real data from WHO with predicted curves of proposed and baseline models All data uptil 4 May 2020 has been used to generate results.