key: cord-0866253-rcfmc442 authors: Pourghasemi, Hamid Reza; Pouyan, Soheila; Heidari, Bahram; Farajzadeh, Zakariya; Shamsi, Seyed Rashid Fallah; Babaei, Sedigheh; Khosravi, Rasoul; Etemadi, Mohammad; Ghanbarian, Gholamabbas; Farhadi, Ahmad; Safaeian, Roja; Heidari, Zahra; Tarazkar, Mohammad Hassan; Tiefenbacher, John P.; Azmi, Amir; Sadeghian, Faezeh title: Spatial modelling, risk mapping, change detection, and outbreak trend analysis of coronavirus (COVID-19) in Iran (days between 19 February to 14 June 2020) date: 2020-06-20 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2020.06.058 sha: cee38b31360b6a08c207b8d1300c3ee136ca6708 doc_id: 866253 cord_uid: rcfmc442 Abstract Objectives Coronavirus disease 2019 (COVID-19) represents a major pandemic threat that has spread to more than 212 countries and 2 international conveyance with more than 432,902 recorded deaths and 7,898,442 confirmed global worldwide so far (on June 14, 2020). It is crucial to investigate the spatial drivers to prevent and control the epidemic of COVID-19. Methods This is the first comprehensive study of COVID-19 in Iran and it undertakes spatial modeling, risk mapping, change detection, and outbreak trend analysis of the disease spread. Four main steps were taken: comparison of Iranian coronavirus data with the global trends; prediction of mortality trends using regression modelling; spatial modelling, risk mapping, and change detection using the random forest (RF) machine learning technique (MLT); and validation of the modelled risk map. Results The results show that from February 19 to June 14, 2020 the average growth rates (GR) of COVID-19 deaths and the total number of COVID-19 cases in Iran were 1.08 and 1.10, respectively. Based on World Health Organisation (WHO) data, Iran’s fatality (deaths/0.1 M pop) is 10.53. Other countries’ fatality rates were, for comparison, Belgium – 83.32, UK – 61.39, Spain – 58.04, Italy – 56.73, Sweden – 48.28, France – 45.04, USA – 35.52, Canada – 21.49, Brazil – 20.10, Peru – 19.70, Chile – 16.20, Mexico– 12.80, and Germany – 10.58. This fatality rate for China is 0.32 (deaths/0.1 M pop). The heatmap of the infected areas over time identified two critical time intervals for the COVID-19 outbreak in Iran. The provinces were classified in terms of disease and death rates into a large primary group and three provinces that had critical outbreaks that were separate from others. The heatmap of countries of the world show that China and Italy were distinguished from other countries in terms of nine viral infection-related parameters. The regression models for death cases showed an increasing trend but with some evidences of turning. A polynomial relationship was identified between coronavirus infection rate and province population density. In addition, a third-degree polynomial regression model for deaths showed an increasing trend recently, indicating that subsequent measures taken to cope with the outbreak have been insufficient and ineffective. The general trend of deaths in Iran is similar to the worlds, but it shows lower volatility. Change detection of COVID-19 risk maps with a random forest model for the period from March 11th to March 18th showed an increasing trend in COVID-19 in Iran’s provinces. It is worth noting that using the LASSO MLT to evaluate variables’ importance indicated that the most important variables were distance from bus stations, bakeries, hospitals, mosques, ATMs (automated teller machines), banks, and the minimum temperature of the coldest month. Conclusions We believe that the risk maps provided by this study is the primary, fundamental step for managing and controlling COVID-19 in Iran and its provinces. Phylogenetic analysis suggested coronaviruses (CoV) belong to Coronavirinae subfamily which is comprised of the Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus genera 1, 2 . Of the four genera, alpha and beta coronaviruses results in respiratory illness in humans and gastroenteritis in animals, and gamma and delta affect birds 3 reported in other countries 5, 6 . SARS-CoV infected more than 8000 people and caused 774 deaths 7 , but the MERS-CoV showed higher infection rates and lower fatality rates 6 . Pathogenic behaviour of SARS-CoV and MERS-CoV was not limited to human communities and severe respiratory syndrome was reported in animals 8 . Analysis of the genetics of coronaviruses demonstrated the receptor-binding motif of SARS-CoV has been mutated in humans and wild animals (bats and civets) 9 . A recombinant SARS-CoV was detected in bats and transmitted to people through civets in Guangdong Province 3 11 . These three emerging infectious diseases (SARS-CoV, MERS-CoV, and COVID-19) spreading globally are caused by βcoronaviruses, mainly infecting bats, but also found in camels and rabbits 7 . There is currently neither a vaccine against COVID-19 nor any specific, proven, antiviral medication 12, 13 , and this makes it a severe threat globally. Coronaviruses are a continuous pandemic threat that has spread to more than 212 countries and 2 Therefore, health boards, governments, and public services need to co-operate globally to prevent its spread. Many publications have addressed the new coronavirus in term of the clinical characteristics, immunological studies, and international spread of COVID-19 [5] [6] [7] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . Only a few studies have investigated the spread and to produce risk maps either during the COVID- 19 J o u r n a l P r e -p r o o f outbreak, for similar coronavirus respiratory syndromes 6, 26 and or even risk assessments of other viral diseases globally [27] [28] [29] [30] [31] [32] . Given the clinical severity of COVID-19 infections, the extent of the outbreak, and public concern, it is important to establish accurate epidemiological limitations, so robust information is essential for inputs into models. In general, there aree too many factors on each pandemic diseas includeing environmental, agro-ecological, and meteorological variables. Each disease and risk mapping related to it has some effective factors and it can be connected to climate, water, animals, humans, and even soils [33] [34] [35] . So, it is important to select the most important effective factors on each pandemic diseas. Also, it is vital to rapidly develop robust information using unbiased and reliable methods to provide situational awareness and to improve response to the pandemic 6 . The aim of this study is to conduct spatial modeling, risk mapping, change detection, and outbreak trend analysis of COVID-19 in Iran ( Fig. 1) to produce the first comprehensive investigation to assist with the management and control of the COVID-19 crisis. The study area is the country of Iran with an area of 1,648,195 square kilometres located between 25°3′50" and 39°46′16" N and between 44°2′19" and 63°19′18" E. Iran borders Armenia, Azerbaijan, and Turkmenistan on its north, Afghanistan and Pakistan on its east, Turkey and Iraq on its west, and the Persian Gulf and Oman Sea to the south. Unfortunately, at various times, Iran has faced a number of serious infectious diseases, many of which have been successfully controlled. Plague is a pandemic disease that is still outbreak in some regions around the world, that has affected Iran at different times in its history from 1829 to 1966 with an estimation of two million deaths lands 36 . Between January 2000 and September 2010, 738 cases of Crimean-Congo Haemorrhagic Fever (CCHF) with 108 deaths were described in Iran 37 . Tuberculosis was well J o u r n a l P r e -p r o o f controlled in Iran. The incidence of tuberculosis in Iran is 17/100,000 population, and its spread was 23/100,000 population. It has a mortality ratio of 1.8/100,000 people 37, 38 .The annual outbreak of malaria in Iran was predictable to be 0.14-8.74/1000 people in 2010. According to the WHO, there were nearly 70,000 confirmed cases of malaria in Iran in 2010 37, 39 . This study undertakes four steps: 1) comparing Iranian coronavirus data with other countries; 2) predicting the trends of deaths from COVID-19 using regression; 3) spatial modelling, risk mapping, and change detection of COVID-19 using the random forest (RF) machine learning technique (MLT); and 4) validation of the modelled risk maps (Fig. 2 ). First, we assess the growth rates (GRs) of active cases and deaths in Iran. To do this, the data were extracted from the daily reports produced by Iranian's Ministry of Health and Medical Education (IMHME). All active cases, deaths, and recoveries were compiled in Excel. An analysis of the correlation between the number of active cases and the number of deaths was conducted using Spearman's Rho. Another analysis was examined the relationships between ages and sexes of active cases and deaths. Following to the IMHME approach, these age groups were used to produce 9 classes: 1) <=9, 2) 10-19, 3) 20-29, 4) 30-39, 5) 40-49, 6) 50-59, 7) 60-69, 8) 70-79, and 9) >=80. All analyses were done using Excel, SPSS, SAS, and ArcGIS. The data for Iran were compared to data describing the experiences of other countries, particularly the 24 countries with highest COVID-19 confirmed death rates. The data for the other countries were collected from the "Worldometer" page (https://www.worldometers.info/coronavirus/) 40 . Also, the corona virus J o u r n a l P r e -p r o o f cases of six continents (Europe, North America, Asia, South America, Africa, and Oceania) were compared in total active cases, total recovered, and total deaths 40 . Simple correlation coefficients for the number of the infected cases in Iran's 31 provinces were calculated with SAS software. A heatmap was created using cluster analysis and shiny heatmap tools 41 . The data used for heatmap construction were normalized based on Z-scores (Eq. 1) where, Xi, X , and σ are raw data, mean, and standard deviation for each tested trait. The number of infected case (y) and the number of days (x) after the first report of coronavirus in each province were subjected to regression to identify the best model parameters. The number of infected case (y) was also regressed on province population (x). The trend of deaths was captured by a cubic or third-degree polynomial specification as: where ℎ( ) represents the death cases in a day, and t denotes the days starting from the 19 th of February. This model was extracted from the equation applied to estimate total deaths as accumulation of daily cases. Other specifications, including quadratic or fourth-degree polynomial forms, were examined, but it was determined that the cubic form produced the most accurate predictions 42 . We also used an ARMA model to compare the processes used to generate the data and moving-average (MA) processes. An ARMA model of order (p, q) can be written as 43 : where y is the dependent variable and is the white noise stochastic error term. In this model, y denotes the total number of deaths and t is the number of days since the date on which the first death was reported. In the case of significant fluctuations in the numbers of deaths per day, a volatility model, the autoregressive conditional heteroskedasticity (ARCH) model, was applied so that in order of p, it can be presented as: where 2 ( ) is the one-period ahead forecast variance, it is obtained from past information. One of the most important spatial modelling and risk mapping steps is to prepare data matrices that include the dependent variables (the total number of cases) and sets of independent (or predictor) variables. In this study, data describing several anthropogenic and climatic factors were compiled and mapped. These variables included: distances from bus stations, bakeshop (bakeries), hospitals, mosques, automated teller machines (ATMs), banks, fuel stations, and attraction sites, the density of cities, human footprint, distance from roads, the density of villages, minimum temperature of coldest month (MTCM), maximum temperature of warmest month (MTWM), precipitation of wettest month (PWM), and precipitation of driest month (PDM) (Fig. 3 (a-p) ). from WorldClim datasets (https://www.worldclim.org/data/index.html) 47 , 48 to spatially model COVID-19. To account for the effects of anthropogenic factors that may increase infection risk through contact with contaminated surfaces or person-to-person contact, public spaces of primary importance were included in the modelling. These data were acquired from Open Street Map (https://www.openstreetmap.org/ ) 49 . The Euclidian distance to the most critical social concentration spaces and facilities, including banks, bakeries, ATMs, bus stations, fuel stations, hospitals, and mosques were measured using ArcGIS Spatial Analyst Tools. To evaluate the effects of settlements and of infrastructure at a broader scale, we calculated geographic distances to road networks, as well as the densities of cities and towns. Also, to factor in the effects of other humanfeatures on the outbreak, the human footprint layer was integrated into the models. This layer is measured using variables related to human development (e.g., population, electric power) and combined them into one seamless layer at a spatial resolution of ~1 km 50 . Because of the coarse precision of the human footprint map, we also included the village density across the country using J o u r n a l P r e -p r o o f kernel density function applied to village locality-layer obtained from a topographic map of the country at a scale of 1: 50,000. RF 51 is an extensively employed robust MLT that applies abundant trees to attain superior classification precision as single trees produce feeble results due to greater adjustment and bias 52 . RF is highly efficient for handling obscure and unknown data and functions well even if a dataset is large and intricate 53 The RF can be conducted using the 'randomForest' package 66 in R software. LASSO is based on ridge regression 67 and non-negative Garrote 68 . It regularizes, manages collinearity, and performs feature selection 67, 69 . In regularization, the model sets an upper limit to the sum of the absolute values of the selected variables of the regression model. If the sum exceeds the limit, the model shrinks the coefficients by penalizing (l1 norms) with a shrinkage factor and making some coefficients equal to 0 70 . In particular, it reduces the residual sum of squares (RSS) subject to an l1 penalty term: where λ refers to the tuning parameter determined based on the crosscheck. However, the nonzero coefficients are also chosen in the variable-selection procedure to reduce the predictionaccuracy error. The model is suitable for several statistical problems and multi-dimensional problems due to its accurate predictions and ease of interpretability 71 . Data for the 24 countries with the highest confirmed total cases (more than 50,000), deaths, total recovered, active cases, serious/critical, and test rates per 0. The number of infected patients per 100,000 inhabitants and the cumulative curves for Iranian provinces (Fig. 7) is fairly high, particularly for residents of the Qom County. The rate of infection per 100,000 was highest in Semnan Province. In terms of the outbreak among age groups and by sex (Fig. 8) , the highest number of active cases was among 50-59-year olds. In this cohort, the percentages of women (23.7%) and men (21.6%) were similar. The group with the second highest risk was the 60 to 69-year-old cohort. The percentages of active cases in this class were 20.2% and 18.2% for women and men, respectively. The age cohort with the fewest cases was children younger than 9 years old. There was no discernible difference between the sexes in this age group. Women have more risk than men, (Fig. 9 ). There are more deaths among the elderly than among youth. The age group with the greatest number of deaths among women was the younger-than-39 group. However, the number of men who died in this group exceeded the number of women. The number of infections in Isfahan Province were strongly correlated (above 0.80) to the number of infections in Semnan, Zanjan, and Yazd provinces (Table 3) The heat map represents the provinces that are grouped by similarity using color intensity (COVID-19 contamination rate) (Fig. 10) . The regression models suggested a cubic trend for the relationship between province population and infections (Fig. 13 ). A regression model was used to determine trends in death rates Iran ( Table 4 ). The explanatory variables that are exponents of the lag of days after outbreak explain more than 99% of variation in the patterns of deaths. The Ljung-Box Q-statistics ( Table 4) indicated that the residuals are not significantly correlated. In addition, the Jarque-Bera test of normality also shows that residuals are normally distributed. Evaluation of prediction accuracy (Fig. 14) determined that predicted values were (to a great extent) able to keep pace with the actual values. The main point is that the deaths are increasing over the prediction horizon; however, it shows two turning points. Around the mid-April a turning point is observed. In other words, based on the data applied to estimate the regression model, the death trend slowly tends to reveal the turning point. Regarding the global context, for instance, for the incidence of SARS 78 , HAV 79 , ARI 80 however; the current situation is far from that, indicating that there has been an inadequate impact and there is a lot of room for improvement. Both the global and Iranian views are presented with a fourth-degree polynomial specification (Fig. 15 ). For both, deaths are increasing exponentially but there is a steeper rate for Iran spread of deaths in the beginning. This fact has been examined more deeply and in a quantitative way using ARMA time-series estimation ( Table 5 ). These models may show the activities of the variables in a specific time horizon. To ensure comparable models, a 116-day time horizon was used for both. This is the time period for which data are available for Iran (19th of February to 14th of June). The world model is generated by an J o u r n a l P r e -p r o o f AR (31,2,8) process (Table 5) , while the Iranian modelling of the death trend was obtained with a moving average, resulting in an ARMA (2, 1). However, the absolute value of the AR terms for Iran is higher than those of world model, indicating a trend of faster increase. Also, the death trend is characterized by volatility since a moving average process (Iran) and autoregressive conditional heteroskedasticity (ARCH) (World) were determined. These processes model volatility. Given the significant coefficients for these variables, the deaths in World tend to fluctuate more than those occurring in Iran. This is not easily captured in the trends (Fig. 15 ). The diagnostic statistics indicate that the models are acceptable, since Q-statistics indicate that the residuals are not significantly correlated. The Jarque Berra statistic supports the normality of residuals at a conventional significance level for Iran specification However, for World specification residuals are not normally distributed, thus, one needs to exercise caution. All coefficients are significant at 99 percent. As stated in methods section, the LASSO MLT was used to determine the relative importance of the variables and to determine the impact of each factor on the COVID-19 outbreak in Iran (Fig. 16 ). Accordingly, the most important variables were the distances from bus stations, bakeshop, hospitals, mosques, ATMs, and banks, and MTCM. On the other hand, the outbreak was least related to the village's density, PWM, and PDM. In the current study, the RF MLT was used for spatial modelling and mapping COVID-19 in Iran twice, on March 11 th and March 18 th . To run the RF, the tree and variables numbers were set on 1000 and 4, respectively. The results of the confusion matrix (Table 6) Among that set, almost 4,940 cases were predicted in the correct class, whereas 323 cases were predicted in the wrong class. The value of OOB as an error rate was 21.14%, meaning that the accuracy of the training dataset (the modelling process) is 78.86%. Finally, the COVID-19 risk map ( Fig. 17 (a-b) Results of the validation of the COVID-19 risk maps ( Fig. 18 The main policy implication is that deaths, to some extent, may be limited by using more effective measures. These measures may be more important and more effective particularly when the early-experiencing countries measures are taken into consideration. We speculate that the risk map and analyses provided in this study implies that this analysis may be the first and most important step in the future management and control of COVID-19 in Iran and in its provinces. We declare no competing interests. Shiraz University, Iran, Grant No. 98GRC1M271143. J o u r n a l P r e -p r o o f All data and materials used in this work were publicly available. The ethical approval or individual consent was not applicable Table 1 Top 20 countries with high death percentage and active cases of > 100 Table 2 Correlation of total cases and deaths using Spearman's Rho calculator Table 3 Correlation of number of virus infection among various Iran's Provinces Table 4 Regression results for Covid-19 death cases of Iran Table 5 The results of ARMA model for Covid-19 death cases of world and Iran Table 6 Confusion matrix of the random forest model (On March 11 th , 2020) Table 7 The AUC value of COVID-19 risk maps using RF MLT J o u r n a l P r e -p r o o f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Table 4 Regression results for Covid-19 death cases of Iran is the significance level of the Ljung-Box statistics in which the first p of the residual autocorrelations is jointly equal to zero. is the significance level of the Ljung-Box statistics in which the first p of the residual autocorrelations is jointly equal to zero. Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong, People's Republic of China Aetiology: Koch's postulates fulfilled for SARS virus Origin and evolution of pathogenic coronaviruses Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Jo ur na l P of Articles nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Coronavirus Disease 2019: Coronaviruses and blood safety S. in Fields Virology Bat origin of human coronaviruses A comparative study on the clinical features of COVID-19 pneumonia to other pneumonias Severe acute respiratory syndrome-related coronavirus: J o u r n a l P r e -p r o o f The species and its viruses-a statement of the R&D Blueprint: Coronavirus disease (COVID-2019) R&D; accessed 23rd CEPI launches new call for proposals to develop vaccines against novel coronavirus, 2019-nCoV; accessed 23rd The COVID-19 epidemic Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV Immunological Studies 2020; 1-15 Middle east respiratory syndrome-corona virus (MERS-CoV) associated stress among medical students at a university teaching hospital in Saudi Arabia Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan , China : a descriptive study Breakthrough: chloroquine phosphate has shown apparent efficacy in treatment of COVID-19 associated pneumonia in clinical studies Asymptomatic carrier state, acute respiratory disease, and pneumonia due to severe acute respiratory syndrome coronavirus 2 (SARSCoV-2): Facts and myths CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges China bans cash rewards for publishing papers Clinical findings in a group of patients infected with the 2019 novel coronavirus ( SARS-Cov-2 ) outside of Wuhan , China : retrospective case series Landslide susceptibility modelling using GIS-based machine learning techniques for Chongren County Application of geographical information system-based analytical hierarchy process as a tool for dengue risk assessmentAsian Application of the analytic hierarchy approach to the risk assessment of Zika virus disease transmission in Guangdong Forecasting incidence of dengue and selecting best method for prevention A cooperative approach to animal disease response activities : Analytical hierarchy process ( AHP ) and vvIBD in California poultry Mapping urban and periurban breeding habitats of Aedes mosquitoes using a fuzzy analytical hierarchical process based on climatic and physical parameters Application of the analytic hierarchy process to a risk assessment of emerging infectious diseases in Shaoxing city in Southern China. (i) Connecting the public with soil to improve human health Global evaluation of heavy metal content in surface water bodies: a meta-analysis using heavy metal pollution indices and multivariate statistical analyses Effect of seasonal variation on bacterial inhabitants and diversity in drinking water of an office building Plague in Iran: its history and current status Infectious diseases in Iran: a bird's eye view World Health Organization. Regional health observatory: malaria, reported confirmed cases World Health Organization Ultra-fast low memory heatmap web interface for big data genomics Climate variability and salmonellosis in Singapore -A time series analysis Applied Econometric Times Series Absolute humidity and pandemic versus epidemic influenza Absolute humidity, temperature, and influenza mortality: 30 years of county-level evidence from the United States The role of absolute humidity on transmission rates of the COVID-19 outbreak Very high-resolution interpolated climate surfaces for global land areas Global terrestrial Human Footprint maps for Random forests Spatial modelling of gully erosion using GIS and R programing: a comparison among three data mining algorithms Land subsidence modelling using tree-based machine learning algorithms Congestive heart failure detection using random forest classifier Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models Application and comparison of decision tree-based machine learning methods in landside susceptibility assessment at Pauri Garhwal area Breast cancer recurrence prediction using random forest model Modelling daily dissolved oxygen concentration using least square support vector machine, multivariate adaptive regression splines and M5 model tree Prediction of the landslide susceptibility: Which algorithm, which precision? Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang An application of ensemble random forest classifier for detecting financial statement manipulation of Indian listed companies High-resolution mapping of daily climate variables by aggregating multiple spatial data sets with the random forest algorithm over the conterminous United States Variable selection using random forests Newer classification and regression tree techniques: Bagging and Random Forests for ecological prediction Kögel-Knabner I. Digital mapping of soil organic matter stocks using Random Forest modelling in a semi-arid steppe ecosystem Breiman and Cutler's random forests for classification and regression A statistical view of some chemometrics regression tools Better subset regression using the nonnegative Garrote Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and application Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection Lasso for linear models The application of computational intelligence to landslide susceptibility mapping in Turkey Is multi-hazard mapping effective in assessment of natural hazards and integrated watershed management? Assessing and mapping multi-hazard risk susceptibility using a machine learning technique Application of learning vector quantization and different machine learning techniques to assessing forest fire influence factors and spatial modelling An overview of the epidemiological, ecological and preventive hallmarks of Argentine Hemorrhagic Fever (Junin virus) Reasons for the increase in emerging and re-emerging viral infectious diseases Has SARS infected the property market? evidence from Hong Kong Hepatitis A incidence, seroprevalence, and vaccination decision among MSM in Amsterdam, the Netherlands A computational approach to investigate patterns of acute respiratory illness dynamics in the regions with distinct seasonal climate transitions Different transmission patterns in the early stages of the influenza A (H1N1)v pandemic: a comparative analysis of 12 European countries Mapping Spread and Risk of Avian Influenza A (H7N9) in China Influenza A H5N1 and H7N9 in China: A spatial