key: cord-0835609-tckjdea9 authors: Shrivastav, Lokesh Kumar; Jha, Sunil Kumar title: A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of COVID-19 in India date: 2020-11-04 journal: Appl Intell DOI: 10.1007/s10489-020-01997-6 sha: 58bb0845a6e0991ecb2b5868b8bbc2218b99fa5a doc_id: 835609 cord_uid: tckjdea9 Meteorological parameters were crucial and effective factors in past infectious diseases, like influenza and severe acute respiratory syndrome (SARS), etc. The present study targets to explore the association between the coronavirus disease 2019 (COVID-19) transmission rates and meteorological parameters. For this purpose, the meteorological parameters and COVID-19 infection data from 28th March 2020 to 22nd April 2020 of different states of India have been compiled and used in the analysis. The gradient boosting model (GBM) has been implemented to explore the effect of the minimum temperature, maximum temperature, minimum humidity, and maximum humidity on the infection count of COVID-19. The optimal performance of the GBM model has been achieved after tuning its parameters. The GBM results in the best accuracy of R(2) = 0.95 for prediction of active cases in Maharashtra, and R(2) = 0.98 for prediction of recovered cases of COVID-19 in Kerala and Rajasthan, India. The transmission rate of the coronavirus disease 2019 (COVID-19) has been very fast since its first reported case in December 2019 in Wuhan, China. It has infected over 3, 181, 642 people in 215 countries worldwide and resulted in 224, 301 deaths by 1st May 2020 according to the world health organization [1] . Till now some common symptoms of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have been identified based on the recognized cases, including fever, tiredness, dry cough, sore throat, and diarrhea, etc. [2, 3] . Though, presently the increasing number of asymptotic patients in some countries is a dangerous situation of the society and a challenge for the doctors and health care system [4] . In some studies, the local seafood market of Wuhan, China is identified to be a source of COVID-19 which results in its transmission from bat to human [1, 3, 5, 6] . Still, many researches are going on to explore the transmission route of COVID- 19. In most of the infected cases of humans to human transmission through the respiratory tract, it spread due to the human contacts in gathering, meeting with relatives and friends, and between patients and healthcare workers, etc. [7] . Besides surface, the presence of coronavirus in blood and fecal swabs [8] , and in the air [9] around the hospital area indicates its transmission through multiple routes; this is another challenge for the healthcare system. Several approaches are in use for the detection of COVID-19 but the exact treatment approach is still lacking. Many drugs are being tested and several vaccines are still in the development process for the treatment [10] , therefore, social distancing, isolation, following instructions of the respective government organizations and doctors, and personal hygiene are some of the precautions to reduce the spread of the COVID-19. In India, 26,167 active cases with 9950 recovered, 1218 deaths, and 1 migrated were reported up to 1st May 2020 [11] . It is a common observation that the health of most of the people is affected by climate change, like seasonal cold and flu at the beginning of winter and summer. It is due to the reason that climate change affects the transmission of most of the virus. Similarly, the climate condition also affects the transmission rate of the epidemic virus. This fact is already established in some studies of the previous epidemic. Severe acute respiratory syndrome (SARS) started in January 2002 and ended mostly in July of the same year, has quit similar genetic sequence to COVID-19, and was affected by the warm weather condition [12] . The transmission rate of influenza has a strong correlation with atmospheric conditions that significantly increases at low daily temperature and humidity [13] . Some recent studies [14] [15] [16] [17] [18] [19] [20] [21] established the effect of climate conditions on the transmission rate of COVID-19. Though, it is hard to find any study based on the impact of the atmospheric factors, including temperature and humidity on the transmission rate of the COVID-19 in different states of India during the lockdown period. The machine learning-based approaches have been widely implemented in the health care system for disease diagnosis, monitoring, and prediction to reduce the workload of doctors and hospital workers [22] [23] [24] . In some recent research reports, machine learning approaches have been implemented successfully in the identification of the COVID-19 pandemic [25] [26] [27] [28] [29] . Though, the implementation of the machine learning approach using the atmospheric factors in the prediction of the COVID-19 is not noticed. With this motivation gradient boosting machine (GBM) approach has been implemented to establish the relationship among atmospheric factors (temperature and humidity) and daily spread rate of COVID-19 in different states of India. The present study has the following contributions: (a) explores the correlation between atmospheric parameters and transmission rate of COVID-19 in different states of India, (b) predicts the active and recovered cases of COVID-19, and (c) establish an efficient tree-based machine learning approach to explore the effect of temperature and humidity on the transmission rate of COVID-19. The meteorological data of all states of India were collected from the Indian Central Pollution Control Board (CPCB) [30] and Indian Meteorological Department (IMD) [31] on a daily basis from 28th March 2020 to 22nd April 2020. The COVID-19 data were collected from the Ministry of Health and Family Welfare, Government of India [11] , and an open-access source [32] . The meteorological parameter includes minimum temperature, maximum temperature, minimum humidity, and maximum humidity of all states of India. Besides, minimum pressure, maximum pressure, minimum wind speed, maximum wind speed, pm-10, and pm-2.5 were collected but not used in the analysis due to less correlated with the COVID-19 information. The COVID-19 related information, includes daily new infection cases, active cases (accumulated total cases up to previous days-recovered cases-deceased cases), recovered cased till the date, and mortality till the date. Finally, the meteorological parameters and COVID-19 information were combined for further analysis. The missing values of meteorological parameters were imputed by replacing the median values. The variations of imputed values of minimum and maximum temperature, and minimum and maximum humidity are shown in Figs. 1 and 2, respectively. The measurement sample represents the total number of measurements of temperature and humidity in different states of India for 26 days. Four meteorological parameters as input and active and recovered cases of COVID-19 were used as the output of the GBM approach, independently. The collected data of 26 days have a total of 702 instances in which 467 instances (2/3rd of total instances of all states) were used in training and 235 (about 1/3rd of total instances of all states) instances were used in the Fig. 2 The variation in the minimum and the maximum humidity 3 Gradient boosting machine (GBM) learning approach GBM is an ensemble forward learning model that is used to solve the regression as well as classification problems. It discards all weaker predictors and picks the stronger one. It is an improved version of the decision tree where every successor comparatively analyzed to build a set of the optimally satisfying structure of the tree by using the structure score, gain calculation, and increasingly refined approximations. The prediction performance of GBM can be boosted by the use of invoking an additional classifier. This modification optimizes the accuracy of the tree without affecting its speed. It also provides an easily distributable and parallelizable feature with an effortless environment for model tuning and selection. This version of GBM is capable to handle the bigdata with optimal accuracy. It is rarely used in COVID-19 prediction modeling. The H2O package in R [33] is used in the present study in the implementation of the GBM approach. The GBM model was optimized for the number of trees (k) = 1, 2, 3,…50. The maximum number of trees K = 50 is selected arbitrarily. The algorithm table of GBM is as follows [34] . Algorithm Table of Gradient Boosting Machine (GBM) Four atmospheric parameters, including minimum temperature, maximum temperature, minimum humidity, and maximum humidity were used in analysis. Specifically, to reduce the computational complexity, the average of the maximum and minimum temperature and the average of the maximum and minimum humidity were used as input in the GBM model to predict the number of recovered and active cases in all states and also for some individual states of India. The ANOVA analysis of the atmospheric parameters and the active case and recovered cases of the COVID-19 is shown in Table 2 . The results of the ANOVA analysis signify that the atmospheric data and COVID-19 data sets included in the present study are significant and can be used for further processing. The GBM model was tuned with the number of trees, learning rate, number of folds, and distribution functions (Gaussian, Tweedie, Huber, Laplace, Poisson, Quantile, and Gamma). The training prediction performance of the GBM model for the active and recovered cases of COVID-19 is summarized in Table 3 . The performance of GBM is evaluated based on mean square error (MSE), root mean square error (RMSE), mean average error (MAE), mean residual deviance (MRD), and coefficient of determination (R 2 ). It is obvious that the optimal prediction performance of the GBM was achieved for the Poisson distribution (R 2 = 0.99) and the number of trees = 50, learning rate = 0. Table 4 . The detailed statewise prediction results of the GBM using different distribution functions for Delhi, Maharashtra, and Gujarat are summarized in Tables 5, 6 Tree-based machine learning approaches are quite useful in the modeling of small as well as big datasets in past studies [35, 36] . The GBM can be used for pandemic prediction and has high efficiently [36] . For this reason, the GBM approach was selected for the modeling of the transmission rate of COVID-19 in India using the atmospheric factors. India has a large geographical region, due to which there is a huge variation in the weather parameters (Table 1 , Figs. 1 and 2) . It is obvious from the statistical description of weather parameters, like, fluctuation of minimum temperature between −18°C to 41°C, maximum temperature between −8°C to 44°C, minimum humidity between 1% to 77%, and maximum humidity between 7%-99%. Besides, a huge variation in the number of cases (both active and recovered) of COVID-19 has been noticed in different parts of India. Specifically, it is in between 0 to 4591 for active cases and 0 to 789 for the recovered cases (Fig. 3) . Considering the earlier mentioned variations in the weather parameters and the number of COVID-19 cases, a total of 702 instances was used from 27 different states of India for 26 days and used in GBM analysis. The statistical analysis of the parameters of the dataset suggests their unequal distribution. ANOVA test results (Table 2 ) reject the null hypothesis and suggest that all parameters of datasets are significant. Those states for which GBM has high accuracy in the prediction of active and recovered cases of COVID-19 using the average temperature and humidity indicates the minor effect of atmospheric factors in the transmission rate of COVID-19. Though for the rest of the states in which the GBM has less accuracy of prediction of active and recovered cases, the atmospheric factors might have a major effect on the transmission rate of COVID-19. The role of atmospheric factors, like temperature and humidity in the transmission rate of COVID-19, is still uncertain and may vary according to location. Though, a negative correlation between the transmission rate of COVID-19 and the temperature and humidity was discussed in some recent studies. Ahmadi et al. [20] have concluded the high transmission rate of COVID-19 cases in Iran at low humidity and temperature; Wang et al. [16] have also reported the low transmission rate of COVID-19 cases in China at high temperature and humidity; Qi et al. [19] have described a negative correlation between the transmission rate of COVID-19 and the average temperature and average humidity, and Tosepu et al. [21] have established a positive Spearman-rank correlation (r = 0.392) between average temperature and cases of COVID-19 in Indonesia. Besides, the analysis outcomes of the ARIMA model and polynomial function [37] suggested the future scope of humidity and other atmospheric factors in the prediction of COVID-19 cases in the different geographic regions. More data sets need to be combined and analyzed to make a concrete conclusion about the impact of the weather parameters on the transmission rate of COVID-19. It was noticed that GBM has high prediction accuracy in the prediction of both active as well as the recovered cases of some states of India. Specifically, the three states Delhi, Maharashtra, and Gujarat which are worst hit by the pandemic having the maximum number of active cases compared with the rest of the states of India. Tables 5, 6 and 7 and Figs. 7, 8, 9 and 10 demonstrate the detailed performance of GBM in using different distribution functions. The test results of Delhi are significant with Poisson and Gaussian distribution that reflects the actual recovery rate. The active cases captured by Poisson and Huber distribution also reflect the real data, but some peaks show the spreading tendency. The prediction results of Maharashtra and Gujarat have high variability and reflect the sudden peaks with some irregular and short intervals that also match with the real condition of these two states. The performance of the implemented GBM approach in the present study is comparable or better than some of the previously implemented approaches in the prediction of transmission rates of COVID-19 by including the weather parameters. The weather dataset was not available (NA) for some states at the time of collection. The NA data were replaced by the median value of instances during the analysis that may be also a cause for the poor performance of the GBM model in the prediction of COVID-19 cases for these states. The transmission speed of COVID-19 was very low in some states of the country before 15th April 2020; which results in the non-availability of the dataset of COVID-19. It also affects the prediction performance of the GBM model. The performance of Gaussian distribution based GBM is compared with the deep neural network and random forest (RF) models using a similar dataset. The comparative experimental results (Table 8 ) suggest that GBM performs better than other models. It is obvious that deep neural network has poor performance in the prediction of active and recovered cases (R 2 equal to 0.22 and 0.02, respectively). The RF approach has an average performance in the prediction of active and recovered cases (R 2 equal to 0.59 and 0.33, respectively). The comparative performance of three models in terms of R 2 is as follows: GBM > RF > deep neural network. Besides, the GBM has better performance than rest two approaches in terms of other evaluation measures, like MSE, RMSE, MAE, and MRD. The deep neural network is one of the most useful techniques in image processing and achieved better performance in several past studies, like in emotion recognition using the combination of deep convolutional neural network and kernel learning classifier [38] . Moreover, the strategies to improve the performance of the deep neural network in the analysis of data of different experimental domains have been discussed in some past studies, like using a training approach [39] , generalized maxout networks [40] , and transfer learning [41] , etc. Deep learning has better performance in the analysis of categorical feature datasets whereas tree-based learning is better in the dense numerical feature dataset [42] . Moreover, the performance of the deep learning approaches varies according to the nature and the dimensionality of the dataset [43] . In the case of a relational dataset, its performance is inferior to the tree-based learning algorithms [44] . It may be due to the fact that a tree-based algorithm is prone to overfitting and gives better results in the case of high dimensionality. The COVID-19 dataset used in the present analysis has high dimensionality and relational in nature, this may be the reason for the deprived performance of a deep neural network method. The deprived performance of deep neural network in the present analysis, even after optimization of its parameters may be also due to the small size, randomness, noise, and missing values, etc. in the dataset. The present study established an association between the number of cases of COVID-19 and meteorological parameters in different states of India. The study implemented an efficient method of predictive modeling using the GBM based machine learning approach. The experimental results suggest that the GBM model is capable to capture the correlation between the cases of COVID-19 and atmospheric parameters. The maximum achieved values of the R 2 and minimum values of the errors of the GBM suggest a certain association between the atmospheric factors and transmission rates of COVID-19 in some states of India, specifically in Delhi, Maharashtra, and Gujarat. Future research will include the additional meteorological parameters for a better understanding of the dependence of the transmission rate of COVID-19 on atmospheric conditions by using an efficient and robust machine learning approach. Also, the performance of the deep neural network needs to be improved in handling pandemic data. World Health Organization (2020) Coronavirus disease (COVID-19) Pandemic COVID-19 and the cardiovascular system A review of coronavirus disease-2019 Clinical characteristics of 24 asymptomatic infections with COVID-19 screened among close contacts in Nanjing Epidemiology, causes, clinical manifestation and diagnosis, prevention and control of coronavirus disease (COVID-19) during the early outbreak period: a scoping review Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak-an update on the status Molecular and serological investigation of 2019-nCoV infected patients: implication of multiple shedding routes Air, surface environmental, and personal protective equipment contamination by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from a symptomatic patient Research and development on therapeutic agents and vaccines for COVID-19 and related human coronavirus diseases Ministry of Health and Family Welfare Government of India (2020) COVI-19 India Environmental factors on the SARS epidemic: air temperature, passage of time and multiplicative effect of hospital infection Influenza virus transmission is dependent on relative humidity and temperature Effects of temperature variation and humidity on the death of COVID-19 in Wuhan Role of temperature and humidity in the modulation of the doubling time of COVID-19 cases Temperature significant change COVID-19 transmission in 429 cities Association between ambient temperature and COVID-19 infection in 122 cities from China Effects of temperature and humidity on the spread of COVID-19: a systematic review COVID-19 transmission in mainland China is associated with temperature and humidity: a time-series analysis Investigation of effective climatology parameters on COVID-19 outbreak in Iran Correlation between weather and Covid-19 pandemic in Jakarta Artificial intelligence in medicine Artificial intelligence in healthcare: past, present and future A comprehensive search for expert classification methods in disease diagnosis and prediction Artificial intelligence (AI) and big data for coronavirus (COVID-19) pandemic: a survey on the state-of-the-arts Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine COVID-19 and artificial intelligence: protecting health-care workers and curbing the spread On the coronavirus (COVID-19) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (AI) to benefit urban health monitoring and management Central Pollution Control Board, Ministry of Environment, Forest and Climate Change, Government of India (2020) Air pollution Government of India (2020) Meteorological Data The H2O.ai Team (2015) h2o: R Interface for H2O The elements of statistical learning: data mining, inference, and prediction Supervised learning with decision tree-based methods in computational and systems biology EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction Forecasting of COVID19 per regions using ARIMA models and polynomial functions Convolutional MKL based multimodal emotion recognition and sentiment analysis Performance improvement of deep neural network classifiers by a simple training strategy Improving deep neural network acoustic models using generalized maxout networks Transfer learning using rotated image data to improve deep neural network performance. In international conference image analysis and recognition DeepGBM: a deep learning framework distilled by GBDT for online prediction tasks An empirical comparison of supervised learning algorithms August) Xgboost: a scalable tree boosting system Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Acknowledgments The authors acknowledge the editor and eminent reviewers for their valuable comments and suggestions.Availability of data and material Not applicable. Conflicts of interest/competing interests Authors declares no conflict of interest.Code availability Not applicable.