key: cord-0855729-0aa3j81z authors: Ogunjo, S. T.; Fuwape, I. A.; Rabiu, A. B. title: Predicting COVID‐19 Cases From Atmospheric Parameters Using Machine Learning Approach date: 2022-04-01 journal: Geohealth DOI: 10.1029/2021gh000509 sha: 3e45bd5b0a18d15cdc5d5485620d418f32ba7b62 doc_id: 855729 cord_uid: 0aa3j81z The dynamical nature of COVID‐19 cases in different parts of the world requires robust mathematical approaches for prediction and forecasting. In this study, we aim to (a) forecast future COVID‐19 cases based on past infections, (b) predict current COVID‐19 cases using PM2.5, temperature, and humidity data, using four different machine learning classifiers (Decision Tree, K‐nearest neighbor, Support Vector Machine, and Random Forest). Based on RMSE values, k‐nearest neighbor and support vector machine algorithms were found to be the best for predicting future incidences of COVID‐19 based on past histories. From the RMSE values obtained, temperature was found to be the best predictor for number of COVID‐19 cases, followed by relative humidity. Decision tree models was found to perform poorly in the prediction of COVID‐19 cases considering particulate matter and atmospheric parameters as predictors. Our results suggests the possibility of predicting virus infection using machine learning. This will guide policy makers in proactive monitoring and control. . The long-short term memory was found to show the best performance. A novel support vector regression model was developed to predict the spread, growth rate, and end of the COVID-19 across different countries (Yadav et al., 2020) . The bidirectional long-short term memory was also compared with other state of the art forecasting technique and found to be more effective (Said et al., 2021) . Three artificial neural network algorithms, Radial Basis-Function, Fuzzy Cluster-Means, and Non-linear Autoregressive-Network with Exogenous Inputs were used to spatial forecast COVID-19 cases in Iraq (Yahya et al., 2021) . A deep neural network was developed to predict COVID-19 cases across the United States and European countries using gini coefficients, percentage of tested population, and urban population (Hashim et al., 2021) . The performance of machine learning algorithms is heavily dependent on the model predictors. There exist a bidirectional relationship between atmospheric parameters and aerosols with COVID-19 infections. On the one hand, the spread of the virus and the associated non-pharmaceutical interventions have led to changes in atmospheric weather conditions and aerosol propagation in many parts of the world (Fuwape et al., 2021) . Furthermore, changing weather has been reported to have significant health impact on humans and animals (Orimoloye et al., 2019; Ropo et al., 2017) . Temperature and particulate matter up to 15-day lag have been found to be associated with an increase in COVID-19 cases in Italy (Stufano et al., 2021) . The initial outbreak of the pandemic in India has been reportedly associated with increase in temperature and humidity (A. Pandey et al., 2022) . Nonlinear relationship was observed between atmospheric parameters (temperature and humidity) and COVID-19 in several cities within the United States of America (Runkle et al., 2020) . Association between atmospheric parameters and aerosols with incidences of COVID-19 have been confirmed in Algeria (Rahal et al., 2021) , Egypt (Anis, 2020) , Turkey (Şahin, 2020) , and Indonesia (Tosepu et al., 2020) . The highlighted researches considered 1-day lag cases as inputs to the machine learning model. However, this might not yield this best results for forecasting as the virus have been known to have latency period between 7 and 14 days. Also, 1-day lag cases might not give sufficient information for the models. It is essential to consider n-days lag cases for better prediction and forecasting. Furthermore, these studies did not consider any other predictors. It has been reported that atmospheric parameters including aerosols are responsible for the spread of the virus. Hence, it is pertinent to investigate prediction of COVID-19 cases using air pollutants and atmospheric parameters. In this study, we aim to investigate the performance of different machine learning classifiers in the forecasting of COVID-19 cases using the number of cases from n-days. Also, we investigated the performance of the machine learning classifiers with atmospheric parameters and air pollutants as predictors. For this study, six locations within Nigeria were considered based on their geographical locations. The locations are classified into two -Northern stations (Kebbi, Kano, Abuja) and Southern stations (Delta, Edo, and Osun). The northern and southern stations have different climatic regimes. The air pollution in the northern region is largely driven by dust from the Bodele region in Chad (Sunnu et al., 2013 ). The dust system, driven by large scale oscillations reaches the southern parts of the country (Anuforom, 2007) and transported as far as the Amazon basin in South America (Koren et al., 2006) . The weather dynamics of the southern part is driven largely by the Atlantic ocean (Ogunjo et al., 2019) . This is largely responsible for the low temperature range throughout the year (Eludoyin et al., 2014) . The major sources of pollution in the southern part are biomass burning and gas flaring (Ezeh et al., 2017; Ologunorisa, 2001) . Daily COVID-19 cases for each of the locations were obtained from the National Centre for Disease Control (www.ncdc.gov.ng) while the atmospheric data (temperature and relative humidity) and particulate matter (PM2.5, PM1.0, PM10.0) were obtained from the ongoing campaign of the Centre for Atmospheric Research, National Space Research and Development Agency using Purple Air sensors. The Purple Air sensors were provided courtesy of the Alliance for Education, Science, Engineering and Design in Africa (AESEDA), Penn State University, USA. The data for particulate matter and atmospheric variables were retrieved from the Purple air network (www.purpleair.com). The research was conducted during the harmattan season in Nigeria from 1 November 2020 till 31 March 2021. Machine learning approach has been chosen due to its various applications in various fields, ability to identify patterns, does not require specific distribution of the underlying data, and reliable results. Four machine learning algorithms (Decision Tree, Random Forest, Support Vector Machine, and k Nearest Neighbor) were considered in this study. In all of the algorithms, 80% of the data was used for training while 20% was used for testing. The root mean square error (RMSE) was considered as the test statistics due to its ability to compensate for large errors and has the same unit as the dependent variable. In Decision Tree (DT), a series of decision based on given conditions are used to arrive at a conclusion. The internal nodes are the available choices at a particular point in the tree. The result that will result in the subdivision of the tree into n-subsets is called the root node. The results from the root and internal node culminate in branches. Branching or splitting is based on a set of conditions. In this study, we used the gini index for splitting. The DT algorithm has been used for COVID-19 diagnosis from chest X-ray imaging (Yoo et al., 2020) , quantify the impact of mandatory lockdown on COVID-19 cases (Karnon, 2020) , and predict COVID-19 cases and fatality based on age and gender . The DT approach has been found to be better than k Nearest Neighbor and other methods in predicting recoveries of infected patients from COVID-19 (Muhammad et al., 2020; Pourhomayoun & Shakibi, 2021) . k Nearest Neighbor (KNN) involves creating a space for training the data set. When a new sample to be trained is introduced into the sample space, the distance to the nearest neighbor in that space is estimated. Then, the status of the sample is determined by the number of neighbors in the vicinity. In this study, the nearest neighbor was estimated using the kd tree approach and a total of 48 neighbors. An enhanced version of KNN has been proposed for the improved detection of COVID-19 infections (Shaban et al., 2020 ). An improved COVID-19 detection method based on genome sequence was performed using KNN (Arslan & Arslan, 2021) . Using age and gender, the KNN classification method was found to be superior in predicting the recovery of infected patients (Romadhon & Kurniawan, 2021) . Support Vector Machine (SVM) are non-parametric approaches to classification of data points. In SVM, a boundary line is drawn for the classification. Points close to this boundary are called support vectors. The classification is then performed by the linear combination of the boundaries (Yadav et al., 2020) . In this study, the radial basis function was used as the kernel with a polynomial function of degree 3. The SVM has been coupled with particle swarm optimization for the detection of COVID-19 virus from chest X-ray images (Dixit et al., 2021) . The spread of COVID-19 across different regions of the world has been predicted based on SVM (Yadav et al., 2020) . The SVM algorithm has been deployed for the real-time prediction of COVID-19 infection, recoveries, and fatalities (V. Singh, Poonia, et al., 2020) . For better performance, the SVM method has been coupled with least squares for the prediction of COVID-19 trajectory (S. Singh, Parmar, et al., 2020) . Random Forest (RF) is a DT based algorithm. The decisions are made from a randomly selected subset of the training data. The decision from various decisions are then used in making the final output. In this study, 40 "trees" were used with the gini measure. The RF algorithm was fine tuned with adaboost for the prediction of infected patients' health (Iwendi et al., 2020) . Spatio-temporal near future prediction of COVID-19 was implemented worldwide using random forest with good results (Yeşilkanat, 2020) . RF has been found to outperform other machine learning algorithms in the prediction of COVID-19 (Prakash et al., 2020) . In India, RF was found to outperform other algorithms for the prediction of cases, fatalities, and recoveries (Gupta et al., 2021) . Using n-days lag as predictors, the step ahead prediction of COVID-19 cases was made at the different locations (Figure 1 ). The lag with minimum RMSE values represents the amount of previous rates that is needed to make an informed decision by the machine learning algorithms. In Kebbi State (Figure 1a) , all the models showed the same lag values of 7 days. This means that values for the last 7 days is required by the models to make the best predictions in Kebbi State. In terms of RMSE, KNN and SVM presented identical values while the largest error of 4 infections were shown by RF. In Kano State (Figure 1b) , DT and KNN were observed to have the same number of lags (5 days). However, KNN presents the lowest RMSE value of 6 cases amongst the four models while RF exhibited the worst estimate at 10 cases. All the models agree in terms of lag (4 days) at Abuja (Figure 1c) . The RMSE values were observed to be 67, 87, 86, and 65 for DT, KNN, SVM, RF models respectively. Thus, RF outperformed all the other models in Abuja. In Delta State, all the models have the minimum RMSE values at 5-days lag except RF which showed 14-days lag. This implies that the RF algorithm will need much more information than the other algorithms in Delta State for best prediction. In Delta State, the best RMSE values was obtained for KNN. KNN and SVM showed identical 6-days lag values while DT and RF also showed identical 14-days lag values in Edo State. KNN and SVM also showed identical 9 days lag in Osun State while DT and RF showed 14-days and 8-days lag respectively. In Edo and Osun State, the RMSE values for KNN and SVM were found to be identical. Generally, KNN and SVM showed identical performances in three of the locations considered, KNN showed superior performance in two of the locations while DT outperformed other models in only one location. In terms of RMSE values, the performance of the models in increasing order was observed to be Kebbi, Edo, Kano, Delta, Osun, and Abuja. The possibility of estimating COVID-19 cases from particulate matter (PM2.5) was considered at zero lag. In Figure 2 , the linear relationship between the predicted and measured values are presented for the various locations under consideration. The RMSE values are shown in each plot. The greatest error was observed for Edo in all machine learning algorithms considered. In Kebbi, KNN outperformed the other models with an error of about 9 cases while DT has the worst performance. Kebbi was found to have the least error amongst all the locations. In Kano, DT exhibited the lowest RMSE values among the models, similar to the results in the number of cases. With an RMSE of about 25 cases, RF was the best performing model in Abuja. However, SVM showed similar performance while the worst performing model was KNN. The best performing models were observed to be KNN, RF, and RF in Delta, Edo, and Osun States respectively. Generally, RF was seen to have the greatest performance at zero lag with superior results in four locations. In order to determine the effect of lag on the performance of the models, the number of COVID-19 cases were predicted under different lags for the three predictors. The results are shown in Figure 3 and Table 1 . The highest and lowest lags were found at 14-day and 1-day lags respectively. In Kebbi State, all the models showed the same 7-days lag for PM2.5 as predictor except RF which showed a 2-day lag. The RMSE values were in the range 8.16-8.73 with KNN and DT reporting the best and worst performance respectively. This implies that KNN will give the best prediction of COVID-19 infection in Kebbi state with an error of about 8 cases given values for 7 previous days. All the models were in agreement for the number of lags with temperature and humidity at 3-days and 13 days respectively. RF and DT were the best performing models with RMSE values of 3.08 and 4.25 in temperature and humidity respectively while SVM has the worst performance in the two parameters. In Kano State, DT and RF are in agreement with the required number of lags in both PM2.5 and temperature. The highest and lowest lags with PM2.5 as predictors were observed in KNN and SVM respectively. Considering temperature as predictors, KNN showed the highest number of lags but lowest RMSE values to emerge as the best performing model. In the case of humidity, KNN and SVM showed lags of 5-days while RF has the highest lag. There are a number of agreements in n-lags for Abuja. DT/KNN agrees for PM2.5, KNN/SVM and DT/RF pairs showed the same number of lags for temperature, and KNN/SVM agrees for humidity. SVM, KNN, RF were the best performing models for PM2.5, temperature, and humidity respectively in Abuja while DT showed the worst performance among all the predictors. In Abuja, temperature was found to be the best predictors with an error of about 3 cases. The performance of the models in the southern locations (Delta, Edo, and Osun States) were also considered. The models were in agreement with a lag of 5 when PM2.5 was considered as the only predictor in Delta state, except DT which showed a lag of 8. The RMSE values were found to be in the range of 14.27-17.73 with SVM and DT having the lowest and highest values respectively. In the case of temperature in Delta State, KNN has the lowest lag while SVM has the highest lag. This implies that KNN requires COVID-19 information from 5 days to make the best predictions while SVM needs 12 days of information. The best and worst performing models are SVM and DT respectively. For humidity, KNN/SVM needs for 4 days of information while DT/RF requires 7 days of information. In this case, KNN has the best performance while DT has the worst performance of the four models considered. DT, KNN, and SVM requires PM2.5 information from prior 7 days to effectively predict the COVID-19 cases in Edo State while RF requires 6 days worth. The best result was obtained in SVM with an error of about 24 cases. DT and KNN were found to have 1-day lag while SVM and RF requires 4 and 3 days lag respectively when predictions were made with temperature. The RMSE values were between 2 and 3 cases in all the models with KNN outperforming others. All the models were in agreement with respect to humidity in Edo state, giving a lag of 4 days. In Osun State, the models agree about the number of days for PM2.5 and temperature at 14 and 8 days lag respectively. In the case of humidity, DT reports the highest lag of 14 while KNN reports 1 day lag. Generally, the best performing model in the southern location was SVM outperforming in five out of the nine cases, closely followed by KNN with 4 out of the nine cases. In all the southern locations, DT has the worst performance of all the models considered. However, in the northern locations, the best performing model was RF closely followed by KNN. Furthermore, based on RMSE values, temperature is a better predictor due to the low values reported. This is closely followed by relative humidity while PM2.5 is the worst predictor with consistently high RMSE values in all locations considered. The global COVID-19 pandemic requires several approaches to understand, mitigate, and curtail it's impact on world population and economy. There has been both pharmaceutical and non-pharmaceutical approaches to limiting the spread of the virus within population. Despite the lax enforcement of non-pharmaceutical interventions such as travel restrictions, the transmission level and fatality rate of the virus within Africa remains low compared to the rest of the world. This has been attributed to the youthful population (Njenga et al., 2020) and experience with pandemics (Musa et al., 2021) . Merow and Urban (2020) posited that seasonality will drive the spread of COVID-19 globally. Considering the uncertainty surrounding the driving force of the infections globally, it will be pertinent to explore the role of atmospheric and aerosols. Furthermore, the potential to predict future occurrences of the infections based on past information about infection numbers or atmospheric conditions will help in mitigating the spread within a location. This study has shown that using machine learning algorithms the number of infections can be predicted with minimal error using previous 3-5 days case numbers. Furthermore, it was shown that previous days information about atmospheric and particulate matter are better predictors than 1-day data for COVID-19 cases. This information is important for better management and mitigation of the virus within the locations considered in this study. In this study, we have examined the potential of machine learning approaches in predicting COVID-19 cases using atmospheric parameters within selected locations in Nigeria. Four machine learning techniques were considered: Decision tree, k Nearest Neighbor, support vector machine, and random forest. First, we determined the effect of previous n-days COVID-19 cases on the forecast performance of the machine learning techniques. We found that for some locations the same number of lags were reported, however, in other locations different lags were obtained. Both KNN and SVM were found to have superior performance in this scenario. Furthermore, we evaluated the forecast capabilities of the machine learning techniques in COVID-19 cases prediction using atmospheric parameters as predictors at different lags. Decision tree method was found to have the worst performance of the four methods considered in this study. Our results presents a new approach to the study of COVID-19 virus by showing the amount of information required for effective prediction in machine learning algorithms. This is particularly important for the planning and management of the pandemic in tropical Nigeria. This research can be extended to consider predictors such as human mobility data. Furthermore, the possibility of other machine learning algorithms for prediction of COVID-19 can be explored. The study of this approach in other locations across the world will create global synergy in the fight against the virus. The effect of temperature upon transmission of Covid-19: Australia and Egypt case study Stability analysis and numerical simulation of seir model for pandemic Covid-19 spread in Indonesia Spatial distribution and temporal variability of harmattan dust haze in sub-sahel west Africa A new Covid-19 detection method from human genome sequences using cpg island features and knn classifier Forecasting of Covid-19 using deep layer recurrent neural networks (rnns) with gated recurrent units (grus) and long short-term memory (lstm) cells Descriptive analysis of Covid-19 patients in the context of India On forecasting the community-level Covid-19 cases from the concentration of sars-cov-2 in wastewater Analysis of infectious disease problems (Covid-19) and their global impact. In(Chap.Dynamics of inter-community spread of COVID-19) Cov2-detect-net: Design of Covid-19 prediction model based on hybrid de-pso with svm using chest x-ray images Air temperature, relative humidity, climate regionalization and thermal comfort of Nigeria Elemental analyses and source apportionment of pm2. 5 and pm2. 5-10 aerosols from nigerian urban cities Covid-19 lethality in brazilian states using information theory quantifiers Impact of Covid-19 pandemic lockdown on distribution of inorganic pollutants in selected cities of Nigeria Prediction of Covid-19 confirmed, death, and cured cases in India using random forest model Machine learning model for predicting number of Covid19 cases in countries with low number of tests. medRxiv Covid-19 patient health prediction using boosted random forest algorithm A simple decision analysis of a mandatory lockdown response to the Covid-19 pandemic. Applied Health Economics and Health Policy Comparative analysis and forecasting of Covid-19 cases in various European countries with arima, narnn and lstm approaches The bodélé depression: A single spot in the sahara that provides most of the mineral dust to the amazon forest An evaluation of Covid-19 transmission control in wenzhou using a modified seir model A modified seir model to predict the Covid-19 outbreak in Spain and Italy: Simulating control scenarios and multiscale epidemics Seasonality and uncertainty in global Covid-19 growth rates Predictive data mining models for novel coronavirus (Covid-19) infected patients' recovery Addressing Africa's pandemic puzzle: Perspectives on Covid-19 transmission and mortality in sub-saharan Africa Why is there low morbidity and mortality of Covid-19 in Africa? Impact of large scale climate oscillation on drought in west africa A review of the effects of gas flaring on the Niger delta environment Implications of climate variability and change on urban and human health: A review Analyzing effects of temperature, humidity, and urban population in the initial outbreak of Covid19 pandemic in India Seir and regression model based Covid-19 outbreak predictions in India An seir model for assessment of current Covid-19 pandemic situation in the UK. medRxiv Predicting mortality risk in patients with Covid-19 using machine learning to help medical decision-making Analysis, prediction and evaluation of Covid-19 datasets using machine learning algorithms Impact of meteorological parameters on the Covid-19 incidence: The case of the city of oran A comparison of naive bayes methods, logistic regression and knn for predicting healing of Covid-19 patients in Indonesia Climate variability and heat stress index have increasing potential ill-health and environmental impacts in the east London, south Africa Short-term effects of specific humidity and temperature on Covid-19 morbidity in select us cities Impact of weather on Covid-19 pandemic in Turkey Predicting Covid-19 cases using bidirectional lstm on multivariate time series. Environmental Science and Pollution Research A new Covid-19 patients detection strategy (cpds) based on hybrid feature selection and enhanced knn classifier. Knowledge-Based Systems Study of arima and least square support vector machine (ls-svm) models for the prediction of sars-cov-2 confirmed cases in the most affected countries Prediction of Covid-19 corona virus pandemic based on time series data using support vector machine Identifying mortality factors from machine learning using shapley values-a case of Covid19 Covid19 outbreak in lombardy, Italy: An analysis on the short-term relationship between air pollution, climatic factors and the susceptibility to sars-cov-2 infection Back-trajectory model of the saharan dust flux and particle mass distribution in west Correlation between weather and Covid-19 pandemic in jakarta, Indonesia. The Science of the Total Environment When will the battle against novel coronavirus end in wuhan: A seir modeling analysis Analysis on novel coronavirus (Covid-19) using machine learning methods Covid-19 prediction analysis using artificial intelligence procedures and gis spatial analyst: A case study for Spatio-temporal estimation of the daily cases of Covid-19 in worldwide using random forest machine learning algorithm Deep learning-based decision-tree classifier for Covid-19 diagnosis from chest x-ray imaging Deep learning methods for forecasting Covid-19 time-series data: A comparative study The authors declare no conflicts of interest relevant to this study. (1) [Dataset] The particulate matter (PM1.0, 2.5, 10.0) and atmospheric parameters (temperature and humidity) were obtained from purpleair.com (https://www.purpleair.com/sensorlist?exclude=true&n-wlat=9.37944961738998&selat=5.6830568335046365&nwlng=1.5013524736630188&sel-ng=9.784479996099975&sensorsActive2=604800).(2) [Dataset] The COVID-19 statistics were downloaded from https://covid19.ncdc.gov.ng/state/. We acknowledge the Centre for Atmospheric Research and their partners for promoting high standards of atmospheric observatory practice as well as the Federal Government of Nigeria for continuous funding of the Nigerian Space programme (www.carnasrda.com). All data used in this study are publicly available.