key: cord-0950572-v51kn26n authors: Chatterjee, Sujoy; Chakrabarty, Deepmala; Mukhopadhyay, Anirban title: Fuzzy Association Analysis for Identifying Climatic and Socio-Demographic Factors Impacting the Spread of COVID-19 date: 2021-08-22 journal: Methods DOI: 10.1016/j.ymeth.2021.08.005 sha: d1d3dd7f379ab0c104f5089e3681a7b6a161c408 doc_id: 950572 cord_uid: v51kn26n Recently, the whole world witnessed the fatal outbreak of COVID-19 epidemic originating at Wuhan, Hubei province, China, during a mass gathering in a film festival. World Health Organization (WHO) has declared this COVID-19 as a pandemic due to its rapid spread across different countries within a few days. Several research works are being performed to understand the various influential factors responsible for spreading COVID. However, limited studies have been performed on how climatic and sociodemographic conditions may impact the spread of the virus. In this work, we aim to find the relationship of socio-demographic conditions, such as temperature, humidity, and population density of the regions, with the spread of COVID-19. The COVID data for different countries along with the social data are collected. For the experimental purpose, Fuzzy association rule mining is employed to infer the various relationships from the data. Moreover, to examine the seasonal effect, a streaming setting is also considered. The experimental results demonstrate various interesting insights to understand the impact of different factors on spreading COVID-19. Association Rules, Climatic Factors, Socio-Demographic Factors Coronavirus Disease 2019 (COVID- 19) is an acute respiratory disease caused by a highly virulent novel coronavirus strain, SARS-CoV-2, which is a single-stranded RNA virus [1] . Due to its highly contagious nature, it rapidly propagates from person to person causing a pandemic situation worldwide. Since the first appearance in late 2019, the ongoing pandemic of COVID-19 has resulted in approximately 25,00,000 confirmed cases and more than 1,70,000 deaths in over 200 countries worldwide (https://coronavirus.jhu.edu/). In India, it has already caused more than 940705 confirmed cases and almost 98678 deaths (https://www.mohfw.gov.in/). Most of the initial efforts in analyzing the spread and outcome of COVID-19 focus on setting complex mathematical models to pandemic data for predicting the spread and peak of the disease transmission [2] . These works have mainly used the data of the number of cases reported daily in different COVID-19 tracker websites. It may be noted that the spread and vulnerability of COVID-19 vary from one place to another and from one person to another. Therefore, it is expected that certain climatic and socio-demographic factors, which vary from place to place and person to person, are likely to have an immense effect on determining the outbreak intensity and outcomes of COVID-19 pandemic. However, no systematic effort has been reported in the literature that deals with this issue. The proposed study addresses the problem of identifying important factors that explain the spread and outcome of COVID-19 through data-driven association analysis. Association analysis is a rule-based machine learning technique that is used to discover interesting associations between the variables or attributes of a large data set [3, 4] . It has been successfully utilized in biological information processing [5] and disease outbreak prediction [6] . In this proposed study, we mainly consider the problem of identifying important climatic and socio-demographic factors responsible for the intensity of COVID-19 outbreak in a particular country, state, or region. The intensity can be measured in terms of the number of cumulative cases of the disease at a particular point of time. The climatic and socio-demographic factors, such as the average temperature of the region, humidity, and population density, are considered as the potential predictor variables. We develop customized association rule learning techniques based on the available pandemic data from various trackers of COVID-19 cases and publicly available patient details of India and other countries to infer possible associations among the above variables and their influence in predicting the intensity and outcomes of COVID-19. The extracted association rules can also be used for predicting future COVID-19 outbreaks and patients' survival chances. Traditional association rule mining methods like Apriori algorithm [3, 4] deal with categorical datasets, where each attribute-value pair is considered as an item. However, the attributes of the COVID-19 datasets, except a few, are mostly numeric and continuous. A possible solution to make these datasets ready for traditional rule mining is to categorize the quantitative attributes by defining some intervals taking coarser granularity. For example, the temperature attribute is continuous. This can be categorized by selecting two thresholds l and h so that a temperature t falls in interval low if t ≤ l, in interval medium, if l < t ≤ h, and in interval high if t > h. However, it is difficult to determine these thresholds universally as it is very subjective and the accuracy of the obtained rules is very sensitive to these threshold values. Therefore the better alternative is to define these intervals as linguistic variables in terms of fuzzy sets, where different intervals may overlap in fuzzy context. In this proposed study, the work is performed in a two-fold manner. Initially, the COVID data is collected for various countries with diverse characteristics for a particular time period. Then the corresponding socio-demographic data are collected from these countries. Finally, these data are combined as the raw dataset for our experiments. The values of different parameters (like temperature, humidity) change due to the seasonal effect. Therefore to understand the behavior of different attributes, various membership functions are needed to be defined. Most of the research conducted on the COVID data considers the dataset as static data or time-varying data. As the data about different death rates and recovery rates are generated each day, this situation is considered as a streaming one. There is minimal research that deals with the dynamic behavior of the COVID data. This dynamic be-havior arises from different preventive actions like lockdown, vaccination drives taken by the Government. Hence the relationship between the different attributes can vary with time, the rules generated from them also vary. In this situation, as time changes, the different climatic conditions like temperature, humidity, etc., also change. Therefore, the effect of various seasonal changes can be captured if the streaming fashion is considered. In this method, the weighted average approach is used to update the rules in a streaming manner. While considering the streaming condition, the most persistent rules over different seasons are also observed. Moreover, the goal is to discard some of the rules which are no longer effective while the new attributes arrive with the seasonal changes. This work aims to develop customized association analysis techniques based on COVID-19 pandemic data for understanding the impact of various climatic and socio-demographic factors. The main contributions of the research are summarized below. • Identifying interesting associations among climatic and socio-demographic factors responsible for the region-specific outbreak intensity. • Predicting possible region-specific future outbreaks and understanding the fuzzy relationships between various attributes in the fuzzy context. • Identifying the relationship of different rules over the different time-spans considering the streaming mode and recognizing the persistent rules across different time windows. • Providing a model to be used as a customized tool that can study the static behavior as well as dynamic behavior of different socio-demographic conditions on COVID data. The rest of the paper is organized as follows. Section 2 describes the state-of-the-art approaches dealing with various COVID data to understand the behavior of the virus spread. Section 3 depicts the proposed methodology. The experimental design and results are explained in Section 4. Section 5 concludes the article by providing some future directions. A spectrum of research works is being carried out to combat the COVID outbreak and forecast the possible spread [6, 7, 8] . One important work is the prediction of the actual spread of the pandemic. To deal with this issue, a set of classifiers, namely, SVM, logistic regression, neural network-based models, and two variants of Bayesian Network classifiers have been applied over the dataset of patients collected from STEMI [9] . In [10] , the authors used the ARIMA model for forecasting the outbreak in 15 countries. The COVID-19 data including cumulative number of cases, cumulative number of deaths and recovery cases of top 15 affected countries in April 2020 were considered and they tried to predict 30 days forecast of COVID-19 outbreak where their prediction showed really very scary outcomes for especially some European countries like Italy, Spain, and France. In [11] , an objective-based approach was proposed for the prediction of the continuation of COVID-19 using live forecasting. In this work, the authors tried to anticipate the live forecasting of COVID-19 assuming the past pattern will continue in the future. Here Exponential smoothing models were adopted to predict the forecast of COVID-19 confirmed cases because the Exponential smoothing family provides really good forecast accuracy especially for short series. Among other studies, the modified SEIR model was also used in [4] to design a model for COVID-19 pandemic considering quarantine and treatment. Here, the authors also applied the particle swarm optimization (PSO) algorithm on the data of Hubei province for estimating parameters of the SEIR model. Additionally, many research works introduced different kinds of mathematical models like SIR, SEIR models for prediction, and tested the performance of the models on different real data collected from different countries [12, 13] . Several analytical approaches of the SIR models have been introduced in the literature. As the different countries follow different strategies to control the spread of the epidemic, therefore, the SIR-based model can be adopted with the different local assumptions specific to the countries. It has been noticed that the major success of SIR models depends on the context of the applications and adoption of proper assumptions [14, 15] . Hence, a large number of variants of SIR model, namely, SIS (susceptible-infectious-susceptible), SIRD (susceptible-infected-recovered-deceased), MSIR (Maternally-derived-immunitysusceptible-infected-recovered), SEIR (Susceptible-exposed-infected-recovered), SEIS (Susceptible-exposed-infected-susceptible), etc. have been considered as the popular methods to predict the COVID-19 spread. Another advanced version of the SIR model, namely, SIR-d model has taken into account another two important characteristics, namely, vital dynamics and constant population [15] . Zhang et. al [16] proposed a segmented Poisson model by using the power-law and exponential law to study the COVID-19 outbreak in six major countries. In another study, a parsimonious model was proposed that identified the infected individuals and fixed various measures for the containment policy. Apart from this, different deep learning-based techniques like Long short Term Memory (LSTM) models and curvefitting have also been proposed [17] for prediction of the month-wise COVID-19 cases. Here the impact of various measures like social isolation and lockdown duration during that time are considered. Meanwhile, using the epidemiological SIR model, Khrapov et. al. [18] developed a mathematical model for forecasting the epidemic development of COVID-19 in China. Another group of researchers extended the SEIR model for understanding the importance of testing and quarantine policy [19] . Another interesting research mentioned in [20] in countries worldwide was presented in [21] . Another study reported in [22] utilized the Support Vector Regression (SVR) model to foresee the spread of novel coronavirus along with the number of patients who would recover. They also used Pearson's Correlation measure to find the correlation between coronavirus and different weather conditions like temperature, humidity, and wind. Similarly, another recent work finds the correlation among a large number of countries and socio-economic indicators to predict COVID-19 spread using various machine learning techniques [23] . In this work, the authors employed a univariate feature selection method to choose the most relevant features (i.e., indicators) and use ANOVA for this purpose. Thereafter, based on the spread of COVID, the countries were classified into four categories and different tradi-tional classifiers are used to classify the countries. But the number of classes based on the COVID statistics may not always be four and also due to several other external effects, the covid cases can abruptly rise up or fall down. Thus it makes the classification task more difficult. In the same line, another recent research [24] finds the environmental effects on COVID spread in different regions of India and in New York city. In this method a number of statistical analysis is performed in order to understand the effect of different environmental factors like Temperature, Relative humidity and COVID cases per day. This work finds the pairwise correlation between any two features. Importantly, the impact of different preventive measures like lockdown, vaccinations, etc can impose some kind of dynamic nature in the system and identifying those patterns over different instances with their stability is important. However, as the dataset is considered as a static one, the complex relationships among a large number of environmental indicators in streaming situation cannot be identified in this work [24] . Although most of the research works are aligned toward predicting the growth or pointing out the final size of the spread, a very limited study has been performed on how to find the effect of different socio-demographic factors over the COVID-19 spread. In the present study, we try to understand the complex relationships of various sociodemographic attributes affecting COVID-19 spread. Different seasonal attributes like temperature, humidity etc., and the various measures (like lockdown, imposing various restrictions, and locating some containment zones, etc.) taken by the Government of the respective countries may change over time. Here, during this process, some old rules are faded away and new rules can be generated. So identifying the most stable rules across different time windows is important for the authority or decision maker to make proper decisions. However, finding this kind of relationships in streaming situation to infer the effect of persistent associations among various environmental factors on COVID data has not been studied in other research works. Therefore, in this work, in addition to understanding the complex relationships among various socio-demographic factors, the static and dynamic behaviors of the COVID data are also captured by considering the association rule mining in both the situations, viz., static rule mining and streaming rule mining. This is expected to help adopt various measures to prevent the spread of the epidemic and take requisite healthcare initiatives. In normal association rule mining methods like Apriori algorithm, the items i.e., attribute-value pairs are not numeric and continuous. Therefore, the traditional association rule mining fails to quantify some intervals (like low, middle and high temperature) considering the coarser granularity. As an alternative means, the intervals can be termed as the linguistic variables considering fuzzy sets. Here overlapping between the different linguistic variables over fuzzy set can remove the discrepancy. A fuzzy association rule [25] will then look like (X is of the linguistic variables defined by fuzzy membership functions for the corresponding items in X and Y , respectively. An example of fuzzy rule can be (Temperature is high) ⇒ (Intensity is high). We aim to develop efficient fuzzy association rule learning algorithms to discover interesting fuzzy association rules from the region-wise COVID-19 pandemic data, where the different regions are considered as the transactions. As mentioned above, consider I be the set of itemsets and the itemsets X, Y ⊆ I. Again, suppose X ⇒ Y be the rule and T be a set of transactions. The support quantifies how frequently the itemsets occur together in the database. The support of an itemset X with respect to T is defined as the proportion of transactions in the dataset that contains X. Mathematically, this support can be expressed as Supp(X) = |{X∈T }| |T | . The quantification of confidence of X ⇒ Y is the proportion of transactions in the dataset that holds the item X, in which item Y also occurs. Confidence of the . For the region-wise data, we primarily consider the association rules in which the consequent is the intensity of the outbreak in a region. Fuzzy rules to be obtained from this data explain the relationships among the different factors and the intensity of the COVID-19 outbreak in a particular region. The factors or variables in the antecedent of the rules will be identified as the most relevant factors responsible for the intensity of the outbreak. Different symbols are utilized to represent the different attributes as shown in Table 1 . This COVID data, including the socio-demographic data are taken. Table 2 Table 2 , consider 903 in the first record of attribute D as an example and it is converted into fuzzy set using fuzzy membership function. The membership values of 903 are (0.2425|mid + 0.2575|large) as 903 lies in both "mid" and "large" classes (as demonstrated in Fig. 3 ). This step is repeated for all the attributes of the dataset. • Step 2: After converting all the values into fuzzy sets, fuzzy normalization process is done by using the ratio of two components. The first component is "The membership value of attribute X in one of its fuzzy class" and the second term is "Sum of the membership values of attribute X in all of its fuzzy classes". Applying the ratio of the two terms to the first record in Table 3 Table 3 shows a set of raw membership values of attribute D, Table 4 shows the normalized equivalents. • Step 3: After normalization, we add all the normalized values of the same fuzzy class. Then, the summation value is divided by the total number of records present in the dataset to calculate the support of each fuzzy class and put them in set Y , where Y is called "itemsets". We consider attribute D as an example and the support values are determined individually for "small", "mid" and "large" fuzzy classes of attribute D as they are three different fuzzy classes of D. To demonstrate the calculation, here we choose "small" class of attribute D. We take the values of D.small from each record in Table 4 and add them together, i.e., (0+0+0+1+0+1+1+1+1+1) = 6. After that, the summation value (here 6) is divided by the number of records (here 10) for obtaining the support of the "small" class of attribute D. So, here (6/10) = 0.6 becomes the support of D.small. • Step 4: Next, we compare each fuzzy class's support with the predefined minimum support value, which is 0.2 here. Taking the support of D.small from = 70.03%. • Step 9: After obtaining the confidence value of each rule, it is compared with the minimum confidence value, which was defined earlier. In this experiment, we set the minimum confidence value to 60%. Therefore, only those association rules are reliable and qualified whose confidence values are greater than or equal to 60%. Otherwise, the rules are rejected. • Step 10: Finally, those rules are treated as importance rules where D and I are present in the consequent part only because the effect of different attributes on death or infection can be realized better if it is present in the rule. The proposed method for the streaming setting is described in a nutshell here. In this scenario, the data concerning COVID statistics arrive continuously. That means the number of deaths, infected and recovered persons are gathered in a day-by-day manner. The COVID statistics like temperature, population density and humidity may differ from region to region for a particular country. Hence, to study the behavior, we focus on a specific region and the day-wise data of COVID patients, including the climatic and socio-demographic data are collected. In this context, motivated by the work in [26] , we introduce the streaming model for the COVID data considering the Fuzzy scenario, although the previous work does not consider the Fuzzy setting. In this work, a particular chunk of data, termed as window, is considered at a time. Therefore, in each time point, the frequent item-set algorithm is applied iteratively. In this context, we update the frequent item database every time. This frequent item database contains those itemsets that are present for a long time. At the same time, the support of each item is calculated for each time window. Therefore, if any of the old item-sets appears again in the next time window, the support value is computed using the weighted average approach. Similarly, the updation in the frequent item database is also needed to be performed. The step-by-step approach of the proposed model is described in the next few subsections. Here, all the attributes of the dataset are numeric. So, to find the frequent itemsets in the current window, we just follow the Steps 1 to 7 as described before in the methodology of Association Rule Mining in Static Setting (mentioned in Section 3.1). After collecting the frequent itemsets (with support value greater than or equal to the minimum support) for the current window, we update the frequent itemset database (FID). FID contains the itemsets that are either frequent for the current window, or they are not frequent for the current window but have been frequent for a long time (for the number of windows that is greater than or equal to the predefined minimum value), with their corresponding "stored support value". We calculate the "stored support value" of each itemset in the FID with the help of a simple weighted average strategy. The equation of the weighted average technique for calculating the stored support value for an item (denoted as S i F ID ) at i th time instant is given below. Here, S (i−1) F ID denotes the stored support value of the same itemset in the last FID. S current be the support value of a frequent itemset (FI) in the current window. Considering the minimum support α, we apply the weighted average technique for a new frequent itemset (FI) in the current window with support greater than or equal to the minimum support α. In this scenario, for a new itemset the first part i.e., ((n−1)/n) * S (i−1) F ID becomes zero as it was not present in the FID before and the second part, i.e., (S current /n) will be stored in the FID as the "stored support value" of the FI. In this situation, for a new itemset that appears for the first time, to alleviate the full weightage of it in the current window (as it appears for the first time in the current window), the second term (S current /n) is computed. Thus, the support of a new item that appears just one time in a new current window cannot be highly weighted. At the same time, if an itemset is found to be infrequent all of a sudden but has been frequent for a long time, instead of immediate removal of that itemset from the FID, we decrease its "stored support value" in the FID by using the above described weighted average technique. For this situation, (S current /n) will be zero and only ((n − 1)/n) * S (i−1) F ID part will be stored in the FID as the "stored support value" of the itemset. Let the minimum number of occurrences of an item in FID be θ. If the total number of occurrences of an itemset becomes less than θ (starting from the window where it first occurred as FI), only that itemset is deleted from the FID. When we update the FID for each window, simultaneously, the association rules extracted from the corresponding FID itemsets are kept updated. For each FID, all the possible association rules along with their corresponding confidence values are generated. Then, each rule's confidence value is compared with the predefined minimum confidence value (Let, β), and only the rules having confidence greater than or equal to β are kept. In the final step, the stable rules that are consistently present in different time windows are identified. Considering a stability threshold value as γ, if the frequency of obtaining the same rule from different windows is greater than or equal to γ, then the rule is considered stable. Thus, it signifies that even though some rules are not generated in successive windows, they are not removed immediately. Instead, the rules remain persistent in the database for a certain time, and thus it reflects the long-term dependency among the attributes showing the utility of the streaming approach. In this section, the two different kinds of datasets used for the experiments (static and streaming scenario) are described. The experiment was performed in MATLAB 2015 and the environment is an Intel(R) CPU 2.4 GHz machine with 8GB RAM running Windows 10. World-wide 13 countries are selected to prepare the dataset for mining static associ- To prepare the data for the analysis considering the streaming scenario, we col- In this experimental setting, we apply Fuzzy Association Rule Mining Technique for extracting different important rules connecting the diverse climatic and socio-demographic factors with COVID-19 pandemic data. For this experiment, we use 0.2 as the minimum support and 60% as the minimum confidence. Keeping the rules with the confidence value greater than or equal to 60%, some subsets of interesting rules are generated and those are reported in Table 6 . Among the various rules evolved from the experimental analysis, some important rules are demonstrated in this table. It can be noticed that in the regions where the temperatures are medium, the number of deaths In this experiment, we consider the sliding window size as 15. The minimum support value is considered as 0. Tables 7-11 . The itemset I.small has a support value 1 for the first sliding window (according to step 3 of Section 3.1 and this support value is termed as actual support value hereafter), and it is greater than the minimum support value, i.e., 0.5. Thus, to compute the weighted support, we need to consider two factors, namely, stored support value and actual support value (as mentioned in Eqn. 1). Now stored support value means the support of the already existing itemset present in FID. I.small becomes a frequent itemset (FI) for this sliding window and is inserted into the current FID after evaluating its actual support value using Eqn. 1. Here, the window's size is 15, and as it is the first sliding window, there was no past FID. Therefore, F ID contributes to zero and the support value of I.small is (1 /15) or 0.0667, which is stored in the current FID as mentioned in Table 7 . If the itemset is consistently found as FI for some time period, then the stored support values in the corresponding FIDs are gradually increased. However, these values never exceed 1, because the maximum range of actual support value of any FI is 1. In Table 8 , the itemsets are generated after a particular instant while FID = 30. It can be observed that the stored support value of I.small is equal to 0.8127 because I.small constantly became frequent from the first sliding window up to 30 th sliding window. In this approach, if an itemset had been frequent for a long time, but now it has become infrequent, we do not remove it immediately from the FID. Here, we set the limit as 60%. If the total number of occurrences of an FI becomes less than 60% of the number of windows, starting from the first window when it was found to be frequent, the FI will be deleted from the FID. In Table 9 , the stored support value of I.small is decreased to 0.2259 as I.small is not frequent in the current sliding window, but its total number of occurrences as an FI is still greater than or equal to 60%. However, I.small is no longer present in Table 10 . Now, in this instance, I.small is purged out from the current FID. Some interesting rules, having confidence values greater than or equal to 70% , generated from different windows, are shown in Tables 12-14 . Here, we consider 0.5 (50%) as the stability factor. If the number of times the same rule is obtained from different windows and the count is greater than or equal to 50% of the total number of windows, then the rule will be considered as a stable rule. Some interesting stable rules corresponding to their stability percentage evolved over the different instances (with different window sizes) are reported in Tables 15-17, keeping the rules with "Output Attributes" (D and R in this case) in the consequent part only. We also performed the experiments by varying the window size=10 and the interesting frequent rules obtained for window numbers of 50 and 100 are demonstrated in In these experiments, we choose 50% as the threshold value for selecting the stable rules across different windows. However, it can be customized and any threshold value can be selected as per the requirement of the decision maker. To exemplify, in Table 15 and 16, as we fixed the threshold value as 50%, all the rules having greater value than 50% are retrieved. Here, by the nature of the data, all these rules have equal stability values, i.e., 55% in Table 15 . However, it may not always be the same. For example, different stability factors of the rules like 69% and 51% can be seen in Table 17 . Besides this, if the decision maker decides to choose more stable rules then he/she may increase the level of threshold value and eventually it will generate more restrictive rules and the number of rules will become less. But these rules are more persistent throughout the whole period. In Table 18 , it can be observed that 4 th , 5 th and the 6 th rules have the confidence 100% that denotes the importance of those rules. Similarly, in Table 19 , the 1 st , 3 rd and 4 th rules have the confidence value 100%. In a normal situation, the support values of an itemset for different time points are calculated. These support values change abruptly depending on the presence of the itemset in specific time windows. We consider a specific time-span to observe the behavior of the support value of a particular itemset temperature (i.e., T.mid) and notice the changes of the support value across different time durations as demonstrated in Fig. 7 . The nature of the stored support value for the streaming situation when the support values are weighted based on the previous window and current window is demonstrated in Fig. 8. Fig. 7 shows that the support value becomes constant for a specific time period and that it becomes zero during the window number 47 to 71. This is due to the absence of that item in the original data at that time-span. The immediate effect in support value is reflected while ignoring the long-term effect. However, from Fig. 9 where sudden fall of support value is noticed after window number 25. Although, Fig. 10 exhibits the window-wise observation while considering the stored support value of I.small. Here, the stored support value of I.small gradually increases as long as this itemset is present across different time windows and its sudden absence in a particular time window did not remove this itemset from the database immediately, rather the stored support value decreases slowly. In Table 6 Table 17 . It can be inferred that the intensity of recovery becomes medium if the temperature, humidity and population density become high, wet and moderate, respectively. Interestingly, this rule is identified as the stable rule due to its persistence over the whole period of time considering all the time windows in streaming situation. As mentioned before, understanding these types of complex associations between the different intensity levels of climatic and socio-demographic factors become difficult. Thus, various preventive measures can be adopted based on the different outcomes observed from the derived fuzzy association rules. Importantly, as the streaming situation is also considered, policy makers may become aware of the seasonal effect on it. In this way, this proposed methodology can assist the decision makers and Government to choose the appropriate preventive measures to fight against COVID. In previous analysis for static situations as mentioned in Section 4.1, we consider the COVID statistics of the persons who died after 50 days of the first instance of Table 6 ). However, the confidence of {T.mid, P.moderate} ⇒ D.small is high in this situation. To compare the performance with a recent work mentioned in [24] , we conduct the same experiment on some common geographical locations e.g., New York City and Mumbai City considering the streaming case. In this experiment, we employed the same dataset available in [24] . The window size is kept fixed as 15. At first we carried out our analysis for New York city and some important rules extracted from this experiment with their corresponding stability percentage are shown in Table 21 . In this first experiment, the temperature of New York city is within the range 5.17-27.5 degree Celsius. As shown in Fig. 2 , the temperature attribute is divided into three fuzzy sets namely, T.high, T.mid and T.low. But here we consider the range of T.low is less than or equal to 12, whereas we set the range from 10 to 19 for T.mid and the range for T.high starts from 18 degree Celsius. From Table 21 , one interesting stable rule can be noticed that exhibits infections become high when temperature becomes low and the stability is 56.8%. The threshold value of the stability factor is chosen as 0.5 (50%) like the previous experiment. It can be observed that another stable rule appears in the database which implies {T.low, P.high} ⇒ I.large. This similar kind of observation is also noticed in this work [24] , where negative correlation exists between temperature and COVID cases per day. One more interesting rule reveals that infection becomes low when temperature remains high, humidity is within the comfort level and population density becomes high. As the stability of this rule is 26.3% (i.e., below 50% which is the threshold value as chosen), even though this is not a stable rule but it appears in the database. Thus the negative correlation between Temperature and COVID cases can be inferred from these rules for this New York City. We also performed another set of experiment on other geographical location i.e., Mumbai city and the corresponding rules obtained from the experiment are reported in Table 22 . In this analysis, it can be observed that the joint effect of {T.high,H.wet,P.high} implies that Infection will be large. For this city also, similar kinds of characteristics can be observed in [24] i.e., infection will become higher when temperature becomes high. Hence, the positive correlation between temperature and COVID cases per day is also observed in this current experiment. Here the stability percentage values of all these important rules is 41.1 (e.g., less than 50%) while 40% is considered as threshold value. In this situation, the decision makers need to be less restrictive to choose the threshold value in order to obtain some associations. Although the streaming condition considering the dynamic behaviour of the data is not considered in this work [24] . Additionally, it is difficult to understand the joint effect of a large number of environmental characteristics for different instances of streaming situations if only the pairwise correlation is computed. Therefore, it is inconvenient in finding the most stable relationships/rules throughout the whole period. Thus, it demonstrates the utility and effectiveness of the proposed research over the socio-demographic data to adopt appropriate decisions. Over the last many months, the whole world has been affected in different ways due A new coronavirus associated with human respiratory disease in china Early dynamics of transmission and control of covid-19: a mathematical modelling study Fast algorithms for mining association rules Very Large Data Bases VLDB 1215 Seir modeling of the covid-19 and its dynamics A novel biclustering approach to association rule mining for predicting hiv-1-human protein interactions A data-driven epidemiological prediction method for dengue outbreaks using local and remote sensing data Genomic variance of the 2019-ncov coronavirus Phase-adjusted estimation of the number of coronavirus disease 2019 cases in wuhan Using Machine Learning Models to Predict In-ospital Mortality for ST-Elevation Myocardial Infarction Patients Forecasting the dynamics of covid-19 pandemic in top 15 countries in april 2020: Arima model with machine learning approach Forecasting the novel coronavirus covid-19 Analysis of a fractional seir model with treatment A simple seir mathematical model of malaria transmission Effective containment explains sub-exponential growth in confirmed cases of recent covid-19 outbreak in mainland china Mathematical models of sir disease spread with combined non-sexual and sexual transmission routes Predicting turning point, duration and attack rate of covid-19 outbreaks in major western countries Prediction for the spread of covid-19 in india and effectiveness of preventive measures Mathematical modelling of the dynamics of the coronavirus covid-19 epidemic development in china An seir infectious disease model with testing and conditional quarantine A machine learning forecasting model for covid-19 pandemic in india Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing Analysis on novel coronavirus (covid-19) using machine learning methods Predicting covid-19 spread level using socio-economic indicators and machine learning techniques Exploring dependence of covid-19 on environmental factors and spread prediction in india Using a fuzzy association rule mining approach to identify the financial data association Frequent set mining for streaming mixed and large data