key: cord-1044201-cjcftceo authors: Chakraborty, S. title: Monitoring COVID-19 Cases and Vaccination in Indian States and Union Territories Using Unsupervised Machine Learning Algorithm date: 2022-05-04 journal: Ann DOI: 10.1007/s40745-022-00404-w sha: c14ed4936f1ed53a5e6d6d46de42fb6871ecad7d doc_id: 1044201 cord_uid: cjcftceo The worldwide spread of the novel coronavirus originating from Wuhan, China led to an ongoing pandemic as COVID-19. The disease being a contagion transmitted rapidly in India through the people having travel histories to the affected countries, and their contacts that tested positive. Millions of people across all states and union territories (UT) were affected leading to serious respiratory illness and deaths. In the present study, two unsupervised clustering algorithms namely k-means clustering and hierarchical agglomerative clustering are applied on the COVID-19 dataset in order to group the Indian states/UTs based on the pandemic effect and the vaccination program from the period of March, 2020 to early June, 2021. The aim of the study is to observe the plight of each state and UT of India combating the novel coronavirus infection and to monitor their vaccination status. The research study will be helpful to the government and to the frontline workers coping to restrict the transmission of the virus in India. Also, the results of the study will provide a source of information for future research regarding the COVID-19 pandemic in India. The COVID-19 pandemic [1] [2] [3] [4] originating from Wuhan, China, in December 2019 is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The cause of the respiratory disease was confirmed by WHO [4] as a novel coronavirus on 12th January 2020. The variable symptoms observed in the COVID-19 B S. Chakraborty chakrabartysonali@gmail.com patients are fever, cough, fatigue, breathing difficulties, loss of smell and taste. Severe consequences were observed in the patients such as cytokine storms, multi-organ failure, septic shock, blood clots, damage to the lungs and heart. The transmission increased rapidly in all countries through the people having travel histories to the affected countries and their contacts. In India, the first case of COVID-19 was reported in Kerala on 30th January 2020, which rose to three cases by 3rd February. After a gap of almost one month, 22 new cases were reported on 4th March 2020. Eventually, the transmission rate grew in March 2020, and the first COVID-19 fatality of India was reported on 12th March of a 76-year-old man, who returned from Saudi Arabia. The rapid spread of the disease led to a nationwide lockdown for 21 days starting from 25th March, 2020 which was further extended by 14 days and thereafter by further two weeks with convinced relaxations. Considering the population of India [3] of about 121,05,69,573, controlling the extensive consequences of the pandemic was quite challenging for the administrative officials. Initially, the authorities decided to test only those people who returned from high-risk countries or who came in contact with the positive cases. Due to the substantial increase in the number of cases, later the government decided to test the people with pneumonia cases, irrespective of travel or contact history. With strict restrictions and guidelines imposed by the government, the number of cases decreased to 9000 per day by February, 2021. A second wave was observed throughout India from early April, 2021and by the end of the month over 4 lakh cases and more than 3500 deaths were reported in a day. India began its vaccination program from 16th January, 2021with two DCGI (Drug Controller General of India) approved drugs. The frontline workers i.e., doctors, nurses, hospital staff and policemen were the first one to receive doses of the vaccine. Thereafter, the vaccination drive was extended to all the residents over the age of 45 and later for all residents over the age of 18. In the present study, two unsupervised clustering algorithms [5] [6] [7] [8] ; namely k-means clustering and hierarchical agglomerative clustering is applied on the Indian states/UTs to group them on the basis of their demographic characteristics, the number of confirmed and death cases due to COVID-19 and their vaccination status. The use of data science in monitoring COVID-19 cases and vaccination status helps in gaining insight from the dataset and extracting meaningful information which can be further used for predicting future patterns and behaviours. The popular unsupervised clustering algorithms are used due to the fact the available COVID-19 data set for India is untagged. The number of clusters to be formed is not known as so it is desirable that the clustering algorithm will divide the dataset into groups based on their similarities. Wide-ranging research is being carried out about the COVID-19 pandemic; therefore, a brief review is presented in Sect. 2 from the available limited literatures. Sections 3.3 and 3.4 discusses the COVID-19 confirmed and death situation in India and each state and UT of India respectively. The vaccination status for India and its states/UT considering their population is discussed in Sects. 3.5 and 3.6. Section 4 performs cluster analysis on the COVID-19 dataset using k-means and hierarchical agglomerative clustering algorithms. The concluding remarks and the future scope of the research are discussed in Sect. 5. Extensive research is being carried out both pathologically and statistically on the COVID-19 dataset across the world in order to observe the trend of infection transmission and to combat the spread of novel corona virus. In India as on 8th June, 2021, more than two crore people got infected with novel corona virus and approximately 3.5 lakh people succumbed to COVID-19 [9] . As on 8th June, 2021, the total number of people affected with coronavirus across the world is more than 18 crore and more than 39 lakh deaths are reported [4] . Apart from losing precious lives, the pandemic has a severe impact on the Indian economy and led to a negative growth rate for the first time in decades. The pathological research aims to study the evolution, replication, pathogenesis [10] the transmission trend of the novel corona virus [11] , its clinical features, diagnosis, treatment [12] , and to observe the impact of the pandemic based on the parameters such as air temperature, relative humidity [13] , age and gender. To perform statistical research on the COVID-19 dataset, various statistical models are being used by the researchers [14] [15] [16] [17] [18] [19] [20] [21] and artificial intelligence techniques [22] are being suggested for predicting the further spread of the pandemic. Gondauri et al. [23] uses BAILEY's model to study and analyse the cases based on corona virus spread in different countries. Based on the experimental results, the author concluded the state of the virus spread and recovery up to 30th March, 2020. A spline-based time series with Bayesian model is used by Kumar et al. [24] to identify the transmission stages of COVID-19 infection in India. The Susceptible-Exposed-Infectious-Recovered (SEIR) model is used by researchers Pai et al. [25] to forecast the active COVID-19 cases in India considering the effect of nationwide lockdown and the possible inflation in the active cases after unlocking the nation. The research data included in the present study is for India which is the second most populated country in the world having 28 states and 8 union territories. The total population [3] of the country is 121,05,69,573 with a density of 382/km 2 . The rural and the urban population of India is 69% and 21% respectively. The COVID-19 dataset for India and the demographic details of each state and UT used in this study is extracted from the official website administered by the government of India [1, 2, 9] . The analysis includes approximately 16,000 records in comma separated values (CSV) format containing day wise information of the number of confirmed cases, active cases, cured cases and the number of deaths from March, 2020 to 8th June, 2021. The vaccination details contain the total number of people vaccinated for the first and the second dose starting from 16th January, 2021 in all states and UTs of the country for different age groups. The computations are performed using Microsoft Excel 2010 and RStudio Desktop 1.3.1093 is used for implementing the clustering algorithms. The trend of covid cases observed in Table 1 and Figs. 1 and 2 depict that from March, 2020 the number of covid cases started rising in India. During September, 2020, the covid cases were at peak and thereafter they started declining towards February, 2021. This period of twelve months can be considered as the first wave of the pandemic in India. Again, from March, 2021 the number of covid cases started increasing and during May, 2021 the number of cases were maximum. Eventually, the cases started decreasing by the onset of June, 2021. This time period of four months can be considered as the second wave of the pandemic in India. Although the duration of the first and the second wave are not same, the total number of cases and the peak observed during the second wave with 92,84,558 confirmed cases in the month of May, 2021 is much higher than that observed during the first wave. The second wave in India was more fatal as compared to the first wave. It is observed that among all the states and UTs, Maharashtra reported the highest number of COVID-19 cases and deaths both during the first and the second wave. All Looking into the severity of the second wave, the foremost priority of the government is to speed up the vaccination process among the residents. No vaccines were available almost during the first wave. The vaccination process started in India from 16th Table 3 depicts the vaccination status of India as on 8th June, 2021 (Fig. 7) . It is observed from Table 3 that out of the total population of the country, 15.44% people have received the first dose whereas only 3.73% people of the total population has received both the doses. This majorly includes the frontline workers and the residents in the age groups of 45 and more. Table 4 depicts the demographics and the vaccination status of each state and UT of India. The demographic details include the total population of the state/UT, its population density and the rural and urban population percentage (Figs. 8, 9 ). The following observations are made from Table 4 : • In states Assam, Bihar, Jharkhand, Meghalaya, Nagaland, Tamil Nadu, Uttar Pradesh and West Bengal out of the total population, less than 15% people have been vaccinated with the first dose • The states/UTs Dadra and Nagar Haveli, Goa, Himachal Pradesh, Ladakh and Lakshadweep have recorded very good vaccination program having more than 30% vaccinated residents with first dose Cluster analysis is a statistical data mining technique [5] used for grouping the data set having similarities in their parameters. In the present study, two data mining clustering techniques namely, k-means clustering and hierarchical agglomerative clustering is applied on the COVID-19 dataset for grouping the states/UTs based on their demographics, number of COVID confirmed and death cases and the vaccination status. k-means clustering [5] is an unsupervised clustering technique in which the dataset is partitioned into k clusters such that the variance between the dataset within the cluster is minimum. Each data from the dataset belongs to a cluster with the nearest mean. The k-means algorithm returns the average value of the parameters. In the present study, k-means clustering is applied to group the states/UTs based on three cases: • Case A_kclust Clustering the states/UTs based on the total number of COVID-19 cases and deaths during the first and the second wave • Case B_kclust Clustering of states/UTs to observe the vaccination status with respect to their population • Case C_kclust Clustering the states/UTs to group them respective to the number of COVID-19 cases and deaths with their vaccination status The following steps are performed while implementing the k-means clustering algorithm: 1. Determination of the parameters used for clustering the data set into groups 2. Implementation of elbow method [2, 5] for determining the optimal number of clusters. The elbow method also known as knee of curve method is a heuristic approach used to determine the number of clusters in a dataset. 3. Applying k-means clustering using the optimal number of clusters determined in the elbow method. Case A_kclust: Clustering the states/UTs based on the number of COVID-19 cases and deaths. Four parameters considered while performing the clustering operation are: total confirmed cases during first wave, total deaths during first wave, total confirmed cases during second wave and total deaths during second wave. Figure 10 depicts the result of the elbow method applied on the dataset. It is observed from Fig. 10 that the total within the sum of squares value does not vary much after 3 clusters and so 3 is considered as the optimal number of clusters. By applying k-means algorithm with k = 3 gives cluster of sizes 28, 7 and 1. The result and the plotting of k-means clustering is tabulated in Table 5 and Fig. 11 respectively. The states belonging to each cluster are depicted in Table 6 : The following inferences are made from the result of k-means clustering depicted in Table 5 and 6: • Maharashtra is the worst hit state with maximum number of COVID-19 confirmed cases and deaths during both the waves • Although there is a substantial increase in the number of COVID-19 cases in the states belonging to cluster 2, overall, they are moderately affected • The states belonging to cluster 1 have fewer number of cases as compared to those in cluster 2 The parameters considered while performing the clustering operation is the percentage of residents vaccinated by the first and the second dose out of the total population. The elbow method is used to determine the optimal number of clusters as depicted in Fig. 12 . The elbow method suggests 3 as the optimal number of clusters. k-means algorithm applied on the data set with k = 3 gives cluster of sizes 18, 4 and 14. Table 7 depicts the mean of the percentage of people vaccinated by first and second dose in each cluster. Figure 13 shows the plotting of k-means clustering for Case B_kclust. The states belonging to each cluster are depicted in Table 8 : The following observations are noted from the results derived in Tables 7 and 8 . • The states belonging to cluster 1 have more than 50% and 10% vaccinated residents with first and second dose respectively • The states belonging to cluster 2 are the least vaccinated states • The first dose vaccination rate of the states/UTs in this cluster is less than 20% which is quiet alarming • The average vaccination rate of the states/UTs belonging to cluster 3 is moderate with 27% of first dose and 6% of second dose Case C_kclust Clustering the states/UTs based on their COVID-19 cases and deaths with their vaccination status. The parameters considered while applying k-means clustering are: total confirmed cases and deaths during the first and the second wave and percentage of residents receiving first and second dose of the vaccine. Similar to Case A_kclust and Case B_kclust, the elbow method shows 3 as the optimal number of clusters as depicted in Fig. 14 . Applying k-means algorithm with k = 3, three cluster of sizes 7, 1and 28 are formed. The result and the cluster plot for Case C_kclust is depicted in Table 9 and Fig. 15 . The states belonging to each cluster are depicted in Table 10 : The following observations are made considering Tables 9 and 10 . • Maharashtra the only state belonging to cluster 2, has highest number of COVID-19 cases and deaths in both waves with quiet a smaller number of residents getting vaccinated • The states/UTs in cluster 2 are moderately affected with COVID-19 cases and deaths having less vaccination rate • The states/UTs belonging to cluster 3 have least number of COVID-19 cases as compared to the other two clusters with highest vaccination rate An unsupervised data mining statistical approach [5] used for grouping data set with similar characteristics by building a hierarchy of clusters. Two types of hierarchical clustering can be performed; i.e., agglomerative and divisive. The type of hierarchical clustering performed in the present study is agglomerative which creates groups from bottom to top. In this method, each observation of the dataset is considered as a cluster which merges with other clusters while moving up the hierarchy. The squared Euclidean distance is used to find the similarities in the data set and average method is used to evaluate the distance between the clusters. Given two set of points p (p1, p2) and q (q1, q2), the Euclidean distance between points p and q is measured as: The algorithm is applied to group the states/UTs based on the following three cases: • Case A_hclust Clustering the states/UTs hierarchically based on the total number of COVID-19 cases and deaths during the first and the second wave • Case B_hclust Clustering of states/UTs to observe the vaccination status with respect to their population The following steps are performed while implementing the hierarchical agglomerative clustering algorithm: 1. Determination of the parameters used for clustering the data set into groups 2. Implementation of elbow method [26] for determining the optimal number of clusters. 3. Application of hierarchical agglomerative using the optimal number of clusters determined in the elbow method. 4. Creating the cluster dendrograms and highlighting individual clusters 5. Creating the phylogenetic tree representation for better understanding of the clusters Case A_hclust Hierarchical clustering is applied with parameters total confirmed cases and deaths during the first and the second wave. The dendrogram showing the clustering of states/UTs for Case A_hclust is depicted in Fig. 16 . Each cluster is marked by the red border. The following observations are made from Fig. 16 : • Maharashtra is the only state belonging to cluster 1 with more than twenty lakh cases in both the waves. • Twenty-eight states/UTs group together for cluster 2. • Cluster 2 is formed by merging three sub-clusters.: • The first sub-cluster contains eight states having less than 4 lakh cases during the first wave and less than 7 lakh cases during the second wave. The total cases considering both the waves is less than 10 lakhs • The second sub-cluster contains 15 states having less than 90,000 cases during the first wave and less than 2 lakh cases during the second wave • The third sub-cluster has 5 states having total cases less than 3 lakh during first wave but less than 6 lakh total cases considering both the waves • Cluster 3 contains 7 states having more than 5 lakh confirmed cases and more than 4000 deaths during both the waves A phylogenetic tree for Case A_hclust using "Radial" representation is depicted in Fig. 17 . Case B_hclust The clustering of states/UTs to observe the vaccination status is done using the percentage of residents receiving the first and the second dose of the vaccine out of their total population. The dendrogram showing the results is depicted in Fig. 18 . The following observations are made from the dendrogram representing the vaccination status of the depicted in Fig. 18 : • Union territories Ladakh and Lakshadweep belonging to cluster 1 has more than 50% first dose vaccinated residents • The 15 states/UTs grouped into 7 sub-clusters merge to form cluster 2 • The first dose vaccination percentage in cluster 2 is less than 20% of their total population • The states/UTs in this cluster represent least number of vaccinated residents • Cluster 3 contains 19 states/UTs grouped by sub-clusters having vaccination percentage between 20 and 50% Fig. 18 Dendrogram showing clustering of states/UTs for Case B_hclust A radial representation of the phylogenetic tree for Case B_hclust is depicted in Fig. 19 . Case C_hclust The clustering of the states/UTs based on their COVID-19 cases and deaths with their vaccination status has the following parameters: total confirmed cases and deaths during the first and the second wave and the percentage of residents receiving first and second dose of the vaccine out of the total population. The dendrogram and radial representation of the phylogenetic tree showing clustering of states/UTs for Case C_hclust is depicted in Figs. 20 and 21 respectively. The following observations are made from the dendrogram depicted in Fig. 20 : • Maharashtra is the only state belonging to cluster 1 with highest number of covid- 19 cases during both the waves and less than 20% residents getting vaccinated with first dose • 28 states/UTs grouped into sub-clusters merge to form cluster 2. The sub-clusters are formed by the total number of cases during both waves and their vaccination percentage. • First 2 states in the sub-cluster have total cases between 3 and 7 lakhs with more than 20% vaccination • Next 6 states/UTs have covid cases up to 3 lakhs during the first wave and more than 3 lakh cases during the second wave with more than 8% vaccination • Sub-cluster 3 formed by grouping 15 states/UTs have less than 60,000 cases during the first wave and up to 1.5 lakh cases during the second wave with more than 10% vaccination • Sub-cluster 4 comprises of 5 states with cases between 90,000 and 3 lakhs during both the waves and more than 10% vaccination • The 7 states /UT belonging to cluster 3 have more than 6 lakh COVID-19 cases during both the waves with less than 30% vaccination. The present study performs cluster analysis on the COVID-19 dataset of each state/UT of India from March, 2020 to 8th June, 2021. Two unsupervised clustering algorithms are applied to the COVID-19 and vaccination data set in order to group and monitor the states and the UTs based on their increase/decrease of covid cases, deaths and vaccination status. The situation in the states having maximum number of COVID-19 cases and deaths both during the first and the second wave with less than 20% vaccination is quiet alarming. Worst affected states during both the waves are Maharashtra, Andhra Pradesh, Delhi, Karnataka, Kerala, Tamil Nadu, Uttar Pradesh and West Bengal. Looking into the severity of the second wave, faster vaccination process in the densely populated states may help in reducing the rapid transmission of the infection. The vaccination drive in the rural part of the country is a major challenge as myths regarding the vaccines proves to be a major obstacle. The results of the analysis presented in this study will provide useful information regarding the pandemic and to the frontline workers combatting the spread of the infection in the country. The results can be used for pursuing further research for the betterment of government policies. The analysis can be further carried out at district level of each state/UT to explore and identify the useful information about the disease spread and the vaccination drive. Ministry of Health and Family Welfare Data mining-concepts and techniques Introduction to business data mining Optimization based data mining: theory and applications Internet of things, real-time decision making, and artificial intelligence Corona viruses: an updated overview of their replication and pathogenesis Understanding epidemic data and statistics: a Case Study of COVID-19 Coronavirus disease 2019: What we know High temperature and high humidity reduce the transmission of covid-19 Epidemic parameters for COVID-19 in several regions of India The age and sex distribution of COVID-19 cases and fatalities in India Inferring epidemic parameters for COVID-19 from fatality counts in Mumbai Estimating the number of COVID-19 infections in Indian hot-spots using fatality data Monitoring novel corona virus (COVID-19) infections in India by cluster analysis Culture vs policy: more global collaboration to effectively combat COVID-19 What are the underlying transmission patterns of COVID-19 outbreak? An age-specific social contact characterization Joint modeling of longitudinal CD4 count and time-todeath of HIV/TB co-infected patients: a case of jimma university specialized hospital AI techniques for COVID-19 Research on covid-19 virus spreading statistics based on the examples of the cases from different countries Study of the trend pattern of COVID-19 using spline-based time series model: a Bayesian paradigm Investigating the dynamics of COVID-19 pandemic in India under lockdown Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations