key: cord-0755443-inlcfxtx
authors: Ahmed, Imran; Ahmad, Misbah; Jeon, Gwanggil
title: Social distance monitoring framework using deep learning architecture to control infection transmission of COVID-19 pandemic
date: 2021-02-17
journal: Sustain Cities Soc
DOI: 10.1016/j.scs.2021.102777
sha: 6dc7c6b33e1799471d8e1a9cb76180e31163e16a
doc_id: 755443
cord_uid: inlcfxtx

The recent outbreak of the COVID-19 affected millions of people worldwide, yet the rate of infected people is increasing. In order to cope with the global pandemic situation and prevent the spread of the virus, various unprecedented precaution measures are adopted by different countries. One of the crucial practices to prevent the spread of viral infection is social distancing. This paper intends to present a social distance framework based on deep learning architecture as a precautionary step that helps to maintain, monitor, manage, and reduce the physical interaction between individuals in a real-time top view environment. We used Faster-RCNN for human detection in the images. As the human's appearance significantly varies in a top perspective; therefore, the architecture is trained on the top view human data set. Moreover, taking advantage of transfer learning, a new trained layer is fused with a pre-trained architecture. After detection, the pair-wise distance between peoples is estimated in an image using Euclidean distance. The detected bounding box's information is utilized to measure the central point of an individual detected bounding box. A violation threshold is defined that uses distance to pixel information and determines whether two people violate social distance or not. Experiments are conducted using various test images; results demonstrate that the framework effectively monitors the social distance between peoples. The transfer learning technique enhances the overall performance of the framework by achieving an accuracy of 96% with a False Positive Rate of 0.6%.

The novel COVID-19 virus initially reported in Wuhan, China, during late December 2019, and after only in few months, the virus affected millions of people worldwide [1] . The contagious virus is the kind of Severe Acute Respiratory Syndrome (SARS) spread through the respiratory system. It usually spreads through the air due to the direct exposure of infected (humans, animals) to healthy (humans and animals). The virus transmits through a cough or sneeze droplets of the infected (humans, animals) and travels up to 2 meters (6 feet).

World Health Organization (WHO) announced it as a pandemic in March 2020 [2] .

Till now, this deadly virus infected almost 8 million people around the globe.

The recent number of confirmed cases [3] around the globe is presented in Fig- ure.1. Medical experts, scientists, and researchers have been intensively working and seeking the vaccine and medicine for this lethal virus. To limit the spread of the virus, the global community is looking for alternative measures and precautions. In the current situation, social distance management is affirmed as one of the best practices to limit the spread of this infection worldwide. It is related to decreasing the physical contact of individuals in crowded environments crowds (such as offices, shopping marts, hospitals, schools, colleges, universities, parks, airports, etc.) and sustaining enough distance between people [4] , [5] . By reducing the close physical interaction between individuals, we can flatten the reported cases' curve and slow down the possibilities of virus transmission. Social distance management is crucial for those individuals that are at higher risk of severe sickness from COVID-19. Social distance management can significantly reduce the risk of the virus' spread and disease severity, as explained in Figure. 2 (adopted from [6] ). If a proper social distance mechanism is executed at the primary stages, it might play a central part in preventing the infection transmission and limiting the pandemic disease's peak (number of sick peoples), as evidenced in Figure. 2. Due to a large number of confirmed cases arising on daily basis, which leads to a shortage of pharmaceutical essentials, it is recognized as one of the most reliable measures to limit the transmission and spread of the contagious infection. From Figure. 2, the virus transmission across global becomes slow down. The graphs clearly indicate that virus transmission is controlled by social distance management as the number of sick people decreases with an increasing number of recovered peoples. With proper social distance mechanisms in public places, the number of infected people/cases and healthcare organizations' burden can be reduced and controlled. It decreases the fatality rates by ensuring that the number of infected cases does not exceed the capacity of healthcare organizations [7] .

Researchers have made different efforts [8] , [9] & [10] , to developed an efficient methods for social distance monitoring. They utilized different machine and 3 J o u r n a l P r e -p r o o f deep learning-based approaches to monitor and measured the social distancing among people. Authors use various clustering and distance-based approaches to determine the distance between people; however, they mostly concentrated on the side or frontal view perspectives, as shown in Figure. 3 In such a perspective, peculiar camera calibration is required to map distance to pixels information for practical and measurable units (such as meters, feet). The developed approaches also suffer from occlusion problems as mostly concentrated on the side and frontal view camera perspectives. Conversely, if the same view is captured from the top view, then the distance calculations become better, and occlusion problems are also overcome with a wide coverage of the scene. Researchers, such as [11] , [12] , [13] , [14] , [15] , [16] , & [17] adopted top perspective for detection, counting and tracking of people in different applications. A top view perspective gives a wide field of coverage and handles occlusion problems in a much better way than a side or frontal view. • To utilize Euclidean distance approach to compute the distance among each pair of the detected bounding boxes.

• To define a distance-based violation threshold for social distance utilizing a pixel to distance estimation approach to monitor people's interaction.

• To evaluate the architecture's performance by assessing it on a top view data set. The results of the presented framework are evaluated with transfer learning.

The remaining work presented in the article is organized as follows. Various methods developed for monitoring of social distance is elaborated in Section.2.

In Section.3, the deep learning framework developed for monitoring of social distance is discussed. The data set utilized throughout the experimentation is 6 J o u r n a l P r e -p r o o f concisely presented in Section.4. The comprehensive summary of the outcomes and evaluation performance of the framework is likewise explained in Section.4.

Section.5 addresses the conclusion of the presented work with future trends.

Social distancing has been identified as the most important practice since late December 2019, after the growth of the COVID-19 pandemic. It is opted as standard practice to stop the infectious virus transmission on January 23, 2020 [18] . During the first week of February 2020, the number of cases was increasing on an exceptional basis, with reported cases ranging from 2,000 to 4,000 per day.

Later, there was a relief symbol for the first time, with no new positive cases for five successive days until March 23, 2020, [19] . This happens because the social distance exercise, which was started in China and, lately, utilized globally to control the spread of COVID-19.

Kylie et al. [20] examined the correlation linking the social distancing strictness and the region's economic condition and reported that modest steps of this exercise could be adopted for avoiding a massive outbreak. They also performed the monitoring of facial masks using facial images. Pouw et al., [38] proposed an effective graph-based system for distancing monitoring and masses management. [39] produced a human detection model for a crowded environment. The system is developed for monitoring of social distance. The scholars considered a robot having an RGB depth camera with a 2-D lidar in crowd gatherings to perform collision free navigation. Some researchers presented a social distance monitoring system for public situations. However, mostly techniques concentrated on the side, and frontal camera perspective camera perspective e.g., [10] , & [9] .

In this work, we introduced a deep learning framework to monitor social distance in a top view environment. Many researchers utilized top view for different surveillance applications and achieved good results. Ahmad et al., [12] presented a deep learning model for top view person detection and tracking. Ahmed et al., [13] different deep learning models for top view multiple object detection. [14] introduced a feature-based detector for person detection in different overhead

views. Further in [15] authors proposed a robust features based method for overhead view person tracking. [16] , presented a background subtraction-based person's counting system for an overhead view. Migniot et al. [17] , proposed a hybrid 3D and 2D human tracking system for a top view surveillance environment. Therefore, inspired by these works, we also developed a system that allows a better view of the scene and overcomes occlusion concerns by performing a pivotal role in social distance monitoring.

In this work, a deep learning framework is introduced for top view social distance monitoring. The general overview of the work presented in the paper is shown in Figure. For human detection, a deep learning paradigm is applied, as different kind of object detection algorithms are available, such as [40] , [41] , [42] , [43] & [44] . Due to the best performance, in this work, we used Faster-RCNN [45] . As compared to other RCNN-based architectures in Faster-RCNN, the objection detection and region proposal generation tasks are performed by the same convolutional network. Due to this reason the object detection is much faster. The architecture applied a two-stage network to estimate object class probabilities and bounding box. The architecture is previously trained on the MS-COCO data set [46] .

The architecture is additionally trained for top view human detection. The detection algorithm detects human in the image and initializes its color to green.

In the next step, after human detection, the center point, also called the centroid, monitoring module is given in the following subsections:

A deep learning architecture, i.e., Faster RCNN [45] , is adopted to perform human detection from the top view perspective. The detection paradigm has a two-stage architecture. The general architecture of Faster RCNN applied for top view human detection is given in Figure. 6. At the first stage, a Region

Proposal Network (RPN) [45] is employed to produce region proposals or feature maps for the sample image. Convolution layers are used at the first stage to produce feature maps. The next stage utilizes a Fast-RCNN architecture [47] , a deep convolutional network to choose specific object features from the various region proposals, as described in Figure. In our case, as discussed earlier, the human body visual appearance is varying in the scene, and due to the highly flexible characteristic of CNNs, the RPN is tuned for various size region proposals. We used ResNet-50 as a backbone architecture for RPN generation that utilizes a sliding window approach over the last layer's feature maps. As compared to conventional deep learning architectures like VGG, which usually have convolution layers for classification, rather than fully connected layers, without any shortcut/skip connection defined as plain networks. (for more explanation, readers are referred to [48] ). It utilizes a skip connection or termed as shortcut connection, which provides a more extensive network to map the preceding layer's input to the consequent layer input, without any modification/alteration in the input.

The RPN produces region proposals, also termed as k-anchors at every window location, with various aspect ratios, scale sizes, and provide a reliable 

(1)

In Equation. 1, x, y, expressed bounding box coordinates, and h, w determines the height and width of the detected bounding box. The ground truth, anchor box, and predicted bounding box is indicated with x * , x a , and x, sequentially.

The object class score is obtained from the class score layer, i.e., person or human. It can be seen, in Figure. 6, that top view images are passed through convolutional layers that produce feature maps also shown with k-anchors. The anchors are divided into two classes (person and no-person (background)) for anchor selection. The Intersection over Union (IoU) is applied to describe where the anchor or proposed regions are projected with ground truth bounding boxes G t . The IoU ratio is defined and provided as [45] and [12] :

Finally, a function is determined at the edge of the RPN's. To represent the predicted bounding box positions, the regression function is used. The function is specified to calculate the loss of bounding box regression, and classification [45] as:

In Equation.3, L cls indicates loss of classification, L reg exhibits loss of bounding box regression. The coefficients of normalization are described with N cls and N reg , and the parameter used as the weight between two losses, and it is determined with λ. As an index for an anchor, i is used; the predicted probability of a human or person is described as p i , while p * i is used as ground truth for classification. The anchor region associates with a positive class if p * i = 1, and the anchor region corresponds to a negative class if p * i = 0. t i represents a predicted bounding box vector, where t * i represents ground truth. For L cls classification loss of person and no-person, a logarithmic loss is determined and provided as:

Applying Equation.4, the regression loss is provided as [45] :

The model outputs bounding box information containing (person), as shown in green boxes in Figure. 6. These detected bounding box information is further processed to the social distance monitoring framework.

After human detection in top view images, the center point, also called the centroid of each detected bounding boxes is computed as shown with green boxes in Figure. 7(a). The estimated centroid of the bounding box is displayed in Figure. 7, the Figure. 7b illustrates a set of bounding box coordinates with its center point information. The centroid of the bounding box is computed as:

In the Equation.6,x represents the minimum and maximum value for width, and y is the minimum and maximum height of the bounding box. After computing, centroid, the distance is measured between each detected centroid, by applying the Euclidean distance approach. For every detected bounding box in the image, we firstly calculate the centroid of bounding boxes presented in Figure.7(b) ;

formerly the distance is measured (represented with red color lines) between every detected bounding boxes, Figure. 7(c). The distance between the two centroids is mathematically computed as;

A threshold value of T is defined using calculated distance values. This value is used to check whether any two or more people are at smaller distances than the defined T pixels threshold. The threshold T is applied using below equation: if the information exists and the detected bounding box are too close, the color is updated and changed to red. The developed framework presents information of total social distancing violations at the output, which is further processed to the surveillance unit.

This section explains different experiments conducted in this work. For monitoring of social distance, a human data set [14] , [49] , 

J o u r n a l P r e -p r o o f be seen that both training and testing accuracy are raised considerably at the beginning of the 11th epoch because we studied only a particular class object,

i.e., a person. From Figure. 10, it is also important to note that both training and testing loss reducing after the 10th epoch. 

Different parameters have been utilized for evaluation purposes, i.e., true-negative, false-positive, and false-negative. Furthermore, these parameters are applied to estimate Accuracy, Recall, F1-Score, Precision, True Positive Rate (TPR), and

False Positive Rate (FPR). The Accuracy, Recall, F1-score, and Precision results of both pre-trained and trained detection architecture are reflected in Figure. 12.

We used the standard error method to show the average values. It can be observed that the accuracy of the detection architecture is 96% , Recall is 92%, the F1-score is 94%, and the Precision is 95%.

These findings are evident that transfer learning and additional training enhances detection results. The TPR and FPR of the detection architecture after training for the top view human data set is depicted in Figure.13 and Figure. 14.

The TPR of the framework increased from 91% to 96%. Furthermore, the FPR of the framework is improved and reduced from 0.9% to 0.6%, which reveals the efficiency of the deep learning architecture.

J o u r n a l P r e -p r o o f For a pre-trained architecture, the accuracy of the framework is 91%, and after training and transfer learning, it archives an accuracy of 96%. In the future, the work may be enhanced for various outdoor and indoor situations. Various detection architectures might be practiced for social distancing monitoring.

Blavatnik School of Government

for Disease Prevention

Cluster Computing

International Journal of Machine Learning and Cybernetics

9th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON)

of the Peoples Republic of China (Online

Megapixels: Origins, ethics, and privacy implications of publicly available face recognition image datasets

Sustainable cities and society

Advances in neural information processing systems

Proceedings of the IEEE conference on computer vision and pattern recognition

Proceedings of the IEEE conference on computer vision and pattern recognition

Proceedings of the IEEE international conference on computer vision

Advances in neural information processing systems

Proceedings of the IEEE conference on computer vision and pattern recognition

2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON)

This paper intends to present a social distance framework based on deep learning.We used Faster-RCNN for human detection in the images.The architecture is trained on the top view human data set.Taking advantage of transfer learning, a new trained layer is fused with a pre-trained architecture.After detection, the pair-wise distance between two people is estimated in an image.